Delimited text files
In a delimited text file, each line contains a record. The columns are separated by a delimiter character. In cases where a field may contain the delimiter character, it may be quoted or escaped.
The two most common variants of delimited text formats are CSV (Comma Separated Values) and TSV (Tab Separated Values). In CSV files, columns are separated by a comma. Fields that contain commas or newline characters are quoted using double quote characters. When a quoted field itself contains double quote characters, they are replaced with two successive double quotes. In TSV files, the tab character is used as the column delimiter. Tab characters are not allowed inside fields in standard TSV format.
Reading and writing delimited text files is implemented by the DelimitedTextFile and DelimitedTextStream classes.
Delimited text options
The DelimitedTextOptions class defines the options available when reading from delimited text files. It inherits from TextOptions, and has several properties in addition to those defined in the TextOptions class. These are listed in the table below:
Property | Description |
---|---|
The character to use to separate columns. The default is a comma. | |
The string to use to end a line. The default is a carriage return plus line feed character. | |
The character to use to separate columns. The default is a comma (,). | |
A QuoteUsage value that specifies when fields should be quoted. The default is AsNeededForColumnType. The possible options are listed later in this section. | |
The character to use when quoting fields. The default is a double quote ("). | |
A QuoteEscapeMethod value that specifies how to handle quote characters inside a quoted field. The default is double. | |
The character to use when escaping a quote inside a quoted field. Any character following the escape character is assumed to be part of the field. The default is the backslash character (\). This value is ignored unless QuoteEscapeMethod is EscapeCharacter. |
The Quote property specifies when a field should be quoted. The possible values are listed below:
Value | Description |
---|---|
Never | Fields are never quoted. This is the default for tab separated values. |
AsNeeded | Every field of every row is scanned individually for the column delimiter, end of line characters, the quote character, and the escape character. If the string representation of the field value contains such a character, then the field is quoted. |
AsNeededForColumnType | A determination is made for each column type whether its values may need to be quoted. If so, the field is always quoted. This is the default for CSV. |
Always | All fields are quoted. |
The value is used when writing files, but also affects reading delimited text files. Being aware of the conditions under which fields may be quoted can speed up the processing of the text.
Sometimes a quoted field will contain the quote character itself. There are two ways to deal with this situation. Either the quote character is doubled (the default), or an escape character is used.
Several predefined options objects have been defined. The Csv field of the DelimitedTextOptions class defines the options for standard CSV files, where fields are delimited by commas and non-numeric columns quoted by double quotes. The CsvWithoutHeader field is similar but omits column headers. Similarly, the Tsv and TsvWithoutHeader fields define the options for standard tab-delimited (TSV) files, with or without headers.
The CsvForCulture(CultureInfo, Boolean) method returns an options object tailored to a specific culture. If a comma is used for the decimal point or as the thousands separator, then a semi-colon is used as the column delimiter. This method has two arguments. The first is a CultureInfo object. The second is optional: a boolean value that indicates whether the data includes column headers. The default is true.
Reading delimited text files
The DelimitedTextFile class contains static methods for reading data frames, vectors, and matrices from a file in delimited text format.
The ReadDataFrame method reads a data frame from a file. The method takes two arguments. The first argument specifies the source of the data. This may be a string containing the path to the file, or a Stream that has been opened for reading. If a filename is given, it may be the path to a local file, or the uri of a resource on the Internet. The second argument is a DelimitedTextOptions object. It is optional. If it is omitted or null, standard CSV format is assumed.
Data frames read in this way always have a column index of strings (the column names) and a row index of row numbers (64 bit signed integers). The row index stored in the R file is essentially lost. To keep the stored index information, the types of the row and the column keys can be passed as generic type arguments to the ReadDataFrame method. This will convert the stored indexes to the requested types as needed.
The example below reads a data frame from a CSV file. Its row index is of type DateTime. It then reads a second data frame from a fictitious URL:
var df1 = DelimitedTextFile.ReadDataFrame<DateTime, string>(@"c:\data.csv");
var df2 = DelimitedTextFile.ReadDataFrame(
"http://www.example.com/sample.tsv", DelimitedTextOptions.Tsv);
Similar methods exist for reading vectors and matrices. The ReadVector method reads a vector from the file. It takes one type argument that is required: the element type of the vector to read. The first actual argument is once again the path to the file or Internet resource, or a stream. The second argument is either a DelimitedTextOptions object, or an integer array that contains the positions of the column breaks.
The ReadMatrix method reads a matrix from the file. It has the same arguments and overloads as the ReadVector. The element type must be supplied as a generic type argument. The actual arguments are the path to the file or resource or the stream to read from, and optionally whether the element type should match exactly.
var vector1 = DelimitedTextFile.ReadVector<double>(@"c:\vector.csv");
var culture = CultureInfo.GetCultureInfo("de-DE");
var options = DelimitedTextOptions.CsvForCulture(culture);
var matrix1 = DelimitedTextFile.ReadMatrix<double>(
"http://www.example.com/german.csv", options);
The ReadComplexVector and ReadComplexMatrix methods read a complex vector and matrix from the file, respectively. These methods are identical to their real counterparts, except that the number of columns in the file must be twice the number of columns in the final object. This is because the real and imaginary parts of the complex values are stored in separate columns. So, a file storing a complex vector should have two columns, while a file storing a complex matrix with 5 columns should have 10 columns total.
Writing delimited text files
The Write method is used to write one or more data frames, vectors, or matrices to a file. The method has many overloads.
The first argument always specifies the destination in one of two ways. It can be a string that contains the path to the file. If the file exists, it is overwritten. If it doesn't exist, then it is created. Alternatively, the destination can be specified as a Stream.
The second argument always specifies the object(s) to be written. This can be a single data frame, matrix, or vector. It can also be a sequence of data frames, matrices, or vectors, or a dictionary that maps names to objects.
The third argument is a DelimitedTextOptions object that specifies how the data should be written. This argument is optional. If omitted, standard CSV format is used.
In the example code below, we write a data frame to a file, and then a matrix to a stream.
DelimitedTextFile.Write(@"c:\data.csv", df1);
using (var stream = File.OpenWrite(@"c:\output.csv"))
{
DelimitedTextFile.Write(stream, matrix1);
}
Using Delimited Text Data Streams
Delimited data streams are implemented by the DelimitedTextStream class. This class has no constructors. Instead, use one of the methods of the DelimitedTextFile class. Streams can be opened for reading only.
Opening files for reading
The Open(String, DelimitedTextOptions) method opens a file or stream for reading. This method has two overloads that take two arguments. The first is a string or a stream. If it is a string, it is the path to the file that should be opened, or the URI of a network or Internet resource. If it is a stream, then it specifies the data stream that the objects should be read from. The second argument specifies the options used to read the data in the file, and is of type DelimitedTextOptions. This argument is optional. If it is omitted or null, standard CSV format is assumed.
The methods for reading objects from streams are similar to those of the DelimitedTextFile class, but with fewer arguments.
Reading from streams
The ReadDataFrame method reads a data frame from a file.
Data frames read in this way always have a column index of strings (the column names) and a row index of row numbers (64 bit signed integers). The row index stored in the R file is essentially lost. To keep the stored index information, the types of the row and the column keys can be passed as generic type arguments to the ReadDataFrame method. This will convert the stored indexes to the requested types as needed.
The example below reads a data frame from a fixed width text file. Its row index is of type DateTime.
using (var s1 = DelimitedTextFile.Open("http://www.example.com/sample.csv"))
{
var df1 = s1.ReadDataFrame<DateTime, string>();
}
Similar methods exist for reading vectors and matrices. The ReadVector<T> method reads a vector from the file. It takes one type argument that is required: the element type of the vector to read. This method takes one argument which is optional: a boolean value that specifies whether the element type of the stored vector should match the specified element type exactly. The default is false, which means that the read operation will succeed as long as the stored element type can be cast to the requested element type.
The ReadMatrix<T> method reads a matrix from the file. It has the same arguments and overloads as the ReadVector<T>. The element type must be supplied as a generic type argument. The one actual arguments is optional. It specifies whether the element type should match exactly.
using (var s2 = DelimitedTextFile.Open(@"c:\vector.csv"))
{
var vector1 = s2.ReadVector<double>();
}
Opening streams for writing
There are two methods that can be used to create an R data stream for writing. The Create(String, Boolean, Boolean) method opens a file for writing. The only argument is a string that is the path to the file that should be opened. If the file exists, its contents are destroyed. If the file does not exist, it is created. The optional second argument is a boolean value that specifies whether the data should be compressed. The default is true. The optional third argument is also a boolean value that specifies whether the data should be written out in human-readable ASCII format. The default is false.
The Append(Stream, Boolean, Boolean) method opens a stream using an existing writable stream. The first argument is the stream to write the objects to. The second and third arguments are optional. They are boolean values that specify whether the data should be compressed, and whether the data should be written in ASCII format.
Writing objects
The Write method is used to write a vector or matrix to a file in Matrix Market format. The method has many overloads.
The first argument always specifies the object(s) to be written. This can be a vector or a matrix. Both real and complex vectors and matrices are supported. The following code creates a new CSV file, and writes a matrix to it:
using (var stream = DelimitedTextFile.Create(@"c:\data.csv"))
{
stream.Write(matrix1);
}