Skip to content

[csv] shorten and move quote rules to first example #5286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 40 additions & 72 deletions csv.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,62 @@
---
language: CSV
name: CSV
contributors:
- [Timon Erhart, 'https://github.com/turbotimon/']
- [Timon Erhart, 'https://github.com/turbotimon/']
---

CSV (Comma-Separated Values) is a lightweight file format used to store tabular
data in plain text, designed for easy data exchange between programs,
particularly spreadsheets and databases. Its simplicity and human readability
have made it a cornerstone of data interoperability. It is often used for
moving data between programs with incompatible or proprietary formats.

While RFC 4180 provides a standard for the format, in practice, the term "CSV"
is often used more broadly to refer to any text file that:

- Can be interpreted as tabular data
- Uses a delimiter to separate fields (columns)
- Uses line breaks to separate records (rows)
- Optionally includes a header in the first row
CSV (Comma-Separated Values) is a file format used to store tabular
data in plain text.

```csv
Name, Age, DateOfBirth
Alice, 30, 1993-05-14
Bob, 25, 1998-11-02
Charlie, 35, 1988-03-21
Name,Age,DateOfBirth,Comment
Alice,30,1993-05-14,
Bob,25,1998-11-02,
Eve,,,data might be missing because it's just text
"Charlie Brown",35,1988-03-21,strings can be quoted
"Louis XIV, King of France",76,1638-09-05,strings containing commas must be quoted
"Walter ""The Danger"" White",52,1958-09-07,quotes are escaped by doubling them up
Joe Smith,33,1990-06-02,"multi line strings
span multiple lines
there are no escape characters"
```

## Delimiters for Rows and Columns
The first row might be a header of field names or there might be no header and
the first line is already data.

Rows are typically separated by line breaks (`\n` or `\r\n`), while columns
(fields) are separated by a specific delimiter. Although commas are the most
common delimiter for fields, other characters, such as semicolons (`;`), are
commonly used in regions where commas are decimal separators (e.g., Germany).
Tabs (`\t`) are also used as delimiters in some cases, with such files often
referred to as "TSV" (Tab-Separated Values).
## Delimiters

Example using semicolons as delimiter and comma for decimal separator:
Rows are separated by line breaks (`\n` or `\r\n`), columns are separated by a comma.

```csv
Name; Age; Grade
Alice; 30; 50,50
Bob; 25; 45,75
Charlie; 35; 60,00
```
Tabs (`\t`) are sometimes used instead of commas and those files are called "TSVs"
(Tab-Separated Values). They are easier to paste into Excel.

## Data Types

CSV files do not inherently define data types. Numbers and dates are stored as
plain text, and their interpretation depends on the software importing the
file. Typically, data is interpreted as follows:
Occasionally other characters can be used, for example semicolons (`;`) may be used
in Europe because commas are [decimal separators](https://en.wikipedia.org/wiki/Decimal_separator)
instead of the decimal point.

```csv
Data, Comment
100, Interpreted as a number (integer)
100.00, Interpreted as a number (floating-point)
2024-12-03, Interpreted as a date or a string (depending on the parser)
Hello World, Interpreted as text (string)
"1234", Interpreted as text instead of a number
Name;Age;Grade
Alice;30;50,50
Bob;25;45,75
Charlie;35;60,00
```

## Quoting Strings and Special Characters
## Data Types

Quoting strings is only required if the string contains the delimiter, special
characters, or otherwise could be interpreted as a number. However, it is
often considered good practice to quote all strings to enhance readability and
robustness.
CSV files do not inherently define data types. Numbers and dates are stored as
text. Interpreting and parsing them is left up to software using them.
Typically, data is interpreted as follows:

```csv
Quoting strings examples,
Unquoted string,
"Optionally quoted string (good practice)",
"If it contains the delimiter, it needs to be quoted",
"Also, if it contains special characters like \n newlines or \t tabs",
"The quoting "" character itself typically is escaped by doubling the quote ("")",
"or in some systems with a backslash \" (like other escapes)",
Data,Comment
100,Interpreted as a number (integer)
100.00,Interpreted as a number (floating-point)
2024-12-03,Interpreted as a date or a string (depending on the parser)
Hello World,Interpreted as text (string)
"1234",Interpreted as text instead of a number
```

However, make sure that for one document, the quoting method is consistent.
For example, the last two examples of quoting with either "" or \" would
not be consistent and could cause problems.

## Encoding

Different encodings are used. Most modern CSV files use UTF-8 encoding, but
older systems might use others like ASCII or ISO-8859.

If the file is transferred or shared between different systems, it is a good
practice to explicitly define the encoding used, to avoid issues with
character misinterpretation.

## More Resources
## Further reading

+ [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values)
+ [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180)
* [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values)
* [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180)