The sed (Stream Editor) FAQ - 4.12. How do I parse a comma-delimited (CSV) data file?

The sed FAQ
Prev	Home	Next

4.12. How do I parse a comma-delimited (CSV) data file?

Comma-delimited data files can come in several forms, requiring increasing levels of complexity in parsing and handling. They are often referred to as CSV files (for "comma separated values") and occasionally as SDF files (for "standard data format"). Note that some vendors use "SDF" to refer to variable-length records with comma-separated fields which are "double-quoted" if they contain character values, while other vendors use "SDF" to designate fixed-length records with fixed-length, nonquoted fields! (For help with fixed-length fields, see question 4.23)

The term "CSV" became a de-facto standard when Microsoft Excel used it as an optional output file format.

Here are 4 different forms you may encounter in comma-delimited data:

(a) No quotes, no internal commas

       1001,John Smith,PO Box 123,Chicago,IL,60699
       1002,Mary Jones,320 Main,Denver,CO,84100,

(b) Like (a), with quotes around each field

       "1003","John Smith","PO Box 123","Chicago","IL","60699"
       "1004","Mary Jones","320 Main","Denver","CO","84100"

       "1005","Tom Hall, Jr.","61 Ash Ct.","Niles","OH","44446"
       "1006","Bob Davis","429 Pine, Apt. 5","Boston","MA","02128"

(d) Like (c), with embedded commas and quotes

       "1007","Sue "Red" Smith","19 Main","Troy","MI","48055"
       "1008","Joe "Hey, guy!" Hall","POB 44","Reno","NV","89504"

In each example above, we have 7 fields and 6 commas which function as field separators. Case (c) is a very typical form of these data files, with double quotes used to enclose each field and to protect internal commas (such as "Tom Hall, Jr.") from interpretation as field separators. However, many times the data may include both embedded quotation marks as well as embedded commas, as seen by case (d), above.

Case (d) is the closest to Microsoft CSV format. However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file.

Before handling a comma-delimited data file, make sure that you fully understand its format and check the integrity of the data. Does each line contain the same number of fields? Should certain fields be composed only of numbers or of two-letter state abbreviations in all caps? Sed (or awk or perl) should be used to validate the integrity of the data file before you attempt to alter it or extract particular fields from the file.

After ensuring that each line has a valid number of fields, use sed to locate and modify individual fields, using the \(...\) grouping command where needed.

In case (a):

     sed 's/^[^,]*,[^,]*,[^,]*,[^,]*,/.../'
             ^     ^     ^
             |     |     |_ 3rd field
             |     |_______ 2nd field
             |_____________ 1st field

     # Unix script to delete the second field for case (a)
     sed 's/^\([^,]*\),[^,]*,/\1,,/' file

     # Unix script to change field 1 to 9999 for case (a)
     sed 's/^[^,]*,/9999,/' file

In cases (b) and (c):

     sed 's/^"[^"]*","[^"]*","[^"]*","[^"]*",/.../'
              1st--   2nd--   3rd--   4th--

     # Unix script to delete the second field for case (c)
     sed 's/^\("[^"]*"\),"[^"]*",/\1,"",/' file

     # Unix script to change field 1 to 9999 for case (c)
     sed 's/^"[^"]*",/"9999",/' file

In case (d):

One way to parse such files is to replace the 3-character field separator "," with an unused character like the tab or vertical bar. (Technically, the field separator is only the comma while the fields are surrounded by "double quotes", but the net effect is that fields are separated by quote-comma-quote, with quote characters added to the beginning and end of each record.) Search your datafile first to make sure that your character appears nowhere in it!

     sed -n '/|/p' file        # search for any instance of '|'
     # if it's not found, we can use the '|' to separate fields

Then replace the 3-character field separator and parse as before:

     # sed script to delete the second field for case (d)
     s/","/|/g;                  # global change of "," to bar
     s/^\([^|]*\)|[^|]|/\1||/;   # delete 2nd field
     s/|/","/g;                  # global change of bar back to ","
     #---end of script---

     # sed script to change field 1 to 9999 for case (d)
     # Remember to accommodate leading and trailing quote marks
     s/","/|/g;
     s/^[^|]*|/"9999|/;
     s/|/","/g;
     #---end of script---

Note that this technique works only if each and every field is surrounded with double quotes, including empty fields.

The following solution is for more complex examples of (d), such as: not all fields contain "double-quote" marks, or the presence of embedded "double-quote" marks within fields, or extraneous whitespace around field delimiters. (Thanks to Greg Ubben for this script!)

     # sed script to convert case (d) to bar-delimited records
     s/^ *\(.*[^ ]\) *$/|\1|/;
     s/" *, */"|/g;
     : loop
     s/| *\([^",|][^,|]*\) *, */|\1|/g;
     s/| *, */|\1|/g;
     t loop
     s/  *|/|/g;
     s/|  */|/g;
     s/^|\(.*\)|$/\1/;
     #---end of script---

For example, it turns this (which is badly-formed but legal):

   first,"",unquoted ,""this" is, quoted " ,, sub "quote" inside, f", lone  " empty:

into this:

   first|""|unquoted|""this" is, quoted "||sub "quote" inside|f"|lone  "   empty:

Note that the script preserves the "double-quote" marks, but changes only the commas where they are used as field separators. I have used the vertical bar "|" because it's easier to read, but you may change this to another field separator if you wish.

If your CSV datafile is more complex, it would probably not be worth the effort to write it in sed. For such a case, you should use Perl with a dedicated CSV module (there are at least two recent CSV parsers available from CPAN).

The sed FAQ
Prev	Home	Next