4.12. How do I parse a comma-delimited (CSV) data file?
Comma-delimited data files can come in several forms, requiring
increasing levels of complexity in parsing and handling. They are
often referred to as CSV files (for "comma separated values") and
occasionally as SDF files (for "standard data format"). Note that
some vendors use "SDF" to refer to variable-length records with
comma-separated fields which are "double-quoted" if they contain
character values, while other vendors use "SDF" to designate
fixed-length records with fixed-length, nonquoted fields! (For help
with fixed-length fields, see question 4.23)
The term "CSV" became a de-facto standard when Microsoft Excel used
it as an optional output file format.
Here are 4 different forms you may encounter in comma-delimited data:
(a) No quotes, no internal commas
1001,John Smith,PO Box 123,Chicago,IL,60699
1002,Mary Jones,320 Main,Denver,CO,84100,
(b) Like (a), with quotes around each field
"1003","John Smith","PO Box 123","Chicago","IL","60699"
"1004","Mary Jones","320 Main","Denver","CO","84100"
(c) Like (b), with embedded commas
"1005","Tom Hall, Jr.","61 Ash Ct.","Niles","OH","44446"
"1006","Bob Davis","429 Pine, Apt. 5","Boston","MA","02128"
(d) Like (c), with embedded commas and quotes
"1007","Sue "Red" Smith","19 Main","Troy","MI","48055"
"1008","Joe "Hey, guy!" Hall","POB 44","Reno","NV","89504"
In each example above, we have 7 fields and 6 commas which function
as field separators. Case (c) is a very typical form of these data
files, with double quotes used to enclose each field and to protect
internal commas (such as "Tom Hall, Jr.") from interpretation as
field separators. However, many times the data may include both
embedded quotation marks as well as embedded commas, as seen by
case (d), above.
Case (d) is the closest to Microsoft CSV format. However, the
Microsoft CSV format allows embedded newlines within a
double-quoted field. If embedded newlines within fields are a
possibility for your data, you should consider using something
other than sed to work with the data file.
Before handling a comma-delimited data file, make sure that you
fully understand its format and check the integrity of the data.
Does each line contain the same number of fields? Should certain
fields be composed only of numbers or of two-letter state
abbreviations in all caps? Sed (or awk or perl) should be used to
validate the integrity of the data file before you attempt to alter
it or extract particular fields from the file.
After ensuring that each line has a valid number of fields, use sed
to locate and modify individual fields, using the \(...\) grouping
command where needed.
In case (a):
sed 's/^[^,]*,[^,]*,[^,]*,[^,]*,/.../'
^ ^ ^
| | |_ 3rd field
| |_______ 2nd field
|_____________ 1st field
# Unix script to delete the second field for case (a)
sed 's/^\([^,]*\),[^,]*,/\1,,/' file
# Unix script to change field 1 to 9999 for case (a)
sed 's/^[^,]*,/9999,/' file
In cases (b) and (c):
sed 's/^"[^"]*","[^"]*","[^"]*","[^"]*",/.../'
1st-- 2nd-- 3rd-- 4th--
# Unix script to delete the second field for case (c)
sed 's/^\("[^"]*"\),"[^"]*",/\1,"",/' file
# Unix script to change field 1 to 9999 for case (c)
sed 's/^"[^"]*",/"9999",/' file
In case (d):
One way to parse such files is to replace the 3-character field
separator "," with an unused character like the tab or vertical
bar. (Technically, the field separator is only the comma while the
fields are surrounded by "double quotes", but the net effect is
that fields are separated by quote-comma-quote, with quote
characters added to the beginning and end of each record.) Search
your datafile first to make sure that your character appears
nowhere in it!
sed -n '/|/p' file # search for any instance of '|'
# if it's not found, we can use the '|' to separate fields
Then replace the 3-character field separator and parse as before:
# sed script to delete the second field for case (d)
s/","/|/g; # global change of "," to bar
s/^\([^|]*\)|[^|]|/\1||/; # delete 2nd field
s/|/","/g; # global change of bar back to ","
#---end of script---
# sed script to change field 1 to 9999 for case (d)
# Remember to accommodate leading and trailing quote marks
s/","/|/g;
s/^[^|]*|/"9999|/;
s/|/","/g;
#---end of script---
Note that this technique works only if each and every field is
surrounded with double quotes, including empty fields.
The following solution is for more complex examples of (d), such
as: not all fields contain "double-quote" marks, or the presence of
embedded "double-quote" marks within fields, or extraneous
whitespace around field delimiters. (Thanks to Greg Ubben for this
script!)
# sed script to convert case (d) to bar-delimited records
s/^ *\(.*[^ ]\) *$/|\1|/;
s/" *, */"|/g;
: loop
s/| *\([^",|][^,|]*\) *, */|\1|/g;
s/| *, */|\1|/g;
t loop
s/ *|/|/g;
s/| */|/g;
s/^|\(.*\)|$/\1/;
#---end of script---
For example, it turns this (which is badly-formed but legal):
first,"",unquoted ,""this" is, quoted " ,, sub "quote" inside, f", lone " empty:
into this:
first|""|unquoted|""this" is, quoted "||sub "quote" inside|f"|lone " empty:
Note that the script preserves the "double-quote" marks, but
changes only the commas where they are used as field separators. I
have used the vertical bar "|" because it's easier to read, but you
may change this to another field separator if you wish.
If your CSV datafile is more complex, it would probably not be
worth the effort to write it in sed. For such a case, you should
use Perl with a dedicated CSV module (there are at least two recent
CSV parsers available from CPAN).