awk is a powerful tool for processing CSV files. Here’s a breakdown of how to use it for common CSV tasks:

Statements are terminated by semicolons, newlines or right braces.

Important Notes

  • For very complex CSV files with nested quotes, escaped characters, or other intricacies, using a dedicated CSV parsing tool is highly recommended. awk is best for simpler CSV structures.
  • The examples above assume a consistent CSV format. Inconsistent formatting can lead to incorrect results.
  • Always test your awk commands on a small sample of your CSV file first to ensure they are working as expected.

Setting the Field Separator

CSV files use commas (,) as delimiters. You tell awk to use a comma as the field separator with the -F option:

awk -F, '{print $1, $2}' your_file.csv

This will print the first and second fields of each line in your_file.csv.

Printing Specific Fields

  • $1, $2, $3, etc., represent the first, second, third, and so on, fields.
  • $0 represents the entire line.
awk -F, '{print $1}' your_file.csv       # Print the first field
awk -F, '{print $2, $4}' your_file.csv    # Print the second and fourth fields
awk -F, '{print $0}' your_file.csv       # Print the entire line

Printing with Custom Formatting

You can combine fields with strings and other formatting:

awk -F, '{print "Name:", $1, "Age:", $2}' your_file.csv

This will print "Name: [value of first field] Age: [value of second field]" for each line.

Filtering Records

You can filter lines based on conditions:

awk -F, '$3 > 30 {print $1, $2, $3}' your_file.csv  # Print lines where the third field is greater than 30
awk -F, '$1 == "John" {print $0}' your_file.csv     # Print lines where the first field is "John"
awk -F, '$2 ~ /Smith/ {print $0}' your_file.csv    # Print lines where the second field contains "Smith" (using a regular expression)

Calculating Totals and Averages

awk -F, '{sum += $3} END {print "Total:", sum, "Average:", sum/NR}' your_file.csv
  • This calculates the sum of the third field.
  • NR is the total number of records (lines).
  • The END block is executed after processing all lines.

Adding a Header

If your CSV doesn’t have a header, you can add one:

awk -F, 'NR==1 {print "Name,Age,Salary"} {print $1,$2,$3}' your_file.csv
  • NR==1 checks if it’s the first record.

Handling Quoted Fields

CSV files often have quoted fields. awk’s default field splitting might not handle this correctly. For more robust CSV parsing, consider using a dedicated CSV parser like csvkit or Python’s csv module. However, for simple cases, you can sometimes use FPAT (field pattern) for quoted fields if your gawk version supports it:

gawk -F, -v FPAT='"([^"]*)"|([^,]*)' '{print $1, $2}' your_file.csv

This FPAT setting attempts to handle quoted fields (content between double quotes) and unquoted fields. Be careful, and test this thoroughly, as complex CSV files might require a dedicated parser.

Example CSV (your_file.csv):

Name,Age,Salary
John,30,50000
Jane,25,60000
Peter,35,70000
Mary,28,55000

Example Usage and Output:

awk -F, '{print $1, $3}' your_file.csv

Output:

Name Salary
John 50000
Jane 60000
Peter 70000
Mary 55000