Bash Pattern Scan (awk)
Process and analyze text data by columns
📊 What is awk?
awk is a powerful text processing tool that scans files line by line and processes data in columns. It's perfect for extracting specific fields, performing calculations, and formatting output from structured text data.
# Print first column from a file
awk '{print $1}' data.txt
# Print specific columns
awk '{print $1, $3}' file.txt
awk Basics
Print Columns
Extract specific fields
awk '{print $1}' file
Pattern Match
Process matching lines
awk '/pattern/ {print}' file
Calculations
Perform math operations
awk '{sum+=$1} END {print sum}'
Custom Delimiter
Specify field separator
awk -F',' '{print $1}' file
🔹 Printing Columns
AWK's default action is to split each input record into fields separated by whitespace, allowing intuitive column access. The variable $1 refers to the first field, $2 to the second, and $0 holds the entire line. This model is perfect for working with tabular data from logs, command output (like ps or df), and configuration files. For example, awk '{print $NF}' file.txt prints the last field of each line, demonstrating how AWK simplifies data extraction tasks that would be cumbersome with manual parsing.
# Print first column
awk '{print $1}' users.txt
# Print multiple columns
awk '{print $1, $3}' data.txt
# Print with custom text
awk '{print "Name:", $1, "Age:", $2}' people.txt
# Print entire line
awk '{print $0}' file.txt
Input (users.txt):
John 25 Engineer Sarah 30 Designer Mike 28 Developer
Output (awk '{print $1, $3}'):
John Engineer Sarah Designer Mike Developer
🔹 Pattern Matching
AWK can selectively process lines by specifying a pattern before the action block, integrating filtering and extraction. Patterns can be regular expressions, string matches, or even conditions based on field values. For example, awk '$3 > 100 {print $1, $3}' data.txt only processes lines where the third field exceeds 100. This reduces the need for preliminary grep or sed commands, leading to cleaner, more efficient one-liners and scripts that perform complex data selection in a single pass.
# Print lines containing "error"
awk '/error/ {print}' app.log
# Print first column of matching lines
awk '/Engineer/ {print $1}' users.txt
# Match and format output
awk '/^John/ {print "Found:", $1, $2}' data.txt
# Multiple patterns
awk '/error|warning/ {print $0}' system.log
Output:
John Found: John 25
🔹 Using Custom Delimiters
The -F flag in AWK defines the input field separator, enabling parsing of non-whitespace delimited data. Common delimiters include commas for CSVs (-F','), colons for files like /etc/passwd (-F':'), or even multi-character separators. This flexibility is crucial for ETL (Extract, Transform, Load) processes, where data arrives in varied formats from different sources. Proper delimiter specification ensures fields are split correctly, forming the foundation for accurate subsequent analysis and reporting.
# Use comma as delimiter (CSV files)
awk -F',' '{print $1, $2}' data.csv
# Use colon as delimiter (like /etc/passwd)
awk -F':' '{print $1, $3}' /etc/passwd
# Use tab as delimiter
awk -F'\t' '{print $1}' data.tsv
# Multiple character delimiter
awk -F'::' '{print $1}' file.txt
Input (data.csv):
John,25,Engineer Sarah,30,Designer
Output (awk -F',' '{print $1, $3}'):
John Engineer Sarah Designer
🔹 Performing Calculations
AWK functions as a capable data analysis tool by supporting arithmetic operations and maintaining variables across lines. You can compute sums, averages, minimums, maximums, and custom aggregations. For instance, to calculate the total and average of the second column: awk '{sum+=$2} END {print "Total:", sum, "Average:", sum/NR}' file.txt. This capability eliminates the need to import data into external spreadsheet software for basic statistics, allowing quick insights directly from the command line or within shell scripts.
# Sum all values in first column
awk '{sum += $1} END {print "Total:", sum}' numbers.txt
# Calculate average
awk '{sum += $1; count++} END {print "Average:", sum/count}' data.txt
# Find maximum value
awk 'BEGIN {max=0} {if($1>max) max=$1} END {print "Max:", max}' nums.txt
# Multiple calculations
awk '{sum+=$2; count++} END {print "Total:", sum, "Count:", count}' sales.txt
Input (numbers.txt):
10 20 30 40
Output:
Total: 100 Average: 25
🔹 Conditional Processing
Conditional logic in AWK, using if, else if, and else, enables sophisticated data-driven processing flows. Conditions can compare fields, match regular expressions, or check line numbers. This allows for tasks like categorizing log entries by severity, filtering dataset rows based on multiple criteria, or applying different formatting rules to various data types. Such conditional processing makes AWK scripts powerful for cleaning, validating, and transforming raw data into structured, actionable information.
# Print if value is greater than 25
awk '{if($2 > 25) print $1, $2}' data.txt
# Print different messages based on value
awk '{if($2 >= 30) print $1, "Senior"; else print $1, "Junior"}' ages.txt
# Multiple conditions
awk '{if($2>20 && $2<30) print $0}' data.txt
# Count matching conditions
awk '{if($3=="Engineer") count++} END {print count}' users.txt
Output:
Sarah 30 Mike 28 Sarah Senior Mike Junior
🔹 BEGIN and END Blocks
The BEGIN block executes once before any input is read, ideal for printing report headers, initializing counters, or setting variables. The END block runs once after all input has been processed, perfect for displaying grand totals, summaries, or final calculations. These blocks provide a clear program structure, separating initialization, main processing, and cleanup phases. This structure enhances script readability, maintainability, and is a hallmark of well-organized AWK programs for production data pipelines.
# Print header before processing
awk 'BEGIN {print "Name\tAge"} {print $1, $2}' data.txt
# Print footer after processing
awk '{print $0} END {print "--- End of File ---"}' file.txt
# Initialize variables and print summary
awk 'BEGIN {sum=0} {sum+=$1} END {print "Total:", sum}' numbers.txt
# Both BEGIN and END
awk 'BEGIN {print "Report"} {count++} END {print "Lines:", count}' file.txt
Output:
Name Age John 25 Sarah 30 --- End of File ---
🔹 Built-in Variables
AWK's built-in variables offer metadata and state information that adapts processing dynamically. NR (Number of Records) counts all lines processed so far. NF (Number of Fields) changes per line, useful for validating data consistency or accessing the last field ($NF). FILENAME provides the name of the current input file, enabling multi-file processing within a single script. Leveraging these variables allows for more robust scripts that handle edge cases and provide informative output, such as line numbers for error tracking.
Common Variables:
- NR: Current line number
- NF: Number of fields in current line
- $NF: Last field in current line
- FILENAME: Current filename
- FS: Field separator (default: whitespace)
- OFS: Output field separator
# Print line numbers
awk '{print NR, $0}' file.txt
# Print number of fields per line
awk '{print "Line", NR, "has", NF, "fields"}' data.txt
# Print last field
awk '{print $NF}' file.txt
# Print filename and line
awk '{print FILENAME, NR, $0}' *.txt
Output:
1 John 25 Engineer 2 Sarah 30 Designer Line 1 has 3 fields Line 2 has 3 fields
🔹 Formatting Output
AWK's printf statement provides fine-grained control over output layout, surpassing the simple print. It uses format specifiers like %s for strings, %d for integers, and %f for floats, with width and precision controls. For example, awk '{printf "| %-15s | %10.2f |\n", $1, $2}' creates a neatly aligned table. This is essential for generating reports, logs, or any output where consistent column alignment and professional presentation are required for readability or further automated processing.
# Format with specific width
awk '{printf "%-10s %5d\n", $1, $2}' data.txt
# Format numbers with decimals
awk '{printf "%.2f\n", $1}' numbers.txt
# Create aligned columns
awk '{printf "%-15s %-10s %5d\n", $1, $2, $3}' users.txt
# Format with tabs
awk '{printf "%s\t%s\t%s\n", $1, $2, $3}' file.txt
Output:
John 25 Sarah 30 Mike 28
🔹 Practical awk Examples
AWK excels in real-world scenarios like log analysis, data summarization, and format conversion. System administrators use it to parse Apache logs for top IP addresses, developers use it to analyze code metrics, and data engineers use it for quick pre-processing of datasets. An example command: awk '{req[$7]++} END {for (page in req) print req[page], page}' access.log | sort -rn | head -10 shows the ten most frequently requested URLs. These practical applications demonstrate AWK's role as a versatile and efficient text-processing Swiss Army knife.
# Calculate total sales from CSV
awk -F',' '{sum+=$3} END {print "Total Sales: $"sum}' sales.csv
# Extract email addresses
awk -F',' '{print $2}' contacts.csv
# Process log file and count errors
awk '/ERROR/ {count++} END {print "Errors:", count}' app.log
# Generate report with header and footer
awk 'BEGIN {print "User Report\n---"} {print $1, $2} END {print "---\nTotal:", NR}' users.txt
# Filter and format data
awk -F':' '$3 >= 1000 {printf "%-15s %5d\n", $1, $3}' /etc/passwd