Bash Pipelines

Connecting commands for powerful data processing

🔗 What are Pipelines?

Pipelines connect multiple commands together, passing the output of one command as input to the next. They enable powerful data processing by chaining simple commands to perform complex tasks efficiently.

#!/bin/bash
# Simple pipeline
ls -l | grep ".txt"

Output:

-rw-r--r-- 1 user group 1234 Jan 1 file.txt

Key Pipeline Concepts

|

Pipe Operator

Connect command outputs to inputs

cmd1 | cmd2
🔄

Data Flow

Output flows left to right

cat file | sort
⛓️

Chaining

Connect multiple commands

cmd1 | cmd2 | cmd3
🎯

Filtering

Process and filter data streams

ps aux | grep bash

🔹 Basic Pipeline Usage

The pipe operator (|) connects commands by feeding standard output from the left command as standard input to the right command. This enables powerful data transformation workflows without temporary files. For example, ls -l | grep ".txt" | wc -l lists files, filters for text files, then counts them. Pipelines represent the Unix philosophy of composing small, specialized tools into complex operations.

#!/bin/bash
# basic-pipe.sh

# List files and count them
ls | wc -l

# Show processes and search for specific one
ps aux | grep bash

# Display file content and search
cat file.txt | grep "error"

Output:

15
user 1234 0.0 0.1 bash
Error: Connection failed

🔹 Chaining Multiple Commands

Chain unlimited commands in pipelines where each processes output from the previous command, enabling complex multi-stage data transformations. This approach is more efficient than running separate commands with intermediate files and creates cleaner, more maintainable scripts. Each command in the pipeline specializes in one transformation, following the single responsibility principle while combining into powerful processing workflows.

#!/bin/bash
# chain-commands.sh

# List, filter, sort, and count
ls -l | grep ".txt" | sort | wc -l

# Process log file: extract, sort, unique, count
cat access.log | grep "ERROR" | sort | uniq -c | sort -nr

# Find large files
du -h | sort -hr | head -5

Output:

8
42 ERROR: Database connection
15 ERROR: Timeout
1.2G /home/user/videos

🔹 Common Pipeline Patterns

Master common pipeline patterns for frequent tasks: sorting with removal of duplicates, counting occurrences, finding top results, and data filtering. Patterns like sort | uniq -c | sort -nr count and rank unique items by frequency. Learning these idioms helps quickly build effective data processing pipelines for log analysis, data cleaning, and system administration tasks without reinventing solutions.

#!/bin/bash
# common-patterns.sh

# Sort and remove duplicates
cat names.txt | sort | uniq

# Count word frequency
cat document.txt | tr ' ' '\n' | sort | uniq -c | sort -nr

# Find top 10 largest files
find . -type f -exec du -h {} + | sort -hr | head -10

# Extract and count IP addresses
cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr

Output:

Alice
Bob
45 the
32 and

🔹 Using Tee Command

The tee command reads from standard input and writes simultaneously to standard output and specified files, splitting the data stream. This preserves intermediate results while continuing the pipeline, ideal for debugging complex transformations or when you need to capture data at multiple processing stages. tee enables both real-time monitoring through continued output and permanent storage of intermediate results.

#!/bin/bash
# tee-demo.sh

# Save output to file and continue pipeline
ls -l | tee files.txt | grep ".txt"

# Save to multiple files
echo "Important data" | tee file1.txt file2.txt

# Append to file while displaying
date | tee -a log.txt

Output:

-rw-r--r-- 1 user group 1234 file.txt
Important data
Mon Jan 1 12:00:00 UTC 2024

🔹 Pipeline with Xargs

The xargs command builds and executes commands from standard input, converting pipeline output into command-line arguments. This bridges programs that expect command-line arguments with those that produce pipeline output. xargs can process items in parallel (-P flag) and handle argument limits efficiently, making it essential for bulk operations and parallel processing of large datasets.

#!/bin/bash
# xargs-demo.sh

# Find and delete files
find . -name "*.tmp" | xargs rm

# Process multiple files
ls *.txt | xargs -I {} cp {} backup/

# Parallel processing
cat urls.txt | xargs -P 4 -I {} curl -O {}

Output:

[Files deleted]
[Files copied to backup/]
[URLs downloaded in parallel]

🔹 Pipeline Exit Status

By default, pipelines return the exit status of the last command, but set -o pipefail makes the pipeline fail if any component fails. This critical error handling ensures failures in early pipeline stages don't go unnoticed. Combined with set -e, it creates robust scripts that properly detect and respond to failures anywhere in processing chains, preventing silent errors.

#!/bin/bash
# pipeline-status.sh

# Default behavior - only last command status matters
false | true
echo "Exit status: $?"  # Returns 0

# With pipefail - any failure causes pipeline to fail
set -o pipefail
false | true
echo "Exit status: $?"  # Returns 1

Output:

Exit status: 0
Exit status: 1

🔹 Advanced Pipeline Example

Combine multiple pipeline techniques to solve complex real-world problems like log analysis, data extraction, and reporting. Example: grep "ERROR" app.log | cut -d' ' -f3 | sort | uniq -c | sort -nr | head -10 finds the ten most frequent error types. Such pipelines replace what would otherwise require custom programs, demonstrating the power of combining simple Unix tools.

#!/bin/bash
# advanced-pipeline.sh

# Analyze log file: find errors, count, sort, save top 10
cat /var/log/app.log | \
    grep "ERROR" | \
    awk '{print $5}' | \
    sort | \
    uniq -c | \
    sort -nr | \
    head -10 | \
    tee error-summary.txt

echo "Analysis complete!"

Output:

156 DatabaseError
89 TimeoutError
45 ConnectionError
Analysis complete!

🧠 Test Your Knowledge

What does the pipe operator (|) do?