Skip to content

Text Processing Tools

Overview

Linux provides a set of powerful text processing tools, including sed, awk, etc. These tools can efficiently process and transform text data.

sed - Stream Editor

sed (Stream Editor) is a powerful text processing tool that can filter and transform text.

Basic Syntax

bash
sed [options] 'command' file
sed [options] -e 'command1' -e 'command2' file
sed [options] -f script_file file

Common Options

OptionDescription
-nSilent mode, don't automatically print
-eAdd command
-fRead commands from file
-iEdit file in-place
-i.bakBackup before editing
-r / -EUse extended regular expressions

Substitute Command s

bash
# Basic substitution (first match per line)
$ sed 's/old/new/' file.txt

# Global substitution
$ sed 's/old/new/g' file.txt

# Case-insensitive
$ sed 's/old/new/gi' file.txt

# Substitute only nth match
$ sed 's/old/new/2' file.txt

# Show substituted lines
$ sed -n 's/old/new/p' file.txt

# Edit file in-place
$ sed -i 's/old/new/g' file.txt

# Backup and edit
$ sed -i.bak 's/old/new/g' file.txt

Addresses and Ranges

bash
# Specify line numbers
$ sed '3s/old/new/' file.txt       # Line 3
$ sed '1,5s/old/new/' file.txt     # Lines 1-5
$ sed '3,$s/old/new/' file.txt     # Line 3 to end

# Lines matching pattern
$ sed '/pattern/s/old/new/' file.txt

# Range pattern
$ sed '/start/,/end/s/old/new/' file.txt

Delete Command d

bash
# Delete specified line
$ sed '3d' file.txt                # Delete line 3
$ sed '1,5d' file.txt              # Delete lines 1-5
$ sed '$d' file.txt                # Delete last line

# Delete matching lines
$ sed '/pattern/d' file.txt

# Delete empty lines
$ sed '/^$/d' file.txt

# Delete comment lines
$ sed '/^#/d' file.txt
bash
# Print specified lines
$ sed -n '3p' file.txt             # Print line 3
$ sed -n '1,5p' file.txt           # Print lines 1-5
$ sed -n '$p' file.txt             # Print last line

# Print matching lines
$ sed -n '/pattern/p' file.txt

# Print line numbers
$ sed -n '=' file.txt

Insert and Append

bash
# Insert before specified line
$ sed '3i\New line content' file.txt

# Append after specified line
$ sed '3a\New line content' file.txt

# Insert/append before/after matching lines
$ sed '/pattern/i\Insert content' file.txt
$ sed '/pattern/a\Append content' file.txt

Replace Entire Line c

bash
# Replace specified line
$ sed '3c\New content' file.txt

# Replace matching lines
$ sed '/pattern/c\New content' file.txt

Multiple Commands

bash
# Separate with semicolons
$ sed 's/a/A/g; s/b/B/g' file.txt

# Use -e option
$ sed -e 's/a/A/g' -e 's/b/B/g' file.txt

# Use braces for grouping
$ sed '/pattern/{s/old/new/; s/foo/bar/}' file.txt

Advanced Techniques

bash
# Use different delimiters
$ sed 's|/usr/local|/opt|g' file.txt
$ sed 's#http://#https://#g' file.txt

# Reference matched content
$ sed 's/\(.*\)/【\1】/' file.txt      # Add brackets
$ sed 's/[0-9]*/(&)/' file.txt         # & represents matched content

# Case conversion
$ sed 's/[a-z]/\u&/g' file.txt         # Capitalize first letter
$ sed 's/.*/\U&/' file.txt             # All uppercase
$ sed 's/.*/\L&/' file.txt             # All lowercase

Practical Examples

bash
# Delete HTML tags
$ sed 's/<[^>]*>//g' file.html

# Delete leading whitespace
$ sed 's/^[ \t]*//' file.txt

# Delete trailing whitespace
$ sed 's/[ \t]*$//' file.txt

# Add line numbers
$ sed = file.txt | sed 'N;s/\n/\t/'

# Add blank line after each line
$ sed 'G' file.txt

# Merge consecutive blank lines
$ sed '/^$/N;/^\n$/d' file.txt

awk - Pattern Processing Language

awk is a powerful text processing language, especially suitable for structured data.

Basic Syntax

bash
awk 'pattern { action }' file
awk -F separator 'pattern { action }' file

Built-in Variables

VariableDescription
$0Entire line content
$1, $2, ...Nth field
NFNumber of fields
NRCurrent line number
FNRCurrent file's line number
FSField separator
OFSOutput field separator
RSRecord separator
ORSOutput record separator
FILENAMECurrent filename

Basic Operations

bash
# Print all lines
$ awk '{print}' file.txt
$ awk '{print $0}' file.txt

# Print specified fields
$ awk '{print $1}' file.txt
$ awk '{print $1, $3}' file.txt

# Specify separator
$ awk -F ':' '{print $1}' /etc/passwd
$ awk -F ',' '{print $1, $2}' file.csv

# Print line numbers
$ awk '{print NR, $0}' file.txt

Pattern Matching

bash
# Match regular expressions
$ awk '/pattern/' file.txt
$ awk '/pattern/ {print $1}' file.txt

# Conditional match
$ awk '$1 > 100' file.txt
$ awk '$1 == "value"' file.txt
$ awk 'NR > 5' file.txt

# Range match
$ awk '/start/,/end/' file.txt

# Field match
$ awk '$1 ~ /pattern/' file.txt
$ awk '$1 !~ /pattern/' file.txt

BEGIN and END

bash
# Execute before processing
$ awk 'BEGIN {print "Start processing"} {print}' file.txt

# Execute after processing
$ awk '{print} END {print "Processing complete"}' file.txt

# Set variables
$ awk 'BEGIN {FS=":"; OFS="\t"} {print $1, $3}' /etc/passwd

# Count lines
$ awk 'END {print NR}' file.txt

Arithmetic Operations

bash
# Basic operations
$ awk '{print $1 + $2}' file.txt
$ awk '{sum = $1 + $2; print sum}' file.txt

# Sum
$ awk '{sum += $1} END {print sum}' file.txt

# Average
$ awk '{sum += $1} END {print sum/NR}' file.txt

# Maximum/minimum
$ awk 'BEGIN {max=0} $1>max {max=$1} END {print max}' file.txt

String Functions

bash
# Length
$ awk '{print length($1)}' file.txt

# Substring
$ awk '{print substr($1, 1, 3)}' file.txt

# Split
$ awk '{split($1, arr, "-"); print arr[1]}' file.txt

# Substitute
$ awk '{gsub(/old/, "new"); print}' file.txt

# Case conversion
$ awk '{print toupper($1)}' file.txt
$ awk '{print tolower($1)}' file.txt

# Find
$ awk '{if (index($0, "pattern") > 0) print}' file.txt

Control Structures

bash
# if-else
$ awk '{if ($1 > 100) print "Large"; else print "Small"}' file.txt

# for loop
$ awk '{for (i=1; i<=NF; i++) print $i}' file.txt

# while loop
$ awk '{i=1; while (i<=NF) {print $i; i++}}' file.txt

# Arrays
$ awk '{count[$1]++} END {for (k in count) print k, count[k]}' file.txt

Formatted Output

bash
# printf
$ awk '{printf "%-10s %5d\n", $1, $2}' file.txt

# Format specifiers
# %s  String
# %d  Integer
# %f  Floating point
# %-  Left align
# %10  Width

Practical Examples

bash
# Count word frequency
$ awk '{for(i=1;i<=NF;i++) count[$i]++} END {for(w in count) print count[w], w}' file.txt | sort -rn

# Calculate total file size
$ ls -l | awk '{sum += $5} END {print sum}'

# Process CSV
$ awk -F ',' '{print $1 "\t" $2}' file.csv

# Extract IPs from log
$ awk '{print $1}' access.log | sort | uniq -c | sort -rn

# Conditional statistics
$ awk '$3 > 1000 {count++} END {print count}' file.txt

# Merge lines
$ awk 'ORS=NR%3?"\t":"\n"' file.txt

diff and patch

diff - Compare Files

bash
# Basic comparison
$ diff file1.txt file2.txt

# Unified format
$ diff -u file1.txt file2.txt

# Side-by-side
$ diff -y file1.txt file2.txt

# Ignore whitespace
$ diff -w file1.txt file2.txt

# Recursive directory comparison
$ diff -r dir1/ dir2/

# Generate patch
$ diff -u old.txt new.txt > changes.patch

patch - Apply Patch

bash
# Apply patch
$ patch < changes.patch

# Specify file
$ patch file.txt < changes.patch

# Reverse (undo)
$ patch -R < changes.patch

# Dry run
$ patch --dry-run < changes.patch

comm - Compare Sorted Files

bash
# Display three columns: only in file1, only in file2, in both
$ comm file1.txt file2.txt

# Show only in both
$ comm -12 file1.txt file2.txt

# Show only in file1
$ comm -23 file1.txt file2.txt

join - Join Files

bash
# Join based on common field
$ join file1.txt file2.txt

# Specify join fields
$ join -1 2 -2 1 file1.txt file2.txt

# Specify separator
$ join -t ':' file1.txt file2.txt

Summary

This chapter introduced powerful Linux text processing tools:

  • sed: Stream editor, suitable for simple text replacement and transformation
  • awk: Pattern processing language, suitable for structured data processing
  • diff/patch: File comparison and patching
  • comm/join: File merging and comparison

Mastering sed and awk will greatly improve your text processing efficiency.


Previous chapter: Text Editors

Next chapter: Regular Expressions

Content is for learning and research only.