Mastering UNIX Text Processing: Tips and Tricks

 
Introduction to UNIX Text Processing
 
Bo Yoo and Karthik Jagadeesh
19 Oct 2016
 
Stanford UNIX resources
 
Host: cardinal.stanford.edu
To connect from Unix/Linux/Mac:
Open a terminal:
ssh user@myth.stanford.edu
ssh user@cardinal.stanford.edu
ssh user@corn.stanford.edu
To connect from Windows:
SecureCRT/SecureFX (software.stanford.edu)
 
Many useful text processing UNIX commands
 
awk bzcat cat column cut grep
head join sed sort tail tee tr
uniq wc zcat …
 
UNIX commands work together via text
streams.
 
Example usage and others available at
http://tldp.org/LDP/abs/html/textproc.html
http://en.wikipedia.org/wiki/Cat_%28Unix%29#Other
 
 
 
3
Huge suite of tools
4
 
Knowing UNIX commands eliminates
having to reinvent the wheel
 
For one of the homework questions in last
year, to perform a simple file sort, submissions
used:
35 lines of Python
19 lines of Perl
73 lines of Java
1 line of UNIX commands
 
5
 
Anatomy of a UNIX command
 
command [options] [FILE1] [FILE2]
options: -n 1 -g -c = -n1 -gc
output  is directed to “standard output” (stdout)
if no input file is specified, input comes from
“standard input” (stdin)
“-” also means stdin in a file list
To view the usage:
command --help
 
6
The real power of UNIX commands comes
from combinations through piping (“|”)
Pipes are used to pass the output of one
program (stdout) as the input (stdin) to another
Pipe character is <Shift>-\
grep “CS273a” grades.txt | sort -k 2,2gr | uniq
7
 
Find all lines in the file
that have “CS273a” in
them somewhere
 
Sort those lines by second
column, in numerical
order, highest to lowest
 
Remove duplicates
and print to
standard output
 
Output redirection (>
, 
>>)
 
Instead of writing everything to standard
output, we can 
write
 (>)or 
append
 (>>) to a file
 
grep “CS273a” allClasses.txt > CS273aInfo.txt
 
cat addlInfo.txt >> CS273aInfo.txt
 
8
 
UCSC KENT SOURCE UTILITIES
 
http://genomewiki.ucsc.edu/index.php/Kent_source_utilities
 
9
/afs/ir/class/cs273a/bin/
Many C programs in this directory that do manipulation
of sequences or chromosome ranges
Run programs with no arguments to see help message
overlapSelect [OPTION]… selectFile inFile outFile
Many useful options to alter how overlaps computed
10
 
Output is all 
inFile
elements that overlap
any 
selectFile
elements
 
selectFile
 
inFile
 
outFile
 
Kent Source and Mysql
 
Linux + Mac Binaries
http://hgdownload.soe.ucsc.edu/admin/exe/
Using MySQL on browser
http://genome.ucsc.edu/goldenPath/help/mysql.h
tml
 
11
 
Interacting with UCSC Genome
Browser MySQL Tables
 
 
Command line:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne “<STMT>“
 
e.g.
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne \
  “select count(*) from hg18.knownGene“;
 
+-------+
| 66803 |
+-------+
http://dev.mysql.com/doc/refman/5.1/en/tutorial.html
 
 
12
Other operations with bash/shell
 
http://www.catonmat.net/blog/set-
operations-in-unix-shell/
Bash noclobber: prevent overwriting files
$ set -o noclobber
$ echo "Can we overwrite it again?" >file.txt
-bash: file.txt: cannot overwrite existing file
$ echo "Can we overwrite it again?" >| file.txt
Bash Dual pipes (tricky, be careful)
sort <(cat file1) <(cat file2)
13
 
SPECIFIC UNIX COMMANDS
 
 
14
 
man, whatis, apropos
 
UNIX program that invokes the 
manual
 written
for a particular program
man sort
Shows all info about the program 
sort
Hit <space> to scroll down, “q” to exit
whatis sort
Shows short description of all programs that have
“sort” in their names
apropos sort
Shows all programs that have “sort” in their names or
short descriptions
 
cat
 
Concatenates files and prints them to
standard output
cat [OPTION] [FILE]…
 
 
 
Variants for compressed input files:
zcat (.gz files)
bzcat (.bz2 files)
 
16
A
B
C
D
1
2
3
A
B
C
D
1
2
3
 
head, tail
 
head: first ten lines
tail: last ten lines
-n option: number of lines
For tail, 
-n+K
 means line K to the end.
head –n5 : 
first five lines
tail –n73 : 
last 73 lines
tail –n+10 | head –n 5 : 
lines 10-14
 
17
 
cut
 
Prints selected parts of lines from each file to
standard output
cut [OPTION]… [FILE]…
-d 
Choose delimiter between columns
(default TAB)
-f
 Fields to print
-f1,7
 : fields 1 and 7
-f1-4,7,11-13
: fields 1,2,3,4,7,11,12,13
 
18
cut
 example
19
CS       273       a
CS.273.a
CS       273  a
file.txt
 
cut –f1,3 file.txt
         =
cat file.txt | cut –f1,3
 
CS       a
CS.273.a
CS
 
cut –d ‘.’ –f1,3 file.txt
 
CS       273       a
CS.a
CS       273  a
In general, you should make sure your file columns
are all delimited with the same character(s) before
applying 
cut
!
 
wc
 
Print line, word, and character (byte) counts
for each file, and totals of each if more than
one file specified
wc [OPTION]… [FILE]…
-l 
Print only line counts
 
20
 
sort
 
Sorts lines in a delimited file (default: tab)
-k m,n
 sorts by columns m to n (1-based)
-g
 sorts by general numerical value (can handle
scientific format)
-r
 sorts in descending order
sort -k1,1gr -k2,3
Sort on field 1 numerically (high to low because of r).
Break ties on field 2 alphabetically.
Break further ties on field 3 alphabetically.
 
21
 
uniq
 
Discard all but one of 
successive identical
lines
 from input and print to standard output
-d 
Only print duplicate lines
-i 
Ignore case in comparison
-u
  Only print unique lines
 
22
uniq
 example
23
CS 273a
CS 273a
TA: Cory McLean
CS 273a
file.txt
 
uniq file.txt
 
CS 273a
TA: Cory McLean
CS 273a
 
uniq –u file.txt
 
TA: Cory McLean
CS 273a
 
uniq –d file.txt
 
CS 273a
In general, you probably want to make sure your file
is sorted before applying 
uniq
!
 
grep
 
Search  for lines that contain a work or match
a regular expression
grep [options] PATTERN [FILE…]
-i
 ignore case
-v
 Output lines that 
do not
 match
-f <FILE>:
 patterns from a file (1 per line)
-E 
Extended regex grep (=egrep)
 
24
grep
 example
  grep -E  “^CS[[:space:]]+273$”     file
25
 
Search
through
“file”
 
For lines
that 
start
with
 CS
 
Then
have 
one
or more
spaces
(or tabs)
 
And 
end
with
 273
CS 273a
CS273
CS        273
cs  273
CS
 
273
 
file
 
CS        273
CS
 
273
tr
Translate or delete characters 
from standard
input
 to standard output
tr [OPTION]… SET1 [SET2]
-d 
Delete chars in SET1, don’t translate
26
 
cat file.txt | tr ‘\n’ ‘,’
This
is an
Example.
 
file.txt
 
This,is an,Example.,
sed: stream editor
Most common use is a string replace.
sed –e “s/SEARCH/REPLACE/g”
27
 
cat file.txt | sed –e “s/is/EEE/g”
This
is an
Example.
 
file.txt
 
ThEEE
EEE an
Example.
 
join
 
Join lines of two files on a common field
join [OPTION]… FILE1 FILE2
-1 
Specify which column of FILE1 to join on
-2
 Specify which column of FILE2 to join on
Important:  FILE1 and FILE2 must 
already
 
be
sorted on their join fields!
 
28
join
 example
29
CS273a           Comp Tour Hum Gen.
CS229             Machine Learning
DB210            Devel. Biol.
file2.txt
Bejerano               CS273a
Villeneuve             DB210
Batzoglou              DB273a
file1.txt
 
join -1 2 -2 1 file1.txt file2.txt
 
CS273a
  
Bejerano
 
   Comp Tour Hum Gen.
DB210
  
Villeneuve           Devel. Biol.
 
SHELL SCRIPTING
 
 
30
 
Common shells
 
Two common shells: bash and tcsh
Run 
ps
 to see which you are using.
 
31
Multiple UNIX commands can be combined into
a single shell script.
#!/bin/bash
set -beEu -o pipefail
cat $1 $2 > tmp.txt
paste tmp.txt $3 > $4
export A=“Value”
32
#!/bin/tcsh 
-e
cat $1 $2 > tmp.txt
paste tmp.txt $3 > $4
setenv A “Value”
script.sh
script.csh
Command prompt
% ./script.sh file1.txt file2.txt file3.txt out.txt
% ./script.csh file1.txt file2.txt file3.txt out.txt
Scripts must first be set to be executable:
% chmod u+x script.sh script.csh
Means die on error.
 
http://www.faqs.org/docs/bashman/bashref_toc.html
http://www.the4cs.com/~corin/acm/tutorial/unix/tcsh-help.html
for loop
# BASH for loop to print 1,2,3 on separate lines
for i in `seq 1 3`
do
      echo ${i}
done
# TCSH for loop to print 1,2,3 on separate lines
foreach i ( `seq 1 3` )
      echo ${i}
end
33
Special quote character, usually left of
“1” on keyboard that indicates we
should 
execute 
the command within
the quotes
 
SCRIPTING LANGUAGES
 
 
34
 
awk
 
A quick-and-easy shell scripting language
http://www.grymoire.com/Unix/Awk.html
Treats each line of a file as a 
record
, and splits
fields by 
whitespace
Fields referenced as $1, $2, $3, … ($0 is entire
line)
 
35
 
Anatomy of an awk script.
 
awk ‘BEGIN {…} {…} END {…}’
 
36
 
before first line
 
after last line
 
once per line
 
awk example
 
Output the lines where column 3 is less than
column 5 in a comma-delimited file.  Output a
summary line at the end.
 
37
 
awk -F',‘
'BEGIN{ct=0;}
{ if ($3 < $5) { print $0; ct=ct+1; } }
END { print "TOTAL LINES: " ct; }'
Useful things from awk
 
Make sure fields are delimited with tabs (to be
used by 
cut
, 
sort
, 
join
, etc.
awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt
 
Good string processing using substr, index, length
functions
awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt
 
38
 
substr(“helloworld”, 4, 3) = “low”
 
index(“helloworld”, “low”) = 4
 
length(“helloworld”) = 10
 
index(“helloworld”, “notpresent”) = 0
 
Python
 
A scripting language with many useful constructs
Easier to read than Perl
http://wiki.python.org/moin/BeginnersGuide
http://docs.python.org/tutorial/index.html
 
Call a python program from the command line:
python myProg.py
 
39
 
Number types
 
Numbers:  int, float
>>> f = 4.7
>>> i = int(f)
>>> j = round(f)
>>> i
4
>>> j
5.0
>>> i*j
20.0
>>> 2**i
16
 
40
 
Strings
 
>>> dir(“”)
[…, 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith',
'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit',
'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust',
'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']
>>> s = “hi how are you?”
>>> 
len
(s)
15
>>> s[5:10]
‘w are’
>>> s.
find
(“how”)
3
>>> s.find(“CS273”)
-1
>>> s.
split
(“ “)
[‘hi’, ‘how’, ‘are’, ‘you?’]
>>> s.
startswith
(“hi”)
True
>>> s.
replace
(“hi”, “hey buddy,”)
‘hey buddy, how are you?’
>>> “   extraBlanks     ”.
strip
()
‘extraBlanks’
 
41
 
Lists
 
A container that holds zero or more objects in
sequential order
>>> dir([])
[…, 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>> myList = [“hi”, “how”, “are”, “you?”]
>>> myList[0]
‘hi’
>>> len(myList)
4
>>> for word in myList:
 
     print word[0:2]
 
hi
ho
ar
yo
 
>>> nums = [1,2,3,4]
>>> squares = [n*n for n in nums]
>>> squares
[1, 4, 9, 16]
 
42
 
Dictionaries
 
A container like a list, except key can be
anything (instead of a non-negative integer)
>>> dir({})
[…, clear', 'copy', 'fromkeys', 'get', 'has_key', 'items',
'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop',
'popitem', 'setdefault', 'update', 'values']
>>> fruits = {“apple”: True, “banana”: True}
>>> fruits[“apple”]
True
>>> fruits.get(“apple”, “Not a fruit!”)
True
>>> fruits.get(“carrot”, “Not a fruit!”)
‘Not a fruit!’
>>> fruits.items()
[('apple', True), ('banana', True)]
 
43
 
Reading from files
 
>>> openFile = open(“file.txt”, “r”)
>>> allLines = openFile.readlines()
>>> openFile.close()
>>> allLines
[‘Hello, world!\n’, ‘This is a file-reading\n’, ‘\texample.\n’]
 
44
Hello, world!
This is a file-reading
     example.
 
file.txt
 
Writing to files
 
>>> writer = open(“file2.txt”, “w”)
>>> writer.write(“Hello again.\n”)
>>> name = “Cory”
>>> writer.write(“My name is %s, what’s yours?\n” % name)
>>> writer.close()
 
45
Hello again.
My name is Cory, what’s yours?
 
file2.txt
 
Creating functions
 
def compareParameters(param1, param2):
    if param1 < param2:
        return -1
    elif param1 > param2:
        return 1
    else:
        return 0
 
 
def factorial(n):
    if n < 0:
        return None
    elif n == 0:
        return 1
    else:
        retval = 1
        num = 1
        while num <= n:
            retval = retval*num
            num = num + 1
        return retval
 
46
 
Example program
#!/usr/bin/env python
import sys    # Required to read arguments from command line
 
if len(sys.argv) != 3:
    print “Wrong number of arguments supplied to Example.py”
    sys.exit(1)
 
inFile = open(sys.argv[1], “r”)
allLines = inFile.readlines()
inFile.close()
 
 
outFile = open(sys.argv[2], “w”)
for line in allLines:
    outFile.write(line)
 
outFile.close()
 
47
 
Example.py
 
Example program
 
python Example.py file1 file2
 
sys.argv = [‘Example.py’, ‘file1’, ‘file2’]
 
48
#!/usr/bin/env python
import sys    # Required to read arguments from command line
 
if len(sys.argv) != 3:
    print “Wrong number of arguments supplied to Example.py”
    sys.exit(1)
 
inFile = open(sys.argv[1], “r”)
allLines = inFile.readlines()
inFile.close()
 
 
outFile = open(sys.argv[2], “w”)
for line in allLines:
    outFile.write(line)
 
outFile.close()
Slide Note
Embed
Share

Explore the world of UNIX text processing through a comprehensive guide covering essential commands, efficient workflows, and powerful combinations. Learn how UNIX commands streamline tasks, eliminate redundant code, and enhance productivity. Discover the art of piping, output redirection, and utilizing a vast suite of tools to manipulate text streams effortlessly.

  • UNIX Text Processing
  • Commands
  • Piping
  • Output Redirection
  • Productivity

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to UNIX Text Processing Bo Yoo and Karthik Jagadeesh 19 Oct 2016

  2. Stanford UNIX resources Host: cardinal.stanford.edu To connect from Unix/Linux/Mac: Open a terminal: ssh user@myth.stanford.edu ssh user@cardinal.stanford.edu ssh user@corn.stanford.edu To connect from Windows: SecureCRT/SecureFX (software.stanford.edu) PuTTy (http://goo.gl/s0itD)

  3. Many useful text processing UNIX commands awk bzcat cat column cut grep head join sed sort tail tee tr uniq wc zcat UNIX commands work together via text streams. Example usage and others available at http://tldp.org/LDP/abs/html/textproc.html http://en.wikipedia.org/wiki/Cat_%28Unix%29#Other 3

  4. Huge suite of tools 4

  5. Knowing UNIX commands eliminates having to reinvent the wheel For one of the homework questions in last year, to perform a simple file sort, submissions used: 35 lines of Python 19 lines of Perl 73 lines of Java 1 line of UNIX commands 5

  6. Anatomy of a UNIX command command [options] [FILE1] [FILE2] options: -n 1 -g -c = -n1 -gc output is directed to standard output (stdout) if no input file is specified, input comes from standard input (stdin) - also means stdin in a file list To view the usage: command --help 6

  7. The real power of UNIX commands comes from combinations through piping ( | ) Pipes are used to pass the output of one program (stdout) as the input (stdin) to another Pipe character is <Shift>-\ grep CS273a grades.txt | sort -k 2,2gr | uniq Find all lines in the file that have CS273a in them somewhere Sort those lines by second column, in numerical order, highest to lowest Remove duplicates and print to standard output 7

  8. Output redirection (>, >>) Instead of writing everything to standard output, we can write (>)or append (>>) to a file grep CS273a allClasses.txt > CS273aInfo.txt cat addlInfo.txt >> CS273aInfo.txt 8

  9. http://genomewiki.ucsc.edu/index.php/Kent_source_utilities UCSC KENT SOURCE UTILITIES 9

  10. /afs/ir/class/cs273a/bin/ Many C programs in this directory that do manipulation of sequences or chromosome ranges Run programs with no arguments to see help message overlapSelect [OPTION] selectFile inFile outFile Many useful options to alter how overlaps computed selectFile inFile Output is all inFile elements that overlap any selectFile elements outFile 10

  11. Kent Source and Mysql Linux + Mac Binaries http://hgdownload.soe.ucsc.edu/admin/exe/ Using MySQL on browser http://genome.ucsc.edu/goldenPath/help/mysql.h tml 11

  12. Interacting with UCSC Genome Browser MySQL Tables Command line: mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A Ne <STMT> e.g. mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A Ne \ select count(*) from hg18.knownGene ; +-------+ | 66803 | +-------+ http://dev.mysql.com/doc/refman/5.1/en/tutorial.html 12

  13. Other operations with bash/shell http://www.catonmat.net/blog/set- operations-in-unix-shell/ Bash noclobber: prevent overwriting files $ set -o noclobber $ echo "Can we overwrite it again?" >file.txt -bash: file.txt: cannot overwrite existing file $ echo "Can we overwrite it again?" >| file.txt Bash Dual pipes (tricky, be careful) sort <(cat file1) <(cat file2) 13

  14. SPECIFIC UNIX COMMANDS 14

  15. man, whatis, apropos UNIX program that invokes the manual written for a particular program man sort Shows all info about the program sort Hit <space> to scroll down, q to exit whatis sort Shows short description of all programs that have sort in their names apropos sort Shows all programs that have sort in their names or short descriptions

  16. cat Concatenates files and prints them to standard output cat [OPTION] [FILE] A B C D 1 2 3 A B C D 1 2 3 Variants for compressed input files: zcat (.gz files) bzcat (.bz2 files) 16

  17. head, tail head: first ten lines tail: last ten lines -n option: number of lines For tail, -n+K means line K to the end. head n5 : first five lines tail n73 : last 73 lines tail n+10 | head n 5 : lines 10-14 17

  18. cut Prints selected parts of lines from each file to standard output cut [OPTION] [FILE] -d Choose delimiter between columns (default TAB) -f Fields to print -f1,7 : fields 1 and 7 -f1-4,7,11-13: fields 1,2,3,4,7,11,12,13 18

  19. cut example file.txt CS a CS.273.a CS cut f1,3 file.txt = cat file.txt | cut f1,3 CS 273 a CS.273.a CS 273 a CS 273 a CS.a CS 273 a cut d . f1,3 file.txt In general, you should make sure your file columns are all delimited with the same character(s) before applying cut! 19

  20. wc Print line, word, and character (byte) counts for each file, and totals of each if more than one file specified wc [OPTION] [FILE] -l Print only line counts 20

  21. sort Sorts lines in a delimited file (default: tab) -k m,n sorts by columns m to n (1-based) -g sorts by general numerical value (can handle scientific format) -r sorts in descending order sort -k1,1gr -k2,3 Sort on field 1 numerically (high to low because of r). Break ties on field 2 alphabetically. Break further ties on field 3 alphabetically. 21

  22. uniq Discard all but one of successive identical lines from input and print to standard output -d Only print duplicate lines -i Ignore case in comparison -u Only print unique lines 22

  23. uniq example file.txt CS 273a TA: Cory McLean CS 273a uniq file.txt CS 273a CS 273a TA: Cory McLean CS 273a TA: Cory McLean CS 273a uniq u file.txt uniq d file.txt CS 273a In general, you probably want to make sure your file is sorted before applying uniq! 23

  24. grep Search for lines that contain a work or match a regular expression grep [options] PATTERN [FILE ] -i ignore case -v Output lines that do not match -f <FILE>: patterns from a file (1 per line) -E Extended regex grep (=egrep) 24

  25. grep example grep -E ^CS[[:space:]]+273$ file Then have one or more spaces (or tabs) For lines that start with CS And end with 273 Search through file file CS 273a CS273 CS 273 cs 273 CS CS 273 CS 273 273 25

  26. tr Translate or delete characters from standard input to standard output tr [OPTION] SET1 [SET2] -d Delete chars in SET1, don t translate cat file.txt | tr \n , file.txt This is an Example. This,is an,Example., 26

  27. sed: stream editor Most common use is a string replace. sed e s/SEARCH/REPLACE/g cat file.txt | sed e s/is/EEE/g file.txt ThEEE EEE an Example. This is an Example. 27

  28. join Join lines of two files on a common field join [OPTION] FILE1 FILE2 -1 Specify which column of FILE1 to join on -2 Specify which column of FILE2 to join on Important: FILE1 and FILE2 must alreadybe sorted on their join fields! 28

  29. join example file1.txt file2.txt Bejerano CS273a Villeneuve DB210 Batzoglou DB273a CS273a Comp Tour Hum Gen. CS229 Machine Learning DB210 Devel. Biol. join -1 2 -2 1 file1.txt file2.txt CS273a DB210 Bejerano Villeneuve Devel. Biol. Comp Tour Hum Gen. 29

  30. SHELL SCRIPTING 30

  31. Common shells Two common shells: bash and tcsh Run ps to see which you are using. 31

  32. Multiple UNIX commands can be combined into a single shell script. Means die on error. script.sh script.csh #!/bin/bash set -beEu -o pipefail cat $1 $2 > tmp.txt paste tmp.txt $3 > $4 export A= Value #!/bin/tcsh -e cat $1 $2 > tmp.txt paste tmp.txt $3 > $4 setenv A Value Command prompt % ./script.sh file1.txt file2.txt file3.txt out.txt % ./script.csh file1.txt file2.txt file3.txt out.txt Scripts must first be set to be executable: % chmod u+x script.sh script.csh http://www.faqs.org/docs/bashman/bashref_toc.html http://www.the4cs.com/~corin/acm/tutorial/unix/tcsh-help.html 32

  33. for loop # BASH for loop to print 1,2,3 on separate lines for i in `seq 1 3` do echo ${i} done should execute the command within the quotes Special quote character, usually left of 1 on keyboard that indicates we # TCSH for loop to print 1,2,3 on separate lines foreach i ( `seq 1 3` ) echo ${i} end 33

  34. SCRIPTING LANGUAGES 34

  35. awk A quick-and-easy shell scripting language http://www.grymoire.com/Unix/Awk.html Treats each line of a file as a record, and splits fields by whitespace Fields referenced as $1, $2, $3, ($0 is entire line) 35

  36. Anatomy of an awk script. awk BEGIN { } { } END { } before first line once per line after last line 36

  37. awk example Output the lines where column 3 is less than column 5 in a comma-delimited file. Output a summary line at the end. awk -F', 'BEGIN{ct=0;} { if ($3 < $5) { print $0; ct=ct+1; } } END { print "TOTAL LINES: " ct; }' 37

  38. Useful things from awk Make sure fields are delimited with tabs (to be used by cut, sort, join, etc. awk {print $1 \t $2 \t $3} whiteDelim.txt > tabDelim.txt Good string processing using substr, index, length functions awk {print substr($1, 1, 10)} longNames.txt > shortNames.txt Start position Length String to manipulate substr( helloworld , 4, 3) = low index( helloworld , low ) = 4 length( helloworld ) = 10 index( helloworld , notpresent ) = 0 38

  39. Python A scripting language with many useful constructs Easier to read than Perl http://wiki.python.org/moin/BeginnersGuide http://docs.python.org/tutorial/index.html Call a python program from the command line: python myProg.py 39

  40. Number types Numbers: int, float >>> f = 4.7 >>> i = int(f) >>> j = round(f) >>> i 4 >>> j 5.0 >>> i*j 20.0 >>> 2**i 16 40

  41. Strings >>> dir( ) [ , 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] >>> s = hi how are you? >>> len(s) 15 >>> s[5:10] w are >>> s.find( how ) 3 >>> s.find( CS273 ) -1 >>> s.split( ) [ hi , how , are , you? ] >>> s.startswith( hi ) True >>> s.replace( hi , hey buddy, ) hey buddy, how are you? >>> extraBlanks .strip() extraBlanks 41

  42. Lists A container that holds zero or more objects in sequential order >>> dir([]) [ , 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] >>> myList = [ hi , how , are , you? ] >>> myList[0] hi >>> len(myList) 4 >>> for word in myList: print word[0:2] hi ho ar yo >>> nums = [1,2,3,4] >>> squares = [n*n for n in nums] >>> squares [1, 4, 9, 16] 42

  43. Dictionaries A container like a list, except key can be anything (instead of a non-negative integer) >>> dir({}) [ , clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values'] >>> fruits = { apple : True, banana : True} >>> fruits[ apple ] True >>> fruits.get( apple , Not a fruit! ) True >>> fruits.get( carrot , Not a fruit! ) Not a fruit! >>> fruits.items() [('apple', True), ('banana', True)] 43

  44. Reading from files file.txt Hello, world! This is a file-reading example. >>> openFile = open( file.txt , r ) >>> allLines = openFile.readlines() >>> openFile.close() >>> allLines [ Hello, world!\n , This is a file-reading\n , \texample.\n ] 44

  45. Writing to files >>> writer = open( file2.txt , w ) >>> writer.write( Hello again.\n ) >>> name = Cory >>> writer.write( My name is %s, what s yours?\n % name) >>> writer.close() file2.txt Hello again. My name is Cory, what s yours? 45

  46. Creating functions def compareParameters(param1, param2): if param1 < param2: return -1 elif param1 > param2: return 1 else: return 0 def factorial(n): if n < 0: return None elif n == 0: return 1 else: retval = 1 num = 1 while num <= n: retval = retval*num num = num + 1 return retval 46

  47. Example program Example.py #!/usr/bin/env python import sys # Required to read arguments from command line if len(sys.argv) != 3: print Wrong number of arguments supplied to Example.py sys.exit(1) inFile = open(sys.argv[1], r ) allLines = inFile.readlines() inFile.close() outFile = open(sys.argv[2], w ) for line in allLines: outFile.write(line) outFile.close() 47

  48. Example program #!/usr/bin/env python import sys # Required to read arguments from command line if len(sys.argv) != 3: print Wrong number of arguments supplied to Example.py sys.exit(1) inFile = open(sys.argv[1], r ) allLines = inFile.readlines() inFile.close() outFile = open(sys.argv[2], w ) for line in allLines: outFile.write(line) outFile.close() python Example.py file1 file2 sys.argv = [ Example.py , file1 , file2 ] 48

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#