CSE 15L, Hacker edition
or, The Joy of Regex

This material was developed for a workshop on regex, efficient text editing, and various command line tricks, hosted by Eve Security on April 29, 2016. It aims to demystify the command line and teach common poweruser shortcuts. Each section explains a tool or concept, illustrated with exercises.

TODO TODO TODO

1 Problems

  • Find all files with phone numbers.
  • Editor scripting: Automatically fix indentation for a whole web project, where html+css+javascript is all mushed into the same file. (use emacs web-mode, write lisp)
  • Standardize filenames in a directory
  • dd a disk image onto a thumbdrive

From http://matt.might.net/articles/what-cs-majors-should-know/

  • Find the five folders in a given directory consuming the most space.
  • Report duplicate MP3s (by file contents, not file name) on a computer.
  • Take a list of names whose first and last names have been lower-cased, and properly recapitalize them.
  • Find all words in English that have x as their second letter, and n as their second-to-last.
  • Directly route your microphone input over the network to another computer's speaker.
  • Replace all spaces in a filename with underscore for a given directory.
  • Report the last ten errant accesses to the web server coming from a specific IP address.

2 Regular Expressions

How to use find-replace properly.

xkcd_regex.png

2.1 Math

A regular expression \(R\) is defined by the grammar

\begin{align*} R &\coloneqq \{ a\;|\; a \in\Sigma \} &\text{ chars in alphabet}\\ &\;|\; \varepsilon &\text{ \{""\} }\\ &\;|\; \emptyset &\text{ \{\} }\\ &\;|\; R \cup R &\text{ or }\\ &\;|\; R \circ R &\text{ concat}\\ &\;|\; R^* &\text{ Kleene's star}\\ \end{align*}

Some mathematical syntactic sugar:

\begin{align*} \{R_1,R_2,\ldots\} &\;\rightsquigarrow\; R_1\cup R_2\cup \ldots\\ R^+ &\;\rightsquigarrow\; R\circ R^*\\ R^? &\;\rightsquigarrow\; R\cup\varepsilon\\ \{R_1,R_2,\ldots\}^c &\;\rightsquigarrow\; \Sigma \setminus \{R_1,R_2,\ldots\} \\ R^n &\;\rightsquigarrow\; \underbrace{R\circ R \circ\ldots\circ R}_{n \text{ times}} \\ R^{\{n,m\}}&\;\rightsquigarrow\; R^n\cup R^{n+1} \cup \ldots\cup R^m \\ \end{align*}

2.2 Regex explained

  • abc is equivalent to \(a\cup b \cup c\)
  • * + ? are the same as you'd expect
  • | is 'OR'
  • NOTE: have to escape {} () + | in POSIX Basic regex but not POSIX Extended regex
  • . is any char (except newline)
  • {n,m} is \(n\) reps to \(m\) reps, leave \(m\) blank to leave the upper bound unspecified
  • ^ marks the beginning and $ marks the end of a line
  • \<char> escapes a character.
    • \t \n \r are tab, linefeed, and carriage return respectively
  • \(...\) captures a group (have to escape the parens in Emacs). Then you can refer to the group in the replace expr by \<N> (1-indexed)
  • Boundary: \b<char*>. Does not consume the marker. TODO TODO TODO
  • Classes:

https://www.emacswiki.org/emacs/RegularExpression

2.3 Exercises

  • Find all files with phone numbers
  • Find words with 20 letters or more
  • Find all words in English that have x as their second letter, and n as their second-to-last.
  • Replace all spaces in a filename with underscore for a given directory.
  • Reformat dates from MM-DD-YYYY to DD-MM-YYYY
  • Update all copyrights from Copyright (c) 2015 to Copyright (c) 2016 in a directory of files.
  • Verify that a password contains at least one upper case char, one lower case char, a digit, a symbol, and is \(>8\) chars long.
  • Find consecutive identical words, delete all but one, i.e. foo foo foo bar \(\rightsquigarrow\) foo bar
  • Delete trailing whitespace
  • Uglify a source file by compressing it into a single line, removing as much whitespace as you can.
  • Pad numbers in a seq of files, i.e. chap1.pdf chap2.pdf ... chap14.pdf chap15.pdf \(\rightsquigarrow\) chap01.pdf chap02.pdf ... chap14.pdf chap15.pdf
  • Replace all C99 comments with C89 comments

    M-x replace-regexp RET //\(.*\)$ RET /* \1 */
    
  • Extract contents enclosed in an html tag. Why can you not match/extract from multiple nested tags?
  • You just torrented an entire season of a TV show, and the files have fucked up names. Luckily, they're fucked up in a regular way (heh). Bulk rename and cleanup the files.
  • Find all gmail usernames from a bunch of email addresses.
  • Implement a fast (linear time) regexp matcher.
  • What is PCRE? Figure out how to install the damn PCRE headers on a Mac.

3 Shell commands and shell scripting

The shell is full of surprises and shortcuts designed to save you time. For a full listing, lookup man builtin.

  • Redirection: standard in/out
  • Composing little programs (all in GNU coreutils, info coreutils)
    • mv
    • cp
    • cat
    • rev
    • more/less
    • echo
    • cut
    • wc
    • uniq
    • seq
    • sort

4 Tools

4.1 find

4.2 grep

4.3 awk

4.4 processes: ps, killall, kill

4.5 make

4.6 wget

  1. Download all lectures from http://people.orie.cornell.edu/shmoys/or630/#handouts

Command: =wget -4 -nH –cut-dirs=3 -r –no-parent -l1 "${l}"=

Problem: plain 'ol wget doesn't work (redirect error)

look @ html, extract all links with emacs / sed

cat lecs | while read l do
  wget -4 -nH --cut-dirs=3 -r --no-parent -l1 "${l}"
done

4.7 scp and rsync

5 "Advanced" plain text editing

  • Org-mode
  • Markdown
  • Pandoc
  • Jekyll, static site generators

6 References and links

Footnotes:

1

: with examples drawn from personal experience.

Date: April 29, 2016

Author: Matt Chan

Created: 2016-04-16 Sat 20:36

Validate