Regular Expressions

To see if a line has the string "abc" in it we can do this:

if (index($line, "abc") >= 0) {

This searching for a string or a more complex pattern is very useful and Perl offers another much more powerful (and, of course, more concise!) way to do it:

if ($line =~ /abc/) {

This should be read "if line matches /abc/". The "=~" is the match operator that takes a scalar on its left and a pattern on the right. It returns true if the pattern matches otherwise false.

Between the two slashes comes the pattern. which can be very complex with LOTS of special characters with special meanings. Another term for these patterns is "regular expressions" although they are not really very 'regular'! That term comes from theoretical computer science. "Pattern" or "Rule" is somehow more appropriate but you will often see "regular expression" or "regexp".

Developing the patterns can be a real challenge since they are a little language unto themselves. This bit of code is a useful way to test and develop a pattern:

while (my $line = <DATA>) { if ($line =~ /abc/) { print "matched: $line"; } } __DATA__ hello abracadabra abcdefg

The lines after the __DATA__ token are a representative sample of the expected input. They are easily read with the <DATA> construct. You don't need to do an 'open' at all! With this in place you can keep fiddling with the pattern until you are satisfied it is correct for both the lines it matches AND the ones it doesn't match.

You can apply some syntactic sugar to make the above more concise, if you wish.

while (<DATA>) { if (/abc/) { print "matched: $_"; } }

If a pattern appears alone without the "=~" operator then it applies to the (you guessed it) default variable $_. A further simplification would yield:

while (<DATA>) { print "matched: $_" if /abc/; }

So ... what is a pattern? There are 4 concepts in the construction of a pattern: Atoms, Anchors, Repetition, and Alternation.

Atoms

An atom matches a single character. The letters a-z, A-Z, 0-9 each match themselves literally.

/abc/ /9B3/ /hello/

A period '.' matches ANY character (except a newline). It is like a wildcard when playing poker.

/a.c/ # abc, axc, aEc, ... /a..d/ # a12d, aqqd, aGGd, ...

If you want to match a real dot precede the '.' with a backslash.

/3\.14159/

A character class matches any one character from a set of characters. You enclose the set of characters in square brackets [ ] like so:

/[abc]/ # matches a, b, or c /[abcABC]/ /[0-9]/ # ranges are okay /[a-zA-Z]/ # any letter /[a-z][0-9]/

There are some abbreviations for common character classes:

\d [0-9] # digits \w [a-zA-Z0-9_] # word character \s [\t\n ] # white space

If the first character in a character class is a hat/circumflex '^' then the atom matches any character that is NOT in the set.

[^abc] # matches anything EXCEPT a, b, or c

More abbreviations:

\D [^0-9] \W [^a-zA-Z0-9_] \S [^\t\n ]

Anchors

To match at a particular place in the line there are some special anchors:

/^abc/ # matches abc at the beginning of the line /abc$/ # matches abc at the end of the line

Note that the hat '^' serves two purposes in patterns.

Repetition

There are three pattern characters used for repetition - a kind of loop construct:

/z+/ # matches one or more z's in a row /z*/ # matches zero or more z's in a row /z?/ # matches zero or one z - i.e. the z is optional.

These are also called "quantifiers" in some books.

These quantifiers apply only to the immediately preceding atom. If you want it to apply to more than just the one atom you can use parentheses to group atoms:

/(ab)+/ # one or more sequences of 'ab'.

A common example:

/ab.*yz/ # 'ab' followed by # ANY AMOUNT OF ANYTHING followed by 'yz'.

You will see '.*' a lot to skip over things you don't care about.

Alternation

The pipe '|' introduces an OR into patterns:

/hello|goodbye/ # matches hello OR goodbye

The pipe serves as a larger grouping construct as well. The above is NOT the same as:

/hell[og]oodbye/

There is a great deal more to what you can put inside a pattern but this will suffice for the moment. Learn these things well - they will be used over and over again.

What Matched?

If you surround a portion of the pattern with parentheses then once a match has been verified you can see exactly what matched that portion. For example:

$line = "the_count = 45"; if ($line =~ /(\w+)\s*=\s*(\d+)/) { print "name: $1 and number: $2\n"; }

See the ( ) surrounding \w+ and \d+? When there is a successful match the name and the number are put in the variables $1 and $2.

Scalar vs List Context

A regular expression match has a value. In scalar context the value is a boolean that says whether the match succeeded or not. This is illustrated by the previous many examples.

If the match is put in list context it returns the list of matched variables $1, $2, etc. This is best illustrated by rewriting the above example:

$line = "the_count = 45"; if (($name, $number) = $line =~ /(\w+)\s*=\s*(\d+)/) { print "name: $name and number: $number\n"; }

This makes for very clear and compact code, yes?

Substitutions

Regular expressions can also be used to modify scalars.

$line = "Now is the time for the election"; $line =~ s/e/E/;

The above is a substitute statement. It will replace the first 'e' in $line with 'E' resulting in:

Now is thE time for the election

Appending a 'g' will do the replacements globally:

$line =~ s/e/E/g;

Which results in:

Now is thE timE for thE ElEction

Between the first two slashes in the substitute statement you can put ANY arbitrarily complex pattern.

$line =~ s/[aeiou]/X/g;

will result in:

NXw Xs thX tXmX fXr thX XlXctXXn

Without the "$var =~" the substitution will take place on $_, as you might expect by now.

Other Modifiers

In addition to the 'g' modifier above there are also 'i', 'x' and 'e'.

Case Insensitivity

Ignoring the case of letters in a pattern is very simple to do:

if ($name =~ /Charles/i) { ... }

See the little 'i' after the closing slash? It stands for insensitive - specifically case insensitivity. The pattern will match Charles, CHARLES and even ChArLeS!

Expanding the Regular Expression

With the 'x' operator you can eXpand the pattern by putting white space and comments to make it more readable. Instead of:

if (/^\s*([a-z])+\s*=\s*(\d+)\s*$/) { print "var: $1 num: $2\n"; }

you can do this:

if (/ ^\s* ([a-z]+) # variable \s*=\s* # equals (\d+) # number \s*$ /x ) { print "var: $1 num: $2\n"; }

The important difference is the 'x' after the final slash. This makes the cryptic expression much clearer. This also illustrates a conventional exception to the indentation rules. Note the placement of the closing parenthesis and the opening brace '{'. The new code alignment rule is this: If the boolean expression of an 'if' or 'while' extends across more than one line then the parenthesis and brace ') {' is not put on the last line (where it might be lost in the visual shuffle). Instead, it is aligned with the keyword on a line by itself.

Evaluating the Regular Expression

With the 'e' modifier the replacement portion of a substitute statement will be Evaluated as an arbitrary Perl expression and the result will be used as the replacement string. Like so:

$line = "hello there 3 and more 67"; $line =~ s/(\d+)/ $1 * 2 /eg;

results in:

hello there 6 and more 134

The ability to evaluate arbitrary Perl code is a very powerful extension of the regular expression machinery. All kinds of fancy things become possible!

The 'g' above caused a global replacement.

Exercises

Using the test program to read lines from DATA make regular expressions to match:
- ab followed by a digit followed by a capital letter
- xy followed by ANY two characters and then an F.
- lines beginning with a 'G'
- lines that end with a digit
- lines containing the name 'fred' or 'Fred' or 'FRED' or ... (ignore the case of letters).
- A perl identifier
  (Begins with a letter followed by any number of letters, digits or underscores).
Describe (in English) what this code does:
while (<DATA>) { next unless /\S/; next if /^\s*#/; print; }
Make a program to identify lines as one of these categories:
- US Zip codes
  95060, 90034, 89102
- Canadian Postal Codes
  V4E 2E3, F4E 6T5
- Phone numbers
  (800) 234-5555, (831) 342-0987
- IP addresses
  12.31.132.111, 45.32.16.4
  (don't worry about ensuring 0-255 for the octets)
- Assignment statements
  abc = 456 abc = 234 we =21 abcekj=5
- web addresses
  http:// as a prefix
- Unix full pathnames
  /usr/bin/perl
- Other
  anything else
Have at least 20 lines of input with 2 or 3 examples of each category. Don't try to be overly precise. Just get some practice with patterns.
Read lines from DATA and print to STDOUT. Skip entirely blank lines and lines that begin with a sharp. Some lines will be "definition lines". Use the definitions later when the variable appears surrounded by percent signs. Definition lines begin with a name followed by an equal sign. The rest of the line is the value. For example:
-- input -- # definitions age = 23 year = 1979 name = Mary Smith # lines to process and output: Her name is %name% and she is %age% years old. %name% was born in %year%. -- output -- Her name is Mary Smith and she is 23 years old. Mary Smith was born in 1979.
Read lines from DATA and print them followed by the number of vowels in the line. Do this with regular expressions by deleting all non-vowels in the line. All that will remain will be the vowels so the length of the line will be the number of vowels! Include both upper and lower case vowels.
-- input -- HELLO how are you doing? I'm fine. -- output -- HELLO how are you doing?: 9 I'm fine: 3

Prev Index Next