Back  Index  Next

Regular Expressions

To see if a line has the string "abc" in it we can do this:

if (index($line, "abc") >= 0) {
This searching for a string or a more complex pattern is very useful and Perl offers another much more powerful (and, of course, more concise!) way to do it:

if ($line =~ /abc/) {
This should be read "if line matches /abc/". The "=~" is the match operator that takes a scalar on its left and a pattern on the right. It returns true if the pattern matches otherwise false.

Between the two slashes comes the pattern. which can be very complex with LOTS of special characters with special meanings. Another term for these patterns is "regular expressions" although they are not really very 'regular'! That term comes from theoretical computer science. "Pattern" or "Rule" is somehow more appropriate but you will often see "regular expression" or "regexp".

Developing the patterns can be a real challenge since they are a little language unto themselves. This bit of code is a useful way to test and develop a pattern:


while ($line = <DATA>) {
    if ($line =~ /abc/) {
        print "matched: $line";
    }
}
__DATA__
hello
abracadabra
abcdefg
The lines after the __DATA__ token are a representative sample of the expected input. They are easily read with the <DATA> construct. You don't need to do an 'open' at all! With this in place you can keep fiddling with the pattern until you are satisfied it is correct for both the lines it matches AND the ones it doesn't match.

You can apply some syntactic sugar to make the above more concise, if you wish.


while (<DATA>) {
    if (/abc/) {
        print "matched: $_";
    }
}
If a pattern appears alone without the "=~" operator then it applies to the (you guessed it) default variable $_. A further simplification would yield:

while (<DATA>) {
    print "matched: $_" if /abc/;
}
So ... what is a pattern? There are 4 concepts in the construction of a pattern: Atoms, Anchors, Repetition, and Alternation.

Atoms

An atom matches a single character. The letters a-z, A-Z, 0-9 each match themselves literally.

/abc/
/9B3/
/hello/
A period '.' matches ANY character (except a newline). It is like a wildcard when playing poker.

/a.c/        #  abc, axc, aEc, ...
/a..d/        #  a12d, aqqd, aGGd, ...
If you want to match a real dot precede the '.' with a backslash.

/3\.14159/
A character class matches any one character from a set of characters. You enclose the set of characters in square brackets [ ] like so:

/[abc]/          # matches a, b, or c
/[abcABC]/
/[0-9]/          # ranges are okay
/[a-zA-Z]/       # any letter
/[a-z][0-9]/
There are some abbreviations for common character classes:

\d        [0-9]           # digits
\w        [a-zA-Z0-9_]    # word character
\s        [\t\n ]         # white space
If the first character in a character class is a hat/circumflex '^' then the atom matches any character that is NOT in the set.

[^abc]        # matches anything EXCEPT a, b, or c
More abbreviations:

\D        [^0-9]
\W        [^a-zA-Z0-9_]
\S        [^\t\n ]

Anchors

To match at a particular place in the line there are some special anchors:

/^abc/    # matches abc at the beginning of the line
/abc$/    # matches abc at the end of the line
Note that the hat '^' serves two purposes in patterns.

Repetition

There are three pattern characters used for repetition - a kind of loop construct:

/z+/    # matches one or more z's in a row
/z*/    # matches zero or more z's in a row
/z?/    # matches zero or one z - i.e. the z is optional.
These are also called "quantifiers" in some books.

These quantifiers apply only to the immediately preceding atom. If you want it to apply to more than just the one atom you can use parentheses to group atoms:


/(ab)+/        # one or more sequences of 'ab'.
A common example:

/ab.*yz/    # 'ab' followed by
            # ANY AMOUNT OF ANYTHING followed by 'yz'.
You will see '.*' a lot to skip over things you don't care about.

Alternation

The pipe '|' introduces an OR into patterns:

/hello|goodbye/        # matches hello OR goodbye
The pipe serves as a larger grouping construct as well. The above is NOT the same as:

/hell[og]oodbye/
There is a great deal more to what you can put inside a pattern but this will suffice for the moment. Learn these things well - they will be used over and over again.

What Matched?

If you surround a portion of the pattern with parentheses then once a match has been verified you can see what exactly matched that portion. For example:

$line = "the_count  =  45";
if ($line =~ /(\w+)\s*=\s*(\d+)/) {
    print "name: $1 and number: $2\n";
}
See the ( ) surrounding \w+ and \d+? When there is a successful match the name and the number are put in the variables $1 and $2.

Substitutions


$line = "Now is the time for the election";
$line =~ s/e/E/;
The above is a substitute statement. It will replace the first 'e' in $line with 'E' resulting in:

Now is thE time for the election
Appending a 'g' will do the replacements globally:

$line =~ s/e/E/g;
Which results in:

Now is thE timE for thE ElEction
Between the first two slashes in the substitute statement you can put ANY arbitrarily complex pattern.

$line =~ s/[aeiou]/X/g;
will result in:

NXw Xs thX tXmX fXr thX XlXctXXn
Without the "$var =~" the substitution will take place on $_ as you might expect.

Modifiers

In addition to the 'g' modifier above there are also 'i', 'x' and 'e'.

Case Insensitivity

Ignoring case of letters in a pattern is very simple to do:

if ($name =~ /Charles/i) {
    ...
}
See the little 'i' after the closing slash? It stands for insensitive - specifically case insensitivity. The pattern will match Charles, CHARLES and even ChArLeS!

Expanding the Regular Expression

With the 'x' operator you can eXpand the pattern by putting white space and comments to make it more readable. Instead of:

if (/^\s*([a-z])+\s*=\s*(\d+)\s*$/) {
    print "var: $1 num: $2\n";
}
you can do this:

if (/

    ^\s*
    ([a-z]+)     # variable
    \s*=\s*      # equals
    (\d+)        # number
    \s*$

    /x)
{
    print "var: $1 num: $2\n";
}
The important difference is the 'x' after the final slash. This makes the cryptic expression much clearer. This also illustrates a conventional exception to the indentation rules. Note the placement of the opening brace '{'. The new rule is this: If the boolean expression of an 'if' or 'while' extends across more than one line then the opening brace '{' is not put on the last line (where it might be lost in the visual shuffle). Instead, it is aligned with the keyword on a line by itself.

Evaluating the Regular Expression

With the 'e' modifier the replacement portion of a substitute statement will be evaluated as an arbitrary Perl expression and the result will be used as the replacement string. Like so:

$line = "hello there 3 and more 67";
$line =~ s/(\d+)/ $1 * 2 /eg;
results in:

hello there 6 and more 134
The 'g' caused a global replacement.

Exercises

  1. Using the test program to read lines from DATA make regular expressions to match:

    • ab followed by a digit followed by a capital letter

    • xy followed by ANY two characters and then an F.

    • lines beginning with a 'G'

    • lines that end with a digit

    • lines containing the name 'fred' or 'Fred' or 'FRED' or ... (ignore the case of letters).

    • A perl identifier
      (Begins with a letter followed by any number of letters, digits or underscores).

  2. Describe (in English) what this code does:
    while (<DATA>) {
        next unless /\S/;
        next if /^\s*#/;
        print;
    }
    
  3. Make a program to identify lines as one of these categories:

    • US Zip codes
      95060, 90034, 89102

    • Canadian Postal Codes
      V4E 2E3, F4E 6T5

    • Phone numbers
      (800) 234-5555, (831) 342-0987

    • IP addresses
      12.31.132.111, 45.32.16.4
      (don't worry about ensuring 0-255 for the octets)

    • Assignment statements

      
      abc = 456
      abc = 234
      we   =21
      abcekj=5
      
    • web addresses
      http:// as a prefix

    • Unix full pathnames
      /usr/bin/perl

    • Other
      anything else

    Have at least 20 lines of input with 2 or 3 examples of each category. Don't try to be overly precise. Just get some practice with patterns.

  4. Read lines from DATA and print to STDOUT. Skip entirely blank lines and lines that begin with a sharp. Some lines will be "definition lines". Use the definitions later when the variable appears surrounded by percent signs. Definition lines begin with a name followed by an equal sign. The rest of the line is the value. For example:

    -- input --
    # definitions
    age = 23
    year = 1979
    name = Mary Smith
    
    # lines to process and output:
    Her name is %name% and she is %age% years old.
    %name% was born in %year%.
    
    -- output --
    Her name is Mary Smith and she is 23 years old.
    Mary Smith was born in 1979.
    
  5. Read lines from DATA and print them followed by the number of vowels in the line. Do this with regular expressions by deleting all non-vowels in the line. All that will remain will be the vowels so the length of the line will be the number of vowels! Include both upper and lower case vowels.

    -- input --
    HELLO how are you doing?
    I'm fine.
    
    -- output --
    HELLO how are you doing?: 9
    I'm fine: 3
    

Back  Index  Next