Regular ExpressionsTo see if a line has the string "abc" in it we can do this:
This searching for a string or a more complex
pattern is very useful and Perl offers another
much more powerful (and, of course, more concise!)
way to do it:
This should be read "if line matches /abc/".
The "=~" is the match operator that takes
a scalar on its left and a pattern on the right.
It returns true if the pattern matches otherwise false.
Between the two slashes comes the pattern. which can be very complex with LOTS of special characters with special meanings. Another term for these patterns is "regular expressions" although they are not really very 'regular'! That term comes from theoretical computer science. "Pattern" or "Rule" is somehow more appropriate but you will often see "regular expression" or "regexp". Developing the patterns can be a real challenge since they are a little language unto themselves. This bit of code is a useful way to test and develop a pattern:
The lines after the __DATA__ token are
a representative sample
of the expected input. They are easily read with
the <DATA> construct. You don't need to do an 'open' at all!
With this in place you can keep fiddling with
the pattern until you are satisfied it is correct
for both the lines it matches AND the ones
it doesn't match.
You can apply some syntactic sugar to make the above more concise, if you wish.
If a pattern appears alone without the "=~" operator
then it applies to the (you guessed it) default variable $_.
A further simplification would yield:
So ... what is a pattern?
There are 4 concepts in the construction of
a pattern: Atoms, Anchors, Repetition, and Alternation.
AtomsAn atom matches a single character. The letters a-z, A-Z, 0-9 each match themselves literally.A period '.' matches ANY character (except a newline). It is like a wildcard when playing poker./abc/ /9B3/ /hello/ If you want to match a real dot precede the '.' with a backslash./a.c/ # abc, axc, aEc, ... /a..d/ # a12d, aqqd, aGGd, ... A character class matches any one character from a set of characters. You enclose the set of characters in square brackets [ ] like so:/3\.14159/ There are some abbreviations for common character classes:/[abc]/ # matches a, b, or c /[abcABC]/ /[0-9]/ # ranges are okay /[a-zA-Z]/ # any letter /[a-z][0-9]/ If the first character in a character class is a hat/circumflex '^' then the atom matches any character that is NOT in the set.\d [0-9] # digits \w [a-zA-Z0-9_] # word character \s [\t\n ] # white space More abbreviations:[^abc] # matches anything EXCEPT a, b, or c \D [^0-9] \W [^a-zA-Z0-9_] \S [^\t\n ] AnchorsTo match at a particular place in the line there are some special anchors:Note that the hat '^' serves two purposes in patterns./^abc/ # matches abc at the beginning of the line /abc$/ # matches abc at the end of the line RepetitionThere are three pattern characters used for repetition - a kind of loop construct:These are also called "quantifiers" in some books./z+/ # matches one or more z's in a row /z*/ # matches zero or more z's in a row /z?/ # matches zero or one z - i.e. the z is optional. These quantifiers apply only to the immediately preceding atom. If you want it to apply to more than just the one atom you can use parentheses to group atoms: A common example:/(ab)+/ # one or more sequences of 'ab'.
You will see '.*' a lot to skip over things you don't care about.
AlternationThe pipe '|' introduces an OR into patterns:The pipe serves as a larger grouping construct as well. The above is NOT the same as:/hello|goodbye/ # matches hello OR goodbye There is a great deal more to what you can put inside a pattern but this will suffice for the moment. Learn these things well - they will be used over and over again./hell[og]oodbye/ What Matched?If you surround a portion of the pattern with parentheses then once a match has been verified you can see what exactly matched that portion. For example:
See the ( ) surrounding \w+ and \d+?
When there is a successful match the name and
the number are put in the variables $1 and $2.
SubstitutionsThe above is a substitute statement. It will replace the first 'e' in $line with 'E' resulting in:$line = "Now is the time for the election"; $line =~ s/e/E/; Appending a 'g' will do the replacements globally:Now is thE time for the election Which results in:$line =~ s/e/E/g; Between the first two slashes in the substitute statement you can put ANY arbitrarily complex pattern.Now is thE timE for thE ElEction will result in:$line =~ s/[aeiou]/X/g; Without the "$var =~" the substitution will take place on $_ as you might expect.NXw Xs thX tXmX fXr thX XlXctXXn ModifiersIn addition to the 'g' modifier above there are also 'i', 'x' and 'e'.Case InsensitivityIgnoring case of letters in a pattern is very simple to do:
See the little 'i' after the closing slash?
It stands for insensitive - specifically case insensitivity.
The pattern will match Charles, CHARLES and even ChArLeS!
Expanding the Regular ExpressionWith the 'x' operator you can eXpand the pattern by putting white space and comments to make it more readable. Instead of:
you can do this:
The important difference is the 'x' after the
final slash. This makes the cryptic expression
much clearer. This also illustrates a conventional exception
to the indentation rules. Note the placement of
the opening brace '{'. The new rule is this: If the boolean expression
of an 'if' or 'while' extends across more than one line
then the opening brace '{' is not put on the last line (where
it might be lost in the visual shuffle). Instead, it is
aligned with the keyword on a line by itself.
Evaluating the Regular ExpressionWith the 'e' modifier the replacement portion of a substitute statement will be evaluated as an arbitrary Perl expression and the result will be used as the replacement string. Like so:results in:$line = "hello there 3 and more 67"; $line =~ s/(\d+)/ $1 * 2 /eg; The 'g' caused a global replacement.hello there 6 and more 134 Exercises
|