 |
Perl Pattern Matching
|
|
Patterns |
Patterns are subject to an additional level of interpretation as a
regular expression. This is done as a second pass, after variables are
interpolated, so that regular expressions may be incorporated into the
pattern from the variables. If this is not what you want, use \Q to interpolate
a variable literally. |
?PATTERN? |
This is just like the /pattern/ search, except that it matches only
once between calls to the reset() operator. This is a useful optimization
when you only want to see the first occurrence of something in each file
of a set of files, for instance. Only ?? patterns local to the current
package are reset. |
m/PATTERN/gimosx |
Searches a string for a pattern match, and in a scalar context returns
true (1) or false (''). If no string is specified via the =~ or !~ operator,
the $_ string is searched. The string
specified with =~ can be a variable or the result of an expression evaluation.
The initial 'm' can be omitted if '/' is used for the delimiters, otherwise
any non-alphanumeric character can be used (apart from whitespace). |
The modifier options are: |
g |
Match globally, i.e. find all occurrences. |
i |
Do case-insensitive pattern matching. |
m |
Treat string as multiple lines - default is to assume just a single
line in the string (no embedded newlines). See $*. |
o |
Only compile pattern once, even if variables within it change. |
s |
Treat string as single line. |
x |
Use extended regular expressions. Whitespace that is not backslashed
or within a haracter class is ignored, allowing the regular expression
to be broken into more readable parts with embedded comments. |
In a list context, the pattern match returns the portions of the target
string that match the expressions within the pattern in brackets. In a
scalar context, each iteration identifies the next match (pos()
holding the position of the previous match on the variable). |
q/STRING/, 'STRING' |
A single-quoted, literal string, default delimiters are single quotes
('...'). Backslashes are ignored, unless followed by the delimiter or another
backslash, in which case the delimiter or backslash is interpolated. |
qq/STRING/, "STRING" |
A double-quoted, interpolated string, default delimiters are double
quotes ("..."). |
qx/STRING/, `STRING` |
A string which is interpolated and then executed as a system command.
The collected standard output of the command is returned. In scalar context,
it comes back as a single (potentially multi-line) string. In list context,
returns a list of lines (depending on how the $/
delimiter is specified).
$today = qx{ date }; |
qw/STRING/ |
Returns a list of the words extracted out of STRING, using embedded
whitespace as the word delimiters. It is exactly equivalent to:
split(' ', q/STRING/); |
Some frequently seen examples:
use POSIX qw( setlocale localeconv )
@EXPORT = qw( foo bar baz ); |
s/PATTERN/REPLACEMENT/egimosx |
Searches a string for a pattern, and if found, replaces that pattern
with the replacement text and returns the number of substitutions made.
Otherwise it returns false (0). |
If no string is specified via the =~ or !~ operator, the $_ variable
is searched and modified. (The string specified with =~ must be a scalar
variable, an array element, a hash element, or an assignment to one of
those, i.e. an lvalue.) |
If the delimiter chosen is single quote, no variable interpolation
is done on either the PATTERN or the REPLACEMENT. Otherwise, if the PATTERN
contains a $ that looks like a variable rather than an end-of-string test,
the variable will be interpolated into the pattern at run-time. If you
only want the pattern compiled once the first time the variable is interpolated,
use the /o option. If the pattern evaluates to a null string, the most
recently executed (and successfully compiled) regular expression is used
instead. |
The modifier options are (see m/PATTERN/ above
for more detailed descriptions of common modifiers): |
e |
Evaluate the right side as an expression. |
g |
Match globally, i.e. all occurrences. |
i |
Case-insensitive pattern matching. |
m |
Treat string as multiple lines. |
o |
Only compile pattern once, even if variables within it change. |
s |
Treat string as single line. |
x |
Use extended regular expressions |
Any non-alphanumeric, non-whitespace delimiter may replace the slashes.
If single quotes are used, no interpretation is done on the replacement
string, overwridden by the /e modifier. If backquotes are used, the replacement
string is a command to execute whose output will be used as the actual
replacement text. If the PATTERN is delimited by bracketing quotes, the
REPLACEMENT has its own pair of quotes, which may or may not be bracketing
quotes, e.g. s(foo)(bar) or s<foo>/bar/ . A /e will cause the replacement
portion to be interpreter as a full-fledged Perl expression and eval()
ed right then and there. It is, however, syntax checked at compile-time. |
Examples:
s/\bgreen\b/mauve/g;
# don't change wintergreen
$path =~ s|/usr/bin|/usr/local/bin|;
s/Login: $foo/Login: $bar/; # run-time pattern
($foo = $bar) =~ s/this/that/;
$count = ($paragraph =~ s/Mister\b/Mr./g);
$_ = 'abc123xyz';
s/\d+/$&*2/e;
# yields 'abc246xyz'
s/\d+/sprintf("%5d",$&)/e; # yields
'abc 246xyz'
s/\w/$& x 2/eg;
# yields 'aabbcc 224466xxyyzz'
s/%(.)/$percent{$1}/g;
# change percent escapes; no /e
s/%(.)/$percent{$1} || $&/ge;
# expr now, so /e
s/^=(\w+)/&pod($1)/ge;
# use function call
# /e's can even nest; this will expand
# simple embedded variables in $_
s/(\$\w+)/$1/eeg;
# Delete C comments.
$program =~ s {
/\*
(?# Match the opening delimiter.)
.*?
(?# Match a minimal number of characters.)
\*/
(?# Match the closing delimiter.)
} []gsx;
s/^\s*(.*?)\s*$/$1/;
# trim white space
s/([^ ]*) *([^ ]*)/$2 $1/; # reverse 1st
two fields |
Occasionally, you can't just use a /g to get all the changes to occur.
Here are two common cases:
# put commas in the right places in an integer
1 while s/(.*\d)(\d\d\d)/$1,$2/g;
# perl4
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/g;
# perl5
# expand tabs to 8-column spacing
1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e; |
tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds |
Translates all occurrences of the characters found in the search list
with the corresponding character in the replacement list. It returns the
number of characters replaced or deleted. If no string is specified via
the =~ or !~ operator, the $_ string is translated. (The string specified
with =~ must be a scalar variable, an array element, or an assignment to
one of those, i.e. an lvalue.) For sed devotees, y is provided as a synonym
for tr . If the SEARCHLIST is delimited by bracketing quotes, the REPLACEMENTLIST
has its own pair of quotes, which may or may not be bracketing quotes,
e.g. tr[A-Z][a-z] or tr(+-*/)/ABCD/ . |
Options are: |
c |
Complement the SEARCHLIST. |
d |
Delete found but unreplaced characters. |
s |
Squash duplicate replaced characters. |
If the /c modifier is specified, the SEARCHLIST character set is complemented.
If the /d modifier is specified, any characters specified by SEARCHLIST
not found in REPLACEMENTLIST are deleted. (Note that this is slightly more
flexible than the behavior of some tr programs, which delete anything they
find in the SEARCHLIST, period.) If the /s modifier is specified, sequences
of characters that were translated to the same character are squashed down
to a single instance of the character. |
If the /d modifier is used, the REPLACEMENTLIST is always interpreted
exactly as specified. Otherwise, if the REPLACEMENTLIST is shorter than
the SEARCHLIST, the final character is replicated till it is long enough.
If the REPLACEMENTLIST is null, the SEARCHLIST is replicated. This latter
is useful for counting characters in a class or for squashing character
sequences in a class. |
Examples:
$ARGV[1] =~ tr/A-Z/a-z/; #
canonicalize to lower case
$cnt = tr/*/*/;
# count the stars in $_
$cnt = $sky =~ tr/*/*/;
# count the stars in $sky
$cnt = tr/0-9//;
# count the digits in $_
tr/a-zA-Z//s;
# bookkeeper -> bokeper
($HOST = $host) =~ tr/a-z/A-Z/;
tr/a-zA-Z/ /cs;
# change non-alphas to single space
tr [\200-\377]
[\000-\177];
# delete 8th bit |
Note that because the translation table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation.
That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $@
if $@;
eval "tr/$oldlist/$newlist/, 1" or die $@; |
Regular Expressions |
The patterns used in pattern matching are regular expressions that
follow the rules laid out below. |
Any single character (or series of characters) matches directly, unless
it is a metacharacter with a special meaning. You can cause characters
which normally function as metacharacters to be interpreted literally by
prefixing them with a "\" (e.g. "\." matches a ".", not any character;
"\\" matches a "\"). A series of characters matches that series of characters
in the target string, so the pattern zyxwv would match "zyxwv" in
the target string. |
The following metacharacters are as supported: |
\ |
Quote the next metacharacter, including escape
sequences (\n, \t etc. apart from \b - see below), ASCII characters
('\nnn' for octal and '\xnn' for hex), and ASCII character controls ('\cx').
'\ n' repeats the part of the 'n'th subpattern that was used to perform
the match (not its complete set of rules). |
^ |
Match just the beginning of the string, or with the /m modifier the
beginning of any embedded line |
. |
Match any character (except newline unless the /s
modifier is used) |
$ |
Match just the end of the string, or with the /m modifier the end of
any embedded line |
| |
Alternation - to match any one of a set of patterns, usually grouped
in brackets. |
() |
Grouping of subpatterns, numbered automatically left to right by the
sequence of their opening parenthesis. |
[] |
Character class, matching any of the characters in the enclosed list.
'^' as the first character in the list negates the expressions - any character
not
in the list. |
The following quantifiers are suppported: |
* |
Match 0 or more times (equivalent to {0,}) |
+ |
Match 1 or more times (equivalent to {1,}) |
? |
Match 0 or 1 times (equivalent to {0,1}) |
{n} |
Match exactly n times |
{n,} |
Match at least n times |
{n,m} |
Match at least n but not more than m times |
Patterns that are qualified as above match as many times as possible
without causing the rest of the match to fail, by default. To match the
fewest number of times (to ensure that multiple matches of the super-pattern
are found) suffix the quantifier with '?', eg. '*?', '{n,m}?'. |
Regular expressions also support the following constructs: |
Single character matches |
Zero width matches |
\w |
a "word" character (alphanumeric plus "_") |
\b |
a word boundary |
\W |
a non-word character |
\B |
a non-(word boundary) |
\s |
a whitespace character |
\A |
beginning of the string (not embedded newlines) |
\S |
a non-whitespace character |
\Z |
end of the string (not embedded newlines) |
\d |
a digit character |
\G |
where previous m//g left off |
\D |
a non-digit character |
|
|
Brackets delimit sub-patterns, allowing the resultant matches in the
target string to be referenced using either /1 ... /n within the pattern
itself, or $1 ... $n outside of the pattern. If the '(' is followed by
a '?', it can be used to delimit a subpattern without the pattern being
saved. |
$+ |
the last pattern that was matched (useful when there are alternatives) |
$& |
the matched string |
$` |
everything before the matched string |
$' |
everything after the matched string |
The extension syntax for regular expressions uses a pair of brackets
where the first character within the brackets is a question mark '(?...)'. |
(?#comment) |
A comment - ignored |
(?:regexp) |
Groups the pattern as with brackets, but doesn't generate back references |
(?=regexp) |
A zero-width positive lookahead assertion. For example, /\w+(?=\t)/
matches a word followed by a tab, without including the tab in $&. |
(?!regexp) |
A zero-width negative lookahead assertion. For example:
/abc(?!xyz)/
matches any occurrence of "abc" that isn't followed by "xyz". This
cannot be used for lookbehind: /(?!abc)xyz/ will not find an occurrence
of "xyz" that is preceded by something which is not "abc". The (?!abc)
is ensuring that the next thing is not "abc" - and it's not, it's "xyz",
so "abcxyz" will match. Would have to do something like /(?abc)...xyz/
for that, but there's the case where the "xyz" does not have three characters
before it. This could be covered by:
/(?:(?!abc)...|^..?)xyz/
It may be easier to say:
if (/abc/ && $` =~ /xyz$/) |
(?imsx) |
One or more embedded pattern-match modifiers. Useful for patterns that
are specified in a table somewhere, some of which want to be case sensitive,
and some of which don't. The case insensitive ones merely need to include
(?i) at the front of the pattern. For example:
$pattern = "abcxyz";
if ( /$pattern/i )
# more flexible:
$pattern = "(?i)abcxyz";
if ( /$pattern/ ) |
|
|