Perl Pattern Matching

Perl Pattern Matching

Patterns

Patterns are subject to an additional level of interpretation as a regular expression. This is done as a second pass, after variables are interpolated, so that regular expressions may be incorporated into the pattern from the variables. If this is not what you want, use \Q to interpolate a variable literally.

?PATTERN?

This is just like the /pattern/ search, except that it matches only once between calls to the reset() operator. This is a useful optimization when you only want to see the first occurrence of something in each file of a set of files, for instance. Only ?? patterns local to the current package are reset.

m/PATTERN/gimosx

Searches a string for a pattern match, and in a scalar context returns true (1) or false (''). If no string is specified via the =~ or !~ operator, the $_ string is searched. The string specified with =~ can be a variable or the result of an expression evaluation. The initial 'm' can be omitted if '/' is used for the delimiters, otherwise any non-alphanumeric character can be used (apart from whitespace).

The modifier options are:

g	Match globally, i.e. find all occurrences.
i	Do case-insensitive pattern matching.
m	Treat string as multiple lines - default is to assume just a single line in the string (no embedded newlines). See $*.
o	Only compile pattern once, even if variables within it change.
s	Treat string as single line.
x	Use extended regular expressions. Whitespace that is not backslashed or within a haracter class is ignored, allowing the regular expression to be broken into more readable parts with embedded comments.

In a list context, the pattern match returns the portions of the target string that match the expressions within the pattern in brackets. In a scalar context, each iteration identifies the next match (pos() holding the position of the previous match on the variable).

q/STRING/, 'STRING'

A single-quoted, literal string, default delimiters are single quotes ('...'). Backslashes are ignored, unless followed by the delimiter or another backslash, in which case the delimiter or backslash is interpolated.

qq/STRING/, "STRING"

A double-quoted, interpolated string, default delimiters are double quotes ("...").

qx/STRING/, `STRING`

A string which is interpolated and then executed as a system command. The collected standard output of the command is returned. In scalar context, it comes back as a single (potentially multi-line) string. In list context, returns a list of lines (depending on how the $/ delimiter is specified).
$today = qx{ date };

qw/STRING/

Returns a list of the words extracted out of STRING, using embedded whitespace as the word delimiters. It is exactly equivalent to:
split(' ', q/STRING/);

Some frequently seen examples:
use POSIX qw( setlocale localeconv )
@EXPORT = qw( foo bar baz );

s/PATTERN/REPLACEMENT/egimosx

Searches a string for a pattern, and if found, replaces that pattern with the replacement text and returns the number of substitutions made. Otherwise it returns false (0).

If no string is specified via the =~ or !~ operator, the $_ variable is searched and modified. (The string specified with =~ must be a scalar variable, an array element, a hash element, or an assignment to one of those, i.e. an lvalue.)

If the delimiter chosen is single quote, no variable interpolation is done on either the PATTERN or the REPLACEMENT. Otherwise, if the PATTERN contains a $ that looks like a variable rather than an end-of-string test, the variable will be interpolated into the pattern at run-time. If you only want the pattern compiled once the first time the variable is interpolated, use the /o option. If the pattern evaluates to a null string, the most recently executed (and successfully compiled) regular expression is used instead.

The modifier options are (see m/PATTERN/ above for more detailed descriptions of common modifiers):

e	Evaluate the right side as an expression.
g	Match globally, i.e. all occurrences.
i	Case-insensitive pattern matching.
m	Treat string as multiple lines.
o	Only compile pattern once, even if variables within it change.
s	Treat string as single line.
x	Use extended regular expressions

Any non-alphanumeric, non-whitespace delimiter may replace the slashes. If single quotes are used, no interpretation is done on the replacement string, overwridden by the /e modifier. If backquotes are used, the replacement string is a command to execute whose output will be used as the actual replacement text. If the PATTERN is delimited by bracketing quotes, the REPLACEMENT has its own pair of quotes, which may or may not be bracketing quotes, e.g. s(foo)(bar) or s<foo>/bar/ . A /e will cause the replacement portion to be interpreter as a full-fledged Perl expression and eval() ed right then and there. It is, however, syntax checked at compile-time.

Examples:
    s/\bgreen\b/mauve/g;                # don't change wintergreen
    $path =~ s|/usr/bin|/usr/local/bin|;
    s/Login: $foo/Login: $bar/; # run-time pattern
    ($foo = $bar) =~ s/this/that/;
    $count = ($paragraph =~ s/Mister\b/Mr./g);
    $_ = 'abc123xyz';
    s/\d+/$&*2/e;               # yields 'abc246xyz'
    s/\d+/sprintf("%5d",$&)/e; # yields 'abc 246xyz'
    s/\w/$& x 2/eg;             # yields 'aabbcc 224466xxyyzz'
    s/%(.)/$percent{$1}/g;      # change percent escapes; no /e
    s/%(.)/$percent{$1} || $&/ge;       # expr now, so /e
    s/^=(\w+)/&pod($1)/ge;      # use function call
    # /e's can even nest; this will expand
    # simple embedded variables in $_
    s/(\$\w+)/$1/eeg;
    # Delete C comments.
    $program =~ s {
        /\*     (?# Match the opening delimiter.)
        .*?     (?# Match a minimal number of characters.)
        \*/     (?# Match the closing delimiter.)
    } []gsx;
    s/^\s*(.*?)\s*$/$1/;        # trim white space
    s/([^ ]*) *([^ ]*)/$2 $1/; # reverse 1st two fields

Occasionally, you can't just use a /g to get all the changes to occur. Here are two common cases:
    # put commas in the right places in an integer
    1 while s/(.*\d)(\d\d\d)/$1,$2/g;      # perl4
    1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/g; # perl5
    # expand tabs to 8-column spacing
    1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e;

tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds

Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is translated. (The string specified with =~ must be a scalar variable, an array element, or an assignment to one of those, i.e. an lvalue.) For sed devotees, y is provided as a synonym for tr . If the SEARCHLIST is delimited by bracketing quotes, the REPLACEMENTLIST has its own pair of quotes, which may or may not be bracketing quotes, e.g. tr[A-Z][a-z] or tr(+-*/)/ABCD/ .

Options are:

c	Complement the SEARCHLIST.
d	Delete found but unreplaced characters.
s	Squash duplicate replaced characters.

If the /c modifier is specified, the SEARCHLIST character set is complemented. If the /d modifier is specified, any characters specified by SEARCHLIST not found in REPLACEMENTLIST are deleted. (Note that this is slightly more flexible than the behavior of some tr programs, which delete anything they find in the SEARCHLIST, period.) If the /s modifier is specified, sequences of characters that were translated to the same character are squashed down to a single instance of the character.

If the /d modifier is used, the REPLACEMENTLIST is always interpreted exactly as specified. Otherwise, if the REPLACEMENTLIST is shorter than the SEARCHLIST, the final character is replicated till it is long enough. If the REPLACEMENTLIST is null, the SEARCHLIST is replicated. This latter is useful for counting characters in a class or for squashing character sequences in a class.

Examples:
    $ARGV[1] =~ tr/A-Z/a-z/;    # canonicalize to lower case
    $cnt = tr/*/*/;             # count the stars in $_
    $cnt = $sky =~ tr/*/*/;     # count the stars in $sky
    $cnt = tr/0-9//;            # count the digits in $_
    tr/a-zA-Z//s;               # bookkeeper -> bokeper
    ($HOST = $host) =~ tr/a-z/A-Z/;
    tr/a-zA-Z/ /cs;             # change non-alphas to single space
    tr [\200-\377]
       [\000-\177];             # delete 8th bit

Note that because the translation table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
    eval "tr/$oldlist/$newlist/";
    die $@ if $@;
    eval "tr/$oldlist/$newlist/, 1" or die $@;

Regular Expressions

The patterns used in pattern matching are regular expressions that follow the rules laid out below.

Any single character (or series of characters) matches directly, unless it is a metacharacter with a special meaning. You can cause characters which normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g. "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern zyxwv would match "zyxwv" in the target string.

The following metacharacters are as supported:

\	Quote the next metacharacter, including escape sequences (\n, \t etc. apart from \b - see below), ASCII characters ('\nnn' for octal and '\xnn' for hex), and ASCII character controls ('\cx'). '\ n' repeats the part of the 'n'th subpattern that was used to perform the match (not its complete set of rules).
^	Match just the beginning of the string, or with the /m modifier the beginning of any embedded line
.	Match any character (except newline unless the /s modifier is used)
$	Match just the end of the string, or with the /m modifier the end of any embedded line
\|	Alternation - to match any one of a set of patterns, usually grouped in brackets.
()	Grouping of subpatterns, numbered automatically left to right by the sequence of their opening parenthesis.
[]	Character class, matching any of the characters in the enclosed list. '^' as the first character in the list negates the expressions - any character not in the list.

The following quantifiers are suppported:

*	Match 0 or more times (equivalent to {0,})
+	Match 1 or more times (equivalent to {1,})
?	Match 0 or 1 times (equivalent to {0,1})
{n}	Match exactly n times
{n,}	Match at least n times
{n,m}	Match at least n but not more than m times

Patterns that are qualified as above match as many times as possible without causing the rest of the match to fail, by default. To match the fewest number of times (to ensure that multiple matches of the super-pattern are found) suffix the quantifier with '?', eg. '*?', '{n,m}?'.

Regular expressions also support the following constructs:

Single character matches		Zero width matches
\w	a "word" character (alphanumeric plus "_")	\b	a word boundary
\W	a non-word character	\B	a non-(word boundary)
\s	a whitespace character	\A	beginning of the string (not embedded newlines)
\S	a non-whitespace character	\Z	end of the string (not embedded newlines)
\d	a digit character	\G	where previous m//g left off
\D	a non-digit character

Brackets delimit sub-patterns, allowing the resultant matches in the target string to be referenced using either /1 ... /n within the pattern itself, or $1 ... $n outside of the pattern. If the '(' is followed by a '?', it can be used to delimit a subpattern without the pattern being saved.

$+	the last pattern that was matched (useful when there are alternatives)
$&	the matched string
$`	everything before the matched string
$'	everything after the matched string

The extension syntax for regular expressions uses a pair of brackets where the first character within the brackets is a question mark '(?...)'.

(?#comment)	A comment - ignored
(?:regexp)	Groups the pattern as with brackets, but doesn't generate back references
(?=regexp)	A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&.
(?!regexp)	A zero-width negative lookahead assertion. For example: `/abc(?!xyz)/` matches any occurrence of "abc" that isn't followed by "xyz". This cannot be used for lookbehind: /(?!abc)xyz/ will not find an occurrence of "xyz" that is preceded by something which is not "abc". The (?!abc) is ensuring that the next thing is not "abc" - and it's not, it's "xyz", so "abcxyz" will match. Would have to do something like /(?abc)...xyz/ for that, but there's the case where the "xyz" does not have three characters before it. This could be covered by: `/(?:(?!abc)...\|^..?)xyz/` It may be easier to say: if (/abc/ && $` =~ /xyz$/)
(?imsx)	One or more embedded pattern-match modifiers. Useful for patterns that are specified in a table somewhere, some of which want to be case sensitive, and some of which don't. The case insensitive ones merely need to include (?i) at the front of the pattern. For example: `$pattern = "abcxyz";` `if ( /$pattern/i )` `# more flexible:` `$pattern = "(?i)abcxyz";` `if ( /$pattern/ )`