Grouping with Regex

Grouping is an insanely powerful regex feature that is often underutilized. Sure, a lot of PHP developers know that if you wrap part of a pattern in '()' than you can pull that chunk out using preg_match and the optional $matches variable, but there's so much more you can do. Direct matching is just the most obvious.

First, though, its important to realize that parentheses in regexes are for grouping. In a pattern you can group sections of the pattern to limit quantifiers, set alternations between sets, and capture blocks for later parsing. Capturing subpatterns, which is used by PHP's native preg_* functions, is merely a side affect of the last usage.

Non-Capturing Groups

The default capture behavior can get annoying after a while. If you're trying to parse an HTML element and want to detect either single- or double-quotes around attributes, or even have optional attributes, capturing every group creates a cluttered environment. You can make a group non-capturing by using the '?:' symbol near the opening parentheses.

  1. $string = '<a href="url" title="click_me">anchor</a>';

  2. $messy = '/<a href=("|\')(\w+)("|\')';

  3. $messy .= '( title=("|\')(\w+)("|\'))?>(\w+)<\/a>/';

  4. preg_match($messy, $string, $matches);

  5. // $matches = [$string, '"', 'url', '"',

  6. // ' title="click_me"', '"', 'click_me', '"', 'anchor'];

  7. // note: preg_* also returns the full capture in [0], other groups after

  8. $clean = '/<a href=(?:"|\')(\w+)(?:"|\')';

  9. $clean .= '(?: title=(?:"|\')(\w+)(?:"|\'))?>(\w+)<\/a>/';

  10. // the '?:' makes it a non-matching group

  11. preg_match($clean, $string, $matches);

  12. // $matches = [$string, 'url', 'click_me', 'anchor'];

The same sort of logic applies when doing alternation between groups. By default each individual group is captured with its own index, which can get messy. If you use '?|' than only a single group will be returned for the section.

  1. $string = 'Mt Ives near Lake Ives';

  2. $messy = '/((Mount(?:ain)?)|(Mt\.?))/';

  3. preg_match($messy, $string, $matches);

  4. // $matches['Mt', 'Mt', '', 'Mt'];

  5. $clean = '/(?|(Mount(?:ain)?)|(Mt\.?))/';

  6. preg_match($clean, $string, $matches);

  7. // $matches['Mt', 'Mt'];

Named Groups

Indexed groups are great for referencing, though they can get overwhelming when you're dealing with complex patterns. Plus, having named keys is just good practice for readability. Naming captures can be done with the '?P<name>' tag and can be referenced as such. Unfortunately, PHP's preg_* still returns the indexed groups as well as the named ones, creating almost twice as many returns, which doesn't really simplify things.

  1. $string = '<span class="caption">words</span>';

  2. $clean = '/<span class="(?P<class>\w+)">words<\/span>/';

  3. preg_match($clean, $string, $matches);

  4. // $matches = [0 => $string, 'class' => 'caption', 1 => 'caption'];

Atomic Groupings

Backtracking is a robust addition to traditional regular expressions that introduces some powerful new options. It also is one of the slowest features of regex parsers. Luckily, you can limit backtracking within groups by using '?>'. With this symbol you can make the group selfish, meaning that it refuses to backtrack once the parser has moved on.

  1. $string = "Only cool if you're brogramming from sun-up to sun-down";

  2. $slow = "/\b(brogrammer|bro)\b/";

  3. preg_match($slow, $string);

  4. // this will try to match the first option first and fail at 'e'

  5. // than it will backtrack and try the second option

  6. $fast = "/\b(?>brogrammer|bro)\b/";

  7. preg_match($fast, $string);

  8. // as soon as the first option matches 'b' it assumes the first option

  9. // when it fails at 'e' and completely fails and does not look at second

There are some other pieces to atomic grouping and backtracking that I'll be posting about separately. Regex optimization is a great topic to explore, one that opens up a lot of possibilities, and grouping is just one of several options available.

Lookarounds

Sometimes you may want to look around a cursor without doing any moving. Lookarounds, both positive and negative, give you that ability. For positive there is '?=' for lookahead and '?<=' and lookbehind, and negative is '?!' for ahead and '?<!' for behind. They are especially useful for finding placement or for minor optimization.

  1. $string = 'Words, more words, and even more words.';

  2. $pattern = '/(\w++)(?![,\.])/';

  3. preg_match_all($string, $pattern, $matches);

  4. // $matches[1] = ['more', 'and', 'even', 'more'];

  5. // only matches words that are not followed by punctuation

Conditionals

This is one of my favorite grouping options. Conditionals are not supported by most regex engines, though PHP's PCRE does allow it, and it gives you the power to write basic operations in regex. The format is '?ifthen|else', and you can use lookarounds or capturing groups in the test area.

  1. $string = 'HTTP/1.x 200 OK' . "\n";

  2. $string .= 'Date: Sat, 28 Nov 2009 04:36:25 GMT' . "\n";

  3. $string .= 'Expires: Sat, 28 Nov 2009 05:36:25 GMT';

  4. $pattern = '/(?:(HTTP)\/\d\.[a-z]|(Date|Expires):)';

  5. $pattern .=' (?(1)(\d+)|([A-Za-z ,\d:]+))/';

  6. preg_match_all($pattern, $string, $matches);

  7. // $matches[3][0] = 200;

  8. // $matches[4][1] = 'Sat, 28 Nov 2009 04:36:25 GMT';

  9. // $matches[4][2] = 'Sat, 28 Nov 2009 05:36:25 GMT';

Grouping is pretty awesome. Of course, you can use them as typical, to capture and backreference and such, but parentheses have a lot of other uses too. Once you get the hang of some of them, especially atomic grouping and lookarounds, writing some nicely optimized patterns becomes a lot easier.