Optimizing Regex

Many people avoid using regex in their application because of speed issues. When you're developing a site and are looking for sub-second response times a regex that takes a few hundred milliseconds is not very appealing. Especially when that slow pattern is being matched against several dozen configurations. Luckily, there are several easy ways to optimize a regex.

Speed issues can usually be attributed to the backtracking logic that engines utilize. Backtracking is when, due to a mismatch, the parser steps backwards to attempt to fit the pattern. Here's an example.

  1. $string = '<span class="broken-attribute>words</span>';

  2. $pattern = '/<span class="[^"]+">[^<]+<\/span>/';

The pattern will first match the literals from left to right, from the words 'span' and 'class' and the opening double quote. Then it will enter the first character class, a negated one, and the rest of the original string matches. Which is great (if a bit greedy), except the pattern is not done yet. After the character class the engine will look at the literal double quote. There is no double quote at the end of the string. So the engine will backtrack, taking one previously matched character out at a time. It isn't until this set is run out that the pattern will actually fail, which is correct.

  1. {literal} < s p a n \s c l a s s = "

  2. {character class} b r o k e n - a t t r i b u t e > w o r d s < / s p a n >

  3. {backtracked} > n a p s / < s d r o w > e t u b i r t t a - n e k o r b

  4. // backtracking was looking for the second literal " in the pattern

This was a simplified example, but it demonstrates how engines backtrack. There was no need to backtrack here, not with the negated character class, and we can control and restrict backtracking with several options.

Atomic Groupings

I brought up atomic groupings in my my post about regex grouping. I explained it as a way to adjust backtracking with alternations. The example in that post is pretty solid (yay brogrammers!) so I'll reuse it here.

  1. $string = "Only cool if you're brogramming from sun-up to sun-down";

  2. $slow = "/\b(brogrammer|bro)\b/";

  3. // this will try to match the first option first and fail at 'e'

  4. // than it will backtrack and try the second option

  5. $fast = "/\b(?>brogrammer|bro)\b/";

  6. // as soon as the first option matches 'b' it assumes the first option

  7. // when it fails at 'e' and completely fails and does not look at second

Ordering is important with atomic grouping. Once the alternation is fully matched, be it on the word brogramming, bro, or brogrammer, the engine considers the group done. Going back to the group if the second word break '\b' fails would be considered backtracking. Thus, '/\b(?>bro|brogrammer)\b/' would not match the string 'brogrammer'.

Possessive Quantifiers

By default, quantifers are greedy. They will try to match as much as possible, be they '+', '*', or '?'. You can make them lazy, though this doesn't affect the potentially slow backtracking behavior. The other option is to make them stubborn. Stubborn quantifers are greedy and will refuse to give up matches during backtracking, similar yet more universal than atomic groupings. To demonstrate, let's add some possessivenss to the first example.

  1. $string = '<span class="broken-attribute>words</span>';

  2. $slow = '/<span class="[^"]+">[^<]+<\/span>/';

  3. // this will try to eat up the entire string in the first negated character class

  4. // after, it will backtrack all the way before failing

  5. $fast = '/<span class="[^"]++">[^<]++<\/span>/';

  6. // this will again try to eat up the entire string in the first class

  7. // after, it will notice that there is no double quote at the end and give up

  8. // no backtracking!

Possessive quantifiers are very useful because they are so universal. You can use them like '++', '*+', or '?+' and wherever normal quantifiers would be used.

Non-Capturing Groups

For a much smaller bit of optimization you can also make unnecessary captures non-capturning. This will save the processing engine some memory usage, though the savings is usually quite small compared to the number of times an engine may backtrack. It's generally good practice to only capture what you need to keep the patterns simple anyways, with or without the minor memory benefit.

Backtracking is good - without it, many of the features of regex engines would not be possible. However, as with any good feature, misuse of the tool can lead to inefficiency and catastrophe. Most good features, anyways. By using atomic grouping and possessive quantifiers you can limit how much a pattern will backtrack and avoid long-running regexes. To debug and analyze patterns I recommend using a robust tool like regex101 and frequently checking on how many steps the pattern will take on example input.