Regular Expressions

Overview

Several AutoMate activities, notably Find text and Replace text, allows the use of "regular expressions" (abbreviated regex or regexp),`. Regular expressions are powerful notations that allow a wide range of text searches to be performed using formulas specific to string manipulation. This article describes the regular expression syntax used in AutoMate.

Fundamentals

Match anywhere - By default, a regular expression matches a substring anywhere inside the string to be searched. For example, the regular expression abc matches abc123, 123abc, and 123abcxyz. To require the match to occur only at the beginning or end, use an anchor.

Escaped characters - Most characters like abc123 can be used literally inside a regular expression. However, the characters \.*?+[{|()^$ must be preceded by a backslash to be seen as literal. For example, \. is a literal period and \\ is a literal backslash. Escaping can be avoided by using \Q...\E. For example: \QLiteral Text\E.

Case-sensitive - By default, regular expressions are case-sensitive. This can be changed via the "i" option. For example, the pattern i)abc searches for "abc" without regard to case. See options for other modifiers.

Common Syntax & symbols

Syntax/Symbol

Description

.

A dot or period matches any single character (except newline: `r and `n). For example, ab. matches abc and abz and ab_ .

*

An asterisk matches zero or more of the preceding character, class, or sub-pattern. For example, a* matches ab and aaab. It also matches at the very beginning of any string that contains no "a" at all.

Wildcard: The dot-star pattern .* is one of the most permissive because it matches zero or more occurrences of any character (except newline: `r and `n). For example, abc.*123 matches abcAnything123 as well as abc123.

?

A question mark matches zero or one of the preceding character, class, or sub-pattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.

+

A plus sign matches one or more of the preceding character, class, or sub-pattern. For example a+ matches ab and aaab. But unlike a* and a?, the pattern a+ does not match at the beginning of strings that lack an "a" character.

{min,max}

Matches between min and max occurrences of the preceding character, class, or sub-pattern. For example, a{1,2} matches ab but only the first two a's in aaab. Also, {3} means exactly 3 occurrences, and {3,} means 3 or more occurrences.

NOTE: The specified numbers must be less than 65536, and the first must be less than or equal to the second.

[...]

Classes of characters: The square brackets enclose a list or range of characters (or both). For example, [abc] means "any single character that is either a, b or c". Using a dash in between creates a range; for example, [a-z] means "any single character that is between lowercase a and z (inclusive)". Lists and ranges may be combined; for example [a-zA-Z0-9_] means "any single character that is alphanumeric or underscore". A character class may be followed by *, ?, +, or {min,max}. For example, [0-9]+ matches one or more occurrence of any digit; thus it matches xyz123 but not abcxyz.

The following POSIX named sets are also supported via the form [[:xxx:]], where xxx is one of the following words: alnum, alpha, ascii (0-127), blank (space or tab), cntrl (control character), digit (0-9), xdigit (hex digit), print, graph (print excluding space), punct, lower, upper, space (whitespace), word (same as \w).

Within a character class, characters do not need to be escaped except when they have special meaning inside a class (i.e. [\^a], [a\-b], [a\]], and [\\a]).

[^...]

Matches any single character that is not in the class. For example, [^/]* matches zero or more occurrences of any character that is not a forward-slash, such as http://. Similarly, [^0-9xyz] matches any single character that isn't a digit and isn't the letter x, y, or z.

\d

Matches any single digit (equivalent to the class [0-9]). Conversely, capital \D means "any non-digit". This and the other two below can also be used inside a class; for example, [\d.-] means "any single digit, period, or minus sign".

\s

Matches any single whitespace character, mainly space, tab, and newline (`r and `n). Conversely, capital \S means "any non-whitespace character".

\w

Matches any single "word" character, namely alphanumeric or underscore. This is equivalent to [a-zA-Z0-9_]. Conversely, capital \W means "any non-word character".

^ and $

Circumflex (^) and dollar sign ($) are called anchors because they don't consume any characters; instead, they tie the pattern to the beginning or end of the string being searched.

^ may appear at the beginning of a pattern to require the match to occur at the very beginning of a line. For example, ^abc matches abc123 but not 123abc.

$ may appear at the end of a pattern to require the match to occur at the very end of a line. For example, abc$ matches 123abc but not abc123.

The two anchors may be combined. For example, ^abc$ matches only abc (i.e. there must be no other characters before or after it).

If the text being searched contains multiple lines, the anchors can be made to apply to each line rather than the text as a whole by means of the "m" option. For example, m)^abc$ matches 123`r`nabc`r`n789. But without the "m" option, it wouldn't match.

\b

\b means "word boundary", which is like an anchor because it doesn't consume any characters. It requires the current character's status as a word character (\w) to be the opposite of the previous character's. It is typically used to avoid accidentally matching a word that appears inside some other word. For example, \bcat\b doesn't match catfish, but it matches cat regardless of what punctuation and whitespace surrounds it. Capital \B is the opposite: it requires that the current character not be at a word boundary.

|

The vertical bar separates two or more alternatives. A match occurs if any of the alternatives is satisfied. For example, gray|grey matches both gray and grey. Similarly, the pattern gr(a|e)y does the same thing with the help of the parentheses described below.

(...)

Items enclosed in parentheses are most commonly used to:

  • Determine the order of evaluation. For example, (Sun|Mon|Tues|Wednes|Thurs|Fri|Satur)day matches the name of any day.
  • Apply *, ?, +, or {min,max} to a series of characters rather than just one. For example, (abc)+ matches one or more occurrences of the string "abc"; thus it matches abcabc123 but not ab123 or bc123.
  • Capture a sub-pattern such as the dot-star in abc(.*)xyz. For example, RegExMatch() stores the substring that matches each sub-pattern in its output array. Similarly, RegExReplace() allows the substring that matches each sub-pattern to be reinserted into the result via backreferences like $1. To use the parentheses without the side-effect of capturing a sub-pattern, specify ?: as the first two characters inside the parentheses; for example: (?:.*).
  • Change options on-the-fly. For example, (?im) turns on the case-insensitive and multiline options for the remainder of the pattern (or sub-pattern if it occurs inside a sub-pattern). Conversely, (?-im) would turn them both off. All options are supported except DPS`r`n`a.