Using Regular Expressions

Applies To: AutoMate 6, AutoMate 5
Published: 2/26/05 , modified December 17, 2007

Introduction

Several AutoMate actions, notably the Find Text and Replace Text actions, allow the use of "regular expressions". Regular expressions are powerful notations that allow a wide range of text searching using formulas specific to string manipulation. This article describes the regular expression syntax used in AutoMate 5 and 6.

Finding Text

Text can be found in a string using a regular expression by specifying a "match expression". A Match expression operates on a single line of text at one time. No match can span multiple lines of text. Match regular expressions are composed of the following:

  • Period ('.') Matches any single character except newline. A newline (internally) is really two characters in a specific order -- <carriage return> followed by <linefeed>. To match a newline, you must always explicitly specify a newline.
  • Caret (^) Matches at the beginning of a line only. A ^ occurring ANYWHERE in the match expression (except within a character class) is interpreted in this manner. This allows meaningful use of ^ in combination with grouping or alternation (see below).
  • Dollar sign ($) Matches at the end of a line only. As with ^ the $ character retains its special meaning anywhere within the expression (except in a character class).
  • Backslash (\) Followed by a single character matches that character. For example, '\*' matches an asterisk, '\\' matches a backslash, '\$' matches a dollar sign, etc.

The following sequences have special meaning

  • \s space (ASCII #32)
  • \t tab (ASCII #9)
  • \b backspace (ASCII #8)
  • \r return (ASCII #13)
  • \l linefeed (ASCII #10)
  • \n newline (#13 followed by #10)
  • \p pipe character |
  • \w word delimiter. Matches any of \t\s!"&()*+,-./:;<=>?@[\]^`{|}~
  • \h hex character. Matches any of 0123456789ABCDEF

The special characters above should be used to produce instances of blanks and tabs. Case is ALWAYS significant when using the special characters. Thus \s matches a space while \S matches a capital letter S.

A single character not otherwise endowed with special meaning matches that character. Thus z matches a single instance of the letter z.

A string enclosed in brackets [] specifies a character class. Any single character in the string is matched. For example, [abc] matches an a, b, or c. Ranges of ASCII letters and numbers can be abbreviated as, for example, [a-z0-9]. If the first symbol following the [ is a caret (^) then a negative character class is specified. In this case, the string matches all characters EXCEPT those enclosed in the brackets. For example, [^a-z] matches everything except lower case characters (and newlines)

The special characters defined above may be used inside of character classes with the exception of \n, \w and \h, which are shorthand for their own character classes. If the characters - or ] are to be used literally inside of a character class, they should be preceded by the escape character \. Note that *?+(){}!^$#& are not special characters when found inside a character class.

Using Closures
A regular expression followed by * matches zero or more matches of the regular expression. This is referred to as a closure. Thus ba*b matches the string bb (no instances of a), bab (one instance), or baaaaaab (several instances).

A regular expression followed by a + matches one or more matches of the regular expression. This is another type of closure. In this case ba+b will not match bb, but it will match bab, or baaaaaab.

A regular expression followed by a ? matches zero or one matches of the regular expression. This is another closure. Here, ba?b will match bb or bab, but not baaaaaab.

Concatenated Expressions
Two regular expressions concatenated match a match of the first followed by a match of the second. Thus (abc)(def) matches the string abcdef.

Alternation
Two regular expressions separated by | match either a match of the first or a match of the second. This is referred to as alternation. Any number of regular expressions can be strung together in this way. Alternation matches are tested in order from left to right, and the first match obtained is used. Then the remaining alternate expressions are skipped over.

Grouping Expressions
A regular expression enclosed in parentheses () matches a match of the regular expression. Parentheses are used to provide grouping, and may be nested to arbitrary depth. Open and close parentheses must be balanced. For example, the following two expressions are not equivalent, and the second probably expresses what was intended:

  • PROCEDURE|FUNCTION
    (PROCEDURE)|(FUNCTION)
  • The first expression is equivalent to
    PROCEDUR(E|F)UNCTION

The second expression matches either of the two words in their entirety.

Tagged Matches
A regular expression enclosed in curly braces {} forms a tagged match word. Whatever was matched within the braces may be referred to by a Replace expression in a manner to be described. Tagged match words may not be nested. Open and close braces must be balanced. A maximum of nine tagged match words can be referenced by the Replace expression. Note that the use of curly braces in expressions is meaningless. However, these expressions share an expression interpreter with the Match expressions, so no exception is raised. For example, consider the expression

  • b{a*}b.

If the string being tested is 'bab', then the tagged match word contains a single 'a'. If the string being tested is 'baaaaaab', then the tagged match word contains 'aaaaaa'. If the string tested is 'bb', then the tagged match word is empty.

Order of Precedence
Regular expressions are interpreted from left to right. The order of precedence of operators at the same parenthesis level is [], then *+!, then |, and then concatenation.

Tag braces are interpreted strictly from left to right and do not control precedence in any way. The first tagged match word found is given a tag of 1, the second a tag of 2, and so on up to a maximum tag of 9. The tag number that each word receives is based on when it is encountered in the line. If tags are skipped over as a result of alternation, then any remaining tags in a line receive shifted tag numbers. For example, consider the expression:

  • (FUNCTION)|({PROCEDURE})\s+{[^\s(]+}

If a line contains the word PROCEDURE then the word following PROCEDURE has a tag number of 2. If a line contains the word FUNCTION, then the word following FUNCTION has a tag number of 1. It is up to the user to take advantage of this behavior. Generally, it is good practice to surround an entire set of alternates with tag markers:

  • {(FUNCTION)|(PROCEDURE)}\s+{[^\s(]+}

Replacing Text
Replace regular expressions are constructed the same way as Match regular expressions, but the number of operators is reduced. The replacement process occurs in the following manner:

The Match expression finds a string of text that starts at the leftmost position in the input line that matches, and continues to the rightmost position that matches. The string of matched text is operated upon by the Replace expression. The Match expression is then tried again on the input, starting at the first position beyond the previous match string. This recurs until the end of line is found.