Table of contents |

Regular Expressions

Regular Expressions allow a powerful find and replace functionality.

[1] char matches itself, unless it is a special character (metachar): . \ [ ] * + ^ $

[2] . matches any character.

[3] \ matches the character following it, except when followed by a left or right round bracket, a digit 1 to 9 or a left or right angle bracket. (see [7], [8] and [9]) It is used as an escape character for all other meta-characters, and itself. When used in a set ([4]), it is treated as an ordinary character.

[4] [set] matches one of the characters in the set. If the first character in the set is "^", it matches a character NOT in the set, i.e. complements the set. A shorthand S-E is used to specify a set of characters S upto E, inclusive. The special characters "]" and "-" have no special meaning if they appear as the first chars in the set. examples: match: [a-z] any lowercase alpha [^]-] any char except ] and - [^A-Z] any char except uppercase alpha [a-zA-Z] any alpha

[5] * any regular expression form [1] to [4], followed by closure char (*) matches zero or more matches of that form.

[6] + same as [5], except it matches one or more.

[7] a regular expression in the form [1] to [10], enclosed as \(form\) matches what form matches. The enclosure creates a set of tags, used for [8] and for pattern substitution. The tagged forms are numbered starting from 1.

[8] a \ followed by a digit 1 to 9 matches whatever a previously tagged regular expression ([7]) matched.

[9] \< a regular expression starting with a \< construct\> and/or ending with a \> construct, restricts the pattern matching to the beginning of a word, and/or the end of a word. A word is defined to be a character string beginning and/or ending with the characters A-Z a-z 0-9 and _. It must also be preceded and/or followed by any character outside those mentioned.

[10] a composite regular expression xy where x and y are in the form [1] to [10] matches the longest match of x followed by a match for y.

[11] ^ a regular expression starting with a ^ character $ and/or ending with a $ character, restricts the pattern matching to the beginning of the line, or the end of line. [anchors] Elsewhere in the pattern, ^ and $ are treated as ordinary characters.

Example: to replace comment lines that begin with a ; and proceed to the end of the line with comments that are enclosed within (*  *)

Change:

;This is a comment

To:

(* This is a comment *)

 

Specify:

Find : (a semicolon, then tag any and all characters to the end of line as tag #1)

;\(.*\)$

Replace with ( parenthesis, asterisk, space, the tagged characters,  space,  asterisk, parenthesis)

(* \1 *)

Acknowledgements:

Regular expression pattern matching and replacement By: Ozan S. Yigit (oz) Dept. of Computer Science York University
Original code available from http://www.cs.yorku.ca/~oz/ Translation to C++ by Neil Hodgson neilh@scintilla.org
These routines are the PUBLIC DOMAIN equivalents of regex routines as found in 4.nBSD UN*X, with minor extensions.
These routines are derived from various implementations found in software tools books, and Conroy's grep. They are NOT derived from licensed/restricted software.
For more interesting/academic/complicated implementations, see Henry Spencer's regexp routines, or GNU Emacs pattern matching module.