RegEx Tutorial

Regular Expressions (RegEx or RegExp) are special text string for describing search pattern. The RegEx may match or not match a string being searched. If it matches, it returns true, if it does not match, it returns false.

In RegEx, the following characters have special meaning. To match the literal character, these characters have to be escaped by preceding “\”.

  • ^ = beginning of a string or negating a character range when it's inside square brackets “[]”.
  • $ = end of string.
  • *+? = denotes repetition, * is zero or more, + is one or more, and ? is zero or one. “a*” matches “”, “a”, “aa”, “aaa”, etc. “a+” does not match “”, but matches “a”, “aa”, “aaa”, etc. And, “a?” only matches “” and “a”. There is a problem with this. The star is “greedy”, meaning it will catch the biggest string it can. “_.*_” will match “_beginning_middle_end_some_more_big_string_”. To make it match the shortest possible string at a time, we can add ?. Thus, “_.*_” will match “_beginning_”, then match “_middle_”, and so on. To replace only the middle portion, we have delimit the special characters with parentheses, then we can preserve it by “$1”. For example, we can use “_(.*?)_” and preserve that by changing the string with “<italic>$1</italic>”. This will change “_beginning_” to “<italic>beginning</italic>”.
  • . = any single character, alphanumeric, space, or other.
  • [] = a single character. What's inside defines what king of character it can be. Ranges of characters are denoted by “-”. “[a-zA-Z0-9]” means any alphanumeric character. “[^a-zA-Z0-9]” any NON-alphanumeric character, such as period, space, slashes, parentheses, quotes, question mark … etc.
  • () = defines more than one character. (abc) matches any single “abc” occurrence, but (abc)+ matches “abc”, “abcabc”, abcabcabc“, etc.
  • | = means OR, but not often used because ”(a|b|c)“ is the same as ”[a-c]“. It can be useful for matching either strings of longer than one character, such as ”(facile|easy)“.
  • {} = denotes a repetition range. “\.[a-zA-Z]{2,3}” means an alphabetic string of 2-3 characters, useful for domain names. Do not use ”-“ inside ”{}“, the ”,“ means a range inside ”{}“. But, inside ”[]“ use ”-“ only, ”,“ is literal comma inside ”[]“.
  • \ = used to escape all the above special characters to mean their literal. Use it for ^.[$()|*+?{\. In php3, if you would like to find or match the literal character “\” in a string, use “\\” in your RegEx. I think in php4+ you don't have to escape the literal “\” character. To make it more confusing, escaping some alphabetic characters changes their meaning to non-alphanumeric. “\s” means space, “\n” new line, “\r” cursor return, “\t” tab, “\f” I'm not sure.

To validate an email address, which can include underscore and dashes, the RegEx using php POSIX type function “eregi” would look like this:

  eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$",$email)

Other Notes

Notice that “eregi()” is case insensitive (as opposed to “ereg()”), and thus [a-z] really means [a-zA-Z] in “eregi()” function. POSIX type regex is not binary safe. Except for POSIX, RegEx usually requires delimiters, default is forward slash ”/“.

Useful regular expression tutorial links: