This shows you the differences between two versions of the page.

Link to this comparison view

computers:regex_tutorial [2015/03/24 05:42] (current)
Line 1: Line 1:
 +====== RegEx Tutorial ======
 +Regular Expressions (RegEx or RegExp) are special text string for describing search pattern. ​ The RegEx may match or not match a string being searched. ​ If it matches, it returns true, if it does not match, it returns false.
 +In RegEx, the following characters have special meaning. To match the literal character, these characters have to be escaped by preceding "​\"​. ​
 +  * ^ = beginning of a string or negating a character range when it's inside square brackets "​[]"​.
 +  * $ = end of string.
 +  * *+? = denotes repetition, * is zero or more, + is one or more, and ? is zero or one. "​a*"​ matches "",​ "​a",​ "​aa",​ "​aaa",​ etc. "​a+"​ does not match "",​ but matches "​a",​ "​aa",​ "​aaa",​ etc. And, "​a?"​ only matches ""​ and "​a"​. ​ There is a problem with this.  The star is "​greedy",​ meaning it will catch the biggest string it can.  "​_.*_"​ will match "​_beginning_middle_end_some_more_big_string_"​. ​ To make it match the shortest possible string at a time, we can add ?.  Thus, "​_.*_"​ will match "​_beginning_",​ then match "​_middle_",​ and so on.  To replace only the middle portion, we have delimit the special characters with parentheses,​ then we can preserve it by "​$1"​. ​ For example, we can use "​_(.*?​)_"​ and preserve that by changing the string with "<​italic>​$1</​italic>"​. ​ This will change "​_beginning_"​ to "<​italic>​beginning</​italic>"​.
 +  * . = any single character, alphanumeric,​ space, or other.\\
 +  * [] = a single character. What's inside defines what king of character it can be. Ranges of characters are denoted by "​-"​. "​[a-zA-Z0-9]"​ means any alphanumeric character. "​[^a-zA-Z0-9]"​ any NON-alphanumeric character, such as period, space, slashes, parentheses,​ quotes, question mark ... etc.\\
 +  * () = defines more than one character. (abc) matches any single "​abc"​ occurrence, but (abc)+ matches "​abc",​ "​abcabc",​ abcabcabc",​ etc.\\
 +  * | = means OR, but not often used because "​(a|b|c)"​ is the same as "​[a-c]"​. It can be useful for matching either strings of longer than one character, such as "​(facile|easy)"​.\\
 +  * {} = denotes a repetition range. "​\.[a-zA-Z]{2,​3}"​ means an alphabetic string of 2-3 characters, useful for domain names. Do not use "​-"​ inside "​{}",​ the ","​ means a range inside "​{}"​. But, inside "​[]"​ use "​-"​ only, ","​ is literal comma inside "​[]"​.
 +  * \ = used to escape all the above special characters to mean their literal. Use it for ^.[$()|*+?​{\. In php3, if you would like to find or match the literal character "​\"​ in a string, use "​\\"​ in your RegEx. I think in php4+ you don't have to escape the literal "​\"​ character. To make it more confusing, escaping some alphabetic characters changes their meaning to non-alphanumeric. "​\s"​ means space, "​\n"​ new line, "​\r"​ cursor return, "​\t"​ tab, "​\f"​ I'm not sure.
 +To validate an email address, which can include underscore and dashes, the RegEx using php POSIX type function "​eregi"​ would look like this:
 +<code php>
 +  eregi("​^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,​3})$",​$email)
 +====== Other Notes ======
 +Notice that "​eregi()"​ is case insensitive (as opposed to "​ereg()"​),​ and thus [a-z] really means [a-zA-Z] in "​eregi()"​ function.
 +POSIX type regex is not binary safe. Except for POSIX, RegEx usually requires delimiters, default is forward slash "/"​.
 +Useful regular expression tutorial links:
 +  * http://​weblogtoolscollection.com/​regex/​regex.php