original in en Guido Socher
Regular expressions can be found in many advanced editors like vi and emacs, in the programs grep/egrep and languages like awk, perl and sed.
Regular Expressions are used for advanced context sensitive searches and text modifications. A Regular Expression is a formal description of a template to be matched against a text string.
When I saw several years ago a person using regular expressions I was fascinated. Text editing and searching tasks that would normally take hours could be done in just a few seconds. Yet, I did not understand a word when I saw the expressions on the screen. They looked like a strange combination of dots slashes, stars and some other characters. Still I was determined to learn how they worked and soon I found that they are quite easy to use. They follow a few simple syntax rules.
Although regular expressions are quite wide spread in the Unix world there is no such thing as 'the standard regular expression language'. It is more like several different dialects. There are e.g. two types of grep programs; grep and egrep. Both use regular expressions with slightly different capabilities. Perl has probably the most complete set of regular expressions. Fortunately all of them follow the same principles. Once you understand the basic idea, it is easy to learn the details of the different dialects.
This article will introduce you to the basics and you can look in the manual pages of the different programs to learn about the different aspects and capabilities of the program.
Let's say you have a phone list of a company and it looks like this:
Phone Name ID ... ... 3412 Bob 123 3834 Jonny 333 1248 Kate 634 1423 Tony 567 2567 Peter 435 3567 Alice 535 1548 Kerry 534 ...
It is a company with 500 people. They keep the data just in a plain ascii text file. People with a 1 as the first digit of the phone number are working in building 1. Who is working in building 1?
Regular Expressions can answer that: grep '^1' phonelist.txt or egrep '^1' phonelist.txt or perl -ne 'print if (/^1/)' phonelist.txt
In words this means, search for all lines that start with a one. The "^" matches the beginning of a line. It forces the whole expression to match only if a line has a one as the first character.
The basic building block of a regular expression is the single-character pattern. It matches just this character. An example of a single-character pattern is the 1 in the example above. It just matches a one in the text.
Another example for single character patterns is: egrep 'Kerry' phonelist.txt
This pattern consists only of single-character patterns (The letters K,e ...)
Characters can be grouped together in a set. A set is represented by a pair of open and close square brackets and a list of characters between the brackets. A set is as a whole also a single-character pattern. One and only one of these characters must be present in the search text for the pattern to match. An example:
[abc] Is a single-character pattern that matches either the letter a, b or c [ab0-9] Is a single-character pattern that matches either a or b or a digit in the ascii range from zero to nine [a-zA-Z0-9\-] This matches a single-character that is either an upper case or lower case letter, a digit or the minus sign. Let's try it: egrep '^1[348]' phonelist.txt
This searches for lines that start with 13 or 14 or 18.
We saw already that most ASCII characters match just that ASCII character but some ASCII characters have a special meaning. The square brackets start e.g a set. In the set the "-" has the special meaning of a range. To take away the special meaning of a special character you can precede it with a backslash. The minus sign in [a-zA-Z0-9\-] is an example for this. There are also some dialects of the regexp language where special characters start with a backslash. In this case you need to remove the backslash to get the normal meaning.
The dot is an important special character. It matches everything except the newline character. Example:
grep '^.2' phonelist.txt or egrep '^.2' phonelist.txt
This searches for lines with a 2 at the second position and anything as the first character.
Sets can be inverted by starting the set definition with "[^" instead of "[". Now the "^" means no longer beginning of the line but the combination of "[" and "^" indicates the inverted set.
[0-9] Is a single character pattern that matches digit in the ascii range from zero to nine. [^0-9] Match any single NON-digit character. [^abc] Match any single character that is not an a, b or c. . The dot matches everything except new line. It is the same as [^\n]. Where \n is the newline character. To match all lines that start NOT with a 1 we could write: grep '^[^1]' phonelist.txt or egrep '^[^1]' phonelist.txt
Already in the previous part we saw the "^" that matched the beginning of a line. Anchors are special regexp characters that match a position in the text and not any character of the text.
^ Match the beginning of a line $ Match the end of a line
To look for people with the company ID number 567 in our phonelist.txt we would use:
egrep '567$' phonelist.txt
This looks for lines with the number 567 at the end of the line.
A multiplier determines how often a single-character pattern must
occur in the text.
description | grep | egrep | perl | vi | vim | vile | elvis | emacs |
---|---|---|---|---|---|---|---|---|
zero or more times | * | * | * | * | * | * | * | * |
one or more times | \{1,\} | + | + | \+ | \+ | \+ | + | |
zero or one time | \? | ? | ? | \= | \? | \= | ? | |
n to m times | \{n,m\} | {n,m} | \{n,m\} |
Note: The various VIs have the magic option set to work as shown above.
An example from the phone list:
.... 1248 Kate 634 .... 1548 Kerry 534 ....
To match a line that starts with a 1, has some digits, at least one space and a name that starts with a K we can write:
grep '^1[0-9]\{1,\} \{1,\}K' phonelist.txt or use * and repeat [0-9] and space: grep '^1[0-9][0-9]* *K' phonelist.txt or egrep '^1[0-9]+ +K' phonelist.txt or perl -ne 'print if (/^1[0-9]+ +K/)' phonelist.txt
The multiplier multiplies the occurrence of the preceding single-character pattern. So "23*4" does NOT mean " 2 then 3 anything 4" (This would be "23.*4"). It means "one time 2 then maybe several 3 and one 4"
It is also important to note that these multipliers are greedy. It means that the first multiplier in the pattern extends the match as much to the right as possible.
The expression ^1.*4 would match the whole line 1548 Kerry 534 form the start until the very last 4. It does NOT match only the 154.
This does not make much difference for grep but is important for editing and substitution.
The Parentheses as Memory construct does not change the way an expression matches but instead causes the enclosed text part to be remembered, so that it may be refered to later on in the expression.
The remembered part is available via variables. The first
Parentheses as Memory construct is available via variable one, the
second Parentheses as Memory construct is available via variable two and so on.
program name | parentheses syntax | variable syntax |
---|---|---|
grep | \(\) | \1 |
egrep | () | \1 |
perl | () | \1 or ${1} |
vi,vim,vile,elvis | \(\) | \1 |
emacs | \(\) | \1 |
Example:
The expression [a-z][a-z] would match two lower case letters.
Now we can uses these variables to search for patterns like the text 'otto':
egrep '([a-z])([a-z])\2\1' The variable \1 did contain the letter o and the \2 the letter t. The expression would also match the name anna but not yxyx.
Parentheses as Memory constructs are not so much used for finding names like otto and anna but rather for editing and substitution.
To do editing you will need an editor like vi, emacs or you can use e.g perl.
In emacs you use M-x query-replace-regexp or you can put the query-replace-regexp command on some function key. Alternatively you can also use the command replace-regexp. The query-replace-regexp is interactive, the other not.
In vi the substitution command :%s/ / /gc is used. The percent refers to the ex-range 'whole file' and can be replaced by any appropriate range. E.g in vim you type shift-v, mark an area and then use the substitution on that area only. I don't explain more about vim here as this would be a tutorial on its own. The 'gc' is the interactive version. The no interactive is s/ / /g
Interactive means that you are prompted at each match on whether or not to execute the substitution.
In perl you can use
perl -pe 's/ / /g'
Let's look at a few examples. The numbering plan in our company has changed and all phone numbers that start with a 1 get a 2 inserted after the second digit.
This means e.g 1423 should become 14223.
The old list: Phone Name ID ... 3412 Bob 123 3834 Jonny 333 1248 Kate 634 1423 Tony 567 2567 Peter 435 3567 Alice 535 1548 Kerry 534 ...
Here is how to do it:
vi: s/^\(1.\)/\12/g emacs: ^\(1.\) replaced by \12 perl: perl -pe 's/^(1.)/${1}2/g' phonelist.txt Now the new phone list looks like this: Phone Name ID ... 3412 Bob 123 3834 Jonny 333 12248 Kate 634 14223 Tony 567 2567 Peter 435 3567 Alice 535 15248 Kerry 534 ...
Perl can handle more than only the memory variables \1 to \9 therefore \12 would refer to the 12-th variable which is of course empty. To solve this we just use ${1}.
Now the allignment in the list is a bit disturbed. How can you fix it? You could just test if there is a space in the 5th position and insert an other one:
vi: s/^\(....\) /\1 /g emacs: '^\(....\) ' replaced by '\1 ' perl: perl -pe 's/^(....) /${1} /g' phonelist.txt Now the phone list looks like this Phone Name ID ... 3412 Bob 123 3834 Jonny 333 12248 Kate 634 14223 Tony 567 2567 Peter 435 3567 Alice 535 15248 Kerry 534 ...
A collegue has manually edited the list and accidently inserted some spaces at the beginning of some lines. How can we remove them?
Phone Name ID ... 3412 Bob 123 3834 Jonny 333 12248 Kate 634 14223 Tony 567 2567 Peter 435 3567 Alice 535 15248 Kerry 534 ... This should correct it: vi: s/^ *// (There is 2 spaces as we do not have a +) emacs: '^ +' replaced by the empty string perl: perl -pe 's/^ +//' phonelist.txt
You are writing a program and you have the variables temp and temporary. Now you would like to replace variable temp by the variable named counter. If the string temp is just replaced with counter then temporary becomes counterorary which is not really what you want.
Regular expressions can do it. Just replace temp([^o]) with counter\1. That is, temp and not the letter o. (An alternative solution would be to use boundaries but we have not discussed this kind of anchoring pattern.)
I hope that this article did catch your interest. Now you might want to have a look at the man-pages and documentation of your favorite editor and learn the details.
There are also more special characters like e.g the alteration which is a kind of "or" and also the word boundaries mentioned above.
Have fun, happy editing.