Regex tutorial (Part 4) by Gerd Ewald

How to Use Regular Expressions (4)
Translated from his original German version by Gerd Ewald
edited by Marck D. Pearlstone

Back to The Bat! FAQ / How-Do-I

Back to Part 1
Back to Part 2
Back to Part 3
Part 5

6. Special Elements - Part 2

6.1 Assertion
6.2 Backreference
6.3 Conditional Regular Expressions
6.4 Options, Modifier
6.5 Specials
6.6 Overview and Summary

Did you think chapter 6 would be more complicated than chapter 5? No, calm down: chapter 6 will deal with some elements that may be a bit more complex but on the whole these elements are not too difficult to learn.

6.1 Assertion

When I wrote the German version of this chapter I had something to start with: what does assertion mean? This question was possible because there is no German word for this aspect of the terminology of regular expressions: it's simply called 'assertion'. Even a look into Friedls book didn't help: he doesn't use this expression. He calls this element of regexian "lookahead". Ok then, let's look at what an assertion can do for us in regular expressions.

An assertion can be used to find out whether characters precede or follow the matched part of a string. Well, that's nothing new. We could do that without an assertion. But: the assertion checks the characters without including them in the match: it does not "eat" the characters.

Let us explain this with an example: We want to match the string 'foo' within a string only if it is followed by 'bar'. But do not want 'bar' to be part of the match! Without an assertion we would have used "foobar". But this regex matches 'foobar'. If we use an assertion "(?=" the regex looks like "foo(?=bar)" The match now is 'foo'.

Of course there are negative look-aheads or assertions "(?!" too. Ok, let's try this one on our more or less senseless example: "foo(?!bar)" means that 'foo' will be matched except when 'bar' follows but only 'foo' is the resulting matched string. So it matches 'foo' within 'foolish' while 'foobar' doesn't match at all.

But be careful, there is a trap: one might think that "(?!foo)bar" only matches 'bar' if 'foo' does not precede it. Wrong!! This regex matches 'bar' in any event with no exception. This assertion is a look-ahead one that only looks ahead to find 'foo' ahead of the point where 'bar' is found. But there will always be a 'bar' and never a 'foo'. Ok? Did you understand this?

What we need is a look-behind assertion! And, as if by magic, here is one: "?<=" is a positive look-behind assertion and "(?<!" is the negative version.

Example:
"(?<!foo)bar" matches 'bar' if 'foo' does not precede it
"(?<=foo)bar" matches 'bar' only if 'foo' precedes it.

You can use assertions in alternatives, e.g.: "(?<=proba|possi)bility" So, 'bility' is matched if either 'proba' or 'possi' precedes this. But it would match 'bility' as well if 'impossi' preceded the string. But I think you expected that ;-))

There is a restriction to the search patterns allowed in assertions: the length must be absolute, which means, you can't use quantifiers. "(?<=\d+,)\d\d" would produce a syntax error. The different branches of alternatives in assertions may have different lengths, but they have to be predefined!

Furthermore: different (but still defined) lengths of branches are permitted at the top level: "what's that?" Have a look at the example:

"(?<=ab(c|de))" is not permitted, "(?<=abc|abde)" is permitted. The second opening parenthesis makes us leave the top level, causing the assertion to become unpredictable for the regex machine.

You may use assertions in a sequence and you may nest them:

Example: "(?<=\d{2}\.\d{2})(?!00\.00)\s+Payment" matches 'payment' if preceded by any amount as long as the amount is not '00.00'

Or: "(?<=(?<!im)possi)bility" matches 'bility' only if preceded by 'possi'. But it won't match if preceded by 'impossi''.

Back to top of section
Back to top

6.2 Backreference

In an earlier chapter we learned something about subpatterns. These were characters that were stored in some kind of variables. I wrote "... are stored in a temporary variable for further use…". Well, now we will see what 'further use' means:

Let's take a regex to explain what I mean: "(sens|respons)e and \1ibility)" This regex will find either 'sense' or 'response'. Whatever it matches is stored in the first subpattern (you remember - the first opening parentheses?). Then it has to be followed by ' and '.

Next comes "\1". This means: use the content of the first subpattern as part of the search pattern. Because this is followed by 'ibility' the regex matches either 'sensibility' or 'responsibility' whatever was found at the beginning of the string. So 'sense and sensibility' or 'response and responsibility' would have a successful match but never 'sense and responsibility'.

Some restrictions: the backreference must not appear in the subpattern it is related to: "(a\1)" would never give a positive match. On the other hand, if the subpattern is followed by a quantifier, it is allowed: "(da|de\1)+" This matches 'dadadada' or 'dadeda' or 'dadedadadada'.

Back to top of section
Back to top

6.3 Conditional Regular Expressions

This is an element that is not very common: conditional regular expressions. The principle is: "If pattern A is found, look for pattern B; if not then look for pattern C".

The correct syntax is "(?(condition)Yes-Pattern|No-Pattern)" or "(?(condition)Yes-Pattern)"

But there is a restriction to the condition pattern: it has to be either a sequence of digits or an assertion.

What is it good for? Let's assume we have to extract a date from some text. But for whatever reason it could be a European DD.MM.YYYY or English DD, MMM YYYY formatted date. We only know that at the beginning of the line there is either 'Datum' for the European (German) version or 'Date' for the English one and the date terminates the line.

What we want is a regex that matches the English formatted date if it is preceded by 'Date', otherwise it should match the European formatted one:

"(?(?=^Date)Date:\s(\d+),\s([A-Za-z]{3})\s(\d{4})$|Datum:\s(\d{2}\.)(\d{2}\.)(\d{4})$)"

Note: the regex is wrapped due to layout reasons. All must be used as a single long line!

At the beginning of this chapter I told you that the condition has to be either an assertion or a sequence of digits. But these digits are backreferences. The condition wouldn't literally search for digits. So what does it all mean?

Example: we receive mails with 'Name' in the first line followed by a value that is that name. The name may change between mails, but we know, that the name will occur again within the mail, and when it does it is related to an attribute we want to extract, let's call it the shoe size

"Name:\s*(.*)?$" will find the name. This is followed by something we are not interested in. But then the name appears again followed by a colon and the shoe size, which are digits: ".*?(?(1):\s*(\d+))" Both parts combined:

"Name:\s*(.*)?$.*?(?(1):\s*(\d+))"

Have a try with the following text:

'Name: James Herriot
Bladibla
James Herriot: 9'

(Note: if you use the regex tester you have to switch on the Singleline option. We will learn about that in a while)

Back to top of section
Back to top

6.4 Options, Modifier

We've had quite a lot of new vocabulary to learn in regexian. Now let's learn something about modifiers. What do they do? Well, they modify something. But what? Elements of regexian are modified by modifiers. I can hear you shout:"Oh no; after I've learned all that about regex and so many different elements ". Don't worry: I am not going to explain all the possible modifiers or options; we will restrict ourselves to those that are essential and most important.

Look at the following options:

i for Caseless
The regex machine is forced to ignore the case of letters. It will search for the pattern ignoring the case (upper/lower) of the pattern

m for Multi-line
The regex machine usually takes the string as a whole line, no matter whether there any newline characters (\n). The circumflex "^" that indicates the beginning of a line only matches the beginning of the string and the dollar "$" matches the end of the string instead of matching the end of a line or a terminating newline at the end. Go ahead, test it with the regex tester: uncheck the Multi-line option. Then enter the following wrapped text: '

This is a test that has
several lines
with a test
at the end of a line'

and the regex "test$". Nothing will be matched. Switch multi-line on and 'test' will be matched.

When this option is active the regex machine will indeed recognize each newline character; the text now consists of multiple lines. This is important when we are going to check the entire text of a message in one hit.

s for DotAll
As we learned in one of the first chapters, the dot matches any character other than the newline character. Once this option is set, the dot matches newlines as well. But this is not actually the whole truth: the newline will also be matched by all negated character classes that do not include the newline, e.g.: "[^x]" matches everything except the character 'x': that includes any newline.

x for Extended
When this option is enabled the regex machine ignores any whitespace character in a search pattern. Thus you are now able to include remarks in the regex, wrapped in #-characters. To search for whitespaces in this mode you have to escape them "\ " or you use "\s". Furthermore you can define your own character class that searches whitespaces e.g.: "[ ]"

If you have a look in the regex tester's options menu you will find some more options or modifiers. I don't want to explain them all. There are some special options that are explained in books or other tutorials. They are not really necessary for a basic understanding of regular expressions. The four I explained above will be useful to you and sufficient for most purposes.

How do we switch them on? That is easy: you just enter the letter that indicates the option in parentheses in which a question mark precedes the letter: "(?" and ")". E.g.: "(?i)" switches on 'ignore case'. You may combine several options for example: "(?im)" means 'Caseless, Multi-line'. Furthermore you may switch options on or off: "(?im-sx)" switches caseless and multi-line on and Dotall and extended off.

If a characters appear before and after a "-"-character then the option is switched off. You may use the options anywhere in the regex. They may appear at the beginning as well as in the middle.

"(?i)Test" is the same as "Te(?i)st". In the case where an option is switched on more than once in the top level part of the regex than the machine will use the option that comes last in the search pattern. Although you may enter the options at any point in the regex I recommend that you do it at the beginning.

Well, there is no rule without an exception: if an option appears within a subpattern it will only apply to the subpattern: "(a(?i)b)c" matches 'abc' as well as 'aBc'.

Something to think about:
"(a(?i)b|c)" is the regex. Does it match a 'C' or only a 'c'? It is obvious that it matches 'aB'….

Back to top of section
Back to top

6.5 Specials

Ok, let's finish this chapter with some special elements and rules that will cross our path in TB only every now and then. I don't want to go into details; this chapter is more like a glossary to look up if a regex "behaves" oddly.

Meta-Characters
In the first chapter of this tutorial I mentioned the meta characters and listed ] and }. These two aren't actually meta characters. You may recall that I asked you to assume they were. If you searched for them as literals you wouldn't need to escape them. But, I always escape them to keep my regex easy to understand. I avoid errors caused by misunderstandings, which I will elaborate here:

Within a character class defined by "[" at the beginning only the following characters are meta characters:
\ to escape
^ to negate a character class but only if the circumflex is the first character to appear within the class
- to indicate a range
] to terminate the character class definition

Ok, now let's have a closer look to the following more or less senseless regex: "[Y-]345]" I wanted to define a range within the class that includes 'Y' to ']' and the digits 3,4 and 5. But what happens? Does the regex match 'Z34' or parts of it? No! Instead try 'Y345]' or '-345]'. And here we are, it is matched. The only problem is... that is not what we wanted.

I am going to explain what happened: the first close square bracket is interpreted as the end of the character class. The regex matches strings beginning with 'y' or '-' followed by '345]'. Yes, "[Y-\]345]" is the correct solution. What have we learnt? Although "]" is not a metacharacter outside a character class it's a good idea to escape them every time one searches for them as literals.

Square brackets
Let's have a closer look at square brackets and special cases. Assuming that caseless is switched on then "[aeiou]" will match 'A' as well as 'a'. But "[^aeiou]" will match 'A' only when caseless is switched off.

Numbers and Digits
We use \d to try to find decimal digits. Of course it is possible to look for characters using hexadecimal or octal character codes. The regex "\x09" matches the character with the hexadecimal code 09.

Octal numbers are a bit more difficult : the syntax is quite easy "\ddd", where each d is a digit. The regex searches for the character with an octal code of 'ddd'. Or, and now it gets a bit tricky, for a backreference. The regex machine takes any number lower than 10 as a backreference if the number is not inside a character class. Inside a character class or if there are not enough parentheses to define a relative subpattern the number is taken as a pattern for octal codes.

Examples:
\040 is octal 'space'
\40 is octal unless there are enough (more than 40) parentheses
defining subpatterns
\6 is always a backreference
\11 could either be a backreference or a 'tab'
\011 is always a 'tab'
\113 is always octal, because there are no more than 99 backreferences allowed

And what is \0113??

Restrictions when using Regex
This is just to inform you about restrictions that we have to bear in mind when using regular expressions. You and I, as 'normal' users, won't reach these limits of regexian but a Regex must not exceed 65535 bytes. There are no more than 99 subpatterns allowed. The total number of elements - like groups, assertions, options and conditionals - must not exceed 200. Furthermore the length of the entire text that is checked for the pattern is restricted as well, but we won't reach this limit in TB. It is restricted to the value of the system's largest positive integer. Because the Regex machine needs to reserve storage for subpatterns and for quantifiers with undefined length due to recursive processing the maximum length available will be reduced. But, to be honest, this is not really of interest to TB-Users ;-)

Back to top of section
Back to top

6.6 Overview and Summary

This chapter explained some of the special features of regular expressions. We should know enough about regex by now to be able to use them in TB's macros.

Short summary:

an assertion can check whether characters appear in before or after a search pattern without including these characters in the match. These are:
```
        positive lookahead-assertion (?=
        negative lookahead-assertion (?!
        positive lookbehind-assertion (?<=
        negative lookbehind-assertion (?<!
```
You are not allowed to use quantifiers on them - that would make them unpredictable to the regex machine.
Strings that are matched as subpatterns are available for further use within the same regex. These backreferences can be addressed by "\#" where # is a positional number indicating the parenthesis pair that defines the subpattern.
Assertions or Backreferences may be used to create conditional regex "(?(condition pattern)yes-pattern|no-pattern)"
The vocabulary of regexian can be modified using modifiers or options. They may precede the Regex in parentheses "(?modifier)". We discussed a few of them here:
```
      i for Caseless
      m for Multi-line
      s for DotAll
      x for Extended
```
We learned that some characters become metacharacters in character classes and behave differently.
A regex can search for hexadecimal or octal character codes. Remember though that there is a possible conflict with backreference numbers.
The length of a regex is limited because the result has a limited length. Even the text to which a Regex is applied must not exceed a certain length. This should only be of minor interest when using regex in TB.

And here are the exercises:

1. Try to define a regex that recognizes doubled words (e.g.: 'the the') Don't forget that the second word may appear at the beginning of the next line. The Regex should only match words and not parts of words (not 'the theme') and it should ignore case.

2. Ok, now let's try to write a simple version of a subject cleaning regex. 'Re' can be followed by anything and a colon. Then the original subject appears. After that a space could follow and a former subject enveloped in parentheses introduced with 'was:' follows. We would like to extract the original subject. This is a very simple version of a cleaner.

3. We receive mail with order amounts of some product. We need the integer of these amounts (without the decimals). The amounts may be mailed in EUR or $. Well, the problem is that the symbol for the decimals is different in both systems. Furthermore, the sign indicating Thousands is different as well: #,###.##$ or #.###,##EUR. The regex should know which of the versions it has to match.

1:

The most important hints were the word wrapping (multiple lines) and caseless. These are options which we switch on: "(?im)". Finding words should be easy: "[a-z]+" We only look for words without digits. We do not have to define capital letters because we already switched "caseless" on. But we only want to search for whole words: so we have to use \b in front of the pattern. We can't use \b at the end of the word, because it is okay to end a sentence with a word and start the next sentence with the same word. So we need to allow at least one whitespace to follow. "(?im)\b[a-z]+\s+" This is the beginning: we still need something to find the second appearance. We have to store the result of the first match in a variable to have a backreference. Ok, second try: "(?im)(\b[a-z]+)\s+\1" But, this matches 'the theme', which we didn't want to have matched. But this is easy now: after the second appearance nothing but a word boundary may follow and that's it: "(?im)(\b[a-z]+)\s+\1\b"

2: This is a very simple subject cleaner.
What we defined above as a possible subject should look like 'Re[2]: proper subject we want ;- (was: old subject)'

The beginning of the subject can be matched by "^Re(.*?):" - 'Re' at the beginning of the line, followed by anything or nothing if there is no counter. We could have done this with ".*". But unfortunately this is greedy and might match more than we want.

The original subject can be found after the colon; to make sure that we don't extract redundant whitespace we match them first. "^Re(.*?):\s*(.*?)" The original subject is stored in subpattern 2. Again there is a question mark to avoid greediness. What is missing? Ah, something that matches the old subject: otherwise it would be included in subpattern 2.

A whitespace and then an opening round bracket follows. Then, as we said, it is introduced with 'was'. But this old subject could be missing from time to time so we can't insist on its appearance: "\s*($was:.*$)*$"

Or as a whole:
"^Re(.*?):\s*(.*?)\s*($was:.*$)*$"

We use the "$"-character to make sure the regex reads the whole line. We can do that because we may expect that the subject is a single line. Those of you who aren't sure should use "\Z" for end of string instead.

3: Ok, this is an exercise that looks quite artificial. But sometimes one needs stupid examples to clarify something (I remember some exercises in physics when I studied that assumed "one-dimensional cattle" or "weightless Christmas bulbs". These weren't much cleverer than my example ;-)) I agree that this could be done using a different regex but I wanted a conditional one: First of all we need an assertion that looks for something, two digits and a dollar sign: "(?=.*\d{2}\$)" If this exists, it should be the $-version: "([\d,]+)\.\d{2}\$" I simplified the problem and defined a character class that allows only digits and commas. (Well, here a mismatch is possible when there is an arbitrary string with digits, several commas then a dot and two digits followed by a dollar sign. Hmmm, well, ok, you're so clever? You improve it! *g*)

If there is no $-version the Regex should match the EUR-version: "([\d\.]+),\d{2}EUR)", which uses the same simplification as above.

The full regex should look like: "(?(?=.*\d{2}\$)([\d,]+)\.\d{2}\$|([\d\.]+),\d{2}EUR)". Did you notice that the EUR-result is stored in the second subpattern while the $-version is stored in the first one?

Back to top