Fri, 06 Apr 2007
Regular Expressions
A regular expression is a string of characters that defines a set of one or more possible strings of characters. This sounds rather cryptic, and the concept is rather difficult, so it is probably best to introduce the concept using examples.
A brief(-ish) introduction
In the above definition, we are essentially saying that we can match a string of characters to another string of characters. We encounter this situation whenever we use the Find command in an application: suppose we want to know if the phrase "Wankel rotary engine" appears in a document. If we enter the string "Wankel rotary engine" into the Find dialogue of our editor, it is trivial to find any occurance of the phrase in a document. The string "Wankel rotary engine" matches the string "Wankel rotary engine" (including the spaces). A string matching itself is a trivial example; the power of regular expressions comes from the fact that one string can match many other strings, in an organised, regular way.
Suppose that instead of searching for an exact string, we wish to find strings that vary but which have certain regular characteristics. An example would be telephone numbers. Imagine that we have been given the task of finding all of the telephone numbers that occur in a document. We can assume that they appear in a standard form for (UK) numbers, with an area code. An example would be "(01865) 776868". Of course, there are other ways to represent telephone numbers with strings (one could omit the area code, or leave out the brackets, or use dashes rather than spaces, etc.), but we will assume that this is the only form that we have to match. We can describe the string that represents a telephone number in words: a telephone number consists of an opening bracket, followed by the digits '01', followed by three more digits, followed by a closing bracket and a space, followed by six digits. It is not possible to ask a computer to seach for this description of a string in English, but we can use another language, that of regular expressions, to tell the computer what to look for. To match this definition of telephone numbers, we can use the regular expression
\(01[0-9]{3}\) [0-9]{6}
Now consider the case where we also have to find telephone number of the form "(0131) 226 7232". We can now expand our definition of a telephone number: a telephone number consists of an opening bracket, followed by the digits '01', followed by two or three more digits, followed by a closing bracket and a space, followed by three digits, followed by either three more digits or by a space and four more digits. Again, we can create a regular expression that will match strings like this
\(01[0-9]{2-3}\) [0-9]{3}\s?[0-9]{3-4}
Regular expressions with Dreamweaver
Using regular expressions in Dreamweaver
What Dreamweaver gets right
I very much like Dreamweaver's large and multi-line input areas for Find and Replace terms
which means that if you have a complicated query, especially one consisting of several lines or more, it is much easier to visually check your input than with most text editors, which often only provide a small dialogue box. For instance, compare TextPad's Replace dialogue
which has very small input boxes of only one line each.
Speed
Search and replace in Dreamweaver is powerful and easy to use, but it is also slow (even with a 3.0GHz Pentium 4 and 1GB of RAM). Often this slowness is of no practical consequence: if a search and replace encompases just a few files, and relatively few matches are made, the limiting factor is likely to be the speed at which you can formulate and type the search and replace query, rather than Dreamweaver. However, if you are using regular expressions to update an entire site, which comprises hundreds of files, or if there are many thousands of matches for a term in a file, then the speed at which Dreamweaver executes a search and replace is likely to precipitate a coffee break.
Stability
Better search and replace
Later versions of Dreamweaver may well have better implemented search and replace, but I have not tried them and, if you have no control over the version of Dreamweaver that you use, this is of little help in any case. There are alternative tools available that improve upon Dreamweaver in some respects. I use a text editor called TextPad which is considerably faster than Dreamweaver for search and replace. A regex which takes 30 seconds to a minute in Dreamweaver is carried out almost instantaneously in TextPad.
Regular expression examples
Addding markup to plain text figure captions
Have a huge number (222) of text files, each of which contains the caption information for a figure. The caption information is just plain text, with no markup. Want to make the word 'Figure' and the figure number bold, leaving everything else alone.
Want to replace
Figure x.y caption text
with
<b>Figure x.y</b> caption text
where x and y are numbers between 1 and 19.
Find
(Figure [1-9][0-9]?\.[1-9][0-9]?)
Replace
<b>$1</b>
Converting spaces in attributes to underscores
Have id
and links (href
) attributes with spaces in them. Two problems:
- Need to escape spaces in URLs or they will be malformed.
- Can't have spaces in
id
properties (well, you can, but it won't validate as strict xhtml).
The practical solution to both of these problems is to the replace spaces in the attributes with underscores.
Want to replace
<li id="Fresh Frozen Plasma">
with
<li id="Fresh_Frozen_Plasma">
the id
attribute has a maximum of three words, and a minimum of one.
First go
Find
<li id="(\w+)\s
Replace
<li id="$1_
Run the above search and replace n-1
times, where n
is the maximum number of words in any id
attribute.
Second attempt
Find
<li id="(\w+)\s?(\w*)\s?(\w*)">
Replace
<li id="$1_$2_$3">
Then need two more Find/Replace operations
Find
__
Replace
leave blank
Find
_"
Replace
"
Equally, could have just run second find and replace operation twice.
Replacing a custom paragraph class with a standard heading
Want to replace
<p class="main">Some text</p>
with
<h3>Some text</h3>
Find
<p class="main">(*)</p>
Replace
<h3>$1</h3>
Adding an id
tag to a list element
Want to add an id
tag to each list element in a document so that we can reference it from an external link. The id tag will be all of the text after the <li>
tag to the end of the line (each list element occupies exactly one line, and the terminating tags are on another line).
Want to replace
<li> Text to end of line
with
<li id="Text to end of line"> Text to end of line
Character after '>' must not be '<'.
Find
<li>([^<]\w*)
Replace
<li id="$1">$1
Closing image tags
This is to do with the dread topic of validation. Before XHTML there was no requirement to close tags (i.e. to make sure that everyone opening tag e.g. <p>
was matched by a closing tag e.g. </p>
).
Find
<img([^/]*)>
Replace
<img$1 />
Note that the above example only works if there are no forward slashes anywhere in the image tag, so if the image path has a slash in it then you're outta luck here. I'll add a method that works for those kind of paths when I can get around to it.
Replacing dummy links with id
attributes
Find
<h2><a name="(\w*)">([\w\s]*)</a></h2>
Replace
<h2 id="$1">$2</h2>
Replace a custon paragraph tag with standard header
Find
<p class="smallheading"><a name=\"([^"]*)" id=\"[^"]*"></a>([^<]*)</p>
Replace
<h3 id="$1">$2</h3>
Change class of dates
Have a series of news articles that include a date; at present, the date is contained in a separate paragraph, but the class used for that paragraph is the same used for plain vanilla paragraphs of text. Thus there is no way to specify a different type of text formatting to make the date more distinctive.
The date is in the form <p class="main">Month Year</p>
, where Month is written (i.e. November) and year is a four digit number (i.e. 2006).
Want to replace
<p class="main">Month Year</p>
with
<p class="date">Month Year</p>
Find
<p class="main">([^\s]*\s200[0-9]{1})</p>
Replace
<p class="date">$1</p>
Note that I have been lazy and the above example will only work for dates between 2000 and 2009. That was fine for my particular application, but the regex would need to be modified to cover other ranges.
Changing links to pass variables
Want to replace
<a href="/sub/chap01/f01-01.jpg">
with
<a href="/scripts/roitt/figure.asp?chap=01&fig=f01-01">
Find:
<a href="/sub/chap([0-9]{2})/([^.]*).jpg">
Replace:
<a href="/scripts/roitt/figure.asp?chap=$1&fig=$2">
External links
- Introduction to Regular Expressions in Dreamweaver
- The Premier Web Site about Regular Expressions
- Regular Expression Tester