Regular Expressions (Regex)

Regular Expressions are complex looking sequences of characters that allow you to find certain patterns in a string. Once these patterns have been found they are often used in a search and replace type of operation. Each of the characters in a regular expression fulfills a certain function and depending on the sequence of these characters will result in a different outcome.

So Why Regex?

Recently I embarked on a project to create an automated approach to file matching (a.k.a fuzzy matching). What I found is that most fuzzy matching approaches worked on the premise that the address is written in the same way on different systems and that all you had to do was find differences in typographical errors (typos), or omissions/additions within a string. From my own experiences there are numerous instances where this won’t work as a universal approach.


EG IconAn apartment can have an address written in various ways:

– 3/F1 Albert st
– 3rd Floor, Flat 1 Albert street
– 3rd Floor 1st Flat Albert St
– Flat 13 Albert street


These are all the same address. I could try to do the standard approaches as one or more of the following:

  1. Change casing
  2. Remove Numbers
  3. Remove Letters
  4. Remove all spaces
  5. Remove non-alphanumeric characters
  6. Output elements of the address as a single text string (e.g. first initial, surname, address 1, postcode)
  7. Match using one of the standard algorithms to check misspellings, phonetic differences and edit distance between strings.

In most cases above this would be an improbable match.

What I tried to consider is what rules we follow when we do a mental address match (i.e. eye-ball 2 addresses and determine if they are the same). Our brains work overtime to analyse them – would it be possible to rewrite our brain’s processes as a series of rules? What I turned to was Regex. Regex has a lot of power and flexibility, and will do a lot more than your standard like statements. Once you get a grasp on the syntax and usage you will find it useful for all sorts of things.


Quick TipThe following is a link to a great one-page cheat sheet on Regular Expressions found on the following website Cheatography.com (opens in a new window). I suggest you print it off if you want to try and understand some of my examples below. It is a great reference all in one place.


One of the best examples of Regex  that I have seen comes from Jan Goyvaerts on his site called Regular-Expressions.info.
He claims that the following regular expression will validate 99% of email addresses.

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

If you want to review the article you can look here – Read Full Article (opens in a new window).

The article shows that Regex can be succinct and efficient but more importantly it also shows examples of where expressions can quickly become unruly.


Quick TipThe best way to learn about regular expressions is to get the one-pager above and then get your hands dirty. If you have a copy of SQL Server Management Studio you can include the Regex Class and start using it in your SQL queries. See the following page for implementation instructions. Then open a website with some simple examples (click here to open in new window) and practice them in your version of SQL Server Management studio.


Some examples of my approach:


— find all addresses with a letter prefix of ‘u’ or ‘a’ (unit or apartment) followed by a number (optional letter) (e.g. u 11a smith street => separate into ’11a’, ‘smith’, ‘street’)
— this is a lookahead assertion to say that if we find a letter after a number there shouldn’t be another letter after that
— (?=([^a-za-z\x27\x2f]{1}|\z))
— the “\x27” stands for an apostrophe… and will catch things like 11 o’brien street, which shouldn’t be a number/letter combination

update [schema].[tablename]
set ad_house_num = ltrim([schema].[regexmatch](‘(?<=\a(u|a|f)[\s-.]{0,3})(\d+)[a-za-z]{0,1}(?=([^a-za-z\x27]{1}|\z))’,working_line,1)),
working_line = ltrim(rtrim([schema].[regexreplace]([schema].[regexreplace](a.working_line,’\a(u|a|f)[\s-.]{0,3}(\d+)[a-za-z]{0,1}(?=([^a-za-z\x27]{1}|\z))’,”),’\a[\s/-]{0,4}’,”)))
from [schema].[tablename]  a
where [schema].[regexismatch](‘\a(u|a|f)[\s-.]{0,3}(\d+)[a-za-z]{0,1}(?=([^a-za-z\x27]{1}|\z))’,a.working_line,1) =1
and working_line is not null
and ad_house_num is null
and flg_rec_error is null


— now find most conventional approach “level #

update [schema].[tablename]
set working_line =  [schema].[regexreplace]([schema].[regexreplace](ltrim(rtrim([schema].[regexreplace](a.working_line,'[-./]{0,1}(level|floor|lvl|storey|story|deck)[\s-.]{0,3}(\d+)[-./]{0,3}’,’ ‘))),'(\s){2,}’,’ ‘),’\a[\s/-]{0,4}’,”),
ad_bldg_lvl = [schema].[regexreplace]([schema].[regexmatch](‘(level|floor|lvl|storey|story|deck)[\s-.]{0,3}(\d+)[a-za-z]{0,1}’,a.working_line,1),'[^0-9]’,”)
from [schema].[tablename] a
where [schema].[regexismatch](‘(level|floor|lvl|storey|story|deck)[\s-.]{0,3}(\d+)’,a.working_line,1)=1
and ad_bldg_lvl is null
and flg_rec_error is null;


— find all addresses with a number and a letter in the working_line and set working line to null (e.g. 11a smith street => separate into ’11a’, ‘smith’, ‘street’)

update [schema].[tablename]
set ad_street_num = [schema].[regexreplace](ltrim(rtrim([schema].[regexmatch](‘(\a((\d+)[\s/]{0,1}[a-za-z]{1}([^a-za-z\x27\x2f]{1}|\z)))’,working_line,1))),'[\s/]{1,}’,”),
ad_street_name = ltrim(rtrim([schema].[regexreplace]([schema].[regexreplace]([schema].[regexreplace](working_line,'(\w+)$’,”),'(\a((\d+)[\s/]{0,1}[a-za-z]{1}))’,”),'(\s){2,}’,’ ‘))),
ad_street_suffix = ltrim(rtrim([schema].[regexmatch](‘(\w+)$’,working_line,1))),
working_line = null
from [schema].[tablename]
where [schema].[regexismatch](‘(\a((\d+)[\s/]{0,1}[a-za-z]{1}([^a-za-z\x27\x2f]{1}|\z)))’,working_line,1)=1
and ad_street_num is null
and ad_street_name is null
and ad_street_suffix is null
and flg_rec_error is null
and [schema].[regexreplace](working_line,'(\d+)’,’xxxx’) not like ‘%xxxx%xxxx%’
and [schema].[regexismatch](‘\a\d+[\s]{1}[a-za-z]{1}’,working_line,1)=0  ;


— for any working line without a number but that has a recognised suffix split it into ad_street_name and ad_street_suffix (e.g. dawson highway => ‘dawson’, ‘hwy’)

update [schema].[tablename]
set ad_street_suffix = upper(b.d_shortname),
ad_street_name = ltrim(rtrim([schema].[regexreplace]([schema].[regexreplace]([schema].[regexreplace](working_line,'(\w+)$’,”),’\d+’,”),'(\s){2,}’,’ ‘))),
working_line = null
from [schema].[tablename] a
left outer join (select distinct shortname as d_shortname from [schema].[streetsuffix]) b on upper(ltrim(rtrim([schema].[regexmatch](‘(\w+)$’,a.working_line,1)))) = b.d_shortname
where b.d_shortname is not null
and ad_street_suffix is null
and a.working_line is not null
and a.working_line <> ”
and len([schema].[regexreplace](working_line,'[^0-9]{0,}’,”)) = 0 — checks that there aren’t any numbers in the working line


— grab rural mail boxes (australia only) and treat as if a po box (but keep rmb prefix)

update [schema].[tablename]
set ad_street_name = ‘rmb ‘ + [schema].[regexreplace]([schema].[regexreplace]([schema].[regexmatch](‘(?<=(rmb[\s/.-]{1,3}))(([a-za-z]{0,1}(\d+)[a-za-z]{1}(?![a-za-z\x27]))|([a-za-z]{0,1}(\d+)))’,([schema].[regexreplace](working_line2 ,'[\s]{2,}’,’ ‘)),1),'[\[\]()]{0,2}’,”),’c[.\s\x2f-]{0,3}p[.\s\x2f-]{0,3}’,”),
po_box_num = ‘rmb ‘ + [schema].[regexreplace]([schema].[regexmatch](‘(?<=(rmb[\s/.-]{1,3}))(([a-za-z]{0,1}(\d+)[a-za-z]{1}(?![a-za-z\x27]))|([a-za-z]{0,1}(\d+)))’,([schema].[regexreplace](working_line2 ,'[\s]{2,}’,’ ‘)),1),'[^a-za-z0-9]’,”),
working_line2 = case when [schema].[regexismatch](‘\a[a-za-z]{1,3}\z’,(ltrim(rtrim([schema].[regexreplace](working_line2 ,'[(]{0,1}((rmb[\s/.-]{1,3}))(([a-za-z]{0,1}(\d+)[a-za-z]{1}(?![a-za-z\x27]))|([a-za-z]{0,1}(\d+)))[)]{0,1}’,”)))),1)=1 then null
else ltrim(rtrim([schema].[regexreplace](working_line2 ,'[(]{0,1}((rmb[\s/.-]{1,3}))(([a-za-z]{0,1}(\d+)[a-za-z]{1}(?![a-za-z\x27]))|([a-za-z]{0,1}(\d+)))[)]{0,1}’,”))) end
from [schema].[tablename]
where
[schema].[regexismatch](‘(rmb[\s/.-]{1,3})(([a-za-z]{0,1}(\d+)[a-za-z]{1}(?![a-za-z\x27]))|([a-za-z]{0,1}(\d+)))’,([schema].[regexreplace](working_line2 ,'[\s]{2,}’,’ ‘)),1) = 1
and ad_street_name is null
and po_box_num is null;


— find any streets prefixed with the word number, no or # and remove it from working_line2

update [schema].[tablename]
set working_line2 = ltrim([schema].[regexreplace](working_line2 ,'(\a(number|no|num|#)[\s.-]{0,3})’,”))
from [schema].[tablename]
where [schema].[regexismatch](‘\a(number|no|num|#)[\s.-]{0,3}\d+’,working_line2 ,1)= 1;


— replace backslashes between numbers with forward slashes (i.e. 1\7 forsyth st => 1/7 forsyth st)

update [schema].[tablename] set working_line2 = [schema].[regexreplace](working_line2 ,'(?<=\a\d+)[\\]{1}(?=\d+)’,’/’)
from [schema].[tablename] where working_line2 is not null and working_line2 <> ”
and line_1 like ‘%\%’
and [schema].[regexismatch](‘\a\d+[\\]{1}\d+’,working_line2 ,1)=1;


— dissect all obvious addresses, only 1 number and at least 2 words (e.g. 11 smith street => separate into ’11’, ‘smith’, ‘street’)

update [schema].[tablename]
set ad_street_num = [schema].[regexmatch](‘\a(\d+)’,working_line2 ,1),
ad_street_name = ltrim(rtrim([schema].[regexreplace]([schema].[regexreplace]([schema].[regexreplace](working_line2 ,'(\w+)$’,”),’\d+’,”),'(\s){2,}’,’ ‘))),
ad_street_suffix = ltrim(rtrim([schema].[regexmatch](‘(\w+)$’,working_line2 ,1))),
working_line2 = null
from [schema].[tablename]
where [schema].[regexismatch](‘\a(\d+)[\s]{1}\w.*\w[\s]{1}\w.*\w[.]{0,1}(?!.*\d)’,working_line2 ,1)=1
and [schema].[regexreplace](working_line2 ,'(\d+)’,’xxxx’) not like ‘%xxxx%xxxx%’
and ad_street_num is null
and ad_street_name is null
and ad_street_suffix is null
and flg_rec_error is null;


— get house type

update [schema].[tablename]
set ad_house_type =
case when [schema].[regexismatch](‘(((unit|unt|ut|uni|uni t|un|unti|uniy|uniot|unir|unnit)[\s-.]{0,3})(\d+))|((?<=\a(u)[\s-.]{0,3})(\d+))’,a.working_line2 ,1) = 1 then ‘unit’
when [schema].[regexismatch](‘(((apartment|apt|apartmnt|apart|ap|appt|aprt|aprtmt|aprtmnt|app|appartment)[\s-.]{0,3})(\d+))|((?<=\a(a)[\s-.]{0,3})(\d+))’,a.working_line2 ,1)=1 then ‘apt’
when [schema].[regexismatch](‘(room|(\a(rm))|(([^a-za-z])rm))[\s-.]{0,3}(\d+)’,a.working_line2 ,1)=1 then ‘room’
when [schema].[regexismatch](‘((shop|shp)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘shop’
when [schema].[regexismatch](‘((flat)[\s-.]{0,3})(\d+)|((?<=\a(f)[\s-.]{0,3})(\d+))’,a.working_line2 ,1)=1 then ‘flat’
when [schema].[regexismatch](‘((duplex)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘duplex’
when [schema].[regexismatch](‘((farm|[^a-za-z]rm)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘farm’
when [schema].[regexismatch](‘((townhouse|town house|twn house|twn hse)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘townhouse’
when [schema].[regexismatch](‘((studio)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘studio’
when [schema].[regexismatch](‘((suite)[\s-.]{0,3})(\d+)’,a.working_line2 ,1)=1 then ‘suite’ end
from [schema].[tablename] a
where  [schema].[regexismatch](‘(((studio|unit|unt|ut|uni|uni t|un|unti|uniy|uniot|unir|unnit|apartment|apt|apartmnt|apart|ap|appt|aprt|aprtmt|aprtmnt|app|appartment|shop|shp|townhouse|town house|twn house|twn hse|flat|duplex|farm|suite|room|(\a(rm))|(([^a-za-z])rm)))[\s-.]{0,3}(\d+)[\s-.]{0,3}(\d+))|(((?<=\a(u|a|f)[\s-.]{0,3})(\d+)))’,a.working_line2 ,1)=1
and ad_house_type is null ;


— grab any obvious units,flats,apartments, etc that are at start of address line and separate into ad_house_num

update [schema].[tablename]
set ad_house_num = [schema].[regexmatch](‘((?<=\a(studio|unit|unt|ut|uni|uni t|un|unti|uniy|uniot|unir|unnit|apartment|apt|apartmnt|apart|ap|appt|aprt|aprtmt|aprtmnt|app|appartment|shop|shp|townhouse|town house|twn house|twn hse|flat|duplex|room|(\a(rm))|(([^a-za-z])rm)|suite)[\s-.]{0,3})(((\d+)[\s/.]{0,2}[a-za-z]{1}([^a-za-z\x27\x2f]{1}|\z))|(\d+)|((?<=\a(u|a|f)[\s-.]{0,3})(\d+)[a-za-z]{0,1})))’,working_line2 ,1),
working_line2 = ltrim(rtrim([schema].[regexreplace]([schema].[regexreplace](a.working_line2 ,'(\a(studio|unit|unt|ut|uni|uni t|un|unti|uniy|uniot|unir|unnit|apartment|apt|apartmnt|apart|ap|appt|aprt|aprtmt|aprtmnt|app|appartment|shop|shp|townhouse|town house|twn house|twn hse|flat|duplex|room|(\a(rm))|(([^a-za-z])rm)|suite)[\s-.]{0,3}(((\d+)[\s/.]{0,2}[a-za-z]{1}([^a-za-z\x27\x2f]{1}|\z))|(\d+)|((?<=\a(u|a|f)[\s-.]{0,3})(\d+)[a-za-z]{0,1})))’,”),’\a[\s/-]{0,4}’,”)))
from [schema].[tablename]  a
where [schema].[regexismatch](‘(\a(studio|unit|unt|ut|uni|uni t|un|unti|uniy|uniot|unir|unnit|apartment|apt|apartmnt|apart|ap|appt|aprt|aprtmt|aprtmnt|app|appartment|shop|shp|townhouse|town house|twn house|twn hse|flat|duplex|room|(\a(rm))|(([^a-za-z])rm)|suite)[\s-.]{0,3}(((\d+)[\s/.]{0,2}[a-za-z]{1}([^a-za-z\x27\x2f]{1}|\z))|(\d+)|((?<=\a(u|a|f)[\s-.]{0,3})(\d+)[a-za-z]{0,1})))’,a.working_line2 ,1) =1
and working_line2 is not null
and ad_house_num is null
and flg_rec_error is null
and mbr_code not in (select mbr_code from [schema].[tablename] where line_1 like ‘%unit [0-9]%’ and line_1 like ‘%flat [0-9]%’);


If you have any questions about any of this content please use the contact me form.