The site describes the services offered in terms of customer data matching. It also explains what is involved and the steps that we take to ensure that our service is best in class.

The complete process includes, but is not limited to:

  • Name Cleaning – Change casing, remove non-characters and symbols, standardise common words
  • Shortening names – remove common words (AND, LTD, PTY, ETC)
  • Splitting Address – complex regular expressions to completely compartmentalise full address details (Building, level, block, flat, apartment, etc)
  • Cleaning address – Standardise common street endings street, st, road, rd, etc
  • Validating and formatting phone numbers
  • Validating and formatting email address
  • Profiling and auditing Date of Birth Fields
  • Fuzzy match – are there typographic errors?
  • Use Soundex to see if the typed words would sound the same
  • Measuring the generalised edit distance between strings
  • Multiple levels of match (e.g. name+address; name+dob+postcode; name+mobile+postcode, etc) to ensure that 95%+ of all matches are found through clever rules rather than relying solely on fuzzy logic.

I have 4 dedicated servers up and running now and I’m just fine-tuning the code so that it will be best-in-class. The purpose of these scripts are to:

  1. Determine if there are duplicate customers in a single file
  2. Determine if the same customers appear in two files
  3. Pinpoint areas of difference between two files (formatting, symbol use, abbreviations, casing, etc)
  4. Categorise the likelihood that the customer is the same person in both cases
  5. Do a data quality check of key fields – email, phone, date of birth, etc

Most of the work will be in determining the difference between the files before getting started. That is why this tailored service will always be better than an automated approach. Repeat jobs will be cheaper because most of my work is upfront and then the processes are repeatable.

