Fuzzy Matching with SQL
Most database servers implement the SOUNDEX command as an algorithm for phonetic searches. SOUNDEX represents the letters of a word as a four-character string according to their sound in the English language. This simple algorithm sometimes delivers reasonably good results. Thus, for example, 'Smith' and 'Smythe' are recognised to be identical. The algorithm also delivers some good results in languages other than English. Thus, 'Maier', 'Mayer', 'Mayr' and 'Mair' are recognised to be identical.
However, the algorithm is configured to compare individual words, so that already the comparison of 'Ken Smith' and 'Smith Ken' will not appear in the results. In addition, this algorithm is language dependent. The representation of the examined words as a 4-character string is also rather rough, so that sometimes strange results are obtained. For example, 'Hilbert' and 'Heilbronn' or 'Knut' and 'Kant' are recognised to be duplicates.
An SQL query to dedupe a database using the SOUNDEX command could look like this:
SELECT tab1.id, tab1.name, tab2.id, tab2.name
FROM tablename tab1, tablename tab2
WHERE SOUNDEX(tab1.name)= SOUNDEX(tab2.name)
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE SOUNDEX(tab.name)=SOUNDEX(tab1.name))
Seeing as the results of this query probably contain hits that are not actually hits, and you do not want to randomly delete or keep records from a duplicates group, the results cannot be processed any further without editing. There is also the fact that such a query always matches all of the records with each other, which of course lengthens the time required for the query.
Therefore, in most cases, the SOUNDEX command in SQL is not a feasible method to deduplicate a database. Only specialised tools that include an error-tolerant (fuzzy) matching can provide a satisfactory solution to this problem, such as DataQualityTools and DeduplicationWizard, available at www.DataQualityApps.com. You can read about how to use the DeduplicationWizard to search for duplicate addresses within a table in the article 'Deduplication with the DeduplicationWizard'. You can learn how to use the DataQualityTools to search for duplicate addresses between two tables in the article 'Processing black lists with DataQualityTools'. And in the article 'Scheduled deduplication with the BatchDeduplicator', you can read about how de-duplication can be automated.
Author: Thomas Hainke



