PostgreSQL: Deduping Data

As a matter of principle, one must differentiate between clear and fuzzy duplicates. Clear duplicates, i.e. dupes that are completely identical except for capitalisation, are relatively easy to find using SQL queries. For the following query, for example, PostgreSQL delivers all records that match the contents of the 'name' data field:

SELECT tab1.id, tab1.name, tab2.id, tab2.name
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name)

As one can see, this SQL query requires a column with an ID clearly identifying each record to ensure that a record is not compared with itself. Furthermore, this ID is required to ensure that the record with the biggest ID only appears in the column ‘tab1.id’, but not in the column ‘tab2.id’. This ensures that the record with the biggest ID from a duplicates group is not deleted. The IDs of the records that are to be deleted are written in column 'tab2.id'. This is how it looks when the results are integrated into a DELETE command for PostgreSQL:

DELETE FROM tablename 
WHERE id IN
(SELECT tab2.id
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name))

Of course, this SQL command can be easily extended to include other data fields in addition to the contents of the 'name' data field, for example, the data fields that contain the postal address.

You can read more about the options SQL provides to search for fuzzy duplicates in the article ‘Fuzzy matching withe SQL'. But only specialised tools that include an error-tolerant (fuzzy) matching can provide a satisfactory solution to this problem, such as DataQualityTools and DeduplicationWizard, available at www.DataQualityApps.com. You can read about how to use the DeduplicationWizard to search for duplicate addresses within a table in the article 'Deduplication with the DeduplicationWizard'. You can learn how to use the DataQualityTools to search for duplicate addresses between two tables in the article 'Processing black lists with DataQualityTools'. And in the article 'Scheduled deduplication with the BatchDeduplicator', you can read about how de-duplication can be automated.

Author: Thomas Hainke