Deduping an EXCEL file

Considering that duplicates can hardly be kept in check by hand in small databases, finding duplicates among large amounts of data, such as those found in databases managed with EXCEL, can only be accomplished using appropriate tools.

As a matter of principle, one must differentiate between clear and fuzzy duplicates. Clear duplicates, i.e. dupes that are completely identical except for capitalisation, are easy to find using EXCEL. EXCEL includes a function for this purpose.

  1. It can be found on the 'Data‘ tab in the ‘Data Tools’ and is called ‘Remove duplicates’.
  2. If only one single cell is marked when starting this function, the selection is automatically extended to the entire table. However, if one single column is marked, EXCEL asks if the selection should be extended to the whole table. If you indicate that the selection is not to be extended, then the duplicates found will only be deleted in the selected column. All of the other columns remain unchanged. If the selection is extended to the whole table, the entire row is deleted, even if not all of the columns were selected to be included in the deduplication.
  3. A list with the columns included in the selection can be found at the bottom of the dialogue for the configuration of the actual dedupe function. If the checkmark is set at ‘My data has headers’, these columns carry over the text from the first row as a designation. Otherwise, they are simply numbered consecutively. You can read more about how to insert column headings in the article 'Excel: Inserting column headings'. If the checkmark is set at one of the columns from the list, the contents of this column will be included in the search for duplicates, if not, it is ignored. Using the 'Select all‘ and 'Unselect all‘ buttons, these selections can easily be set or removed with a single mouse click.
  4. Once the dedupe function is configured, it can be started by clicking on 'OK‘.
  5. EXCEL then executes the function immediately. When EXCEL is finished, the user is informed of the number of duplicates found and how many rows will be left after deleting. Unfortunately, EXCEL does not say which records will be deleted or what they look like. As already mentioned above, EXCEL either deletes the entire row, no matter which columns were selected as criteria for the deduping when the entire table was selected, or only the part of the row whose columns are within the selection.

This way, clear duplicates can easily be deleted from tables using EXCEL. However, EXCEL does not have a function for deleting fuzzy duplicates. You can read about what causes duplicates and what they may look like in the article 'Duplicates and Addresses‘. But only specialised tools that include an error-tolerant (fuzzy) matching can provide a satisfactory solution to this problem, such as DataQualityTools and DeduplicationWizard, available at www.DataQualityApps.com. You can read about how to use the DeduplicationWizard to search for duplicate addresses within a table in the article 'Deduplication with the DeduplicationWizard'. You can learn how to use the DataQualityTools to search for duplicate addresses between two tables in the article 'Processing black lists with DataQualityTools'. And in the article 'Scheduled deduplication with the BatchDeduplicator', you can read about how de-duplication can be automated.

Author: Thomas Hainke