Common Techniques to Employ in the Identification of Duplicate Data

Duplicate and orphan data plague every organization. Explore proven techniques for discovering, identifying, and eliminating unwanted data records.

Tagged in:

Best Practices

Steve

Novak

Vice President

View bio

Duplicate master data is a common problem across every organization. Identifying and resolving duplicate information on customers, suppliers, items, employees, etc. is an important activity. Solving the duplicate issue can yield increased buying power, more accurate marketing campaigns, reduced maintenance costs, improved inventory management, better forecasting; improvements across any aspect of the business.

In terms of a new implementation, if the duplicate data is not resolved before or during the migration, the duplicate data will either silently load into the new ERP application and cause problems when the application is put to use or will error out during the load and cause problems because the downstream transaction conversions won’t be able to load. On an ongoing data management process, it is important to prevent duplicates by defining the criteria and keeping scorecard metrics that ensure that the organization is preventing duplicates adding to overall data degradation.

The basic concept for identifying a duplicate candidate is straight forward. Standardize fields that identify a unique entity (e.g. a single customer, vendor, item, etc.) and compare. In practice, depending on the type/cleanliness of the data, the standardization process can be complex.

There are several techniques that are commonly used to identify duplicates within master data.

Noise Word Removal – The process of removing words that don’t add significance to the data or are often incorrect. Common examples of noise words are “the”, “of”, and “inc”.
Word Substitution - The process of replacing an existing word or phrase with another word or phrase. It is common to substitute names and abbreviations when identifying duplicates. For example, Tim would be replaced with Timothy, OZ would be replaced with Ounce, a single quote might be replaced with foot, and a double space might be replaced with a single space. Additionally, context of the words in the current field or other fields could affect the substitution that needs to take place.
Case Standardization – The process of making everything the same case.
Punctuation Removal – The process of removing all alpha or numeric values that don’t add any significance to the field value.
Phonetic Encoding – The process of encoding words based on how they sound. For example, “donut” and “doughnut” would be phonetically encoded to the same value. There are several types of phonetic encoding methods that are commonly used.
Address Standardization – The process of standardizing all of the components of an address prior to comparing values. Usually the process that checks for duplicates utilizes address validation techniques/software/services to make sure the address is valid and to ensure that all of the pieces of the address are formatted uniformly throughout the data.
Attribute Standardization - Attribute data can be used to determine a distinct entity, but inconsistencies could make them look different at first blush. For example Unit of measure could be in pounds and ounces, but those could be the same once the conversion factor is complete. Or there could be 2 Assemblies that have the same Bill of Material parts list attached to them. This attribute data can be used in the identification of duplicates.

Outside of these techniques there are others that might be used like ignoring all one or two character words, match on the first X number of characters, alphabetize the words in a field or across fields, execute rules in a variety of orders, not requiring a field if it is not populated, merging results of different match criteria together, and many more. There are seemingly endless combinations of rules that can be applied for identifying duplicate candidates. It usually takes several iterations of rules and combinations to determine which rule set(s) are appropriate.

Additionally, if you have large enough datasets, machine learning algorithms coupled with some of the techniques above can further automate the identification and consolidation of duplicate data.

If you have questions on data management, governance, migration, or just love data, let's connect. My email is steve.novak@definian.com.

Ready to unleash the value in your data?

Best and Brightest Companies to Work For

Common Techniques to Employ in the Identification of Duplicate Data

Other articles

Finding Tomorrow's Warranty Claims Today

Enterprise AI Strategy: From License Purchase to Business Outcomes

Identifying Jane Doe: Beyond the Ticket Holder

Partners & Certifications

Ready to unleash the value in your data?