All articles

Common Techniques to Employ in the Identification of Duplicate Data

Common Techniques to Employ in the Identification of Duplicate Data
Duplicate and orphan data plague every organization. Explore proven techniques for discovering, identifying, and eliminating unwanted data records.
Tagged in:
Steve Novak
Steve
Novak
Vice President
View bio

Duplicate master data is a common problem across every organization. Identifying and resolving duplicate information on customers, suppliers, items, employees, etc. is an important activity. Solving the duplicate issue can yield increased buying power, more accurate marketing campaigns, reduced maintenance costs, improved inventory management, better forecasting; improvements across any aspect of the business.

In terms of a new implementation, if the duplicate data is not resolved before or during the migration, the duplicate data will either silently load into the new ERP application and cause problems when the application is put to use or will error out during the load and cause problems because the downstream transaction conversions won’t be able to load. On an ongoing data management process, it is important to prevent duplicates by defining the criteria and keeping scorecard metrics that ensure that the organization is preventing duplicates adding to overall data degradation.

The basic concept for identifying a duplicate candidate is straight forward. Standardize fields that identify a unique entity (e.g. a single customer, vendor, item, etc.) and compare. In practice, depending on the type/cleanliness of the data, the standardization process can be complex.

There are several techniques that are commonly used to identify duplicates within master data.

  • Noise Word Removal – The process of removing words that don’t add significance to the data or are often incorrect. Common examples of noise words are “the”, “of”, and “inc”.
  • Word Substitution - The process of replacing an existing word or phrase with another word or phrase. It is common to substitute names and abbreviations when identifying duplicates. For example, Tim would be replaced with Timothy, OZ would be replaced with Ounce, a single quote might be replaced with foot, and a double space might be replaced with a single space. Additionally, context of the words in the current field or other fields could affect the substitution that needs to take place.
  • Case Standardization – The process of making everything the same case.
  • Punctuation Removal – The process of removing all alpha or numeric values that don’t add any significance to the field value.
  • Phonetic Encoding – The process of encoding words based on how they sound. For example, “donut” and “doughnut” would be phonetically encoded to the same value. There are several types of phonetic encoding methods that are commonly used.
  • Address Standardization – The process of standardizing all of the components of an address prior to comparing values. Usually the process that checks for duplicates utilizes address validation techniques/software/services to make sure the address is valid and to ensure that all of the pieces of the address are formatted uniformly throughout the data.
  • Attribute Standardization - Attribute data can be used to determine a distinct entity, but inconsistencies could make them look different at first blush. For example Unit of measure could be in pounds and ounces, but those could be the same once the conversion factor is complete. Or there could be 2 Assemblies that have the same Bill of Material parts list attached to them. This attribute data can be used in the identification of duplicates.

Outside of these techniques there are others that might be used like ignoring all one or two character words, match on the first X number of characters, alphabetize the words in a field or across fields, execute rules in a variety of orders, not requiring a field if it is not populated, merging results of different match criteria together, and many more. There are seemingly endless combinations of rules that can be applied for identifying duplicate candidates. It usually takes several iterations of rules and combinations to determine which rule set(s) are appropriate.

Additionally, if you have large enough datasets, machine learning algorithms coupled with some of the techniques above can further automate the identification and consolidation of duplicate data.

If you have questions on data management, governance, migration, or just love data, let's connect. My email is steve.novak@definian.com.

Other articles

Foundation First: The Root Cause and the Path Forward

Foundation First: The Root Cause and the Path Forward

Data Governance
Best Practices
Data Value Realization
Part 2 of The Three Failures That Will Define Who Survives AI. Why treating data as a technology concern instead of its own strategic pillar is the root cause, and what Foundation First looks like in practice.
The Three Failures That Will Define Who Survives AI

The Three Failures That Will Define Who Survives AI

Data Governance
Best Practices
Data Value Realization
Over 80% of AI projects fail to reach production. The problem is not the technology. Three predictable failure modes are turning enterprise AI into the most expensive technology failure in corporate history.
The Model Isn’t the Problem

The Model Isn’t the Problem

Data Governance
Best Practices
Healthcare AI pilots stall before reaching production. The model is rarely the issue. The gap between training data and production data is what breaks deployment.
Client testimonial
The Definian team was great to work with. Professional, accommodating, organized, knowledgeable ... We could not have been as successful without you.
Senior Manager | Top Four Global Consulting Firm

Partners & Certifications

Ready to unleash the value in your data?