All articles

Common Techniques to Employ in the Identification of Duplicate Data

Common Techniques to Employ in the Identification of Duplicate Data
Duplicate and orphan data plague every organization. Explore proven techniques for discovering, identifying, and eliminating unwanted data records.
Tagged in:
Steve Novak
Steve
Novak
Vice President
View bio

Duplicate master data is a common problem across every organization. Identifying and resolving duplicate information on customers, suppliers, items, employees, etc. is an important activity. Solving the duplicate issue can yield increased buying power, more accurate marketing campaigns, reduced maintenance costs, improved inventory management, better forecasting; improvements across any aspect of the business.

In terms of a new implementation, if the duplicate data is not resolved before or during the migration, the duplicate data will either silently load into the new ERP application and cause problems when the application is put to use or will error out during the load and cause problems because the downstream transaction conversions won’t be able to load. On an ongoing data management process, it is important to prevent duplicates by defining the criteria and keeping scorecard metrics that ensure that the organization is preventing duplicates adding to overall data degradation.

The basic concept for identifying a duplicate candidate is straight forward. Standardize fields that identify a unique entity (e.g. a single customer, vendor, item, etc.) and compare. In practice, depending on the type/cleanliness of the data, the standardization process can be complex.

There are several techniques that are commonly used to identify duplicates within master data.

  • Noise Word Removal – The process of removing words that don’t add significance to the data or are often incorrect. Common examples of noise words are “the”, “of”, and “inc”.
  • Word Substitution - The process of replacing an existing word or phrase with another word or phrase. It is common to substitute names and abbreviations when identifying duplicates. For example, Tim would be replaced with Timothy, OZ would be replaced with Ounce, a single quote might be replaced with foot, and a double space might be replaced with a single space. Additionally, context of the words in the current field or other fields could affect the substitution that needs to take place.
  • Case Standardization – The process of making everything the same case.
  • Punctuation Removal – The process of removing all alpha or numeric values that don’t add any significance to the field value.
  • Phonetic Encoding – The process of encoding words based on how they sound. For example, “donut” and “doughnut” would be phonetically encoded to the same value. There are several types of phonetic encoding methods that are commonly used.
  • Address Standardization – The process of standardizing all of the components of an address prior to comparing values. Usually the process that checks for duplicates utilizes address validation techniques/software/services to make sure the address is valid and to ensure that all of the pieces of the address are formatted uniformly throughout the data.
  • Attribute Standardization - Attribute data can be used to determine a distinct entity, but inconsistencies could make them look different at first blush. For example Unit of measure could be in pounds and ounces, but those could be the same once the conversion factor is complete. Or there could be 2 Assemblies that have the same Bill of Material parts list attached to them. This attribute data can be used in the identification of duplicates.

Outside of these techniques there are others that might be used like ignoring all one or two character words, match on the first X number of characters, alphabetize the words in a field or across fields, execute rules in a variety of orders, not requiring a field if it is not populated, merging results of different match criteria together, and many more. There are seemingly endless combinations of rules that can be applied for identifying duplicate candidates. It usually takes several iterations of rules and combinations to determine which rule set(s) are appropriate.

Additionally, if you have large enough datasets, machine learning algorithms coupled with some of the techniques above can further automate the identification and consolidation of duplicate data.

If you have questions on data management, governance, migration, or just love data, let's connect. My email is steve.novak@definian.com.

Other articles

Clinical Transformation Is Not an IT Project: What Your Physicians and Operational Leaders Need From You Before the Initiative Launches

Clinical Transformation Is Not an IT Project: What Your Physicians and Operational Leaders Need From You Before the Initiative Launches

Healthcare
Clinical transformation fails when it is run as an IT project. What physicians and operational leaders need from the C-suite before the initiative launches.
The Five Fault Lines in Health System Clinical Operations That Transformation Must Address, and That Most Initiatives Miss

The Five Fault Lines in Health System Clinical Operations That Transformation Must Address, and That Most Initiatives Miss

Healthcare
Five recurring fault lines in health system clinical operations that transformation initiatives must address. Most miss them and stall in stabilization.
When Compliance Becomes a Data Problem: Why Your Greatest Regulatory Risk May Be Hiding in Your Data Infrastructure

When Compliance Becomes a Data Problem: Why Your Greatest Regulatory Risk May Be Hiding in Your Data Infrastructure

Healthcare
Compliance is a data problem before it is a legal problem. The regulatory exposures hiding in fragmented infrastructure that auditors are now empowered to find.
Client testimonial
The Definian team was great to work with. Professional, accommodating, organized, knowledgeable ... We could not have been as successful without you.
Senior Manager | Top Four Global Consulting Firm

Partners & Certifications

Ready to unleash the value in your data?