Data Spelunking: Simple Techniques to Start Understanding Data Landscapes

Every data migration begins with understanding the data landscape. Learn simple techniques and strategies to start exploring the data you are working with.

Tagged in:

Best Practices

Mike

Mulhern

Principal

View bio

"You'll never be able to figure out our data", that's what I was told. 48 hours later after putting in some evening hours, mission accomplished. This is not a hypothetical event but rather a recurring theme in what we do at Definian. Every company believes that their data is so unique and different that only they can make sense of it. In some cases, they believe their data is so complex that it can't even be located, let alone pulled out of the underlying database. The truth, however, is that data can only be structured in so many ways. In order for a system to function, that structure must be logical. Whether it's financial data, manufacturing data, or payroll data and whether it resides in a mainframe or relational database, it's all just rows and columns. In order to make sense of it, one needs only to identify the important tables and then understand the relationships between them.

I've personally run into several projects where my client didn't know where the data of interest was stored. They could identify the database and the content they were looking for through the front end, but they couldn't speak to the names of the underlying tables or columns. This always presents a fun challenge and a great opportunity to add value. It’s a process I like to think of as “Data Spelunking” – searching through the database trying to see what you can find and going wherever it takes you.

Assuming the data is in a relational database, there are a couple of strategies you can employ to trek through the data and find what you're looking for. Both start with the system catalog at the heart of the database (or data dictionary, information schema, etc.) and are straightforward.

If the database has tables and columns with meaningful names and you're looking for something specific, simply using the catalog to search for tables or tables containing columns with likely names can often quickly lead you to the data you're interested in. For instance, if you were looking at an Oracle database supporting some system and wanted to find tables and columns related to vendors, you could try the following 2 queries (the first for table names by themselves, the second for tables corresponding to column names):

SELECT OWNER, TABLE_NAME FROM ALL_TABLES WHERE TABLE_NAME LIKE '%VENDOR%'

SELECT TABLE_NAME, COLUMN_NAME FROM USER_TAB_COLUMNS A WHERE COLUMN_NAME LIKE '%VENDOR%' ORDER BY TABLE_NAME

Varying the search term and common abbreviations of it may enable you to pinpoint what you're looking for. This works extremely well for something like PeopleSoft where most of the names are very meaningful.

Sometimes, however, that's not enough due to the sheer number of tables in a database, the number of tables sharing the same column names, or the need to identify many types of data. In such a case, the system catalog can be used to generate a list of tables with corresponding row counts for further analysis. You can actually use a SQL Query with the system catalog to construct a whole series of queries that obtain counts for each table. For Oracle (as an example), you could use:

select 'Select ' || '''' || owner || '''' || ' as "Owner_Name", ' || '''' || table_name || '''' || ' as "Table_Name", Count(*) from ' || owner || '.' || table_name || 'union' from all_tables

This query will return a result that is actually series of SQL queries followed by the word "union".

Select 'DEV' as "Owner_Name", 'PS_CUSTOMER' as" Table_Name", Count(*) from DEV.PS_CUSTOMER union

Select 'DEV' as "Owner_Name", 'PS_CUSTOMER_FSS' as" Table_Name", Count(*) from DEV.PS_CUSTOMER_FSS union

Select 'DEV' as "Owner_Name", 'PS_CUSTOMER_LANG' as" Table_Name", Count(*) from DEV.PS_CUSTOMER_LANG union

Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO1' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO1 union

Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO2' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO2 union

Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO3' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO3 union

Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEOA' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEOA union

Select 'DEV' as "Owner_Name", 'PS_CUST_ADDRESS' as" Table_Name", Count(*) from DEV.PS_CUST_ADDRESS union

Select 'DEV' as "Owner_Name", 'PS_CUST_ADDRSQ_LNG' as "Table_Name", Count(*) from DEV.PS_CUST_ADDRSQ_LNG union

Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_CNTCT' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_CNTCT union

Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_EXS' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_EXS union

Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_SEQ' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_SEQ union

You can then feed this result (minus the final "union") back into a SQL query tool approximately 500 rows at a time and obtain lists of tables along with the current row counts for those tables.

DEV PS_CUSTOMER 21914

DEV PS_CUSTOMER_FSS 14800

DEV PS_CUSTOMER_LANG 568

DEV PS_CUSTOPT_TEO1 0

DEV PS_CUSTOPT_TEO2 0

DEV PS_CUSTOPT_TEO3 0

DEV PS_CUSTOPT_TEOA 0

DEV PS_CUST_ADDRESS 29362

DEV PS_CUST_ADDRSQ_LNG 443

DEV PS_CUST_ADDR_CNTCT 0

DEV PS_CUST_ADDR_EXS 1236

DEV PS_CUST_ADDR_SEQ 27945

These lists can then be added to a spreadsheet, sorted based on row count or name, and quickly scanned to eliminate tables that won't be of interest.

Generally, you will already have an expectation on data volume and can quickly eliminate tables based on counts. For instance, it's unlikely that you're going to migrate data from tables with under 100 rows or maybe even 500. Similarly, there may be some obvious upper bounds on counts for the type of data you're looking for. In most cases you don't have millions of items or customers. High volumes are likely transactional data or temporary tables used for system reporting.

The naming conventions of the tables will also allow you to pare the list down. Suffixes like "TMP" or "RPT" are typically giveaways that these tables are used by system processes and are not the sources of truth you're looking for. Sometimes you'll see date suffixes indicating that someone backed up a table. Scanning through the list, trends should begin to leap off the page. Additionally, you'll be able to pick out situations where multiple tables with similar names have identical row counts allowing you to quickly speculate on relationships you'll want to research further.

Using this process, you can generally reduce the entire database down to a list of likely tables in under an hour. Then start the more tedious process of looking at the content and index structure of each remaining table. I've done it successfully many times over the last decade, and it should work for nearly any relational database with minor tweaks to the queries used.

Happy Spelunking!

Ready to unleash the value in your data?

Best and Brightest Companies to Work For

Data Spelunking: Simple Techniques to Start Understanding Data Landscapes

Other articles

Finding Tomorrow's Warranty Claims Today

Enterprise AI Strategy: From License Purchase to Business Outcomes

Identifying Jane Doe: Beyond the Ticket Holder

Partners & Certifications

Ready to unleash the value in your data?