Anthem Consulting, LLC



About Us
Services
Seminars
Newsletter
Research
Links
Contact Us





Last Modified
March 25, 2003

Data Profiling

The foundation for the majority of Information Technology projects is the data. Whether Data Warehouse (DW), Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), or Sales Force Automation (SFA), the data loaded into the new application determines whether or not the Return On Investment for the project is realized. There are three possible sources of data for these projects. Your projects may use any combination of them.


  • Existing corporate legacy systems
  • External or 3rd party sources (e.g., outside mailing lists, credit scores)
  • Future data entry by a corporation's employees, applications, or customers (web forms)

Without a thorough understanding of all of the data from these disparate sources, a company runs the risk of loading a brand new system with inaccurate, incomplete, or superfluous data. Business decisions, based on this data, are jeapordized.

Data profiling is a process where data from existing systems is exhaustively analyzed and modeled to identify all issues, problems and characteristics regarding the data. This information is used to develop accurate cleansing, transformation, and design specifications. Data profiling accomplishes this in less time, less risk, and reduced costs while ensuring greater accuracy and quality.

Though each project may have different needs, there are generally five steps or dimensions of data profiling.

Domain Studies
Obtaining the basic characteristics of each column or field within a particular file. These characteristics include data types, null rules, minimum and maximum values, value sets, and patterns. Very helpful in identifying content and quality issues.

Structural or Dependency Inference
Inferring functional dependencies or relationships between columns or fields in a single file or table. These dependencies can help define the optimal structure of the file.

Redundancy Profiling
Identifying foreign key relationships, forced duplication of columns or fields, as well as violations of integrity constraints between files or tables. Very useful when trying to integrate two different systems with the same data (e.g., two Ordering systems).

Business Rule Profiling
Validating rules, required by the business, needed in entering or updating source data. Also encompasses flagging data that violate business rules. Required when designing new back office systems.

Dynamic or Periodic Profiling
Profiling data over time and comparing/trending new results with previous profiling results. This can be performed with either new load files (DW projects) or online transactions (EAI projects or data streams).

The results of data profiling can assist the rest of the project. A normalized "Legacy Model" can be constructed and used as the basis for a database design or comparison with a purchased system. Data cleansing and transformation rules are developed to assist the Extract-Cleanse-Transform-Load process. The testing phase can profile the populated target databases.

This is a basic overview of data profiling. To learn more about data profiling techniques and technology, sign up for our quarterly newsletter or one of our seminars.





©2002 Anthem Consulting, LLC All Rights Reserved