You want answers? In business intelligence solutions, insights delivered by key reports and models are essential for making executive decisions. But reports and models come from analyzing data, and using bad data will yield inaccurate results which will certainly degrade decision making. This is often referred to as the ‘Garbage in, garbage out’ principle. If you’re looking for the best “truth” in your reports, you will need the cleanest data possible.
Bad data means that the data being collected, integrated and stored in a database contains unreliable and/or invalid data. In the world of business intelligence solutions, bad data can be misleading and creates obstacles in segmentation, targeting, messaging, and reporting on program performance.
Common Data Quality Issues
There are several reasons why data quality issues may arise prior to the integration of data into a centralized database.
- Invalid values are allowed to be populated in key data fields or data fields that require specific values yield data integrity issues (i.e., invalid email addresses will result in higher email bounce back rates)
- Missing values in key data fields that require data to be populated in the field yield issues in business rules logic
- Little or no requirements given to the channel partner for validating data that has been populated in key data fields prior to sending data to the database vendor forces the database vendor to either reject incoming files and wait for them to be corrected by the channel partner or potentially integrate bad data into the centralized database
- Minimal monitoring by the database vendor of the quality of data being sent by the channel partner to the database vendor and managing the quality of data provided by the channel partner over time yield data integrity issues that may go unnoticed
- Lack of automated processes to alert database vendor of incoming file and/or data issues means increased inefficiencies on time due to manual efforts and there is a significantly higher risk of data issues due to manual mistakes and human error
- Lack of using Incident Reports that capture data file transfer issues, the resolution of the issues, and preventative measures to ensure incidents do not happen again
- Lack of version control for key documents being used across multiple parties, or an approach to ensure that the most updated version is provided to all relevant parties yield misunderstandings in how data is to be captured and integrated
A high percentage of data issues can be prevented from the start, by instituting some best practices. Here are 8 tips to ensure best practices in data quality management:
- Valid Response Values: The description of each data field in the data layout section should list out acceptable values when only certain values are acceptable. For example, if the only possible answer values are 1, 2, 3 or 4 to a particular survey question, then the only allowable answers are 1, 2, 3 and 4. If answers are anything but these acceptable values then they are considered invalid and should not be integrated into the centralized database.
- Validation Requirements for Key Data Fields: Validation of key data fields such as Age, DOB, Phone and Email should be specified by the database vendor. For example, the range of values for Age and DOB should only be so high or so low (an age of 700 years old or a date of birth of 1/3/1785 should trigger flags that the data is erroneous). Phone number values such as (123) 456-7890 may follow the format correctly and be populated with the right amount of numbers but an added validation measure is to validate the area code. Validation efforts for Email address should specify that it should contain an @ symbol, a period after the @ symbol, and a valid domain after the period such as COM, NET, ORG, EDU, GOV, MIL, AERO, BIZ, COOP, INFO, MUSEUM, NAME, PRO, US.
- Required versus Not-Required Data Fields: Distinguish between Required versus Non-Required data fields in the data layout section. For example, there should be an indication of whether each data field is required to be populated with a value or not. Depending on the industry and marketing tactic, some required fields such as Age, State and Country must be populated to avoid legal entanglements and ensure regulatory compliance. Further, data missing in key data fields that are used for analyses, segmentation and targeting will negatively impact marketing initiatives and results from reporting.
- File or Record Rejection Specifications: Specify requirements the channel partner must follow in order to prevent full or partial record or file rejections by the database vendor. Channel partners play an active role in the quality of data being captured and sent to the database vendor. Some channel partners need more supervision than others on the rigor involved in ensuring data quality, and issues with the data may not surface until it is too late. Catching data quality issues early and monitoring the performance of the channel partners will improve the capture and transfer of data.
- Automation for Flagging Bad Data: Automated processes for identifying and/or flagging invalid data populated in key data fields or data file transfer issues will eliminate the added risk of human error when manually looking at the quality of data in incoming files.
- End-to-End Testing: End-to-end testing between the channel partner and the database vendor should include validation of data values that are populated in each data field and not just counts of records within each file.
- Incident Reporting: Specifications on incident reporting should include who is responsible for documenting file transfer issues, details on the issue that was identified, how the issue was resolved, whether the issue is significant enough to warrant a change management request and how the issue will be prevented in the future.
- Version Control: When there are multiple parties involved in the transfer and integration of data, it is imperative that all parties are aware of any changes that have been made to how data is captured and integrated over time. Formal documents that provide data specifications should be updated to reflect any changes and distributed to relevant parties and there should be version control on such documents.
Using bad data even with the most sophisticated business intelligence solutions will yield erroneous reporting and cause a domino effect of poor executive decisions. On the flipside, using good data with the right business intelligence solutions will empower decision makers with the truth.