Business Intelligence is all about taking better decisions and decisions can only be as accurate as the data we have at the time of decision making. Data quality is mostly about ensuring there are no typos in data. E.g. if province name is “British Columbia” then it should be consistent everywhere and not use “BC”. At times data can be inconsistent e.g.
Receiving data in an txt / csv / excel / xml where the operator entered data by hand.
Data entry application has lots of open-ended text boxes without field validation in place.
Data is received from 3rd party company and they have different conventions.
DQS makes life a lot easier for both data stewards and developers. Data steward is a person who’s responsible for manually fixing the typos. Developer is a person who can use bunch of tools to automate data cleansing. DQS provides tools caters to both the data stewards and developers.
DQS comes with various components:
Server side knowledgebase.
Client side tool for creating adding data to the knowledgebase
SSIS Data Cleansing Transformation that can be used by SSIS developers for automating data cleansing
For working with DQS there are various steps involved:
Step 1: Create Knowledgebase
Knowledgebase is a repository for domains. Domain is basically a column or a field where you define the possible values for that domain. To create the knowledgebase we can use the DQS Client tool which looks like this:
Step 2: Add Domains to the Knowledgebase
Add at least one domain to the knowledgebase. Here I have added two domains.
Knowledgebase resides in SQL Server databases that are creating by DQS. Here’s a list of databases that’s usually created by DQS:
Step 3: Populate Domains
After creating the domains we have to add data to the domains either manually or we can import it from Excel spread sheet or SQL server. In case if you want to import data from Oracle or some other source then you want to convert it to Excel / SQL first by using SSIS or any ETL tool. Here I have populated the domain by hand:
If you want to import data from Excel / SQL then we can use the following option:
Step 4: Perform Data Cleansing
Once we have the knowledgebase in place we can perform data cleansing. We can either do it manually or we can automate it by using SSIS. If you want to clean it manually then we can use the same DQS client tool that we used for creating the knowledgebase. In DQS client tool we can create a project and specify Excel / SQL database as the source. When we run the project it will let us output both the original column value and the fixed column value. Then from here we can choose to save the fixed data to Excel / SQL database.
In case if you want to automate data cleansing as part of your ETL operation then we can use DQS Cleansing Transformation which is new in SSIS 2012. Here’s how it looks like:
DQS cleansing transformation makes use of the knowledgebase that we defined earlier on. It’s never 100% automation since we do have to update the knowledgebase whenever new typos are discovered. It’s on going activity and over period of time the knowledgebase eventually becomes mature enough to fix most of the typos.
DQS Pattern Matching
On top of fixing the typos DQS can also be used for finding patterns e.g. we could have a mailing address that’s combination of various fields like street number, avenue, postal code etc. I will leave the details for some other time.
SQL Server 2012 received the first iteration of DQS and it’s really cool. It makes fixing typos a lot easier both for data stewards and developers. It will be exciting to see further improvements in the next version.