The problem set here is what to do when faced with an input file that has columns with multiple types of content(Domains). For instance a single column may contain address, email, name or city state zip..
This example stems from a problem I faced recently with some banking data. We needed to load data into a Customer Dimension and MDM Tool and needed to get the data organized, in this case 8 input columns each with a different part of the address, name , company name or email. The goal was to send the data through an address correction tool as well, however simply concatenating all the columns was not an alternative.
Input File with multiple type of content in same column. The sample data was derived form the 2008 Adventure Works Sample. The TSQL for generating the sample is at the end of the article.
From a data profiling perspective the approach would be to identify if a column matches a particular pattern (address, email, city state zip, etc...) and then move the data into the appropriate labeled column
I have worked with Regular Expression in SSIS and decided to use them in a Script Component to acomplish the column pattern identification and then a Derived Column Transform to actually do the dynamic mapping.
Organized and properly mapped results.
The major considerations are to add the necessary output columns as Booleans in the Script Component Editor. I have provided the input file and package.
Here is the script. i did rely on Expresso to test the final Regular Expression.
' Microsoft SQL Server Integration Services Script Component' Write scripts using Microsoft Visual Basic 2008.' ScriptMain is the entry point class of the script.
Imports SystemImports System.DataImports System.MathImports Microsoft.SqlServer.Dts.Pipeline.WrapperImports Microsoft.SqlServer.Dts.Runtime.Wrapper
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _<CLSCompliant(False)> _Public Class ScriptMainInherits UserComponentDim patAlphaUpper As String = "[^a-z0-9 ,@%&/#'.-]"Dim patAlphaLower As String = "[^A-Z0-9 ,@%&/#'.-]"Dim patNum As String = "[^A-Za-z ,@%&/#'.-]"Dim patSpecial As String = "[^A-Za-z0-9@&#]"'Dim patEmail As String = "\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b"Dim patEmail As String = "([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})"Dim patUSPhone As String = "^(?:(?<1>[(])?(?<AreaCode>[1-9]\d{2})(?(1)[)])(?(1)(?<2>[ ])|(?:(?<3>[-])|(?<4>[ ])))?)?(?<Prefix>[1-9]\d{2})(?(AreaCode)(?:(?(1)(?(2)[- ]|[-]?))|(?(3)[-])|(?(4)[- ]))|[- ]?)(?<Suffix>\d{4})$"Dim patAddress As String = "\w*\s*\d+\s+\w+\s*\w*"Dim patCityStateZip As String = "\w+\s+\w+\s+\d{5}"'Dim patPOBOX As String = "[poPO]+[BbOoXx]+\s*\d+"Dim patPOBOX As String = "^\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b\s+\d+\s*$"Dim rgxAlphaUpper As New Text.RegularExpressions.Regex(patAlphaUpper)Dim rgxAlphalower As New Text.RegularExpressions.Regex(patAlphaLower)Dim rgxSpecial As New Text.RegularExpressions.Regex(patSpecial)Dim rgxNum As New Text.RegularExpressions.Regex(patNum)Dim rgxEmail As New Text.RegularExpressions.Regex(patEmail)Dim rgxPhone As New Text.RegularExpressions.Regex(patUSPhone)Dim rgxAddress As New Text.RegularExpressions.Regex(patAddress)Dim rgxCityStateZip As New Text.RegularExpressions.Regex(patCityStateZip)Dim rgxPOBOX As New Text.RegularExpressions.Regex(patPOBOX)Dim tagAlphaUpper As StringDim tagAlphaLower As StringDim tagNumber As StringDim ADDR1Work As StringDim ADDR2Work As StringDim ADDR3Work As StringDim ADDR4Work As StringDim ADDR5Work As StringDim ADDR6Work As StringDim ADDR7Work As StringDim ADDR8Work As String
Public Overrides Sub PreExecute()MyBase.PreExecute()'' Add your code here for preprocessing or remove if not needed'End Sub
Public Overrides Sub PostExecute()MyBase.PostExecute()'' Add your code here for postprocessing or remove if not needed' You can set read/write variables here, for example:' Me.Variables.MyIntVar = 100'End Sub
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
ADDR1Work = rgxSpecial.Replace(Row.ADDRLNE1TXT, " ")ADDR2Work = rgxSpecial.Replace(Row.ADDRLNE2TXT, " ")ADDR3Work = rgxSpecial.Replace(Row.ADDRLNE3TXT, " ")ADDR4Work = rgxSpecial.Replace(Row.ADDRLNE4TXT, " ")ADDR5Work = rgxSpecial.Replace(Row.ADDRLNE5TXT, " ")ADDR6Work = rgxSpecial.Replace(Row.ADDRLNE6TXT, " ")ADDR7Work = rgxSpecial.Replace(Row.ADDRLNE7TXT, " ")ADDR8Work = rgxSpecial.Replace(Row.ADDRLNE8TXT, " ")
Row.ADDR1Address = rgxAddress.IsMatch(ADDR1Work)Row.ADDR2Address = rgxAddress.IsMatch(ADDR2Work)Row.ADDR3Address = rgxAddress.IsMatch(ADDR3Work)Row.ADDR4Address = rgxAddress.IsMatch(ADDR4Work)Row.ADDR5Address = rgxAddress.IsMatch(ADDR5Work)Row.ADDR6Address = rgxAddress.IsMatch(ADDR6Work)Row.ADDR7Address = rgxAddress.IsMatch(ADDR7Work)Row.ADDR8Address = rgxAddress.IsMatch(ADDR8Work)Row.ADDR1CityStateZip = rgxCityStateZip.IsMatch(ADDR1Work)Row.ADDR2CityStateZip = rgxCityStateZip.IsMatch(ADDR2Work)Row.ADDR3CityStateZip = rgxCityStateZip.IsMatch(ADDR3Work)Row.ADDR4CityStateZip = rgxCityStateZip.IsMatch(ADDR4Work)Row.ADDR5CityStateZip = rgxCityStateZip.IsMatch(ADDR5Work)Row.ADDR6CityStateZip = rgxCityStateZip.IsMatch(ADDR6Work)Row.ADDR7CityStateZip = rgxCityStateZip.IsMatch(ADDR7Work)Row.ADDR8CityStateZip = rgxCityStateZip.IsMatch(ADDR8Work)Row.ADDR1POBOX = rgxPOBOX.IsMatch(ADDR1Work)Row.ADDR2POBOX = rgxPOBOX.IsMatch(ADDR2Work)Row.ADDR3POBOX = rgxPOBOX.IsMatch(ADDR3Work)Row.ADDR4POBOX = rgxPOBOX.IsMatch(ADDR4Work)Row.ADDR5POBOX = rgxPOBOX.IsMatch(ADDR5Work)Row.ADDR6POBOX = rgxPOBOX.IsMatch(ADDR6Work)Row.ADDR7POBOX = rgxPOBOX.IsMatch(ADDR7Work)Row.ADDR8POBOX = rgxPOBOX.IsMatch(ADDR8Work)Row.ADDR1Email = rgxEmail.IsMatch(Row.ADDRLNE1TXT)Row.ADDR2Email = rgxEmail.IsMatch(Row.ADDRLNE2TXT)Row.ADDR3Email = rgxEmail.IsMatch(Row.ADDRLNE3TXT)Row.ADDR4Email = rgxEmail.IsMatch(Row.ADDRLNE4TXT)Row.ADDR5Email = rgxEmail.IsMatch(Row.ADDRLNE5TXT)Row.ADDR5Email = rgxEmail.IsMatch(Row.ADDRLNE6TXT)Row.ADDR7Email = rgxEmail.IsMatch(Row.ADDRLNE7TXT)Row.ADDR8Email = rgxEmail.IsMatch(Row.ADDRLNE8TXT)
End Sub
End Class
After the script component applys the pattern matching(Regular Expression) we use the Derived Column with a series of Condition Expression to populat each column. Here is a sample for mapping the address column Address , the rest are included in the attached package.
Derived Column code:
ADDR_1_Address == TRUE && ADDR_1_CityStateZip == FALSE ? ADDR_LNE_1_TXT : ADDR_2_Address == TRUE && ADDR_2_CityStateZip == FALSE ? ADDR_LNE_2_TXT : ADDR_3_Address == TRUE && ADDR_3_CityStateZip == FALSE ? ADDR_LNE_3_TXT : ADDR_4_Address == TRUE && ADDR_4_CityStateZip == FALSE ? ADDR_LNE_4_TXT : ADDR_5_Address == TRUE && ADDR_5_CityStateZip == FALSE ? ADDR_LNE_5_TXT : ADDR_6_Address == TRUE && ADDR_6_CityStateZip == FALSE ? ADDR_LNE_6_TXT : ADDR_7_Address == TRUE && ADDR_7_CityStateZip == FALSE ? ADDR_LNE_7_TXT : ADDR_8_Address == TRUE && ADDR_8_CityStateZip == FALSE ? ADDR_LNE_8_TXT :
"No Address"
SELECT TOP (15) CAST(CustomerKey as Varchar(50)) as ADDR_LNE_1_TXT , ISNULL(DimCustomer.FirstName,' ' ) + ' ' + ISNULL(DimCustomer.MiddleName,'' ) + ' ' + ISNULL(DimCustomer.LastName,' ' ) as ADDR_LNE_2_TXT, ISNULL(DimCustomer.AddressLine1,' ' ) + ' ' + ISNULL(DimCustomer.AddressLine2,'' )as ADDR_LNE_3_TXT ,DimCustomer.EmailAddress as ADDR_LNE_4_TXT, ISNULL(DimGeography.City,' ' ) + ' ' + ISNULL(DimGeography.StateProvinceCode,'' ) + ' ' + ISNULL(DimGeography.PostalCode,'' )as ADDR_LNE_5_TXT , StateProvinceName as ADDR_LNE_6_TXT, DimGeography.CountryRegionCode as ADDR_LNE_7_TXT, EnglishEducation as ADDR_LNE_8_TXTINTO ADDRESS_PARSE_DEMOFROM DimCustomer INNER JOINDimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKeyWhere CustomerKey in ( 11021 , 11022 ,11023 ,11024 ,11036) Union All SELECT TOP (15) EnglishEducation as ADDR_LNE_1_TXT, ISNULL(DimCustomer.FirstName,' ' ) + ' ' + ISNULL(DimCustomer.MiddleName,'' ) + ' ' + ISNULL(DimCustomer.LastName,' ' ) as ADDR_LNE_2_TXT, ISNULL(DimCustomer.AddressLine1,' ' ) + ' ' + ISNULL(DimCustomer.AddressLine2,'' )as ADDR_LNE_3_TXT ,DimCustomer.EmailAddress as ADDR_LNE_4_TXT, ISNULL(DimGeography.City,' ' ) + ' ' + ISNULL(DimGeography.StateProvinceCode,'' ) + ' ' + ISNULL(DimGeography.PostalCode,'' )as ADDR_LNE_5_TXT , StateProvinceName as ADDR_LNE_6_TXT, DimGeography.CountryRegionCode as ADDR_LNE_7_TXT, CAST(CustomerKey as Varchar(50)) as ADDR_LNE_8_TXT
---INTO ADDRESS_PARSE_DEMOFROM DimCustomer INNER JOINDimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKeyWhere CustomerKey in (11021 , 11022 ,11023 ,11024 ,11036) Union All SELECT TOP (15) ISNULL(DimGeography.City,' ' ) + ' ' + ISNULL(DimGeography.StateProvinceCode,'' ) + ' ' + ISNULL(DimGeography.PostalCode,'' )as ADDR_LNE_1_TXT , StateProvinceName as ADDR_LNE_2_TXT, DimGeography.CountryRegionCode as ADDR_LNE_3_TXT, CAST(CustomerKey as Varchar(50)) as ADDR_LNE_4_TXT, EnglishEducation as ADDR_LNE_5_TXT, ISNULL(DimCustomer.FirstName,' ' ) + ' ' + ISNULL(DimCustomer.MiddleName,'' ) + ' ' + ISNULL(DimCustomer.LastName,' ' ) as ADDR_LNE_6_TXT, ISNULL(DimCustomer.AddressLine1,' ' ) + ' ' + ISNULL(DimCustomer.AddressLine2,'' )as ADDR_LNE_7_TXT ,DimCustomer.EmailAddress as ADDR_LNE_8_TXT
FROM DimCustomer INNER JOINDimGeography ON DimCustomer.GeographyKey = DimGeography.GeographyKeyWhere CustomerKey in ( 11040 ,11041 ,11042 ,11043, 11049)
Ira Warren Whiteside
Actuality Business Intelligence
"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."
Master Data Services - SSIS MDM Load
Microsoft recently released Master Data Services in SQL Server 2008 R2 CTP. Without getting into the strategic benefits, definitions etc. of MDM(Master Data Management) this tool will be very helpful for any application in automating the setup and maintenance of look up or cross reference tables or utilizing reference data.
Kirk Haselden has provided a white paper on The What, Why, and How of Master Data Management
Nick Barclay: BI-Lingual has provided an excellent post on getting started with this tool. In his post he provide a walkthrough on how to set up a sample Geography Model and the associated Entities, Attributes and Hierarchies. In addition he provides the TSQL to load them. I have expanded on his work and provided a basic SSIS Package for loading the Model.
MDS_StagingLoadBarclayProject.dtsx
The approach was to use derived column transforms to set up the required metadata and aggregate transform to generate distinct results. Obviously Nick has done all the heavy lifting I thought it would be helpful to have an SSIS Package to compliment his very informative post.
Last I intentionally kept this package basic, they are many opportunities for improvement and making the entire process data driven, which I will explore next.
An additional reference for understanding MDM is David Loshin's book.
The great philosophical lesson I think I have learned is that “Perception is NOT Reality”. I have always used the phrase “Perception is Reality” and referred to Plato’s Allegory of the Cave, smugly interpreting it for people unfamiliar. It was not until the fear of losing my soul mate , my wife, my love Tessie , that I realized the true meaning. I would always relate it to the common conundrum of a person trying to understand why they were misunderstood or not appreciated, and what they could do to change the reality or perception of others. I have now been enlightened It is now obvious to me that I am an idiot. Mine and Tessie’s Perception/Reality was that her moms breast/lung/bone cancer and death was from second hand smoke, that is what the doctors told her ,so if she didn’t smoke and basically ate right and her sisters didn’t have cancer she was safe. Our reality was DNA. Perception is what a single person chooses to believe , base their decisions on and has nothing to do with Reality. I am an idiot. Everything I have learned about cancer in the last few months has been there for years , my knowledge has completely changed my Perception and has not altered our Reality, it has increased our awareness. I have read the allegory hundreds of times and only now have paid attention to one of the last paragraphs.
“And now consider what would happen if such a man were to descend again and seat himself on his old seat? Coming so suddenly out of the sun, would he not find his eyes blinded with the gloom of the place?
Certainly, he would.
And if he were forced to deliver his opinion again, touching the shadows aforesaid, and to enter the lists against those who had always been prisoners, while his sight continued dim and his eyes unsteady, - and if this process of initiation lasted a considerable time, - would he not be made a laughingstock, and would it not be said of him, that he had gone up only to come back again with his eyesight destroyed, and that it was not worth while even to attempt the ascent? And if anyone endeavored to set them free and carry them to the light, would they not go so far as to put him to death, if they could only manage to get him into their power?”
I always focused on the fact that the prisoner became enlightened and gained awareness and a different and I thought better Perception and changed his Reality, and I was very pleased with myself in comprehending this. The revelation to me now is that the experience(awareness) that changed his Perception of his Perception , did not change his Reality, it simply made him aware of his incorrect awarement and led to a quicker death. Apparently Perception is Perception and changes frequently, based on your current awareness and Reality is a constant and unchanging. How can it change , since as soon as you think about it it’s gone. Changing your Perception can prolong your ability to experience(gain knowledge/awareness) of Reality , it does not change Reality. I have been pondering lately if “Awareness is Reality”
In any event I have endeavored to increase our awareness of awarement regarding cancer and prolong our ability to experience reality.
Recently a client has described a need for real time (or near-real time) movement of data from their operational logistics application to an analytics/reporting visualization environment where presentation tools such as Microsoft Reporting Services, Business Objects or QlikView consume the data. First we will focus on the “Real Time” update scenarios. We can address presentation in a separate document. Based on the discussion we noted three possible approaches to solving this problem.
1. Replicate Tables using SQL Server Replication and use Business Objects directly against the OLTP.
2. Utilize a real time update process (Informatica/SQL Server) to incrementally update tables and use Business Objects directly against the OLTP.
3. Use a OLAP or Memory based approach, using various compression techniques, eliminating the need for additional relational tables and SQL based query tools.
In our discussion several vendors were mentioned:
1. Informatica via Informatica PowerExchange™:MS SQL Server Real-Time Option for updated relational tables. This is a change data capture tool.
2. QlikView – A Memory based reporting and querying tool.
3. SQL Server 2005 – Microsoft DBMS , OLAP and associated applications (SSIS – SQL Server Integration Services, SSAS – SQL Server Analysis Services)
There are other solutions that are not relational based such as Netezza or Sand Technology. I would suggest a brief review of Gartner current analysis as it relates to real time updates.
I would like to add the observation that the transfer problem is a separate logical problem from the analysis visualization problem. The three tools discussed all approach the problem completely differently. Only QlikView attempts to present a solution that combines the two.
There are two approaches to near-real time analytics. They are as follows:
I conducted a brief analysis of the vendors mentioned and have the following observations:
a. This would require replicating the existing database model and reporting directly against it or developing a dimensional model.
Different policies allow for quite different behaviors. For example:
The proactive caching settings can be set on each different Analysis Services object (dimension and partition). This provides great flexibility in cases where the policies and the tradeoffs between performance and latency vary for different parts of the cubes .
Microsoft Cubes in conjunction with the SSIS (Pipeline Architecture) and UDM (Universal Data Model) do not require any staging.
Recommendations: I would recommend a POC(Proof of Concept) and the creation of two scenarios, focusing on the two OLAP type options using QlikView and SQL Server 2005 Analysis Services. As part of the scenarios I suggest:
In reviewing the two tools it is apparent the extraction processes would be similar and SQL based, but relying on different scheduling and/or scheduling processes. Also it would seem using views for database would be rejected out of hands due to performance. A best practices Extraction Architecture would require a metadata driven application architecture that would be data aware of structure or content (domain) changes and automatically adjust load processes accordingly. Each solution would probably require replication. The focus of completing the scenarios would be to evaluate the gaps each of the recommended scenarios/tools have against Extraction Architecture requirements, effort and skill set required to complete the POC.
[1] Microsoft SQL Server 2005 analysis Services Performance Guide
[2] Actually QlikView requires significant manual coding for loading. Specifically QlikView assumes all like field names are keys and attempts to create joins, these must be hand coded for only the required fields FAQ 7 in their doc.
By Ira Warren Whiteside
A while back I was reading one of Jamie Thomson most excellent posts on Conchango blogs SSIS Junkie SSIS: Data Profiling Task: Part 7 - Functional Dependency and it occurred to me it would be interesting to provide a poor man’s Functional Dependency using on TSQL.
First let’s define Functional Dependency.
Determines the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column) ; This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains Country or Region Codes and a column that contains States/Provinces. The same Country or Region Codes should always have the same States/Provinces, but the profile discovers violations of this dependency.
Now let’s take a crack at writing our own. Naturally when facing a difficult problem in TSQL I turned to SQL Server Central vast library.
When analyzing the capabilities needed to determine one columns dependency on another you would obviously need recursion or the ability to link one row to the next. A brief search of existing articles SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13 and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09
Armed with these two articles I now had all the techniques I need to develop a query or stored procedure to provide Column Functional Dependency.
The heart of this effort is to utilize Common Table Expressions(CTE) or the “with” clause as defined on MSDN as:
“Specifies a temporary named result set, known as a common table expression (CTE). This is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression.”
Basically the “with” clause provides you with the ability to logically create CTE’s or result sets ,logically develop your code and employ recursive techniques.
A detailed treatment of CTE’s is available at and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09
I have used the same test setup that Jamie Thomson provides for testing the SQL Server 2008 Data Profiling Task.
select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum
, a.PostalCode
, sp.Name AS StateProvinceName
, sp.CountryRegionCode
into Addresses
from Person.[Address] a
inner join Person.StateProvince sp
on a.StateProvinceID = sp.StateProvinceID
update Addresses
set CountryRegionCode = 'ZZ'
where RowNum = 1
This will create a test table called Addresses, which contains a Functional Dependency scenario for Province or States and Countries.
USE [AdventureWorks]
GO
/****** Object: Table [dbo].[DependencyColumnTable] Script Date: 01/14/2009 19:07:54 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DependencyColumnTable]') AND type in (N'U'))
DROP TABLE [dbo].[DependencyColumnTable]
/****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
/****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependency]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[ColumnDependency]
CREATE PROCEDURE [dbo].[ColumnDependency]
@TableName VARCHAR(128) ,
@Determinant VARCHAR(128) ,
@Dependent VARCHAR(128)
AS
Declare @SQL VarChar(MAX)
SELECT @SQL = ' ;WITH DependencyCompare AS ('
SELECT @SQL = @SQL + ' SELECT '
SELECT @SQL = @SQL + @Determinant + ' ,' + @Dependent + ', '
SELECT @SQL = @SQL + 'ROW_NUMBER() OVER (Partition BY ' + @Dependent + ' ORDER BY ' + @Determinant + ' ) AS rownumcalc FROM '
SELECT @SQL = @SQL + @TableName
SELECT @SQL = @SQL + ' ) SELECT
currow.' + @Dependent + ' AS curDet,
prevrow.' + @Dependent + ' AS prevDet,
nextrow.' + @Dependent + ' AS nextDet,
currow.' + @Determinant + ' AS curDep,
prevrow.' + @Determinant + ' AS prevDep,
nextrow.' + @Determinant
+ ' AS nextDep,
currow.rownumcalc ,
ViolationCount =
Case
When currow.' + @Determinant + ' = prevrow.' + @Determinant +
' and currow.' + @Dependent + ' = prevrow.' + @Dependent +
' then 0
When prevrow.' + @Dependent + ' is null
then 0
Else 1
End'
+
' INTO DependencyColumnTable
FROM
DependencyCompare currow '
+ ' LEFT JOIN DependencyCompare nextrow ON currow.rownumcalc = nextrow.rownumcalc - 1 AND currow.'
+ @Dependent
+ ' = nextrow.'
+ ' LEFT JOIN DependencyCompare prevrow ON currow.rownumcalc = prevrow.rownumcalc + 1 AND currow.'
+ ' = prevrow.'
+ ' Order by currow.'
Print @SQL
Exec (@SQL)
Exec ColumnDependency
@TableName = 'Addresses',
@Dependent = 'StateProvinceName',
@Determinant = 'CountryRegionCode '
The code here is basically leveraging the recursive capabilities of CTE. Our objective here is to compare current rows to previous row columns to determine when a Dependent Column value changes and the Determinant columns has NOT changes. This is a Violation. A detailed explanation of the technique of linking to previous records and be reviewed at SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13. One issue I had was needing to dynamically populate the From clause or Table.
I used the technique presented by Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001. This technique relies in creating a @SQL variable and executing the variable as part of the stored procedure.
Create Procedure GenericTableSelect
@TableName VarChar(100)
Declare @SQL VarChar(1000)
SELECT @SQL = 'SELECT * FROM '
Exec ( @SQL)
`
As you can see here Alabama is our Determinant column which normally has an associated Dependent Values of US. You can see when the Dependent values changes to ZZ it is tagged a Violation Count as ‘1’
/* Sample Execution
exec ColumnDependencyOLAP
*/
/* This SP will create a Summary Table and calculate the Support and Vio;ation percentages.
/****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependencyOLAP]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[ColumnDependencyOLAP]
CREATE PROCEDURE [dbo].[ColumnDependencyOLAP]
SELECT @SQL = '
;WITH DependencyCompareDetail AS
(
Select count(ViolationCount) as Violations , curDet , curDep,
Case when ViolationCount = 0
then ''Support Percentage''
else ''Violation Percentage''
end as Violation ,
ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ,curDep) AS rownumcalc
FROM [AdventureWorks].[dbo].[DependencyColumnTable]
--Where ViolationCountNull = ''N''
group by curDet, curDep , ViolationCount
)
,DependencyCompareSummary AS (
Select count(ViolationCount) as total , curDet ,
ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ) AS rownumcalc
group by curDet
,DependencyCompareOLAP AS
Select
DependencyCompareDetail.curDet ,
DependencyCompareDetail.curDep,
DependencyCompareDetail.Violation,
DependencyCompareDetail.Violations ,
DependencyCompareSummary.total ,
ROW_NUMBER() OVER (Partition BY DependencyCompareDetail.curDet ORDER BY DependencyCompareDetail.curDep) AS rownumcalc
From DependencyCompareDetail
LEFT JOIN DependencyCompareSummary
ON DependencyCompareDetail.curDet =DependencyCompareSummary.curDet
curDet as Determinant,
curDep as Dependent,
Violation as ''Dependency Type'',
Violations as ''Dependeny Counts'' ,
total as ''Total Determinant Count'',
cast(cast(Violations AS Decimal ) /cast(total AS Decimal )* 100 as decimal (5, 2)) as PercentViolation
INTO DependencyCompareOLAP
from DependencyCompareOLAP
Order by curDet , curDep '
Now that we have created a table with the Functional Dependency Violations identified we want to create a table that can be used for Analysis or Reporting, containing the same percentages available in the SSIS 2008n Data Profiling Task Functional Option. Again we have relied on the CTE in order to create a reusable query that can create a single table that has the required for analysis such as Totlal Determinint , Dependents categroized by Supporting Dependents and Violating Dependents.
Here are the results from the SSIS 2008 Data Profiling Task for Functional Dependency.
As you can see the Support Percentages for Alabama are exactly the same.
I would also add from an analytical perspective you can quickly create a Pivot Table in Microsoft Excel 2007. Simply open Excel 2007, select the Data Tab .
Then select From Other Sources then follow the wizard and create a connection to the DependencyCompareOLAP table to create the Pivot table and Chart below. As you can see here is a quick way to determine the outliers like Alabama.
Or Manitoba in Canada
You also can add additional columns to the table and stored procedures, such as table name, date and time and create a repository that would allow you to track the Violations incrementally or over time, and provide proactive processes to respond to continuing anomalies.
Here are several additional links that were helpful.
2. Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001
3. SQL Server 2005 Recursion and WITH Clause By Robbe D. Morris
4. The SQL WITH Clause, (Order Your Thoughts, Reuse Your Code) Submitted by Kevin Meade on Tue, 2007-06-19 14:59