The Issue
When you work on a data warehouse, it's enough of an issue that there are probably very many tables to be using in your analysis.
Lots of tables, means there are lots of columns, and most likely, a lot of those column will hold date data types. Anywhere there are dates there are potentials for having absurd values showing up in them.
Where I work, a longstanding policy is to not allow vital date columns to support NULL values. Nevertheless, NULL values are need, often for an end-date field as a way to indicate that the record is still current. As a consequence, there is a practice of using the maximum possible date of 31st December 9999 for the pseudo-NULL value. Inevitably though, it seems that not all processes get written using the correct exact value, and so numerous instances of 1st January 9999 will show up as well as other misunderstandings of what should be used.
While I have a fairly stock method for handling these things when I'm analysing end-date values - as per the NULLIF
expression:
SELECT
NULLIF( Cln_Dt, ( '31/12/9999' (DATE, FORMAT 'dd/mm/yyyy') ) ) AS EnNulled_Cln_Dt
and which can be easily extend to handle a few such values:
SELECT
NULLIF(
NULLIF(
NULLIF(
Cln_Dt,
( '31/12/9999' (DATE, FORMAT 'dd/mm/yyyy') )
),
( '01/01/9999' (DATE, FORMAT 'dd/mm/yyyy') )
),
( '31/12/9000' (DATE, FORMAT 'dd/mm/yyyy') )
)
AS EnNulled_Cln_Dt
but ultimately the problem becomes one of knowing quite which values to explicitly write the code to handle.
Besides the specific date values, there is also the issue of knowing how often this kind of thing is occurring.
The Absent Solution
Alas, the most likely thing you've thought as you're reading this is: surely any organisation with this kind of problem would acquire some kind of Data Quality tool and tackle it head on?
Yes, you are probably right to think that, and I'm certainly not going to argue against the idea. However I can honestly say that in over 20 years of doing data analysis work I've never known any actual progress to have been made on issues like this. Or maybe it has, but new processes for messing up data keep springing up.
- I could write a whole book on that topic, don't tempt me!
So what I present here is something that can be done by anyone with the ability to write SQL and sufficient access permissions to create and populate a holding table. I don't claim it is super-clever - it is just something I wrote in a half afternoon and confirmed that it works.
What you see is the SQL I wrote - slightly changed to not use names specific to my workplace - and then some annotations added to describe the code.
The Method
As I work on a Teradata, I'll be using its syntax and features, namely:
- making a couple of tables
- making a number of views
- making a number of macros
Also, I will be using the way that Teradata provides a "data dictionary" - i.e. a way to query the system itself to find out what databases, tables and columns are on the system. In Teradata, this is done by a pseudo database called DBC
and various views within that. For our purpose here we'll just use one of those, one that lists all the columns, which is unsurprisingly called DBC.ColumnsV
One other thing that I use for this, is that my SQL tool allows me to run an SQL statement - that has been written to itself generate yet more SQL as a set of return values - and then execute those return values. If you don't have that kind of feature in your tooling, that's ok - because you can usually manually clip the generated cover over from being a result to being submitted as more statements to execute.
Corresponding to that, there are some issues about how to execute very large numbers of SQL statements and how to abandon them partway through. Those issues will not be covered in this article - but the method is structured to deal with the after effects of such interruptions.
Similarly, in other system or dialects the implementation may need to use other-named features instead - e.g "stored procedures" and this article won't attempt to cover what those might be.
The SQL
Create Two Tables - Bank and Run
I will want two tables to hold findings:
- one for use during a run
- and one for storing the latest findings across all runs.
While I work on Teradata quite a lot, there can be slight variations among setups, so I have a stock method to use a "CREATE AS" to generate a prototype table. I then do a SHOW TABLE commend to see what the DDL looks like and then adapt it for the real table.
Here's my prototype table maker:
CREATE TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
AS
(
SELECT
DatabaseName ,
TableName ,
ColumnName ,
CAST( NULL AS DATE) AS DateValue ,
CAST( NULL AS INTEGER ) AS PresenceCount ,
CAST( NULL AS DATE) AS LastCheckedAtDate
FROM
DBC.ColumnsV
WHERE
DatabaseName IS NULL
AND
TableName IS NULL
AND
ColumnName IS NULL
)
WITH
NO DATA
;
Next is how wee see the DDL for the prototype table.
SHOW TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
;
I clip the output of that to a text editor and make some small changes.
Next we drop the bank table - in case you're wondering, Teradata (still) does not provide a "drop if exists" statement in its SQL dialect.
DROP TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
;
Next we make the real bank table - and is my customisation from the DDL of the prototype table. In this case I just added the UPI (unique primary index) which is the Teradata equivalent of a primary key setting. Another Teradata feature is SET TABLE
for which the other option is MULTISET TABLE
but I'm not going to explain the distinction here. I think "set tables" are a matter of good practice.
CREATE SET TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank ,
FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO,
MAP = TD_MAP1
(
DatabaseName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
TableName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
ColumnName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
DateValue DATE FORMAT 'yyyy-mm-dd',
PresenceCount INTEGER,
LastCheckedAtDate DATE FORMAT 'yyyy-mm-dd'
)
UNIQUE PRIMARY INDEX (
DatabaseName ,
TableName ,
ColumnName ,
DateValue
)
;
Next I have a statement to drop the per-run table, which I would only need to do here if I was re-doing this sequence, because otherwise the table doesn't yet exist.
DROP TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
;
Next we make the per-run table - this is the same as for the bank table and Teradata lets us do a simple clone of the structure.
CREATE TABLE
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
AS
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
WITH
NO DATA
;
Define Views and Macros
Next we define a View that exists just to give us a constant value of the number of days to wait before refreshing the check for extreme dates on a specific column. Teradata allows for a SELECT with no FROM clause, which makes this quite simple.
- If you're wondering why this is a View and not merely a matter of having the value quoted as a literal where it gets used, this method allows us to independently re-write this view to change the behaviour of our little system. While this is a small example, in other situations, this approach of putting definitions in vies can allow for quite sophisticated method variations.
REPLACE VIEW
myuid.Prefix_DQ_C_DaystoRefresh
AS
(
SELECT
150 AS DaystoRefresh
)
;
Next we define a View for how to get a list of date columns from the system. Here we are using knowledge about the DBC database on a Teradata system and its view holding metadata for the columns in every view and table.
REPLACE VIEW
myuid.Prefix_DQ_C_DTC_DateColumns_Potential
AS
(
SELECT
TRIM( C_V.DatabaseName ) AS Dbn ,
TRIM( C_V.TableName ) AS Tbn ,
TRIM( C_V.ColumnName ) AS Cln
FROM
DBC.ColumnsV AS C_V
WHERE
/* -- later need to find a way to get the real data types for columns in views
C_V.Type IN ( )
AND
*/
C_V.DatabaseName IN ( 'SpecificDbs' )
/* AND
TableName IS NULL */
AND
C_V.ColumnName LIKE '%_Dt'
)
;
Note that I've left myself some code in a bracketed comment in case I wanted to filter the columns by actually having a DATE data type. In my case, the specific database I wanted to run analyses on was a space of only Views, and this resource does not hold their data types. Hence I've filtered them by assuming that their names would all end with "_Dt"
Next we define a View for how to get a list of date columns to inspect this time around. While the base of this is the view we just defined for fetching the metadata, my data warehouse is so large that this would give me thousands of columns to analyse. While I do want to analyse them all, on any day when I run this, I won't want to re-analyse ones that I've done somewhat recently.
To that end, I compare to the Bank table and look at the dates a column was last analysed. Similarly, if I had to interrupt during a run then I don't want to re-analyse ones that I've just done, but which are pending my end-of-run step to merge into the Bank table.
Both of those checks are simply LEFT OUTER JOINs and testing for a match through them.
REPLACE VIEW
myuid.Prefix_DQ_C_DTC_DateColumns_BeyondDaystoRefresh
AS
(
SELECT
GetDateCols.Dbn ,
GetDateCols.Tbn ,
GetDateCols.Cln ,
( CURRENT_DATE - Overdue_Holdings.Max_LastCheckedAtDate ) AS DaysSinceLastCheck ,
CASE
WHEN DaysSinceLastCheck IS NULL THEN 'A'
ELSE 'B'
END AS LastCheckCategory
FROM
myuid.Prefix_DQ_C_DaystoRefresh AS D_T_R
CROSS JOIN
myuid.Prefix_DQ_C_DTC_DateColumns_Potential AS GetDateCols
LEFT OUTER JOIN
( -- Overdue_Holdings
SELECT
DatabaseName ,
TableName ,
ColumnName ,
MAX( LastCheckedAtDate ) AS Max_LastCheckedAtDate
FROM
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
GROUP BY
DatabaseName ,
TableName ,
ColumnName
) AS Overdue_Holdings ON
Overdue_Holdings.DatabaseName = GetDateCols.Dbn
AND
Overdue_Holdings.TableName = GetDateCols.Tbn
AND
Overdue_Holdings.ColumnName = GetDateCols.Cln
LEFT OUTER JOIN
( -- During_Run = this is here to allow for the meta-SQL generation step to be stopped and re-started
SELECT
DatabaseName ,
TableName ,
ColumnName ,
MAX( LastCheckedAtDate ) AS Max_LastCheckedAtDate
FROM
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
GROUP BY
DatabaseName ,
TableName ,
ColumnName
) AS During_Run ON
During_Run.DatabaseName = GetDateCols.Dbn
AND
During_Run.TableName = GetDateCols.Tbn
AND
During_Run.ColumnName = GetDateCols.Cln
WHERE
Overdue_Holdings.Max_LastCheckedAtDate IS NULL
OR
Overdue_Holdings.Max_LastCheckedAtDate + D_T_R.DaystoRefresh > CURRENT_DATE
AND
During_Run.Max_LastCheckedAtDate IS NULL
)
;
Next we define a View to provide a constant value for the number of years after the current year to treat as an extreme date. The setting here is quite arbitrary, but you should think carefully about setting it to zero or one - you might be surprised to find out how often your data includes "next year" or "next decade" values to be valid in various places.
REPLACE VIEW
myuid.Prefix_DQ_MetaSQL_V_YearsAfterCurrent
AS
(
SELECT
150 AS YearsToSkip_Int ,
TRIM( CAST( YearsToSkip_Int AS VARCHAR(8) ) ) AS YearsToSkip_Str
)
;
Next we define the View that will be the SQL script builder for analysing the columns. I've been writing this kind of thing for many years. While it's always a bit messy doing this, hopefully you can read through the construction aspects and get a feel for what the SQL it generates will look like.
REPLACE VIEW
myuid.Prefix_DQ_MetaSQL_V_ExtremeDatesInColumns
AS
(
SELECT
(
'INSERT INTO ' ||
'TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run ' ||
'( DatabaseName, TableName, ColumnName, DateValue, PresenceCount, LastCheckedAtDate ) ' ||
'SELECT ' ||
'''' || Dbn || ''' AS Db_N , ' ||
'''' || Tbn || ''' AS Tb_N , ' ||
'''' || Cln || ''' AS Cl_N , ' ||
Cln || ' AS DateValue , ' ||
'COUNT( ' || Cln || ') AS PresenceCount , ' ||
'CURRENT_DATE AS LastCheckedAtDate ' ||
'FROM ' ||
Dbn || '.' || Tbn || ' ' ||
'WHERE ' ||
'EXTRACT( YEAR FROM ' || Cln || ') > EXTRACT( YEAR FROM CURRENT_DATE ) + ' || Y_A_C.YearsToSkip_Str || ' ' ||
'GROUP BY ' ||
Cln || ' ' ||
';'
) AS SqlStr ,
( LastCheckCategory || '#' ||
CAST( ( DaysSinceLastCheck (FORMAT '99999') ) AS CHAR(2) ) || ':' ||
( Dbn || '.' || Tbn || '.' || Cln )
) AS Sorter
FROM
myuid.Prefix_DQ_MetaSQL_V_YearsAfterCurrent AS Y_A_C
CROSS JOIN
myuid.Prefix_DQ_C_DTC_DateColumns_BeyondDaystoRefresh AS GetDateCols
)
;
While the sorter
column is both optional and arbitrary, the way I've constructed it here will let us prioritise columns that have not been analysed yet and then those done longest ago.
Do note, that Teradata Views are not allowed to have an ORDER BY clause, so while we can provide the sorting column, actually applying it will have to wait for a Macro.
Next we have (surprise!) a Teradata Macro, that really doesn't do any other than select just the generated SQL strings and in a prudent order.
Also note that the generated SQL will all be INSERTs into the Run table.
REPLACE MACRO
myuid.Prefix_DQ_MetaSQL_M_ExtremeDatesInColumns
AS
(
SELECT
SqlStr
FROM
myuid.Prefix_DQ_MetaSQL_V_ExtremeDatesInColumns
ORDER BY
Sorter
;
)
;
Being a SELECT, that macro will return rows when it gets run.
Next we define a Macro for deleting all the rows in the Run table.
REPLACE MACRO
myuid.Prefix_DQ_R_ExtremeDatesInColumns_ClearRun
AS
(
DELETE FROM
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
;
)
;
Next we define a Macro to perform the merge of the findings that were inserted into the Run table.
To old people like me, this uses the "new" MERGE INTO
statement, but really it's been in Teradata for quite a long time now. For any Teradata users not familiar with it, you should note that the internal execution of this was written in a way that is much more efficient than either the older UPDATE or INSERT commands (but you'd need to read Teradata doco to see why that is). The syntax and run-time logic of this command can take some getting used to - but in this situation it fits the bill perfectly, and hopefully is fairly apparent.
REPLACE MACRO
myuid.Prefix_DQ_R_ExtremeDatesInColumns_MergeRun
AS
(
MERGE INTO
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank AS T_Bnk
USING
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run AS T_Run ON
T_Run.DatabaseName = T_Bnk.DatabaseName
AND
T_Run.TableName = T_Bnk.TableName
AND
T_Run.ColumnName = T_Bnk.ColumnName
AND
T_Run.DateValue = T_Bnk.DateValue
WHEN MATCHED THEN UPDATE SET
PresenceCount = T_Run.PresenceCount ,
LastCheckedAtDate = T_Run.LastCheckedAtDate
WHEN NOT MATCHED THEN INSERT
(
DatabaseName ,
TableName ,
ColumnName ,
DateValue ,
PresenceCount ,
LastCheckedAtDate )
VALUES (
T_Run.DatabaseName ,
T_Run.TableName ,
T_Run.ColumnName ,
T_Run.DateValue ,
T_Run.PresenceCount ,
T_Run.LastCheckedAtDate )
;
)
;
Perform a Run
Now, having created our tables, and defined all the views and macros, we can actually do a run.
First Clear the Run table by running the macro for that.
EXECUTE
myuid.Prefix_DQ_R_ExtremeDatesInColumns_ClearRun
;
Next we run the macro that generates the SQL.
EXECUTE
myuid.Prefix_DQ_MetaSQL_M_ExtremeDatesInColumns
;
Assuming you don't have a tool that can automate this, you can just copy all that output as text and re-paste it back into where you submit your SQL.
Perhaps try just the first record on its own first.
At this point, the design of our little system doesn't care how many of those rows you transplant over to being submitted. Perhaps keep doing a few until at least one indicates that it actually inserted something into the Run table.
Next, after we've run as many of those as we feel like doing, we run the macro to merge the Run discoveries into the Bank table.
EXECUTE
myuid.Prefix_DQ_R_ExtremeDatesInColumns_MergeRun
;
With all that done, we inspect the results now in the Bank table.
SELECT
*
FROM
TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
ORDER BY
DatabaseName ,
TableName ,
ColumnName ,
PresenceCount DESC ,
DateValue DESC
;
Addendum
As it happened, I wrote all the above and ran it and it chugged along nicely, taking different amounts of time for various tables and the date columns in them.
However, when I inspected the results I realised it included something that I really wasn't interested in - in that it counted all the correct-dummy date of 31 December 9999.
Because really, my interest was to show various people my findings of unusual values - for discussion about when to treat them as erroneous data and when to just treat them as alternate "NULL" end dates.
The 31 December 9999 dates being in the results aren't themselves a problem, but if I want to avoid having them in there at all, then here is an extra couple of lines to insert to have it not bother collecting those values.
'WHERE ' ||
'NULLIF( ' || Cln || ' , ( ''31/12/9999'' (DATE, FORMAT ''dd/mm/yyyy'') ) ) IS NOT NULL ' ||
'AND ' ||
'EXTRACT( YEAR FROM ' || Cln || ') > EXTRACT( YEAR FROM CURRENT_DATE ) + ' || Y_A_C.YearsToSkip_Str || ' ' ||
Summary
In a better world, problem dates would find enthusiastic people and tools for correcting errant values and managing data in organised ways. But while you wait for that to spontaneously happen (perhaps don't hold your breath) you can at least use your own skills to get a handle on the scale of the problem, with just plain old SQL.
Top comments (0)