ETL tools can be extremely involved, especially with complex
data sets. At one time or another, many data management professionals have built
tools that have done the following:
- Taken data from multiple places.
- Transformed into (often significantly) into formats that other systems can accept.
- Loaded said data into new systems.
In this post, I discuss how to add some basic checkpoints
into tools to prevent things from breaking bad.
The Case for Checkpoints
Often, consultants are brought into organizations in need of
solving urgent data-related problems. Rather than gather requirements and
figure everything out, the client usually wants to just start building. Rarely
will people listen when consultants advocate the need to take a step back
before beginning our development efforts in earnest. While this is a bit of a
generalization, few non-technical folks understand:
- the building blocks required to create effective ETL tools
- the need to know what you need to do–before you actually have to do it
- the amount of rework required should developers have an incomplete or inaccurate understanding of what the client needs done
Clients with a need to get something done immediately don’t
want to wade through requirements; they want action–and now. The consultant who
doth protests too much runs the risk of irritating his/her clients, not to
mention being replaced. While you’ll never hear me argue against understanding
as much as possible before creating an ETL tool.
Enter the Checkpoint
Checkpoints are simply invaluable tools for preventing
things from getting particularly messy. Even simple SQL SELECT statements
identifying potentially errant records can be enormously useful. For example, I need to manipulate a large number of
financial transactions from disparate systems. Ultimately, these transactions
need to precisely balance against each other. Should one transaction be missing
or inaccurate, things can go awry. I might need to review the thirty or so
queries that transform the data, looking for an error on my end. This can be
time-consuming and even futile.
Enter the checkpoint. Before the client or I even run the
balancing routine, ETL tool spits out a number of audits that identify major
issues before anything else happens. These include:
- Missing currencies
- Missing customer accounts
- Null values
- Duplicate records
While the absence of results on these audits guarantees
nothing, both the client and I know not to proceed if we’re not ready. Consider
starting a round of golf only two realize on the third whole that you forgot
extra balls, your pitching wedge, and drinking water. You’re probably not going
to have a great round.
Sure, agile methods are valuable. However, one of the
chief limitations of iterative development is that you may well be building
something incorrectly or sub-optimally. While checkpoints offer no guarantee,
at least they can stop the bleeding before wasting a great deal of time
analyzing problems that don’t exist. Use them liberally; should the produce no
errors, you can always ignore them, armed with increased confidence that you’re
on the right track.