Adrian Mowat on Agile Data Integration

Thursday, October 28, 2010

Test Driven Development - Building a Magic Funnel

An automated data integration test suite is like a funnel that scoops up abstract, unstructured and downright weird knowledge from every corner of your problem domain, condenses it onto a concise form, and pipes it out to consumers (i.e. development teams) who turn it into a working system.

It fulfils three important roles that are otherwise missing.

* To define a complete and unambiguous specification held in a form that is directly useful to technical people (n.b. a spec is not directly useful because it is open to interpretation and assumes tacit knowledge about the problem)
* To create single, well understood, vocabulary for discussing requirements and issues in development
* To give ourselves fast, simple way to prove functional correctness

So how do you build the funnel?

Let's assume you a basic technical framework in place that allows you to store test cases and run them through the system under test. The question we need to answer is how to integrate it into your project lifecycle in a cost effective manner. The only way to do that is to use Test Driven Development (TDD). Put very simply, write a failing test that describes a change before you write the code to implement that change.

There is plenty of information out there on how to do TDD in Object Oriented systems. I have worked through a lot of it and built systems of my own to learn the techniques (and do other useful things as well). I found that Object Oriented languages allow much more modular code than ETL systems and give you much more control over the low-level details. Many of the techniques do not map perfectly into the Data Integration world. It's certainly impossible, or at least highly impractical, to start from a single test case and build up a system from there. For a start, a large part of the system already exists in the form of source and target systems and interfaces.

However, I have found that the principles and benefits of doing TDD are relevant and achievable to Data Integration when you tweak the practices a little. Furthermore, data integration projects raise challenges around data quality and opaque interfaces to external systems that are not a concern on most OO systems. I have found that building the funnel using TDD provides an elegant and proactive solution to these problems that integrates well with normal project activities and does not mandate an explicit 'phase','activity' or 'gate'.

Here's how to do it.

Before you write any code...

1. Create an automated test harness.
2. Work with technical team, business users and analysts to define an initial set of test cases based on the current understanding of the business problem. Make sure you can run all your test cases in one pass and you don't need to run batches of test cases through the system under test.
3. Roll technical test cases such as boundary conditions and known data quality issues into the test suite.
- Do your initial data profiling here

At this point, you will find that you have driven out a lot of problems in your understanding of the problem and the solution required by simply thinking through the problem at a detailed level. Some of these will be misunderstandings between the business and technical people, while others will be areas the business users and analysts had not considered in their initial discussions. The test framework has already been valuable and you don't even have any code yet!

You are now ready to start coding and you can move the test suite into "TDD" mode.

For every modification...

4. Check you have test cases for the feature. If you don't, write them.
5. Run the suite to ensure the test fails (this is how you "test the test")
6. Write code to make the test pass - running the tests frequently to make sure you are on the right track
7. Once all the tests pass, refactor your code to make it production ready
8. Go to step (4) and start the next feature

You must expect issues to arise once you start development. Some will be technical, others business-related and the rest due to tests that were missed in the first pass. It is very important to assign collective ownership of the test suite to the development team so everyone has the right and ability to make changes as soon as they find a problem. In practical terms, you should have a version control strategy to ensure changes are propagated around the team in a controlled fashion, but you should avoid assigning ownership or setting up any sort of approval mechanism that will create a bottleneck.

Periodically, you should run a smoke test with a full volume of production or production-like data through the system to drive out issues you have missed. Similarly, if you have a lot of dependencies on other systems you should build your test suite to stub out those dependencies but run regularly in a fully integrated environment to find new cases that should be added to your automated suite.

Footnote: Why you really can't create the tests after the code

Apart from the fact you have already lost the benefits of the funnel during development, it's actually impossible to create a complete test suite after the event.

Ask yourself a few of questions...

* How can you validate the test suite tests what you think it does unless you have made the test fail?
* How can you be sure the test suite tests everything the code base does unless it grew alongside the code base?
* How can you inject test cases into a system that was never built to have test cases inserted into it? Usually you will find a few tricky dependencies that you can't work around.

I'm not saying you should not try to build a test suite if you have to manage a codebase that does not have tests, but you should understand the limitations. You should certainly not plan to build the tests after the code if you have the option.

Friday, September 17, 2010

Fundamentals of data testing: Setup and Teardown Patterns

A good automated test suite can setup a new environment, run the tests while retaining enough information to debug failures, and then teardown, or reset, the environment ready for next time the test suite runs. Data integration systems, by definition, deal with persistent data stores so the test suite needs to reflect that and deal with the inherent complications.

>> Edit: As Nick pointed out in the comments, it's useful to defer teardown and leave the test data intact after the test runs so you can debug any failures. I usually work on the cycle of teardown -> setup -> test

This article describes 3 patterns I have found useful for setup and teardown.

Truncate and Rebuild
Build from Copy
Test Data Set

They are not mutually exclusive and can be mixed and matched as needed so, for example, you might use a truncate and rebuild strategy in the filesystem and a test keys approach in the database.

The Test Environment

The test environment consists of mixture of code, persistent storage, applications and libraries and tools including, but not limited to...

The system under test and libraries on which it depends
The test framework
Source code repositories
Example data sets and expected results
Source, target and intermediate databases and tables
Source, target and intermediate files and directories
Business applications
ETL environments/tools
Data transport software - e.g. MQSeries and FTP
Scheduling software

The test suite must include enough of these elements to repeatedly verify the data-transformation logic and supporting infrastructure on-demand, at the touch of a button. You should keep it as simple as possible by only including only the elements needed to perform the test. For example, don’t bother running your tests within the scheduler unless the schedule is complex enough to justify the extra effort.

It is important to build a test suite that can run at any time without impacting other teams’ work. Unit test suites must be under the direct control of the developer(s) working on the code so they can build test cases and perform the TDD cycle for every change they make to the code base. “Higher” test environments might run on a scheduled basis (every hour, or after each commit, for example), but they still should be made as accessible as possible.

Real-world constraints

Unfortunately, it is not usually cost-effective to have dedicated access to all the elements needed (especially when first introducing these practices) so you will probably need to be a little creative about how you live alongside other users of the existing development and test systems at your disposal.

The following patterns can be used to create a test environment that coexists happily with other teams.

The “Truncate and Rebuild” Pattern

Overview:

A brute force approach where the test suite assumes it has free reign to delete existing artifacts. It deletes everything it finds, builds everything it needs from scratch and deletes it all afterwards.

Mechanics:

Delete all existing artifacts and re-create them from source as you would when deploying to a new environment.

Advantages:

Very good at driving out deployment problems (missing files, bad checkins etc)

The environment is guaranteed to be clean
You are unlikely to impact anyone else’s work
Simplifies the test framework because it does not need to worry about other users of a given resource.

Disadvantages:

Often not cost effective - especially if applied to elements with expensive licences
Can mask real world problems because real world systems need to coexist with other applications so a totally isolated test suite may miss defects caused by other system’s activities
Must be used with care to avoid accidentally deleting someone else's work or artefacts that cannot be easily recovered

The “Build from Copy” Pattern

Overview:

A close cousin of Truncate and Rebuild. Instead of building an element from scratch, it is created by copying from a master-location.

Mechanics:

Delete all existing artifacts and re-create them by copying from a master location.

Advantages:

As per Truncate and Rebuild
Easy to visualise and manage because the master copy of the artifact is available

Disadvantages:

As per Truncate and Rebuild

The “Test Data Set” Pattern

Overview:

The test framework is built to work alongside other users’ data by managing a data set that is easily distinguishable from, and does not overlap with, any other data in the system. Thus it can be inserted, updated and deleted on demand without any undesirable side-effects.

Mechanics:

Records are ideally identified by primary key, or some compound combination of secondary keys. If this is not possible, a value in a comment field can be used - but this is less than optimal because it means you need to change the system under test to allow the test framework to inject it’s own comment value(s).

Keys must be chosen very carefully to ensure you can always identify them. A useful technique is to pick values that fall way outside the normal range. For instance, if a product code is an 8 digit integer in production, you can use 4 digit product code for testing.

Advantages:

Cost effective because it uses existing resources.
Aids data discovery because the team need to think hard about the data to build the framework

Disadvantages:

Adds complexity to the test framework
Risk that teardown may miss some data and pollute the next test run

A Worked Example

Consider an ETL process that reads a csv file pushed from a source system using FTP into a location on the ETL server and loads it into a data warehouse star schema. The warehouse is an established system that has been live for a number of years. It runs on a relational database and the new import job is being developed as part of the normal release cycle alongside a handful of other, distinct, projects that share the same physical production, development and test environments.

Our new import populates a pre-existing fact table and adds 2 new dimensions. We want to build a test framework for the import process that continues to be useful after the current project has finished so we must assume other users are going to hit the same tables at some point, even if it’s not an immediate concern.

Starting with the source data, the test framework holds a set of input cases and corresponding result(s) for each case (how we store and manage them is beyond the scope of this article). Let’s assume out ETL environment provides a mechanism for a tester to have a dedicated area of a filesystem to store test cases. We can use Build from Copy to recreate source_data.csv by taking a copy of the test framework’s input cases each time the test suite runs.

Next, the ETL process under test is stored under version control. Assuming we have a dedicated area where we can create working copies, we can simply delete all the code and re-create it from the version control system each time the test runs. This is an example of Truncate and Rebuild.

Notice there is a subtle difference between Truncate and Rebuild and Build from Copy. In the first case, we build the test environment programmatically as we would in a real deployment, while when employing Build from Copy the build process is synthetic and only found in the test framework.

Finally, the database that holds the warehouse is probably the most complicated part of the system because it is shared by all the project teams in the data warehousing group. We need to solve 2 problems: and how to manage the test data, and how to test the DDL that adds columns to the Fact Table and creates the new Dimensions.

First, the test data. We use the Test Data Set pattern to structure our input data set in such a way that it can be identified easily once it has been loaded to the data warehouse. Thus we can can also delete it easily when we teardown the test environment.

The DDL is a bit more tricky. The dimensions can be simply deleted and re-created using Truncate and Rebuild. We can also use Truncate and Rebuild on the the fact table but the teardown script needs to be sophisticated enough to drop only the new columns. In both cases, the DDL tests should be optional because future users of the test framework will be testing additions to your business logic in a future release so they can reasonably assume the star schema already exists.

Monday, August 23, 2010

Fast Database Resets

The GOOS (Growing Object Oriented Systems guided by Tests) mailing list today included a discussion on a method of quickly resetting a database using stored procs.

It's Java focussed, but the ideas are useful on all projects doing automated database testing.

Friday, August 13, 2010

New name for the blog

Gradually, I have started thinking about what I do as Data Integration rather than Data Management so I have changed the title of my blog accordingly.

Fundamentals of data testing: Characteristics of a test suite

An automated test suite should be a living thing that is easy to update and easy to run. The test suite only adds value when it is run to find errors so it should be built to run many times in a day. This is just normal Agile thinking.

People who are not used to Agile sometimes think about creating an automated test suite that is open to a privileged few and used only at the end of the project when testing is traditionally performed. At best, this leads to a situation where the suite cannot not deliver maximum return on investment because it has limited chances to catch defects. At worst, the test suite will simply become an overhead that quickly falls out of use.

The most successful automated test solutions on data integration projects I have been involved with have been those that business users, analysts, developers, testers and other relevant parties all provide input to throughout the project. This means unexpected problems late in the project are easy to solve because everyone trusts the test suite covers all possibilities. Just this reason alone is adequate return on the investment needed to setup the test suite.

The tests are run by the developers after every significant change and it’s not unusual for the test suite to be run 10 to 20 times in a day (as an aside unit test suites in object development are run much more frequently, but data integration test suites are limited in speed because they deal with files and databases and it’s often hard to isolate and run the lowest level code the developer writes).

Here, at a high level, is what works for me.

You need 3 things:

An environment you can teardown and rebuild repeatedly many times a day
A way of storing input data sets and expected results
A way of running the input data sets through the system and verifying the output matches the expected results

I usually just build this in the development environment where we have most control over what we can setup and delete and who is able to use it.

The nature of the test suite I suggest is difficult to categorise. It’s like a unit test because it’s designed to be run by the developers and uses a small volume of data. But it’s also like a system and/or user acceptance test because it often runs a sequence of related jobs and should include all the cases we expect to find in the production environment. I suppose “unit test” is the most useful description for practical purposes.

One thing it definitely is not is an integration suite or a basis for application testing. A common mistake is to try and test the flow of the data through a number of systems. It can work under specific conditions, but usually it’s working at just too high a level for your tests to run quickly and you are at the mercy of changes in the internals, workflows and input/output interfaces of the applications and distracts you from the real goal of building good data transformation code.

There is an obvious dilemma, though; what happens if a change to an application has an impact on the data transformations?

Well, it’s OK to run the application interfaces every so often. It’s just that you should try to avoid automating the testing through them. In fact, I strongly recommend you make sure you run the whole application end to end regularly so you can get real feedback on how the system in behaving. It can be manual and doesn’t need to be heavyweight - just enough to spot problems quickly and get enough detail to update the automated tests.

This is actually a similar problem to making sure your tests cover edge cases in the data, but I’ll cover that in a future post.

Fundamentals of data testing: Deterministic vs Heuristic

The style of test automation I find most useful when testing data integration systems is largely based on the “deterministic” testing style popularised by Agile methods. That means that for every value input, we need to be able to define an expected output from the system under test. This is appropriate for testing most situations that arise in data integration solutions but we find that there are occasional cases we will need to borrow from techniques in systems based on heuristics.

For example, consider the following mapping rules:

Source Column	Rule	Target Column
	System date formatted as “YYYYMMDD HH24:MI:SS”	creation_date
flight_number	==>	flight_number
destination	Convert to Sentence Case	dest_name

The rules for populating flight_number and destination are deterministic. Flight number is straightforward because the value from the source is simply copied to the target. The rules for destination are slightly more involved but we easily construct test cases that define an expected result based on a given input - EDINBURGH -> Edinburgh, gLasgow -> Glasgow... and so on.

The "creation_date" rule is quite different. Timestamps taken from the system clock are non-deterministic because it’s impossible to define a value that will definitely match the system time at the moment the job is run. In some cases, you can get round this by injecting a known value into the system under test using a parameter (or some similar method) but it’s not always possible so we need to take a leaf out of heuristic systems thinking.

Heuristic systems use a combination of rules and data gathered at runtime to make decisions. A scheduling system used by an airline plots a schedule for the airline each day that takes into account the many factors that affect airline scheduling - plane location, crew availability, turnaround time, weather conditions, volcanic activity in iceland, and so forth. There is no “right” answer to the question of how to arrange the planes so it’s hard to test deterministically. Instead, the development team need to ensure that the correct decision making flow is executed when deciding the schedule rather than trying to predict an outcome. Emily and Geoff Bache work on such as system and gave a presentation at Agile2008 explaining how their test suite works.

Back to our system datetime. We can borrow from heuristic concepts by plugging in a set of rules that allow us to validate the date generated. A reasonable heuristic rule for the system date might be to check the date is correctly formatted and within a few minutes of the moment the test suite was run.

Take care though. Heuristic tests make your test suite more complicated because you need capture and process more data in a heuristic style test than a deterministic test. Sometimes it may just not be worth the effort. Furthermore, it’s easy to write heuristic tests that are just as complex than the system being tested. Let’s face it, the risk of getting the system date call wrong is pretty remote so maybe it’s just easier to ignore those fields and focus building watertight tests for rules you can test deterministically. That’s usually where the real business benefits are found anyway.

Above all, you should always try to find a deterministic solution and you should expect to do so at least 95% of the time.

Agile Data Testing

Those of you who avidly follow my blog :-) will have noticed that there have been one or two quiet spells lately. I would love to say I have been living as a hermit living on seaweed and shellfish scraped from the rocks while building the perfect suite of tools for Agile Data Integration but, really, life just got in the way of writing the past couple of years.

However, I have not been idle. I've been working hard to define strategies for creating automated test solutions for ETL/Data Integration projects for the past 4 years and I have a lot of knowledge I want to share. Some of this has been captured in tools, but not as much as I would like so I am also looking for people to help me develop the tool capability in a warm, fuzzy, open-sourcy kind of way.

I'm about to publish some articles explaining what I have learned and where I am going. I know time is valuable, but I really, really want your feedback so we can start to develop more of a community around this topic. I'm going to start with the theory and get into the tooling once I have laid down my thoughts on the core concepts of what we are trying to achieve.

So please, please, please read the articles, be forthright in your comments and encourage me to write more.

My thanks.

Adrian