Adrian Mowat on Agile Data Integration: August 2010

Monday, August 23, 2010

Fast Database Resets

The GOOS (Growing Object Oriented Systems guided by Tests) mailing list today included a discussion on a method of quickly resetting a database using stored procs.

It's Java focussed, but the ideas are useful on all projects doing automated database testing.

Friday, August 13, 2010

New name for the blog

Gradually, I have started thinking about what I do as Data Integration rather than Data Management so I have changed the title of my blog accordingly.

Fundamentals of data testing: Characteristics of a test suite

An automated test suite should be a living thing that is easy to update and easy to run. The test suite only adds value when it is run to find errors so it should be built to run many times in a day. This is just normal Agile thinking.

People who are not used to Agile sometimes think about creating an automated test suite that is open to a privileged few and used only at the end of the project when testing is traditionally performed. At best, this leads to a situation where the suite cannot not deliver maximum return on investment because it has limited chances to catch defects. At worst, the test suite will simply become an overhead that quickly falls out of use.

The most successful automated test solutions on data integration projects I have been involved with have been those that business users, analysts, developers, testers and other relevant parties all provide input to throughout the project. This means unexpected problems late in the project are easy to solve because everyone trusts the test suite covers all possibilities. Just this reason alone is adequate return on the investment needed to setup the test suite.

The tests are run by the developers after every significant change and it’s not unusual for the test suite to be run 10 to 20 times in a day (as an aside unit test suites in object development are run much more frequently, but data integration test suites are limited in speed because they deal with files and databases and it’s often hard to isolate and run the lowest level code the developer writes).

Here, at a high level, is what works for me.

You need 3 things:

An environment you can teardown and rebuild repeatedly many times a day
A way of storing input data sets and expected results
A way of running the input data sets through the system and verifying the output matches the expected results

I usually just build this in the development environment where we have most control over what we can setup and delete and who is able to use it.

The nature of the test suite I suggest is difficult to categorise. It’s like a unit test because it’s designed to be run by the developers and uses a small volume of data. But it’s also like a system and/or user acceptance test because it often runs a sequence of related jobs and should include all the cases we expect to find in the production environment. I suppose “unit test” is the most useful description for practical purposes.

One thing it definitely is not is an integration suite or a basis for application testing. A common mistake is to try and test the flow of the data through a number of systems. It can work under specific conditions, but usually it’s working at just too high a level for your tests to run quickly and you are at the mercy of changes in the internals, workflows and input/output interfaces of the applications and distracts you from the real goal of building good data transformation code.

There is an obvious dilemma, though; what happens if a change to an application has an impact on the data transformations?

Well, it’s OK to run the application interfaces every so often. It’s just that you should try to avoid automating the testing through them. In fact, I strongly recommend you make sure you run the whole application end to end regularly so you can get real feedback on how the system in behaving. It can be manual and doesn’t need to be heavyweight - just enough to spot problems quickly and get enough detail to update the automated tests.

This is actually a similar problem to making sure your tests cover edge cases in the data, but I’ll cover that in a future post.

Fundamentals of data testing: Deterministic vs Heuristic

The style of test automation I find most useful when testing data integration systems is largely based on the “deterministic” testing style popularised by Agile methods. That means that for every value input, we need to be able to define an expected output from the system under test. This is appropriate for testing most situations that arise in data integration solutions but we find that there are occasional cases we will need to borrow from techniques in systems based on heuristics.

For example, consider the following mapping rules:

Source Column	Rule	Target Column
	System date formatted as “YYYYMMDD HH24:MI:SS”	creation_date
flight_number	==>	flight_number
destination	Convert to Sentence Case	dest_name

The rules for populating flight_number and destination are deterministic. Flight number is straightforward because the value from the source is simply copied to the target. The rules for destination are slightly more involved but we easily construct test cases that define an expected result based on a given input - EDINBURGH -> Edinburgh, gLasgow -> Glasgow... and so on.

The "creation_date" rule is quite different. Timestamps taken from the system clock are non-deterministic because it’s impossible to define a value that will definitely match the system time at the moment the job is run. In some cases, you can get round this by injecting a known value into the system under test using a parameter (or some similar method) but it’s not always possible so we need to take a leaf out of heuristic systems thinking.

Heuristic systems use a combination of rules and data gathered at runtime to make decisions. A scheduling system used by an airline plots a schedule for the airline each day that takes into account the many factors that affect airline scheduling - plane location, crew availability, turnaround time, weather conditions, volcanic activity in iceland, and so forth. There is no “right” answer to the question of how to arrange the planes so it’s hard to test deterministically. Instead, the development team need to ensure that the correct decision making flow is executed when deciding the schedule rather than trying to predict an outcome. Emily and Geoff Bache work on such as system and gave a presentation at Agile2008 explaining how their test suite works.

Back to our system datetime. We can borrow from heuristic concepts by plugging in a set of rules that allow us to validate the date generated. A reasonable heuristic rule for the system date might be to check the date is correctly formatted and within a few minutes of the moment the test suite was run.

Take care though. Heuristic tests make your test suite more complicated because you need capture and process more data in a heuristic style test than a deterministic test. Sometimes it may just not be worth the effort. Furthermore, it’s easy to write heuristic tests that are just as complex than the system being tested. Let’s face it, the risk of getting the system date call wrong is pretty remote so maybe it’s just easier to ignore those fields and focus building watertight tests for rules you can test deterministically. That’s usually where the real business benefits are found anyway.

Above all, you should always try to find a deterministic solution and you should expect to do so at least 95% of the time.

Agile Data Testing

Those of you who avidly follow my blog :-) will have noticed that there have been one or two quiet spells lately. I would love to say I have been living as a hermit living on seaweed and shellfish scraped from the rocks while building the perfect suite of tools for Agile Data Integration but, really, life just got in the way of writing the past couple of years.

However, I have not been idle. I've been working hard to define strategies for creating automated test solutions for ETL/Data Integration projects for the past 4 years and I have a lot of knowledge I want to share. Some of this has been captured in tools, but not as much as I would like so I am also looking for people to help me develop the tool capability in a warm, fuzzy, open-sourcy kind of way.

I'm about to publish some articles explaining what I have learned and where I am going. I know time is valuable, but I really, really want your feedback so we can start to develop more of a community around this topic. I'm going to start with the theory and get into the tooling once I have laid down my thoughts on the core concepts of what we are trying to achieve.

So please, please, please read the articles, be forthright in your comments and encourage me to write more.

My thanks.

Adrian