Adrian Mowat on Agile Data Integration: September 2010

A good automated test suite can setup a new environment, run the tests while retaining enough information to debug failures, and then teardown, or reset, the environment ready for next time the test suite runs. Data integration systems, by definition, deal with persistent data stores so the test suite needs to reflect that and deal with the inherent complications.

>> Edit: As Nick pointed out in the comments, it's useful to defer teardown and leave the test data intact after the test runs so you can debug any failures. I usually work on the cycle of teardown -> setup -> test

This article describes 3 patterns I have found useful for setup and teardown.

Truncate and Rebuild
Build from Copy
Test Data Set

They are not mutually exclusive and can be mixed and matched as needed so, for example, you might use a truncate and rebuild strategy in the filesystem and a test keys approach in the database.

The Test Environment

The test environment consists of mixture of code, persistent storage, applications and libraries and tools including, but not limited to...

The system under test and libraries on which it depends
The test framework
Source code repositories
Example data sets and expected results
Source, target and intermediate databases and tables
Source, target and intermediate files and directories
Business applications
ETL environments/tools
Data transport software - e.g. MQSeries and FTP
Scheduling software

The test suite must include enough of these elements to repeatedly verify the data-transformation logic and supporting infrastructure on-demand, at the touch of a button. You should keep it as simple as possible by only including only the elements needed to perform the test. For example, don’t bother running your tests within the scheduler unless the schedule is complex enough to justify the extra effort.

It is important to build a test suite that can run at any time without impacting other teams’ work. Unit test suites must be under the direct control of the developer(s) working on the code so they can build test cases and perform the TDD cycle for every change they make to the code base. “Higher” test environments might run on a scheduled basis (every hour, or after each commit, for example), but they still should be made as accessible as possible.

Real-world constraints

Unfortunately, it is not usually cost-effective to have dedicated access to all the elements needed (especially when first introducing these practices) so you will probably need to be a little creative about how you live alongside other users of the existing development and test systems at your disposal.

The following patterns can be used to create a test environment that coexists happily with other teams.

The “Truncate and Rebuild” Pattern

Overview:

A brute force approach where the test suite assumes it has free reign to delete existing artifacts. It deletes everything it finds, builds everything it needs from scratch and deletes it all afterwards.

Mechanics:

Delete all existing artifacts and re-create them from source as you would when deploying to a new environment.

Advantages:

Very good at driving out deployment problems (missing files, bad checkins etc)

The environment is guaranteed to be clean
You are unlikely to impact anyone else’s work
Simplifies the test framework because it does not need to worry about other users of a given resource.

Disadvantages:

Often not cost effective - especially if applied to elements with expensive licences
Can mask real world problems because real world systems need to coexist with other applications so a totally isolated test suite may miss defects caused by other system’s activities
Must be used with care to avoid accidentally deleting someone else's work or artefacts that cannot be easily recovered

The “Build from Copy” Pattern

Overview:

A close cousin of Truncate and Rebuild. Instead of building an element from scratch, it is created by copying from a master-location.

Mechanics:

Delete all existing artifacts and re-create them by copying from a master location.

Advantages:

As per Truncate and Rebuild
Easy to visualise and manage because the master copy of the artifact is available

Disadvantages:

As per Truncate and Rebuild

The “Test Data Set” Pattern

Overview:

The test framework is built to work alongside other users’ data by managing a data set that is easily distinguishable from, and does not overlap with, any other data in the system. Thus it can be inserted, updated and deleted on demand without any undesirable side-effects.

Mechanics:

Records are ideally identified by primary key, or some compound combination of secondary keys. If this is not possible, a value in a comment field can be used - but this is less than optimal because it means you need to change the system under test to allow the test framework to inject it’s own comment value(s).

Keys must be chosen very carefully to ensure you can always identify them. A useful technique is to pick values that fall way outside the normal range. For instance, if a product code is an 8 digit integer in production, you can use 4 digit product code for testing.

Advantages:

Cost effective because it uses existing resources.
Aids data discovery because the team need to think hard about the data to build the framework

Disadvantages:

Adds complexity to the test framework
Risk that teardown may miss some data and pollute the next test run

A Worked Example

Consider an ETL process that reads a csv file pushed from a source system using FTP into a location on the ETL server and loads it into a data warehouse star schema. The warehouse is an established system that has been live for a number of years. It runs on a relational database and the new import job is being developed as part of the normal release cycle alongside a handful of other, distinct, projects that share the same physical production, development and test environments.

Our new import populates a pre-existing fact table and adds 2 new dimensions. We want to build a test framework for the import process that continues to be useful after the current project has finished so we must assume other users are going to hit the same tables at some point, even if it’s not an immediate concern.

Starting with the source data, the test framework holds a set of input cases and corresponding result(s) for each case (how we store and manage them is beyond the scope of this article). Let’s assume out ETL environment provides a mechanism for a tester to have a dedicated area of a filesystem to store test cases. We can use Build from Copy to recreate source_data.csv by taking a copy of the test framework’s input cases each time the test suite runs.

Next, the ETL process under test is stored under version control. Assuming we have a dedicated area where we can create working copies, we can simply delete all the code and re-create it from the version control system each time the test runs. This is an example of Truncate and Rebuild.

Notice there is a subtle difference between Truncate and Rebuild and Build from Copy. In the first case, we build the test environment programmatically as we would in a real deployment, while when employing Build from Copy the build process is synthetic and only found in the test framework.

Finally, the database that holds the warehouse is probably the most complicated part of the system because it is shared by all the project teams in the data warehousing group. We need to solve 2 problems: and how to manage the test data, and how to test the DDL that adds columns to the Fact Table and creates the new Dimensions.

First, the test data. We use the Test Data Set pattern to structure our input data set in such a way that it can be identified easily once it has been loaded to the data warehouse. Thus we can can also delete it easily when we teardown the test environment.

The DDL is a bit more tricky. The dimensions can be simply deleted and re-created using Truncate and Rebuild. We can also use Truncate and Rebuild on the the fact table but the teardown script needs to be sophisticated enough to drop only the new columns. In both cases, the DDL tests should be optional because future users of the test framework will be testing additions to your business logic in a future release so they can reasonably assume the star schema already exists.

Adrian Mowat on Agile Data Integration

Friday, September 17, 2010

Fundamentals of data testing: Setup and Teardown Patterns

About Me