Adrian Mowat on Agile Data Integration
Thoughts on data integration projects and Agile by Adrian Mowat
Thursday, October 28, 2010
Test Driven Development - Building a Magic Funnel
It fulfils three important roles that are otherwise missing.
* To define a complete and unambiguous specification held in a form that is directly useful to technical people (n.b. a spec is not directly useful because it is open to interpretation and assumes tacit knowledge about the problem)
* To create single, well understood, vocabulary for discussing requirements and issues in development
* To give ourselves fast, simple way to prove functional correctness
So how do you build the funnel?
Let's assume you a basic technical framework in place that allows you to store test cases and run them through the system under test. The question we need to answer is how to integrate it into your project lifecycle in a cost effective manner. The only way to do that is to use Test Driven Development (TDD). Put very simply, write a failing test that describes a change before you write the code to implement that change.
There is plenty of information out there on how to do TDD in Object Oriented systems. I have worked through a lot of it and built systems of my own to learn the techniques (and do other useful things as well). I found that Object Oriented languages allow much more modular code than ETL systems and give you much more control over the low-level details. Many of the techniques do not map perfectly into the Data Integration world. It's certainly impossible, or at least highly impractical, to start from a single test case and build up a system from there. For a start, a large part of the system already exists in the form of source and target systems and interfaces.
However, I have found that the principles and benefits of doing TDD are relevant and achievable to Data Integration when you tweak the practices a little. Furthermore, data integration projects raise challenges around data quality and opaque interfaces to external systems that are not a concern on most OO systems. I have found that building the funnel using TDD provides an elegant and proactive solution to these problems that integrates well with normal project activities and does not mandate an explicit 'phase','activity' or 'gate'.
Here's how to do it.
Before you write any code...
1. Create an automated test harness.
2. Work with technical team, business users and analysts to define an initial set of test cases based on the current understanding of the business problem. Make sure you can run all your test cases in one pass and you don't need to run batches of test cases through the system under test.
3. Roll technical test cases such as boundary conditions and known data quality issues into the test suite.
- Do your initial data profiling here
At this point, you will find that you have driven out a lot of problems in your understanding of the problem and the solution required by simply thinking through the problem at a detailed level. Some of these will be misunderstandings between the business and technical people, while others will be areas the business users and analysts had not considered in their initial discussions. The test framework has already been valuable and you don't even have any code yet!
You are now ready to start coding and you can move the test suite into "TDD" mode.
For every modification...
4. Check you have test cases for the feature. If you don't, write them.
5. Run the suite to ensure the test fails (this is how you "test the test")
6. Write code to make the test pass - running the tests frequently to make sure you are on the right track
7. Once all the tests pass, refactor your code to make it production ready
8. Go to step (4) and start the next feature
You must expect issues to arise once you start development. Some will be technical, others business-related and the rest due to tests that were missed in the first pass. It is very important to assign collective ownership of the test suite to the development team so everyone has the right and ability to make changes as soon as they find a problem. In practical terms, you should have a version control strategy to ensure changes are propagated around the team in a controlled fashion, but you should avoid assigning ownership or setting up any sort of approval mechanism that will create a bottleneck.
Periodically, you should run a smoke test with a full volume of production or production-like data through the system to drive out issues you have missed. Similarly, if you have a lot of dependencies on other systems you should build your test suite to stub out those dependencies but run regularly in a fully integrated environment to find new cases that should be added to your automated suite.
Footnote: Why you really can't create the tests after the code
Apart from the fact you have already lost the benefits of the funnel during development, it's actually impossible to create a complete test suite after the event.
Ask yourself a few of questions...
* How can you validate the test suite tests what you think it does unless you have made the test fail?
* How can you be sure the test suite tests everything the code base does unless it grew alongside the code base?
* How can you inject test cases into a system that was never built to have test cases inserted into it? Usually you will find a few tricky dependencies that you can't work around.
I'm not saying you should not try to build a test suite if you have to manage a codebase that does not have tests, but you should understand the limitations. You should certainly not plan to build the tests after the code if you have the option.
Friday, September 17, 2010
Fundamentals of data testing: Setup and Teardown Patterns
A good automated test suite can setup a new environment, run the tests while retaining enough information to debug failures, and then teardown, or reset, the environment ready for next time the test suite runs. Data integration systems, by definition, deal with persistent data stores so the test suite needs to reflect that and deal with the inherent complications.
>> Edit: As Nick pointed out in the comments, it's useful to defer teardown and leave the test data intact after the test runs so you can debug any failures. I usually work on the cycle of teardown -> setup -> test
This article describes 3 patterns I have found useful for setup and teardown.
- Truncate and Rebuild
- Build from Copy
- Test Data Set
They are not mutually exclusive and can be mixed and matched as needed so, for example, you might use a truncate and rebuild strategy in the filesystem and a test keys approach in the database.
The Test Environment
The test environment consists of mixture of code, persistent storage, applications and libraries and tools including, but not limited to...
- The system under test and libraries on which it depends
- The test framework
- Source code repositories
- Example data sets and expected results
- Source, target and intermediate databases and tables
- Source, target and intermediate files and directories
- Business applications
- ETL environments/tools
- Data transport software - e.g. MQSeries and FTP
- Scheduling software
The test suite must include enough of these elements to repeatedly verify the data-transformation logic and supporting infrastructure on-demand, at the touch of a button. You should keep it as simple as possible by only including only the elements needed to perform the test. For example, don’t bother running your tests within the scheduler unless the schedule is complex enough to justify the extra effort.
It is important to build a test suite that can run at any time without impacting other teams’ work. Unit test suites must be under the direct control of the developer(s) working on the code so they can build test cases and perform the TDD cycle for every change they make to the code base. “Higher” test environments might run on a scheduled basis (every hour, or after each commit, for example), but they still should be made as accessible as possible.
Real-world constraints
Unfortunately, it is not usually cost-effective to have dedicated access to all the elements needed (especially when first introducing these practices) so you will probably need to be a little creative about how you live alongside other users of the existing development and test systems at your disposal.
The following patterns can be used to create a test environment that coexists happily with other teams.
The “Truncate and Rebuild” Pattern
Overview:
A brute force approach where the test suite assumes it has free reign to delete existing artifacts. It deletes everything it finds, builds everything it needs from scratch and deletes it all afterwards.
Mechanics:
Delete all existing artifacts and re-create them from source as you would when deploying to a new environment.
Advantages:
Very good at driving out deployment problems (missing files, bad checkins etc)
- The environment is guaranteed to be clean
- You are unlikely to impact anyone else’s work
- Simplifies the test framework because it does not need to worry about other users of a given resource.
Disadvantages:
- Often not cost effective - especially if applied to elements with expensive licences
- Can mask real world problems because real world systems need to coexist with other applications so a totally isolated test suite may miss defects caused by other system’s activities
- Must be used with care to avoid accidentally deleting someone else's work or artefacts that cannot be easily recovered
The “Build from Copy” Pattern
Overview:
A close cousin of Truncate and Rebuild. Instead of building an element from scratch, it is created by copying from a master-location.
Mechanics:
Delete all existing artifacts and re-create them by copying from a master location.
Advantages:
- As per Truncate and Rebuild
- Easy to visualise and manage because the master copy of the artifact is available
Disadvantages:
- As per Truncate and Rebuild
The “Test Data Set” Pattern
Overview:
The test framework is built to work alongside other users’ data by managing a data set that is easily distinguishable from, and does not overlap with, any other data in the system. Thus it can be inserted, updated and deleted on demand without any undesirable side-effects.
Mechanics:
Records are ideally identified by primary key, or some compound combination of secondary keys. If this is not possible, a value in a comment field can be used - but this is less than optimal because it means you need to change the system under test to allow the test framework to inject it’s own comment value(s).
Keys must be chosen very carefully to ensure you can always identify them. A useful technique is to pick values that fall way outside the normal range. For instance, if a product code is an 8 digit integer in production, you can use 4 digit product code for testing.
Advantages:
- Cost effective because it uses existing resources.
- Aids data discovery because the team need to think hard about the data to build the framework
Disadvantages:
- Adds complexity to the test framework
- Risk that teardown may miss some data and pollute the next test run
A Worked Example
Consider an ETL process that reads a csv file pushed from a source system using FTP into a location on the ETL server and loads it into a data warehouse star schema. The warehouse is an established system that has been live for a number of years. It runs on a relational database and the new import job is being developed as part of the normal release cycle alongside a handful of other, distinct, projects that share the same physical production, development and test environments.
Our new import populates a pre-existing fact table and adds 2 new dimensions. We want to build a test framework for the import process that continues to be useful after the current project has finished so we must assume other users are going to hit the same tables at some point, even if it’s not an immediate concern.
Starting with the source data, the test framework holds a set of input cases and corresponding result(s) for each case (how we store and manage them is beyond the scope of this article). Let’s assume out ETL environment provides a mechanism for a tester to have a dedicated area of a filesystem to store test cases. We can use Build from Copy to recreate source_data.csv by taking a copy of the test framework’s input cases each time the test suite runs.
Next, the ETL process under test is stored under version control. Assuming we have a dedicated area where we can create working copies, we can simply delete all the code and re-create it from the version control system each time the test runs. This is an example of Truncate and Rebuild.
Notice there is a subtle difference between Truncate and Rebuild and Build from Copy. In the first case, we build the test environment programmatically as we would in a real deployment, while when employing Build from Copy the build process is synthetic and only found in the test framework.
Finally, the database that holds the warehouse is probably the most complicated part of the system because it is shared by all the project teams in the data warehousing group. We need to solve 2 problems: and how to manage the test data, and how to test the DDL that adds columns to the Fact Table and creates the new Dimensions.
First, the test data. We use the Test Data Set pattern to structure our input data set in such a way that it can be identified easily once it has been loaded to the data warehouse. Thus we can can also delete it easily when we teardown the test environment.
The DDL is a bit more tricky. The dimensions can be simply deleted and re-created using Truncate and Rebuild. We can also use Truncate and Rebuild on the the fact table but the teardown script needs to be sophisticated enough to drop only the new columns. In both cases, the DDL tests should be optional because future users of the test framework will be testing additions to your business logic in a future release so they can reasonably assume the star schema already exists.
Monday, August 23, 2010
Fast Database Resets
It's Java focussed, but the ideas are useful on all projects doing automated database testing.
Friday, August 13, 2010
New name for the blog
Fundamentals of data testing: Characteristics of a test suite
- An environment you can teardown and rebuild repeatedly many times a day
- A way of storing input data sets and expected results
- A way of running the input data sets through the system and verifying the output matches the expected results
Fundamentals of data testing: Deterministic vs Heuristic
The style of test automation I find most useful when testing data integration systems is largely based on the “deterministic” testing style popularised by Agile methods. That means that for every value input, we need to be able to define an expected output from the system under test. This is appropriate for testing most situations that arise in data integration solutions but we find that there are occasional cases we will need to borrow from techniques in systems based on heuristics.
For example, consider the following mapping rules:
Source Column | Rule | Target Column |
| System date formatted as “YYYYMMDD HH24:MI:SS” | creation_date |
flight_number | ==> | flight_number |
destination | Convert to Sentence Case | dest_name |
The rules for populating flight_number and destination are deterministic. Flight number is straightforward because the value from the source is simply copied to the target. The rules for destination are slightly more involved but we easily construct test cases that define an expected result based on a given input - EDINBURGH -> Edinburgh, gLasgow -> Glasgow... and so on.
The "creation_date" rule is quite different. Timestamps taken from the system clock are non-deterministic because it’s impossible to define a value that will definitely match the system time at the moment the job is run. In some cases, you can get round this by injecting a known value into the system under test using a parameter (or some similar method) but it’s not always possible so we need to take a leaf out of heuristic systems thinking.
Heuristic systems use a combination of rules and data gathered at runtime to make decisions. A scheduling system used by an airline plots a schedule for the airline each day that takes into account the many factors that affect airline scheduling - plane location, crew availability, turnaround time, weather conditions, volcanic activity in iceland, and so forth. There is no “right” answer to the question of how to arrange the planes so it’s hard to test deterministically. Instead, the development team need to ensure that the correct decision making flow is executed when deciding the schedule rather than trying to predict an outcome. Emily and Geoff Bache work on such as system and gave a presentation at Agile2008 explaining how their test suite works.
Back to our system datetime. We can borrow from heuristic concepts by plugging in a set of rules that allow us to validate the date generated. A reasonable heuristic rule for the system date might be to check the date is correctly formatted and within a few minutes of the moment the test suite was run.
Take care though. Heuristic tests make your test suite more complicated because you need capture and process more data in a heuristic style test than a deterministic test. Sometimes it may just not be worth the effort. Furthermore, it’s easy to write heuristic tests that are just as complex than the system being tested. Let’s face it, the risk of getting the system date call wrong is pretty remote so maybe it’s just easier to ignore those fields and focus building watertight tests for rules you can test deterministically. That’s usually where the real business benefits are found anyway.
Above all, you should always try to find a deterministic solution and you should expect to do so at least 95% of the time.