What Data Are You Using for Testing?
Creation of representative data sets is very difficult and time consuming. Therefore, developers and testers often want live data from your web application to work with. This needs to be complete, current and representative. Make sure this is controlled - live data should never be used in development or testing. Live data are also sometimes requested for "running reports offline" - this can be just as risky and illegal.
For example, I've seen a developer testing their work on an email broadcast module send thousands of test messages to real customers of their client because they were working with an exact copy of the live data.
Live data may contain personally identifiable information, authentication credentials or other sensitive information such as contract values or intellectual property. Three methods that can be of help are:
- Extracting a subset - to limit the number of records
- Anonymise records - to remove personally identifiable information
- Masking - to hide sensitive information
On smaller, less complex, systems where the file & database structure and data content are well understood, this could be undertaken manually, or more likely with some pre-prepared scripts. If possible, these should be done on the devices hosting the existing data, so as not to have to manage the transfer of such data elsewhere first. The extract could be exported, treated and then removed from the server over a secure transmission protocol, to a possibly less-secure destination.
In more complex systems such as customer relationship management (CRM) and enterprise resource planning (ERP) applications, the effort of extracting subsets, transforming and masking data can be very time-consuming, complicated to maintain data integrity and difficult to ensure statistical consistency. A difficulty can be the introduction of inconsistences... for example in an insurer's data where the postcode and house insurance premium are inter-related. In these cases, tools can be purchased to help with the task. However, firstly understand the purposes for which the data extract will be used, and ensure that the tool can be used to generate suitable data sets.
Ensure the extracted data sets cannot be reverse engineered back into the original data, are tracked and disposed of securely at the end of their use. Don't forget that "data" can exist in formats other than database files and office documents... in images, multimedia files, caches, logs and backups.
Are there legal restrictions? Under the Data Protection Act 1998 (DPA), you need to inform subjects of your intended uses for the data they provide. If they haven't agreed to its use for testing your systems, you mustn't use it in this way. Remember, if the data cannot be used to identify individuals, the DPA doesn't apply.
Do you have any experiences to share?
Update 7th November 2008: The question of whether IP addresses are personally identifiable data often arises. Comments by Peter Hustinx, the European Data Protection Supervisor, at the RSA Conference Europe 2008 are a useful reminder that nameless data, such as IP addresses, could be personal data and are thus protected by data protection legislation.
Posted on: 31 October 2008 at 08:03 hrs

Comments are filtered automatically and should appear shortly after they been checked.