Data alright?

We may be wrong to focus on our data being right, when we should be more concerned that it is the right data. There is a congruence here with efficiency and effectiveness - doing the thing right and doing the right thing. We can put lots of effort into honing our processes and interactions so that they are efficient, but this can be to no avail if we are not doing the right thing. All very obvious you might think, but the reality is many people have conceptual problems with that abstract stuff we call data and will ask people like me to check if their data is right and miss out on the more important question of whether they hold the right data. I recently examined a database for an application that had been in use for some years. The quality of the data on most measures was excellent. The curious incident was what was not there, amongst which were email addresses and mobile phone numbers.

The data protection principles tell us that data should be right in both ways. The right data comes first in principle 3: "Personal data shall be adequate, relevant and not excessive...", or as my wife might specify it: "enough, but not too much". Principle 4 tells us the data should be right: "Personal data shall be accurate and, where necessary, kept up to date". In some ways we are struggling with terminology, as we don't have words that necessarily sum up our different dimensions of rightness with a clarity similar to efficient and effective, but I'm going to use pertinence to talk about the right data, and accuracy to talk about the data being right.

We can measure, quantify and report on both efficiency and data completeness and accuracy, and plot ways to fill any gaps in the existing record. The questions of effectiveness and whether we are recording the pertinent data are much more open and difficult to quantify. Whilst there may be some things that are definitely wrong, it is difficult to be certain that we ever have it as good as it could be, but we can be pretty certain that the world will move on and invalidate both our accuracy and pertinence. Focusing on pertinence more that accuracy, we move from the certainty of analysing the data quality to business analysis where we will need to take account of the dynamism of data and the business, and differing frames of reference for different users.

Within our companies, what is considered pertinent is going to vary by department. A contact record with just an email address may be considered a valid lead, and if you measure your sales or marketing effort on new leads then you will probably have lots of these. In the business world we may be able to extract some information from the domain, but the marketer tasked with deriving meaningful insight from this data has a lot to do, and would prefer all fields to be populated with accurate data. An operations person might be uninterested in the age of your contact, but would like the address and postal code to be correct. They may also be interested in any references stored to other systems, so they can link the data to the corresponding data elsewhere. An accounts person might be interested in something as basic as whether this is a new or existing customer, and the high level of duplicates in many systems shows this is not as simple a question as it sounds. Of course, you can try to enforce some mandatory fields, but the side effect of this can just be bad data as users enter something to get past the mandatory field that they see as stopping them doing their job. Mostly, bad data is worse than no data, as we may otherwise make assumptions based on that bad data.

Even in well controlled systems we may have problems where we try to categorise data. Too few categories and we lose precision in our insight. Too many and we similarly lose precision as the users will pick either the first one or an overall bucket. Enabling users to add their own categories may lead to chaos, as once you are beyond 30 or 40 categories users will add their own rather than try to find the closest match. I've seen many thousands of job functions in a system. Our categorisation, like our overall data, needs to be pertinent.

The world is dynamic, and our data will decay over time, both in accuracy and pertinence. General accuracy will probably decay faster than pertinence, but the way we reach customers and they find us keeps developing. You will need to review pertinence regularly. As with general accuracy, do not be afraid to throw away the old stuff. Not only will this help you see the wood for the trees, it is a data protection requirement. You may need to learn to tolerate some ambiguity, particularly as regards completeness. This can be difficult for those of us who have grown up with fully populated data, as well as making our query writing and reporting rather more complex, but the reality of the modern world is much more to start with minimal data and then update and enhance it when and where you can. This can be quite difficult, but at least if will be effective if you have focused on the right data before striving to get the data right.

John Davis
10 Jan 18.

Email: john.davis@cranfieldsoftware.co.uk

Telephone: +44 7973 504232

Data alright?