Three Cheers for Data Cleansers

A meme that would have been topical if I had made it in April

I had a dream the other night that I was trying to make a fruit salad, but in this dream our fruit bowl was doubling up as a storage bin for the compost heap so half of it was full of luscious fruity goodness and the other half was full of horse manure. So there I was, standing in the kitchen, daintily picking out strawberries, blackberries, gooseberries and beetroots (Beetroots were a fruit in this dream too. Look, I don’t make the rules…). and inspecting them to see if there were any flecks of poo on them before placing them carefully on the plate.

I’m not just pretending to have had this dream for the sake of a blog post, by the way. I promise you I would never have made up anything this disgusting deliberately. Let’s not think too much about what this says about my subconscious because I suspect I wouldn’t like the answer. It feels like a pretty good analogy for data cleansing though. Do I even need to explain the parallels? We want as much fresh and clean material as possible and we definitely don’t want any “extras”. And so we need to get busy cleansing.

Data cleansers are the unsung heroes of any migration project. It’s not a glamorous job. It doesn’t earn anyone a place in the limelight, but it requires superhuman levels of patience and attention to detail. Often there are lists that scroll endlessly off the bottom of the screen and each row needs to be unpicked one by one. I remember my first time seeing a data cleanser in action. He was living at his desk, shirtless, wild-eyed, pencils stuck in his hair, muttering, yelping at sudden noises, for months on end. And here’s the thing that made me realise the depth of his Sisyphean predicament: it was all the same enormous spreadsheet he was working on the entire time! It must be tempting, when you’re performing the same repetitive action over and over again, to cut corners, make assumptions, or just get hypnotised by the monotony and make mistakes, but he didn’t. He had the bug and he wasn’t going to let a single row get lost. Data Hero. His was an extreme case, but even under normal circumstances, the struggle is real.

More often than not, data cleansing is done in-house. In practice that often means it gets delegated to team admins because they are careful but they are also close enough to the coalface to be able to get information about the clients direct from the social workers. And they invariably do a cracking job. They’ve often spent years evangelising within their teams for good case recording, having to clear up inconsistencies every April when the statutory returns fall due, having to help social workers who can’t find Mrs Ellis because someone has keyed her in as Mrs Ellace; so they know good data makes a difference. But they have day-jobs too, so the project should make sure they are properly appreciated and supported and not taken for granted or overloaded.

Of course there are many other people who could be involved: senior admins (usually the high ticket items), system support officers (where special access is needed such as merging duplicate clients), social workers (if it’s something that needs a professional opinion). Sometimes external data cleansers will be brought into the project team because there isn’t enough internal resource, but I worry that an outsider doesn’t always have the same stake in the end result, so you need to look for someone with the right attitude more than the right technical qualifications.

But whoever they are, all data cleansers are heroic in their own way, fighting an unending war against the law of entropy, armed only with a mouse, a keyboard and a strong cup of tea.

My own advice on data cleansing for migration projects is to forget everything you know about data cleansing. It tends to be something that people in social services departments only think of in the context of preparing for statutory returns, so the only data that gets looked at is the stuff that forms part of those returns: timescales for assessments, correct placement codes, things like that. This is completely understandable of course, because there’s an obligation to get it right. However, in a data migration project this kind of detail is lower priority if it’s a priority at all. After all, you’ve already submitted the returns from 2013 so is it is usually not a good use of time to backload every single manual change that was made to that return into the system in the run-up to migration. In the unlikely event that someone ever needs to know, they can dig out the 2013 return and have a look at that. In terms of my dream, the kind of abstract information found in a government return like this would just be a lightly besmirched beetroot. We should be more concerned about the delicious strawberries that have somehow, over time, got squished deep into the mire and need to be thoroughly rinsed down and carefully dried on the arm of a t-shirt before they can be placed in front of a dinner guest. Ah… OK, well I think we’ve established that there’s a reason why I’m a data migration engineer, not a chef. Let’s move on.

What kinds of things should we be focusing on for data cleansing? I think it’s usually best to prioritise the data nobody has ever needed to grapple with until now because it isn’t the subject of any stat returns, but it will cause major issues in the new system or it will make the the social worker’s job harder in some concrete way if it were left unfixed. Some key examples would be

  • Who’s who? Duplicate person records will cause absolute havoc if not sorted out and each one will take a long time because you have to apply quite a lot of detective work to establish if it’s a true duplicate or just two people with similar names. Start a process for detecting duplicates early and get someone meticulous to work on the list for as long as it takes, then revisit it periodically to pick up any new ones.
  • Ambiguous casework. The classic situation here is where you need to migrate only children’s service data out of a database that includes both adults and children. How do you know which segments of casework to migrate and which to leave behind? It will often come down to who has recorded it, or some identifying detail of the referral that led to the case being open in the first place. So far so good, but in a lot of cases, we find that there are ambiguities and grey areas and someone needs to spend some time delineating where one set of involvements stops and the other starts.
  • Bad addresses. You’ll usually have to do some matching with addressbase or NLPG or some sort of gazetteer, but depending on the system some of the addresses might be too sloppily entered to be matchable so someone will need to do some legwork, especially on current addresses, to ensure you know where to send letters and social workers know where they need to go on their home visit.
  • Bad numbers. Things like Unique Pupil Numbers (for children’s services) and NHS numbers (primarily in adults) will often be used as a link to other related databases within the department. Obviously there is no way of knowing for certain that a given number is wrong, but it’s usually possible to detect anomalies like duplicate IDs or IDs that don’t fit the allowable format given by the government, or a UPN listed against someone who is clearly too old ever to have had one. Once produced, the list can be passed on to someone with the patience to unravel the mysteries therein.
  • Bad Dates. Again, we can’t know if a specific date is wrong, but people born in 0213 or 2089 will usually be easy to spot, for example. Then there are the people who are placed in older people’s teams but are only three years old.
  • Casework for deceased clients. These are usually more common in adult services than children’s for obvious reasons. If there is a financial implication it’s more likely to have been spotted but it’s surprising how often you’ll find ongoing services listed for deceased clients, to say nothing of ongoing assessments, allocations and so on. Closing these down in the process of migration is usually straightforward by simply putting in the date of death as an end date but it’s often something the business would rather tidy up sooner rather than later of course.
  • General sloppiness. Things like squeezing three phone numbers into the same box instead of listing them separately or adding a surname like “Smith aka Jones” instead of adding Jones separately under the aliases section. These kinds of errors are mildly annoying and can affect the ability to search for a particular service user, but it’s seldom a high priority and realistically it’s quite rare that any local authority department has enough time to start putting things right to this level of detail unless it’s bad enough to cause validation errors in the new system.

These are all areas where the scope for automatic cleansing might be limited, so a real human being will need to roll up their sleeves and get to grips with the case recording. And those people are heroes and we must cherish them!

I follow someone on Linkedin who has recently written a book about data cleansing and it’s due out in a few weeks. I’m not sure how I’ve ended up following her because I have never met her or worked with her, but she seems to be very good at marketing herself so here I am in her network! She’s called Susan Walsh and her book is Between the Spreadsheets, which is a pleasingly cheeky title for what is arguably not the most entertaining subject in the world, and I must say, if she’s as good as data cleansing as she is at ensnaring new Linkedin followers, this should be a good read. I’ll be interested to see what she has to say and how much of it is directly applicable to the kinds of project I usually work on. There is a section in the Practical Data Migration framework that deals with data quality rules, but it’s more from the point of view of meeting thresholds for migration into the new system, rather than the nitty gritty detail of how you get to grips with actual… well, to continue my analogy, the actual fruit-depooping. I’m crossing my fingers for a solid, simple, replicable methodology that will make it easier for me to help customers see their way through to data victory. That’s the dream, Susan. No pressure!