Talend Open Studio – Product Review

I find Talend's logo on my desktop most beguiling
Even Talend’s lime-green starfish logo on my desktop is most beguiling

When I did my first ever migration, I used a tool that had been designed for the purpose by another developer in Microsoft Access. It was interesting and certainly simple enough for a beginner to use, and I think it was probably the right tool for then, but it had some major flaws. It was slow and temperamental. Sometimes when working with large text fields, I’d have to take a Long (an outmoded Oracle text field) and put it into a CLOB, both pretty awkward things to work with, but when you’re using an access Note field as your bridge, all bets were off. It would chop off great chunks of text but not tell me there was a problem. “Oh yes, Colin”, it would say, cheerfully, “I’ve loaded all the rows, don’t you worry”. Only later would I be told that when the testers looked, all the notes stopped abruptly at the 4000 character mark. It was a bit picky about what inputs it would use, too. CSV or nothing. In short, it had its limitations. These days, I need something with a bit more oomph, and I think I’ve found it in a fantastic product called Talend Open Studio.

Talend is an open source product built around the eclipse platform. It is made up of several distinct components. I use the data integration tool and the data quality tool, but they come in Cloud, Big Data, MDM and various other cool flavours too, if you need those things. And the best part? The products are all free to use if you are working at the scale of only one developer. The company even runs a very good online support system which is open to the free customers too. So how the heck do they make money? Speaking as someone who once tried to get them to take some of mine, I’m not really sure, because their very lovely, helpful programmers didn’t seem to be all that interested. However, they must be doing it, because I’ve been to their office and it was very nice. The revenue flow comes from a fully integrated platform product which does cost extra and which allows you to move swiftly from data quality to data integration with zero faffing about, collaborate with other developers, and generally scale the whole thing up. I can’t tell you how utterly impressed I am with this attitude to distributing software. Charities and small outfits with relatively modest requirements can get by with a great product that they look after themselves, while there’s an option of going to the next level if you happen to belong to a larger enterprise.

This review is only of the free version of the two products I’ve used. They are all you need to run a data migration. Or rather, they’re all you need to handle the technical aspects. If you want to know what you’ll need for the more “projecty” aspects, get yourself a copy of “Practical Data Migration“.

Talend Data Quality Tool - Correlation Graphs
With pretty graphs like this, you can convince just about anyone of just about anything

Talend Open Studio for Data Quality

If you’re lucky enough to have any data quality in your organisation, this tool will find it! ;^)

The backbone of this version of Talend is similar to the Integration tool. You get an Eclipse workspace with folders to store various sets of enquiries into the state of the data in the system. These can be at several levels. You can do an analysis of the entire database to see how many rows are in each table and what sort of general structure it has, or you can get closer to a set of tables, the relationships between the values in two columns say, or at an even finer level look at the patterns of data: Do the postcodes all match the regex for a valid postcode, for example. How many of the customer names are null? What’s the longest value in a text field? What are the outliers in a set of loan amounts? Are any of the dates insanely far in the future/past? Results are usually presented in bar charts but there are other formats too, like the nice correlation shown in the picture. In each case, you can drill down to get a closer look at problem rows. This is a great way to convey to stakeholders what the issues are in the data and what needs to be addressed right at the start. The GUI is easy to use and – with a little fiddling around – you can soon work out how it all works and get some real work done.

Talend Data Integration Tool Workflow
Talend Data Integration Tool Showing a main migration controller calling off several subprocesses in turn

Talend Open Studio for Data Integration

When you’re ready to start extracting, this is the next step. The integration tool has connectors. that can hook into most common database types and quite a few I hadn’t heard of. Values extracted from one datbase can be joined with external data, manipulated, mapped using a table from an excel sheet and have functions applied to them before they are reinserted into a table in the target database. Exception lists and logs can be dropped into a CSV or other flat file for a cleanser to look at, and a stored procedure called to kick off any cleansing routines you might be using. The separate steps can be joined together using connectors and orchestrated any way you like. I’ve seen people do this with Visual Studio but I’ve never been in a standardised SQL Server-only environment where that’s been an option, so it’s really good to have the same functionality but with a lot more flexibility (not to mention colour!) in this product. In short, it’s as easy as building lego models but quite a bit more productive.

In general, I’ve found it pretty reliable, stable and trouble-free. It can be a little slow, especially if you’re working outside the LAN (Hint: Remote desktop!). I daresay performance is quite dependent on the resources of the machine you’re on too, of course, but you can set things going in parallel so as to get them all in in whatever window you have available.

What else is there?

There are other software products available of course, and in some cases you might need them. I’ll try and do some more tech reviews as and when I get a chance to get my hands on them. I must say though, from what I’ve seen of the others none is as well-rounded as this one. Pandora strikes me as too limited for proper ETL (although I’ll admit I’ve only seen an older version), and some of the others seem to be geared up to very specific challenges around technical limitations of moving bulk data. None of them seem like they can do a complete, end-to-end process on anything like this kind of diverse set of data sources, but by all means, if anyone knows of anything better, I’d love to hear about it. Not MS Access though. Sorry, been there, done that.

Advertisements

Data Vs People

Another mildly ranty one, I’m afraid:

Today on Twitter some company was touting its wares on Twitter under the header “It’s unanimous: data holds the key to improved healthcare at reduced costs.” which was spammed into my timeline like so many sponsored tweets.

Now, to be fair, I shouldn’t really be singling out this company at all since a lot of software companies are similar, implying that with the right data tool, you can do healthcare (or whatever) on the cheap. I find it mildly irritating though, in this case, since it implies that boring things like – y’know – doctors, nurses, hospitals and so on are all a bit yesterday and that what really makes healthcare work is proprietary products that appeal to a certain sort of administrative mindset. Meh. Data is important but let’s not get above ourselves here: the heart of any health service is the people who work in it, their knowledge and their willingness to do the right thing.

Am I being unfair? Probably.

Data Migration Matters 8, 2 June 2015

Data Migration Matters is pretty well established now and this is my third year attending. You can read my last review here if you’re interested. As usual, it was fronted by Johny Morris, blogger, migration guru and author of the industry-standard manual on how to get the job done, Practical Data Migration. A few of the better-known suppliers were there, touting their wares; consulting firm Data Dynamics, Experian (who recently gobbled up X88 and their impressive Data Profiling and Quality tool, Pandora), Vision Solutions and QBase.

Sponsor stalls at Data Migration Matters 8
Sponsor stalls at Data Migration Matters 8

I started by arriving late. It was my daughter’s birthday and I wanted to see her off to school after her presents were opened, so I ended up having to sneak in at the back and missed the big announcement that Pandora had obtained some sort of certification for PDMv2. I’m not wholly clear what that means, since PDM isn’t very prescriptive when it comes to tools. I suspect there’s a little harmless log-rolling going on there but since I missed that part of the speech, you probably shouldn’t take my word for it.

After some conference coffee, there were break-out sessions. I attended one on testing, which focused very heavily on the role of functional testing, skipping the earlier stages. That’s probably as it should be, since people tend to skip functional testing, assuming if they’ve tested the new system and looked at the data then putting the two together will present no problem. Getting people to see the benefit of sitting down and trying to go through a “dress rehearsal” of their go-live to see what breaks can be a hard sell and it’s resource intensive. People discussed how they go about this, how they sell it to the business and how they do triage on the results that come back. It was interesting to see that there seemed to be a few people who had more-or-less given up on being able to do it at all. Not a safe way to go, that. Anyway, there is a blog discussion on Johny’s blog if you’d like to follow that line of thought further.

The second session was an overview of the PDM method, given by a rep from Pandora, but having seen it last year there weren’t any surprises. Next door, Vision solutions were touting the benefits of their product as a way of doing minimal-downtime migrations. It seems like a good concept, actually, although not well suited to the kinds of jobs I undertake.

After lunch (sandwiches, little cakes), Dylan Jones of Data Migration Pro did a recap of some of the themes of last year’s presentation, with a narrower focus, looking at some of the actual, real-world gaps in data that presented themselves during a large migration. Again, it’s funny to hear how the same issues come up whether you’re dealing with (in my case) a small adoption team or (in his) a division of a telecoms giant. The data is always scattered, nobody knows the whole story of what’s where, nobody wants to hear that the spreadsheet they’ve used for years is ill-suited to prime the pump of a new system without a major reworking. Getting those points across requires tact and diplomacy. In fact, the “gaps” he talks about are what is commonly referred to as “errors”. Calling them gaps takes some of the blame out of the situation and emphasises their incompatibility with the new system, rather than implying everyone has been muddling through with a fairly shonky system thus far. This seems like sound reasoning.

Glen Vaal, of outsourcing behemoth Capita took up some of the same themes in his speech, although I was a bit baffled by some of the things he said. He seemed to have a slightly odd idea of engagement with the business, beginning when he dissed the first of Johny’s four golden rules “Data Migration is a Business Activity”. “So what?” he asked. Heresy! He also mentioned that he didn’t have access to any fancy DM products like Pandora, prompting me to pipe up that Talend was free and excellent. I don’t think that I piped up loudly or clearly enough though, and I think the poor man might have thought he was being heckled by someone in the process of choking on a leftover cheese bap. I really need to acquire some social skills. If you know him, please pass on my apologies.

Freebies
The Magic Smarties of Data Migration

I was quite interested, asking around in the breakout session later, while we were all swigging wine out of plastic cups (yeah, I’m so fancy!), that a few of the people I spoke to had done social care migration projects in the past, and I wondered how they’d got on. In the main they seemed to have got on the right track, starting with the idea of a cohort based on records retention policies and building out from there, looking to retire the old system entirely, which is good. I spoke to a guy at lunch who specialised in health care migrations, too, and I imagine he did a lot of the same kinds of things. I also met a friend of mine, a former colleague from three or four local authorities ago, who goes every year and is an all-round good egg in the same line of business as me. Chatting with a few of the consultants at the stalls, I tried them out on a particularly thorny problem I’ve been trying to decide how to deal with – moving, merging and converting a large number of flat files as part of a wider migration process. The only one who came up with anything, really was the guy from QBase, who just confirmed that the way i was thinking of tackling it (basically, using Talend’s toolset a bit creatively) was the best way in, which was reassuring, since it’s always good to know I’m not off on the wrong track entirely!

The final event of the day was an open forum in which people got to ask questions. Mostly it was straight down the line apart from one disagreement about the role of the CDO (“Chief Data Officer”) as distinct from the CIO (“Chief Information Officer”). Most of those present felt that a data officer was a good thing and might help organisations understand their data requirements better, but Andrew Reichman cut through that consensus pointing out that data and information are basically two different words for the same thing and there’s probably not much point in creating an extra tier of bureaucracy just because the IT department hasn’t got a grip on its own systems. I found that persuasive. There’s a bit of a tendency, I think, to create new roles to address percieved gaps rather than ask why the existing organisation isn’t handling them well. It’s something I see in local government more often than I would really like.

Epilogue – there’s more to life than migration

I drifted off at the end while the networking was still in full flow, and made my way home for the rest of the tenth birthday celebrations. I left front-line social care to work in case recording/IT support roles just a couple of months before she came into the world, so as well as my amazement that – holy crap! – i’ve been a dad for ten whole years, I reflect on how my life has changed since then. I enjoyed working directly with adults with learning disabilities of course, but I’m glad to have made the change for all sorts of reasons. Different challenges, different ways of being. And oddly, it’s not as wild a change as you might think. Data systems, if they’re handled correctly should free up front-line workers in social care to do excellent work, unencumbered by onerous paperwork and unwieldy systems, and if I can contibute to that happening in some small way then this job is really a continuation, not a complete break, from the fifteen years at the coalface that preceded it.

Archiving #2: The Afterlife of Data

Hopefully after reading the previous post on archiving I’ve already convinced you that it’s not a good idea to simply make the legacy database read only and keep it in perpetuity “just in case”, In this section, I’m going to be thinking about maintaining the archive, and hopefully in the process you’ll be able to get an idea of what implications the choice of an archiving strategy will have.

What happens to your archive after you’ve gone live with the migration? Ideally, it becomes less and less relevant and can finally be forgotten entirely. But you might need to look back at it. Here are a few reasons I often come across in my projects:

  1. We might find that we missed some things in the first migration and we need to go back and salvage some bits that were missed. If you do the migration right, this won’t happen, and if people are thinking this it would be far, far better to examine why they’re thinking it. Are they not fully satisfied with the scope? Are the test results too vague? If so, why are you going live? It’s a hard question to ask but it’s easier than the alternative of crossing your fingers and hoping for the best.
  2. We might need it for audit. Yes, you might, and that’s a good reason. It’s helpful to be able to see what decisions were made (the process documents should be archived to show this) and what the records looked like prior to being extracted and transformed. This seems like an eminently sensible reason to hang onto them for a set period of time.
  3. We might get a subject access request about an old case. If it relates to a case in scope, you will have migrated it anyway, and that should cover anything you need. If it’s so old that it’s out of scope then the reson for that will be because your records retention policy deemed it to be no longer relevant and – under data protection legislation – you were obligated to destroy it. That being the case, you should be able to reply to the enquirer that you don’t have any data to disclose. In fact, as I understand it*, if you were able to answer the request for a case that was outside the data retention schedule then that would be reason enough for that person to lodge a complaint, wouldn’t it?
  4. We might just want to retain the old system in a read-only form so we can go back to it because there were some things we decided would be too expensive to migrate and left behind for simplicity. This is sheer madness. If you need the data, migrate it! if you don’t you’re commiting to keep [some part of] an old system indefinitely, with all the licensing, hardware and support costs involved. Not only that, but there has to be someone around who can query the data. that might mean retaining someone who is able to write queries on a long-forgotten schema, or even paying for staff to be trained on the older application and given logins so they can access it as well as the new one. And where does this stop? Let’s say ten years from now, you decide to migrate to an even newer, shinier system. Well now you have to ask do you want to retrieve that data from the old archive (treating it as yet another source) or leave it behind in an increasingly outdated, unsupported system, running on an operating system that is long past its use-by date, and not supported by modern OSes.

This last point is quite a deep one. Archived data that’s just left in its old format is rotting away like the salad in Jeremy Clarkson’s fridge. Some records relating to children need to be archived for a century or so. Think how much computer systems have changed in the last twenty years, say, let along going back to the days of the Commodore 64. Do you think that database you’re migrating from will be intelligible to anyone twenty, thirty, forty years from now? Hell – as Ed Miliband would say – No! In fact, when I was discussing archiving for adoption records over the phone with a records expert from the National Archive (Sorry, I’ve long-since forgotten his name) he was unimpressed by any data format for long-term storage and said as far as he was concerned, if you had this stuff on paper it was probably best left on paper because paper wasn’t going to stop being readable over time. This goes against the grain, of course, but it’s a bloody good point. That’s why you need to plan to bring any data you have and need to retain onto your newest platform, so it’s never too close to the point of extinction.

Maintaining the Archive After Migration

OK, so you’ve taken my advice and only archived records that you’re planning to migrate anyway. Now, what happens in five years when all your adult social care records, which were seven years old (and in scope according to policy) when you went live, are now twelve years old (and therefore suplus to requirements). If your data policy works, these records will have been purged from your production database but there they are still sitting in the archive. This presents data managers with a problem since it will be awkward to delete the records from the archive. This is why you need to talk about a policy for retiring not just the legacy database but the archive copy of it.

How long do you need to keep it for audit (point two above)? So long as that need is there I think it’s legitimate to keep the archive, but when that period is judged to have expired, I don’t see a reason to keep it any longer, and I would suggest you bin the whole thing at that point rather than get into how to maintain it separately from your new production system.

Summary

So, to sum up my approach to archiving:

  1. It should be a by-product of your migration process rather than some extra work you do. In other words, when you extract data from the source system you plan to structure it in a way that works as an archive.
  2. It should hold the original records you extracted, as they were before they were mapped for the new system.
  3. It should include things like mapping tables so you can relate old to new for audit purposes.
  4. It shouldn’t contain anything that is out of scope for migration due to data protection rules.
  5. It should be in a reasonably sensible format such as a SQL database, and ideally denormalised and stored in sensibly-named tables so it can be queried reasonable easily by someone without specialist knowledge and without too much pain.
  6. It should be well understood what will be in it, what it’ll be used for and for how long.

*= This phrase “As I understand it…” is important of course. After a lot of discussion on this point, I’m confident that what I’m writing is sound but I’m not a lawyer and this isn’t legal advice, so check with your organisation’s legal department before acting on any of it. The stock market can go down as well as up and if symptoms persist for more than 48 hours, consult a general practitioner.

Archiving #1: What is an Archive For?

I recently wrote a question on LinkedIn about Archiving. Nobody replied*. It’s a bit unnerving when nobody replies to a question because you’re never quite sure whether it’s because it’s such a fiendishly difficult and intractable problem that nobody can think of a solution or its just so obvious that everyone is ignoring it out of sheer embarrassment. I’ll hope it’s the former reason and press on with some thoughts.

The question is about Archiving. A local authority wants to migrate to a new system. It faces hard choices about what data it needs and what it intends to leave behind. At the end of this long process you get to the question of how to archive the old system and the unspoken assumption pops out: “Well, we’re not doing away with the old system, are we? That will always be there to refer back to…” and it’s at this point you realise that people haven’t been focused on the difficult subject of record retention at all because in their mind they were always expecting to hang on to the old system as a store of historical data and just use the new one for case recording going forward. Don’t just let this slide – you need to examine thus underlying assumption or you’ll definitely have problems later. It’s important to establish what is happening to the old database right from the start because that decision informs a lot of the decisions you reach about the main migration. There aren’t any right answers – it’s up to the business to chose, but here are some questions to get the discussion started:

Are we even allowed to keep it?

In most cases, when you’re dealing with personal data, you will have a set of guidelines over what you have to retain and what you absolutely can’t retain. With children’s records that will often mean retaining everything forever. Adoption records, for example, need to be held for a hundred years and it’s unlikely that any of your data pre-dates that! Cases with lower levels of involvement, though (people referred to the department but that led to no further action, for example), often have to be destroyed after shorter periods under the Data Protection Act. Most authorities have their own in-house record retention policy based on these rules, and there might even be a records officer whose job it is to interpret and enforce these rules. Get in touch with them as soon as you can and try to marry up its requirements with the migration and archiving strategies you’re formulating.

Who’s going to look after the old database?

If you keep the old system, someone had to own and maintain it. Users have to be trained, software has to be licensed, servers kept running. This is such a can of worms I’m going to cover it in another post.

What if we only archive a subset?

If the lazy option of just leaving the old system running isn’t an option what else can we do? Well, we could suggest that the LA retain a new database consisting of just the records that are within retention schedules, perhaps using original, untransformed data values, de-normalised for ease of querying, and store that somewhere for later. Optionally, a simple front end can be slapped on it to enable users to access it easily when it’s needed.

 * =Update: In between my writing this article and proofreading it, a couple of people did pitch into the discussion and it became quite lively with some very well-made points.

Blame

One of my least-favourite traits in a project environment is arse-covering. The division that exists in a lot of methodologies between the project and the business tends to foster a mentality that says “Well, the business aren’t engaging and there will probably be problems as a result but at least we can show we did everything we were supposed to do”. When contractors are involved, it’s exacerbated further because don’t expect to be around when the excrement hits the air-conditioning.

Project teams succeed together or they fail together and if anyone sounds like they’re planning for who they’re going to blame after completion, that’s one of the surest signs that the project manager needs to step in. If the project manager is the one at fault, you’re all doomed and it’s time to look for a new contract. Leave now. Don’t stop to clear your locker, just go.