Exporting Form Data as PDFs

I’ve come across a few solutions to the problem of how to take form data, from a variety of sources and turn them into flat files. This is quite often needed in migrations out of case recording systems. Forms may be used for assessments, referrals, and other processes, and the number of templates used tends to be pretty high, so the work involved in transferring answers cleanly into an equivalent in the new system is far too high to be considered seriously (remember the fourth rule in the Practical Data Migration methodology!). 99.999% of it is usually unneeded by reporting but needs to be easily accessible by the practitioner, so what’s needed is a general purpose process that can take all the questions and answers and apply a standard format to them so that they are readable and easy on the eye.

Most of the methods I’ve seen in use seem to be slow (using the supplier’s own export feature, for example, I’ve seen one project creak on for months) or expensive (buying third party software) or both. Luckily, our old friend Talend can be used for this sort of job with a little bit of cunning. The approach taken is to embed the form questions/answers in html tags, which are pretty easy to use and format, and then pass the result into a java library that is capable of taking HTML and converting it to PDF. I’m sharing it online because I think it’s something that could be helpful to other LAs, and maybe save them a bit of cash. God knows, there’s little enough of it about in these austere times. Talend is free if you only use one developer (there is a paid version for more complex projects too), so it’s a cost-effective option. If you don’t have the skills in house and need someone to plumb it in for you, drop me a line to discuss terms.

In order to follow this tutorial you’ll need a few prerequisites:

  • Reasonably sound understanding of Talend
  • Decent SQL skills, probably including the ability to write functions in T-SQL, PL/SQL or whatever is your tipple of choice.
  • Ability to construct a reasonably sound html document.
  • No fear of Java. You don’t really need to know much about it, just not freak out if you see it.

OK? Still here? Shall we have a go?

Get hold of the Java Libraries

Download zip file of all jar files from latest version of openhtmltopdf here

Unzip the archive to a folder

Use tlibraryload to load all of the jar files one after the other. You’ll probably find that one of the jars – commons-logging – already exists in Talend so you won’t need that one but the others all need loading like this:


Then you’ll need to build a workflow that follows this sort of pattern (I’ll list them in the order shown in the screenshot and with the names they have there, so you can relate the narrative to the flow, but obviously the names themselves don’t really matter so much)

Make a Workflow


list document ids

An Input listing all the document IDs. In my case, that’s an MS SQL Server input containing a query that has just two columns: The document ID and a text string which I can use as name, but in reality you’ll actually probaby want some more metadata such as date of creation, who created it and so on as most load scripts will want this kind of info for each file they load.

take each in turn

A flow-to-iterate component. In other words, it will take all the rows handed to it by the previous component and pass each one in turn to the following stage so it can deal with each one, one at a time.

select rows as html

This is going to be by far the hardest component to build since it is effectively trying to build an html document out of the various parts of the form, which will be complete and valid XHTML (very important that – it has to be XHTML because the java will want it to be vaid XML as well as decently formatted HTML)

Let’s say you have a super-basic structure where each form has several sections, each setion has sevral questions and each question may or may not have an answer, which can either be a string or a date, you’d have something like this



s.id page_order,

q.id question_order,

‘<span> style=\”color: #00006B;font-weight: bold;padding: 5px;\” >’ +

dbo.dm_xml_cleanse(the_question)+'</span> as opener,

coalesce(a.the_text,convert(varchar(32),thedate,103) as stringans,

‘<br/>’ closer

from forms f

inner join sections s on s.form_id=f.id

on f.id=”+((String)globalMap.get(“form_id”))+”

inner join questions q on q.section_id=s.id

left join answers a on a.question_id=q.id “

In this code, form_id is just the name of the column in which the unique identifier of the form is held. That globalMap.get() function just retrieves it in order to limit the output to only the sections/questions/answers relevant to this specific instance of form it’s looking at right now.

Note that each has a section and a question order (which I am assuming to be _in_ order – ie, sorting on these two will get the questions into the order they appear in the front end). The text output is in three parts too: a beginning, a middle and an end. It doesn’t have to be like that, I just find it easier if I have the opening tag in one, the text in teh second and the closing tag in the third.

Obviously the schema for this query will need to give all those text strings plenty of elbow-room, otherwise the text will be truncated and nobody wants that.

dbo.dm_xml_cleanse is a t-sql function whose job is just to replace dodgy xml entities with better equivalents, and replace line breaks with <br/> tags. Obviously if the text already has markup within it you’ll need to deal with that first.

Of course, these tags on their own arent enough: you’ll also need to union it with other sections which will include things like a header section (section and question ID will both be set at 0 so it appears at the top) and footer (ditto but set at 100000000).

The header should have something like this in its text output

‘<!DOCTYPE html PUBLIC \”-//W3C//DTD XHTML 1.0 Transitional//EN\”


<html xmlns=\”http://www.w3.org/1999/xhtml\“>


<title>Migrated Form</title>

<meta http-equiv=\”Content-Type\” content=\”text/html; charset=UTF-8\” />


<!–insert stylesheet data here –>



<body> ‘

and you can add headings, divisions and other bits by unioning other queries in there: for example, a query on only form and section with a 0 in the question_order field would give you a row at the top of each section which you could use for a header, for example. You get the idea. Just stick to standard meat-and-potatoes html tags with some css to make it look more readable.

You can use images like this

<img src=\”file:///E:/mylocalfolder/pdf_logo.png\” style=\”float: right;\” ></img>

For some reason, I couldn’t get this to work if the image was on a network though, only a local drive.

You could also look at odd css tricks that I guess were developed for printable pages and are not commonly used in standard web format. For me, the most useful was

page-break-before: always;

…which I used in the style tag of the division surrounding each new section in order to break it up better and stop a lot of ugly splits in mid-question

join three parts together

All this is doing is reuniting the three text fields in each row into one single field

sort in order

Sort on the section_order and question_order fields to get all the rows in the correct order with the header at the top, the footer at the bottom and so on

aggregate all rows

Uses a “list” aggregate function to amalgamate all the text fields in all the rows into one huge text string

Set Global Variable

Assigns that text string to a single (and very large!) variable that can then be re-used in various places. Obviously make sure the list separator is a space, not any other character otherwise the other characters will bollox up the xml.


A tJava component where the file is actually created. There’s a small amount  of Java here that does the work.

Create the File

First of all, on the imports tab, you need to write

//import java.util.List;

import java.io.FileOutputStream;

import java.io.OutputStream;

import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;

import java.io.File;

Then on the code tab, this code, which will actually write a pdf file contining a pdf version of your form.

OutputStream os = new FileOutputStream(“E:\\myfilepath\\”+((String)globalMap.get(“filename”))+”.pdf”);

PdfRendererBuilder builder = new PdfRendererBuilder();  builder.withHtmlContent(((String)globalMap.get(“theHTML”)),”http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd“);



“filename” is the column in the “list document ids” component that contains the name of the document to be created. Obviously this should be unique so either the file path or the name should probably contain the unique identifier for the form so that there’s no risk that one form overwrites another.

“theHTML” is the name of the huge text variable created in the “Set Global Variable” component. If the htnl isn’t valid xml, it’ll explode so while you’re in test mode, there’s an extra bit of code that can be used to write the HTML into a separate file before converting it into pdf, so it’s best to paste this ABOVE the previous block


FileOutputStream fwr = null;

File file;

file = new File(“E:\\pdf_exports\\asm\\”+((String)globalMap.get(“filename”))+”.htm”);

fwr = new FileOutputStream(file);

if (!file.exists()) {



byte[] toBytes = ((String)globalMap.get(“theHTML”)).getBytes();




If you get an error from the tJava Component, you should be able to find the file that was created just before the error. You can upload it to the W3C’s XML validator which will tell you what is wrong with the XHTML you constructed and you can keep tweaking till it goes through.

Taking it Further

As you can see, this is a basic outline. To turn it into a complete working process you’ll probably want a bit more such as an output that takes the metadata of each file created and stores it in a table (along with the file path) subject to confirming that the file ws created successfully, so you can track what you’ve captured so far. Subsequent loads will then check that table, disregard any that have already been captured and not subsequently updated, and then start creating new files based on what’s left. If you have a gazillion forms they most likely won’t all fit into a single folder so you’ll also need to create a function to construct a path to the folder each belongs in. This is mildly annoying because I found I couldn’t get Java to create the folders in a nice, easy way so I ended up having to add, just before the tJava component, a tFileTouch component to make a temporary file (this WILL create the folder tree automatically) and then a tFileDelete that got rid of the temp file, leaving just the folder behind it so that the Java can drop the file in the right place.

Building a List in Talend to Use as a Filter in Another Component

This is an answer to a question about Talend that was posted on StackOverflow. I wasn’t able to post as many pictures as I needed due to house rules in place, so I have moved the whole thing here and linked back to it. The general gist of the question was that there was a mahoosive table that would have eaten too much memory if its entire contents were dragged into Talend, so the user wanted to be able to take some values from a different table, string them out into a list, pop them into a context variable and then squirt them back into another SQl query so that he ended up with a smaller set of results to work with. With me so far? OK, read on!


This should be possible. I’m not working in MySQL but I have something roughly equivalent here that I think you should be able to adapt to your needs.

As you can see, I’ve got some data coming out of the table and getting filtered by tFilterRow_1 to only show the rows I’m interested in.


The next step is to limit it to just the field I want to use in the variable. I’ve used tMap_3 rather than a tFilterColumns because the field I’m using is a string and I wanted to be able to concatenate single quotes around it but if you’re using an integer you might not need to do that. And of course if you have a lot of repetition you might also want to get a tUniqueRows in there as well to save a lot of unnecessary repetition


The next step is the one that does the magic. I’ve got a list like this:
etc, and I want to turn it into ‘A1′,’A2′,’B1′,’B2’ so I can slot it into my where clause. For this, I’ve used tAggregateRow_1, selecting “list” as the aggregate function to use.


Next up, we want to take this list and put it into a context variable (I’ve already created the context variable in the metadata – you know how to do that, right? If not, here’s a quick rundown). Use another tMap component, feeding into a tContextLoad widget. tContextLoad always has two columns in its schema, so map the output of the tAggregateRows to the “value” column and enter the name of the variable in the “key”. In this example, my context variable is called MyList


Now your list is loaded as a text string and stored in the context variable ready for retrieval. So open up a new input and embed the variable in the sql code like this

from MY_SECOND_TABLE where the_selected_row in (“+

It should be as easy as that, and when I whipped it up it worked first time, but let me know if you have any trouble and I’ll see what I can do.

Merging TIFFs into PDFs Using Only Free Software

I had a tricky problem a while ago and nobody seemed to know how to do it so when I worked it out, I thought it might be fun to post a how-to here for other people to crib from and take the credit. Wait, is this such a great idea? Oh well, never mind, here goes…

The challenge is to take a group of scanned pages from a document management system and prepare them for migration into Servelec Corelogic’s Frameworki/Mosaic product. The documents are scanned on a page-by-page basis as TIFFs, and the objective is to merge the pages into a single file, either as TIFFs or as PDFs in a new folder, with the paths held in a database table. In this example, I’ve used nConvert, which is largely free, although if you use it commercially you should buy a license. There’s another free program that I believe can do the same job, although I haven’t specifically tried it – namely Irfanview.

The general strategy is:

  • List the where they’re stored in the file system or EDRMS
  • Use t-sql or pl/sql to write a command line function to grioup all the individual files (pages) together and merge them into a single file in the file system
  • Pass the location of the new file to the import process.

Starting in Talend Open Studio, the first step is to create as new job using the tFileList component as the starting point, to get a list of files in the folder you’re interested in.


Use an iterator to connect to the next step- a tFileProperties component, which you can use to get the file properties of each file in turn. Check the image below for the format to use. You can use this to store the details of all the files in a table called – in this example – FILE_SILESYSTEM.


To move to the next stage, I’ve used a T-SQL function to create a shell-command that  does two things: first, create a new folder for the files to live in, and second to invoke a third party app called  nConvert to merge the pages into a single file. In the command below, you can see the “md” command being used to create the folder. nConvert- a third party app – can then be called to either merge the files or to merge and conver them to pdfs.

cmd /c cd C:/test/smart_files/ &
md ID &
cd ID &
md 64398 &
nconvert -multi -out tiff -c 5 -o C:/test/smart_files/ID/64398/164994_v1.tif U:/00707000/00706853.tif U:/00707000/00706854.tif U:/00707000/00706855.tif U:/00707000/00706856.tif U:/00707000/00706857.tif U:/00707000/00706858.tif U:/00707000/00706859.tif U:/00707000/00706860.tif U:/00707000/00706861.tif U:/00707000/00706862.tif U:/00707000/00706863.tif U:/00707000/00706864.tif U:/00707000/00706865.tif U:/00707000/00706866.tif U:/00707000/00706867.tif U:/00707000/00706868.tif U:/00707000/00706869.tif U:/00707000/00706870.tif U:/00707000/00706871.tif U:/00707000/00706872.tif U:/00707000/00706873.tif U:/00707000/00706874.tif >>C:/test/output.txt

In the example above, I’m just merging them but it’s simple to merge them as a pdf by just chainging the format to

-out pdf



The content of the table can then be split in two; first, the bult of the table can be passed to the import process. The last column – containing the output of the T-SQL function is stored in the final column of a table and the output passed to a shell command using a tMap component:


into an iterator….


The iterator then passes the output of the function into a shell command and merges the files into a single file in the specified folder.

You now have a list of merged files in a format the import process can understand and a folder containing the merged files, all stored in the place in which the import process expects to find them. It should be straightforward to simply run the load procedure and scoop up the merged file into Mosaic.

Christmas Gift Ideas

It’s that time of year again, and project managers up and down the country are wondering what to put in their team members’ stockings. Well, have no fear, here’s my must-have gift-giving guide for the Data Guru who has everything.

1. A Better Computer

Is your data migration lead’s brow furrowed? Does he spend hours staring at his screen clenching and unclenching his fists as the record count ticks from 100 to 200 on a 10,000,000 record load? This might be a sign that the refurbished Pentium III laptop or the virtual box accessed through a dumb terminal that you thought would be so much more cost-effective than a new Dell wasn’t such a good choice after all.

As data flies in and out of it, headed for the target database, both of the test machine’s kilobytes fill up immediately and it starts furiously swapping to keep up. The lights dim, the smell of burning fills the air, development computer fails to respond to mouse-clicks, the screen fades to grey. This is when that lovely, christmassy scarlet colour can be seen in the cheeks of your colleague.

Why not log in to the purchasing portal and order a better computer? What it costs you will be more than made up in fees as work gets done more quickly and doesn’t spill over into extra days and evening work.

2. Talend Data Integration Suite

OK, so Open Studio is the best £0.00 you’ve ever spent, but there’s a whole other level of greatness!

3. The Force Awakens Tie-in Poster

The power of the Force (AKA The Disney Corp) has reached into the world of data migration, producing a system even more powerful than PDMv2, and now you can buy inspirational posters based on the movie script to help motivate your data migration lead to fight the power of the dark side.Yoda

4. Another Spreadsheet

This one is a perennial favourite, and ultimately what most data migration professionals are given every year. We’ve all seen this heart-warming yuletide scene: Late December, a few scant weeks before go-live, and the project team are pulling on their coats, ready to go down the pub for their Christmas do. As if suddenly remembering something, one of the BAs turns and says

“Oh by the way, I’ve just emailed you a spreadsheet the business have told me about. It has mission-critical data on it and they absolutely can’t go live without it. Merry Christmas!”

…and with that they are all gone, leaving the vision of a slowly turning egg-timer reflected in the tears of – one assumes – pure joy, streaming down the data migrator’s face.

Happy Christmas…. And remember, we’re making a list, we’re checking it twice….

Indiana Jones and the Last Spreadsheet

We recently introduced our ten-year-old daughter to the Indiana Jones movies. Even the fourth one, but let’s not talk about that. At the very end of the original Raiders of the Lost Ark, there’s a scene in which the Ark of the Covenant is boxed up and placed in a warehouse surrounded by what looks like tens of thousands of identical boxes. The modern equivalent of its desert resting place is not an underground tomb, guarded by snakes and poison darts, but a total immersion in endless, bureaucratic sameness. Government_WarehouseNow, I don’t know if you’ve ever tried to implement a “Spreadsheet Amnesty” but if you have, you’ll know it is exactly like that. Ex. Act. Ly.

For the uninitiated, a spreadsheet amnesty is an essential part of any data migration project. Essentially, the problem you have is that the old system you’re setting out to replace is total rubbish. After all, if it wasn’t, you wouldn’t be there. the staff are clever people with a job to do, and they can’t be hampered by this bad system, so they’ve created all manner of spreadsheets, access databases, and who-knows-what to record all their day-to-day working data in. If you ignore it, you’ll be starting out with an incomplete system, missing key data. So, you have to ask everyone to identify their data sources so you can scoop them all in and use them to plug gaps in the main database. Sometimes, you might come across political impediments; in all likelihood, internal business teams have been waging a war of attrition against these spreadsheets for years, and I’ve heard of cases where, when they get onto the project board there’s been a desire to exert some influence to “punish” the offenders by ruling these contraband data sources as invalid and out of scope. If you go down that route, you are condemning the project to repeat the mistakes of the past. Hence the name “amnesty”. No blame. Everyone is welcome, and so is their data.

But politics aside, when you go out and trawl through the spreadsheets, it can be like searching for a needle in a large stack of identical needles. Or, if you prefer, a sacred relic in a warehouse full of fake sacred relics. The trouble is, there are often hundreds of spreadsheets which often seem to be capturing slightly different view of the same data, and a lot of the time, it’s hard to pick out the ones that have unique data being used to drive a specific business process, not just a mish-mash of items drawn from other places. I thought it would be helpful to list out a few questions that are worth asking when deciding which box to pry open in your search for the ark.

So you visit the team room, and ask them to show you the spreadsheets they’re using day-to-day in addition to recording in the case recording system, which for the sake of argument, I’ll refer to as “Disappointech”. As each one is produced you ask:

Is this a report from Disappointech?

Is the answer is yes, it’s not a data migration source. Whatever is in it must be in the core system already. As an aside though, if this is used as part of business-as-usual, you need to make sure the business analysts who are configuring the software are aware of it, and that the new system will produce an equivalent report.

Is there anything here that’s not in Disappointech?

If the answer is no, the team might just have been compiling or copy/pasting the information into a spreadsheet for ease of use. Again, there’s nothing here that you need to migrate but it’s worth thinking how the new system can save them having to do such a tedious, time-consuming chore. Maybe a report could help?

Is it information you need to have in the new system?

OK, so we’ve established that this is unique information that can’t be found in Disappointech and only exists in the spreadsheet. But does it belong in the new system? If it’s about a social worker’s annual leave, say, or contributions to the coffee fund, or the staff Eurovision sweepstake then the answer is probably no and you can move on. More likely, it’s something borderline: information that’s pretty close to the type of data you need to bring in but not quite in scope, and you might have to get a project decision on which side of the fence it lies on so you don’t get into “scope creep”.

Is the data about a specific person? 

If the data is aggregated – for example, if it shows – such-and-such a percentage of visits done on time in each month, say, but you can’t identify specific dates for a specific service-user in there then this is not likely to be usable. Most social care systems have to associate each item with a specific client record and it isn’t possible to go from a statistic back to the source data that went into producing the statistic.

What’s the quality like?

Having established that there’s something here that the project needs to have, you’ll need to assess whether it’s tidy enough to import. Is each service user’s Disappointech ID present and correct? Is each item on its own row? Do columns with dates in all have real dates or do some of them have things like “To Be Confirmed” or the dreaded “N/K”? Do the team store all the data in this one spreadsheet or do they have multiple copies, say one for each financial year?

If the quality is low, you’ve got a few choices. you could improvise and write some fairly complicated code that tries to work around the problems. That’s a high-risk strategy because it’s quite likely to need a lot of maintenance as new rows get added to the sheet, introducing new problems. A second option is to ask them to transfer the data to a new template which you can set up with lots of validation so that it has to be filled in in a certain way. Lastly, it’s always worth considering whether all this pain is necessary and whether – if the list is fairly short – they could manually transfer the data by keying it in on day 1, perhaps with a bit of help from someone who knows the new software well. It’s surprising how often this is the best solution for everyone.

indiana-jones-boulder-oOK, I think that covers Excel spreadsheet. I was going to write about how to cope with poorly designed Access Databases as well, but instead, here is a visual metaphor to describe what that feels like:

“Abandoned To Their Fate” – File on 4, 2/6/2015

Earlier this month on BBC Radio 4, File on 4 broadcast an investigation into the state of provision for Care Leavers in the UK, ahead of the National Audit Office’s report on the same subject, which is due out later in the summer. I’ll link to the audio at the bottom of the page in case you’re interested.

The whole thing was deeply saddening, just because of the subject matter: young people – children really – who find themselves, through no fault of their own, in a situation where they have had to be taken away from home and placed in public care. The stories are many and varied, and of course children’s experiences differ wildly. Many will end up with utterly lovely carers, many will return home in a few days or weeks, but whatever the situation, it’s hard to imagine ever being able to come through that kind of separation completely emotionally unscathed, especially when it lasts for years and involves a series of placements in homes of varying quality. The programme focused on people at the upper end of that age group, in mid-late teens, when they seem more independent, able to cope even with being temporarily housed in a B&B instead of a carer’s home, when perhaps people around them are starting to forget that they are needy and hurt and to see them as troublemakers or delinquent in some way. There was a real sense of damage done and lives dislocated in the stories the interviewees told of life in the fringes of the care system.

For those of us who work in and around the social care sector, we know that social work teams are incredibly hard-working and committed to their role. It’s not the sort of job you go into for the money and the glory (just ask the tabloids…) so if that’s a person’s chosen career you can usually assume there’s an underlying core of dedication there. The programme raised some interesting points though, both for social workers themselves about case management, and for policy makers about how resources are allocated to these services, because, after all, you can only stretch a team and a budget so far, no matter how well-managed it might be and how dedicated the individual staff members. I can’t pretend to have any answers on this of course.

What did strike me, as a data person, was the figures that came up in the discussion, comparing numbers of care leavers in suitable accommodation across various local authorities, The numbers vary wildly. Now, I’ve been around a bit and I’ve seen how some of this data is collected across over maybe a dozen local authorities in my time, to one degree or another, and I didn’t place a lot of significance in the numbers. Here are a few general impressions, picked up from the whirl of past experience:

Cases like this are usually managed by a specialist team, maybe serving all looked after children, or maybe even more specialised on care leavers from the age of fourteen or so, (when preparation for adulthood begins in earnest) up to about twenty-five in some cases, depending on the situation. Often the teams are keeping their eye on the ball just fine, down at the local level in Looked After Reviews or Pathway Plan Reviews, where a young person’s situation is discussed. However, the government can’t really assimilate that level of detail. All it wants at year end for its SSDA903 Return are a few codes relating to whether or not the authority is in touch with the young person, whether they are properly housed, in employment and so on. Getting those codes is where the problem lies. A really green performance manager (and I must say, I’ve never met one) might just assume whatever they get out of the central case recording system is the gospel. It isn’t though, because in their creaky old legacy systems, either there’s nowhere in that system to record the codes needed (that’s often why I’m theer: to help replace it with something better) or else they aren’t well understood. In rare cases where they are fully recorded, they might be out of date or else other recording or interpretation errors have crept in as the social worker tries to relate the building they have just placed a child in and the list of options in the dropdown menu before them. More hard-nosed performance managers, with a detached, world-weary look, will chase people to update the system and trust that these entreaties are acted on, and of course there will be others who just give up and send round questionnaires to be filled in offline and sent back. So, in short, when I hear that Council X has 100% suitable accommodation and Council Y has only 35%, I don’t necessarily assume that Council Y is worse than Council Y, because there’s a complicated network of factors in play, and the difference might come down, largely, to the efficiency of the data-gathering machine. It tends to be that some areas of reporting are more scrutinised than others, and it might be that with more emphasis on Care Leavers recently in the press, both Council X and Council Y will dig into the data more deeply and apply thumbscrews to a couple of deputy managers to get more clarity on the numbers, as certainly happens when other figures (timeliness of Child Protection Reviews, say) dip below a hundred percent.

What lessons can I draw from this? Well, as someone involved in preparation for new systems, I guess there are two things. Firstly, with this kind of data, never assume it’s held in the system; it’s a prime example of items to be addressed during the data migration project’s “Spreadsheet Amnesty”. It needs to be gathered in and assimilated into the body of the migration. Secondly, as far as possible, new systems should be designed in such a way that social workers don’t have to spend all their time “feeding the beast”. The recording system should be easy to complete, with no hidden backwaters, and all the relevant statistical data should be extracted from normal case work instead of needing extra recording. Why should I care about the poor social worker doing a little extra work? Well, because if the team is overstretched already by government cuts, ever hour spent at a terminal is an hour less spent with the care leaver, helping them sort out a bed to sleep in, a college place, or whatever it might be. Whichever way you slice it, that has to be a good thing, both for the care leaver, for the taxpayer and for the social worker. Social workers have feelings too, you know!

Simple case recording also enables more consistency. Whether or not you agree with the idea of government returns and their role in a centralised command-and-control system, it doesn’t help anyone if certain councils are named and shamed on File on 4, not because they are serving their customers poorly, but because stats are being collected haphazardly. Now, of course, I haven’t studied the councils named in the report. I’ve worked at a couple, but not recently enough to be able to say anything specific, and I’m certainly not here to take sides for or against them. All I know is my data-geek sense was tingling when I heard that particular part of the documentary, and that’s what made me sit down and plan this blog post.

Audio: File on 4 – Abandoned to Their Fate

(Can’t see the audio player? Try this link)
Next month the National Audit Office is due to report on the outcomes for young people leaving care. There are claims that, under financial pressure, local authorities are pushing too many teenagers into independent living before they’re ready. File on 4 investigates new figures that suggest many young care leavers are failing to cope – with large numbers ending up in custody, homeless, sexually exploited or pregnant. Social services chiefs say the welfare of care-leavers must be a key priority for the new government. But who holds them to account when they fail those they are meant to have looked after? And, with more cuts on the way, can the system cope? Fran Abrams reveals how hands-off caring can have tragic consequences.

Talend Open Studio – Product Review

I find Talend's logo on my desktop most beguiling
Even Talend’s lime-green starfish logo on my desktop is most beguiling

When I did my first ever migration, I used a tool that had been designed for the purpose by another developer in Microsoft Access. It was interesting and certainly simple enough for a beginner to use, and I think it was probably the right tool for then, but it had some major flaws. It was slow and temperamental. Sometimes when working with large text fields, I’d have to take a Long (an outmoded Oracle text field) and put it into a CLOB, both pretty awkward things to work with, but when you’re using an access Note field as your bridge, all bets were off. It would chop off great chunks of text but not tell me there was a problem. “Oh yes, Colin”, it would say, cheerfully, “I’ve loaded all the rows, don’t you worry”. Only later would I be told that when the testers looked, all the notes stopped abruptly at the 4000 character mark. It was a bit picky about what inputs it would use, too. CSV or nothing. In short, it had its limitations. These days, I need something with a bit more oomph, and I think I’ve found it in a fantastic product called Talend Open Studio.

Talend is an open source product built around the eclipse platform. It is made up of several distinct components. I use the data integration tool and the data quality tool, but they come in Cloud, Big Data, MDM and various other cool flavours too, if you need those things. And the best part? The products are all free to use if you are working at the scale of only one developer. The company even runs a very good online support system which is open to the free customers too. So how the heck do they make money? Speaking as someone who once tried to get them to take some of mine, I’m not really sure, because their very lovely, helpful programmers didn’t seem to be all that interested. However, they must be doing it, because I’ve been to their office and it was very nice. The revenue flow comes from a fully integrated platform product which does cost extra and which allows you to move swiftly from data quality to data integration with zero faffing about, collaborate with other developers, and generally scale the whole thing up. I can’t tell you how utterly impressed I am with this attitude to distributing software. Charities and small outfits with relatively modest requirements can get by with a great product that they look after themselves, while there’s an option of going to the next level if you happen to belong to a larger enterprise.

This review is only of the free version of the two products I’ve used. They are all you need to run a data migration. Or rather, they’re all you need to handle the technical aspects. If you want to know what you’ll need for the more “projecty” aspects, get yourself a copy of “Practical Data Migration“.

Talend Data Quality Tool - Correlation Graphs
With pretty graphs like this, you can convince just about anyone of just about anything

Talend Open Studio for Data Quality

If you’re lucky enough to have any data quality in your organisation, this tool will find it! ;^)

The backbone of this version of Talend is similar to the Integration tool. You get an Eclipse workspace with folders to store various sets of enquiries into the state of the data in the system. These can be at several levels. You can do an analysis of the entire database to see how many rows are in each table and what sort of general structure it has, or you can get closer to a set of tables, the relationships between the values in two columns say, or at an even finer level look at the patterns of data: Do the postcodes all match the regex for a valid postcode, for example. How many of the customer names are null? What’s the longest value in a text field? What are the outliers in a set of loan amounts? Are any of the dates insanely far in the future/past? Results are usually presented in bar charts but there are other formats too, like the nice correlation shown in the picture. In each case, you can drill down to get a closer look at problem rows. This is a great way to convey to stakeholders what the issues are in the data and what needs to be addressed right at the start. The GUI is easy to use and – with a little fiddling around – you can soon work out how it all works and get some real work done.

Talend Data Integration Tool Workflow
Talend Data Integration Tool Showing a main migration controller calling off several subprocesses in turn

Talend Open Studio for Data Integration

When you’re ready to start extracting, this is the next step. The integration tool has connectors. that can hook into most common database types and quite a few I hadn’t heard of. Values extracted from one datbase can be joined with external data, manipulated, mapped using a table from an excel sheet and have functions applied to them before they are reinserted into a table in the target database. Exception lists and logs can be dropped into a CSV or other flat file for a cleanser to look at, and a stored procedure called to kick off any cleansing routines you might be using. The separate steps can be joined together using connectors and orchestrated any way you like. I’ve seen people do this with Visual Studio but I’ve never been in a standardised SQL Server-only environment where that’s been an option, so it’s really good to have the same functionality but with a lot more flexibility (not to mention colour!) in this product. In short, it’s as easy as building lego models but quite a bit more productive.

In general, I’ve found it pretty reliable, stable and trouble-free. It can be a little slow, especially if you’re working outside the LAN (Hint: Remote desktop!). I daresay performance is quite dependent on the resources of the machine you’re on too, of course, but you can set things going in parallel so as to get them all in in whatever window you have available.

What else is there?

There are other software products available of course, and in some cases you might need them. I’ll try and do some more tech reviews as and when I get a chance to get my hands on them. I must say though, from what I’ve seen of the others none is as well-rounded as this one. Pandora strikes me as too limited for proper ETL (although I’ll admit I’ve only seen an older version), and some of the others seem to be geared up to very specific challenges around technical limitations of moving bulk data. None of them seem like they can do a complete, end-to-end process on anything like this kind of diverse set of data sources, but by all means, if anyone knows of anything better, I’d love to hear about it. Not MS Access though. Sorry, been there, done that.

Data Vs People

Another mildly ranty one, I’m afraid:

Today on Twitter some company was touting its wares on Twitter under the header “It’s unanimous: data holds the key to improved healthcare at reduced costs.” which was spammed into my timeline like so many sponsored tweets.

Now, to be fair, I shouldn’t really be singling out this company at all since a lot of software companies are similar, implying that with the right data tool, you can do healthcare (or whatever) on the cheap. I find it mildly irritating though, in this case, since it implies that boring things like – y’know – doctors, nurses, hospitals and so on are all a bit yesterday and that what really makes healthcare work is proprietary products that appeal to a certain sort of administrative mindset. Meh. Data is important but let’s not get above ourselves here: the heart of any health service is the people who work in it, their knowledge and their willingness to do the right thing.

Am I being unfair? Probably.