Working with Excel Dates in Talend

Adding to the store of knowledge on the interwebs, here’s a page about something I was trying to do but couldn’t get Google to tell me how, so in case it’s useful to anyone else, I’m going to write it up.

So what’s the problem?

I’m trying to get a date out of an Excel spreadsheet (xlsx) that was created as the result of exporting data from an Amazon DocDB using NoSQL Manager. The software saves it in a fairly common format for Excel: as a long decimal number like this

  • 44551.750000
  • 44552.375000

Now, I know I can convert these to dates in Excel itself using the format menu, but it’s an enormous file containing social work case notes and every time i try to open it in the very useless Google Sheets app, it crashes, so I have to get it out of there into Talend and convert it into something more datelike using java, and put it into my database, complete with time in hours and minutes.

Why the heck do the dates look like that?

Excel stores dates as numbers, mainly to make it easier to perform calculations on them. Excel is a spreadsheet, after all, so ease of number-crunching is a valuable thing. According to Microsoft:

Excel stores dates as sequential serial numbers so that they can be used in calculations. By default, January 1, 1900 is serial number 1, and January 1, 2008 is serial number 39448 because it is 39,447 days after January 1, 1900

Microsoft Excel Documentation cited 12/04/2022

Having tried various native Talend functions, I’d given up on it so I decided to work out the correct date based on the definition MS gave. I built a quick workflow, taking data out of a sheet containing just a small subset of the data, applying some voodoo on it in a tmap component and displaying the results in a tLogRow. This allows me to play with it a bit before copying the code back into my main process

Basically, I need to get a date representing the first of January 1900 and then count forward the number of days in the decimal number. Easy enough. There are a couple of pitfalls though. For a start, the function I was using (Talend.addDate) expects an integer as its second argument, so passing in a fractional number of days wasn’t going to fly. To solve this, I converted the decimal number to a number of minutes by multiplying by 1440 (that’s 60 (minutes) times 24 (hours). Since we’re working with BigDecimal, 1440 has to be converted to a BigDecimal too (I’ve highlighted the value I’m feeding in, so you can see where I’m starting from).

TalendDate.parseDate("yyyy-MM-dd HH:mm:ss","1900-01-01 00:00:00"),row19.DateOfEvent.multiply(BigDecimal.valueOf(1440))

Then, after multiplication, the result can get turned back into an integer and passed into TalendDate.addDate. Here’s what I ended up with

TalendDate.addDate(TalendDate.parseDate("yyyy-MM-dd HH:mm:ss","1900-01-01 00:00:00"),row19.DateOfEvent.multiply(BigDecimal.valueOf(1440)).intValue() ,"mm")

That’s close, but all the dates came out with an extra two days added on. I think this is due to the vagaries of counting on computers: first of all, Jan 1 is day 1, not day 0, and secondly, although the day you’re in is the nth day, you’re only part way through the day so you are on day n-1 plus some hours and minutes. Effectively, what that means is that instead of counting from 01 January 1900 you need to count from 30 December 1899!

TalendDate.addDate(TalendDate.parseDate("yyyy-MM-dd HH:mm:ss","1899-12-30 00:00:00"),row19.DateOfEvent.multiply(BigDecimal.valueOf(1440)).intValue() ,"mm")

…and so the dates came out right!

Anyway, I hope somebody out there in time and space finds this useful, and if you do, hello from 2022! It’s a bit fraught here at the moment. I hope it’s a more tranquil time for you, whenever you are!

Between the Spreadsheets – Review

So in my most recent blog post in… When was it? 1983 I think. OK, OK, I’m not a very prolific blogger, am I. Anyway, in my most recent blog post I mentioned briefly that I’d spotted someone on LinkedIn who was about to publish a book about life in the trenches of data cleansing. That book, “Between the Spreadsheets” by Susan Walsh is out now and I’ve finally got a copy. I tried to order it from a proper shop but they couldn’t get hold of it so I had to swallow my pride and buy it off Jeff Bezos. Ugh. As you can see, my copy drew some interest from the local fauna.

This is your system before data cleansing

So what’s it like and is it useful for people who work on social care projects? Well, it’s pretty good really and yes, in short, it is definitely going to be helpful if you’re going into any project where data quality is a bit shonky and it’s your first time thinking about how to go about correcting it. It isn’t designed with social care data in mind, of course, but I knew that when I bought it, and I’m only looking out for general lessons so that’s fine.

There’s some good stuff about the basics of cleansing: splitting names and addresses into separate fields for example, and dealing with common annoyances. That could be useful to tidy up a spreadsheet that has been used in the wild for years by some team that used to run its own process offline and now needs to be brought into the fold. It won’t get you all the way there; for example, the address splitting wouldn’t result in something you could just load into Mosaic as-is, but it’s a good start and shows how to use the tools available in Excel to unstick things.

A lot of the examples are about purchasing and how to standardise supplier names. More than anything else, I was reminded of the organisation lists you often have to tidy up in the process of setting up the services in Controcc or rates in Mosaic’s finance module. The challenge is usually that people have entered slightly different versions of the supplier name so you have to work out which are duplicates as opposed to, say, different branches of the same parent company. I have picked up a few tricks for sorting these out already but it’s not my main task so I haven’t had to spend as much time torturing the lists as the author has. So how does this torture progress? First, she sits them in the comfy chair of VLOOKUP and COUNTIF. Then she pokes them with the soft cushions of pivot tables and visualisations. They didn’t expect that. Finally, she shows how to handle larger, more recalcitrant data sets using a tool called a Omniscope to really get medieval on their ass. I’d never used Omniscope and although I’ve used some similar tools like PowerBI, I haven’t ever used them in this specific way, so that was interesting to see.

Taxonomy gets a chapter to itself. That’s something that mainly comes up in social care systems where you have sub-lists, for example in things like document classes in a document management system or within the case management app itself. It’s an area that people seldom think of in a very consistent way. Often, one group of people will sit down and design something pretty good and then it gets passed around, but instead of being refined further, other managers tend to want to bolt on extra options specific to their own teams and so you end up with some values down at the bottom that aren’t necessarily bad but are out of keeping with the initial approach. The result is usually unsatisfactory. If not corrected, end users would have to put up with a slow drip-drip of annoyance caused by menus that are just slightly too long, and a bit contradictory. There’s some good stuff here that can probably give you sense of what you’re setting out to achieve.

One small gripe: I was surprised there wasn’t more about the data validation feature on Excel’s Data tab. If the team is going to continue working with the data in excel (in other words, if you aren’t just dealing with data extracted from a database) then it seems like a really good idea to encode some of the quality rules into the cells so that they stay clean. I don’t think it was mentioned though. I admit, I skipped over a few sections that seemed less relevant to me, so I might have missed it, but it’s not in the index so I don’t think it’s there. It could probably have used a whole chapter to itself, but maybe these kinds of things aren’t as relevant in the kinds of projects she’s been involved in.

There’s more of course, and I’m not going to go into huge detail. For those of us who have been around the block a few times there won’t be many huge revelations but even I, a world-weary veteran of a thousand migration projects, picked up a few tips that I’ve mentally tucked away for later.

Blade Runner Tears In the Rain
I’ve seen things you people wouldn’t believe. Access Databases on fire off the shoulder of Orion.

I think the main use for it, though, is probably as something you can share with partners in the business who have been drafted in to help prepare the data for migration. These are the unsung heroes I wrote about in the last blog post: the people trying to whip their data into shape through sheer perseverance. If the book can help them figure out where to start and maybe save them a few hours of unnecessary sweat and bring them, in the end, to a better result, that would be great. And I think it just might. So in future I may press it into a few people’s hands and say in a hoarse, raspy voice “The answer you’re looking for is in Chapter Two”. Things always sound better in a hoarse, raspy voice, don’t you think?

Three Cheers for Data Cleansers

A meme that would have been topical if I had made it in April

I had a dream the other night that I was trying to make a fruit salad, but in this dream our fruit bowl was doubling up as a storage bin for the compost heap so half of it was full of luscious fruity goodness and the other half was full of horse manure. So there I was, standing in the kitchen, daintily picking out strawberries, blackberries, gooseberries and beetroots (Beetroots were a fruit in this dream too. Look, I don’t make the rules…). and inspecting them to see if there were any flecks of poo on them before placing them carefully on the plate.

I’m not just pretending to have had this dream for the sake of a blog post, by the way. I promise you I would never have made up anything this disgusting deliberately. Let’s not think too much about what this says about my subconscious because I suspect I wouldn’t like the answer. It feels like a pretty good analogy for data cleansing though. Do I even need to explain the parallels? We want as much fresh and clean material as possible and we definitely don’t want any “extras”. And so we need to get busy cleansing.

Data cleansers are the unsung heroes of any migration project. It’s not a glamorous job. It doesn’t earn anyone a place in the limelight, but it requires superhuman levels of patience and attention to detail. Often there are lists that scroll endlessly off the bottom of the screen and each row needs to be unpicked one by one. I remember my first time seeing a data cleanser in action. He was living at his desk, shirtless, wild-eyed, pencils stuck in his hair, muttering, yelping at sudden noises, for months on end. And here’s the thing that made me realise the depth of his Sisyphean predicament: it was all the same enormous spreadsheet he was working on the entire time! It must be tempting, when you’re performing the same repetitive action over and over again, to cut corners, make assumptions, or just get hypnotised by the monotony and make mistakes, but he didn’t. He had the bug and he wasn’t going to let a single row get lost. Data Hero. His was an extreme case, but even under normal circumstances, the struggle is real.

More often than not, data cleansing is done in-house. In practice that often means it gets delegated to team admins because they are careful but they are also close enough to the coalface to be able to get information about the clients direct from the social workers. And they invariably do a cracking job. They’ve often spent years evangelising within their teams for good case recording, having to clear up inconsistencies every April when the statutory returns fall due, having to help social workers who can’t find Mrs Ellis because someone has keyed her in as Mrs Ellace; so they know good data makes a difference. But they have day-jobs too, so the project should make sure they are properly appreciated and supported and not taken for granted or overloaded.

Of course there are many other people who could be involved: senior admins (usually the high ticket items), system support officers (where special access is needed such as merging duplicate clients), social workers (if it’s something that needs a professional opinion). Sometimes external data cleansers will be brought into the project team because there isn’t enough internal resource, but I worry that an outsider doesn’t always have the same stake in the end result, so you need to look for someone with the right attitude more than the right technical qualifications.

But whoever they are, all data cleansers are heroic in their own way, fighting an unending war against the law of entropy, armed only with a mouse, a keyboard and a strong cup of tea.

My own advice on data cleansing for migration projects is to forget everything you know about data cleansing. It tends to be something that people in social services departments only think of in the context of preparing for statutory returns, so the only data that gets looked at is the stuff that forms part of those returns: timescales for assessments, correct placement codes, things like that. This is completely understandable of course, because there’s an obligation to get it right. However, in a data migration project this kind of detail is lower priority if it’s a priority at all. After all, you’ve already submitted the returns from 2013 so is it is usually not a good use of time to backload every single manual change that was made to that return into the system in the run-up to migration. In the unlikely event that someone ever needs to know, they can dig out the 2013 return and have a look at that. In terms of my dream, the kind of abstract information found in a government return like this would just be a lightly besmirched beetroot. We should be more concerned about the delicious strawberries that have somehow, over time, got squished deep into the mire and need to be thoroughly rinsed down and carefully dried on the arm of a t-shirt before they can be placed in front of a dinner guest. Ah… OK, well I think we’ve established that there’s a reason why I’m a data migration engineer, not a chef. Let’s move on.

What kinds of things should we be focusing on for data cleansing? I think it’s usually best to prioritise the data nobody has ever needed to grapple with until now because it isn’t the subject of any stat returns, but it will cause major issues in the new system or it will make the the social worker’s job harder in some concrete way if it were left unfixed. Some key examples would be

  • Who’s who? Duplicate person records will cause absolute havoc if not sorted out and each one will take a long time because you have to apply quite a lot of detective work to establish if it’s a true duplicate or just two people with similar names. Start a process for detecting duplicates early and get someone meticulous to work on the list for as long as it takes, then revisit it periodically to pick up any new ones.
  • Ambiguous casework. The classic situation here is where you need to migrate only children’s service data out of a database that includes both adults and children. How do you know which segments of casework to migrate and which to leave behind? It will often come down to who has recorded it, or some identifying detail of the referral that led to the case being open in the first place. So far so good, but in a lot of cases, we find that there are ambiguities and grey areas and someone needs to spend some time delineating where one set of involvements stops and the other starts.
  • Bad addresses. You’ll usually have to do some matching with addressbase or NLPG or some sort of gazetteer, but depending on the system some of the addresses might be too sloppily entered to be matchable so someone will need to do some legwork, especially on current addresses, to ensure you know where to send letters and social workers know where they need to go on their home visit.
  • Bad numbers. Things like Unique Pupil Numbers (for children’s services) and NHS numbers (primarily in adults) will often be used as a link to other related databases within the department. Obviously there is no way of knowing for certain that a given number is wrong, but it’s usually possible to detect anomalies like duplicate IDs or IDs that don’t fit the allowable format given by the government, or a UPN listed against someone who is clearly too old ever to have had one. Once produced, the list can be passed on to someone with the patience to unravel the mysteries therein.
  • Bad Dates. Again, we can’t know if a specific date is wrong, but people born in 0213 or 2089 will usually be easy to spot, for example. Then there are the people who are placed in older people’s teams but are only three years old.
  • Casework for deceased clients. These are usually more common in adult services than children’s for obvious reasons. If there is a financial implication it’s more likely to have been spotted but it’s surprising how often you’ll find ongoing services listed for deceased clients, to say nothing of ongoing assessments, allocations and so on. Closing these down in the process of migration is usually straightforward by simply putting in the date of death as an end date but it’s often something the business would rather tidy up sooner rather than later of course.
  • General sloppiness. Things like squeezing three phone numbers into the same box instead of listing them separately or adding a surname like “Smith aka Jones” instead of adding Jones separately under the aliases section. These kinds of errors are mildly annoying and can affect the ability to search for a particular service user, but it’s seldom a high priority and realistically it’s quite rare that any local authority department has enough time to start putting things right to this level of detail unless it’s bad enough to cause validation errors in the new system.

These are all areas where the scope for automatic cleansing might be limited, so a real human being will need to roll up their sleeves and get to grips with the case recording. And those people are heroes and we must cherish them!

I follow someone on Linkedin who has recently written a book about data cleansing and it’s due out in a few weeks. I’m not sure how I’ve ended up following her because I have never met her or worked with her, but she seems to be very good at marketing herself so here I am in her network! She’s called Susan Walsh and her book is Between the Spreadsheets, which is a pleasingly cheeky title for what is arguably not the most entertaining subject in the world, and I must say, if she’s as good as data cleansing as she is at ensnaring new Linkedin followers, this should be a good read. I’ll be interested to see what she has to say and how much of it is directly applicable to the kinds of project I usually work on. There is a section in the Practical Data Migration framework that deals with data quality rules, but it’s more from the point of view of meeting thresholds for migration into the new system, rather than the nitty gritty detail of how you get to grips with actual… well, to continue my analogy, the actual fruit-depooping. I’m crossing my fingers for a solid, simple, replicable methodology that will make it easier for me to help customers see their way through to data victory. That’s the dream, Susan. No pressure!


I’m re-doing an old site and wanted to rescue this page from it because although it’s a bit old, I think it’s worth preseving these “how-to” pages on the internet because there might be, somewhere, sometime, someone doing the same kind of thing and might find this useful, who knows…


Recently, I needed to set up a computer running Linux (Ubuntu, Gutsy Gibbon), so that it could be remotely controlled from a Windows laptop. The box was going to be disconnected from its keyboard, mouse and monitor, so all the screen output and all the keyboard and mouse input would have to go through Windows too. Ideally, an authorised user should be able to use the “headless” machine as if he or she were actually sitting in front of it even though, in reality, it was hidden away in a cupboard out of the way. At the same time, it was essential that non-authorised users should be shut out completely.

I found the online documentation good but widely dispersed, and I had to hunt around and ask a lot of questions to finally get all my penguins ducks in a row, so I thought it would be a good idea to collect it all in one place for the benefit of anyone else who might be trying to achieve the same thing.

What is VNC?

VNC is a remote-control device for computers. You might have seen it – or variants of it – if you’ve ever had your computer repaired by someone using a remote control/remote desktop connection to take control of your mouse cursor and make changes that way. It’s really helpful if you want to be able to run more than two computers but don’t necessarily have space for both in the office. One can go in the cupboard under the stairs but you can still use it by simply opening up a new window on the other using VNC.

So what’s the catch? Well, there’s a pretty obvious one in that it allows you so much control that if you just leave it open for anyone to access then someone might be able to take over the machine and do pretty much whatever they like with it. In this discussion, the method I’ll be using hinges on forcing VNC to only accept connections if they come through a secure IP tunnel

OK, So What’s an IP Tunnel?

VNC uses port 5900 by default, but there’s another protocol, Secure Shell (SSH) that uses port 22. IP tunneling allows the user to set up a connection via secure shell, which is encrypted and can be set to use string authentication methods. The VNC traffic is then siphoned down the SSH connection, where it’s sent back up to 5900, so you get exactly the same benefits but it’s all rock-solid secure.

Before You Start

Firstly, I find wireless connections a bit ropey on ubuntu, and since the network is going to be the only way in, make sure you have a decent wired connection to fall back on or you’ll risk being locked out by an update or a change in config.

Secondly, VNC – for some reason will fail (because X will fail) if the monitor and keyboard aren’t plugged in using default settings on Ubuntu. In other words, if you’re starting with a headless box, it may be hard to test some of these changes. This is dealt with in this article, so you could skip ahead to the “Going Headless” section or you could just borrow some I/O devices so you can check the progress as you go along and debug things.

Thirdly, remember you’re changing important system files here, and we can’t be held responsible for anything that might go wrong along the way, so it’s your responsibility to take normal precautions to back everything up properly and have an exit strategy if it doesn’t turn out the way you expected.

And lastly, I’m assuming SSH is to be the only way in. If you’re running a telnet server as well it will potentially open another security hole, so you may want to consider what to do about that…

The VNC Server

To begin with, Ubuntu’s default VNC server, Vino, isn’t so hot and doesn’t start working till after you’ve logged in. There are plenty more available but I found X11vnc the best for what was needed for various reasons. There are some good instructions on this page from the Ubuntu Forums (jnorth’s post, half way down) but the author’s configuration must be different from mine because some of the file paths weren’t right. The order I used was:

1. Switch to the correct server. Enter each of the following two lines at the console and follow the on-screen prompts

sudo apt-get remove vino
sudo apt-get install x11vnc

2. Set a password for VNC. This will actually be one of three passwords you’ll need to use to access the box, so if you want to skip this stage as overkill, feel free!

sudo x11vnc -storepasswd

3. Make new configuration settings for Gnome

sudo gedit /etc/gdm/gdm.conf

and change #KillInitClients=true to KillInitClients=false

4. Add vnc to gdm’s initiation script

sudo gedit /etc/gdm/Init/Default

and add the following line (all on one line!) at the top, directly after the opening comments

/usr/bin/x11vnc -rfbauth /etc/vnc.passwd -noxdamage -forever -bg -o /var/log/x11vnc.log -rfbport 5900 -localhost

Note that at this point, you still don’t have a working vnc server. If you are planning to test vnc is working correctly before setting up ssh, you need to temporarily remove the -localhost from the end of that line and restart the machine because its purpose is to block external connections so that users have to be logged in via the secure shell before they can start work.

The SSH Server

1. To set up the SSH server on the headless machine type the following at a command prompt:

sudo apt-get install openssh-server

2. Then update the configuration

sudo gedit /etc/ssh/sshd_config

and check you have the following settings in the relevant parts of the file:
PubkeyAuthentication yes (should be like that already, but if it isn’t, change it!)
HostbasedAuthentication no (ditto)
PasswordAuthentication no

3. Bounce the server with

/etc/init.d/ssh restart

4. Now let’s make some keys. Summon the resident locksmith with…

ssh-keygen -t dsa

which will prompt you for a file location (use .ssh/id_dsa) and a passphrase. This will be something you need to type in each time you connect to the box. You’ll recall I mentioned there were three passwords needed? Well, this is another, so again, if you think this is overkill you can skip it but you run the risk that anyone who gets access to your key (say by finding a thumb-drive you’ve left lying around) will have direct access to the server.

5. Navigate to the .ssh folder in your home directory and make a copy of with the filename authorized_keys then copy id_dsa to the client machine so it can be used by whoever is logging in.

[Addendum] – SSH in Hardy seems to have rsa configured in as standard so you could try substituting “rsa” in place of “dsa” in steps 4 and 5 to make this bit work in Hardy.

Setting up SSH on Windows

1. First of all, you need two programs: Putty and Puttygen. The former is a Secure Shell Client and the latter is a key generator. Both are available for free download from the Putty download page.

2. Open Puttygen and go to File>Load Private Key. Browse to where you saved id_dsa. Enter the passphrase (assuming you used one!) and click “Save Private Key” to save it in some unobtrusive location. You should make a backup of the key in case you lose it , but obviously make sure you keep both copies secure!

3. Now open Putty and make the following changes:
Enter the name or IP of the host computer under “Host Name (or IP Address)”. The port number should be the default 22 unless you consciously changed it.
The protocol is SSH of course!
Use the tree menu on the left to get to Connections>SSH>Auth and click the “Browse” button to browse to where you saved the key file.
Again, in the tree menu, click on “Tunnels”.
Under “Add a new forwarded port”, add the following settings:
Source Port: 5900
Destination: Localhost:5900
* (Radio Button): Local
…and click “Add”
Now, using the tree menu again,return to “Session” at the top. Enter a name for the saved session – e.g., VNCTunnel, and click “Save”.


If all has gone according to plan, you should now be able to connect to your server to make sure everything works properly before you actually take the step of removing the I/O devices. If you don’t have any I/O devices and have been following the above using a terminal, see the “Going Headless” section below for the final configuration changes you’ll need to make before X will agree to play nicely with your VNC server.

1. Start by restarting the Linux box, just to make sure any changes you’ve made are picked up.

2. While that’s going on, download tightvnc from the tightvnc web site. It’s a handy little gizmo that works without even needing to be installed.

3. When you’re ready, open Putty, select your stored connection, VNCTunnel, or whatever you called it, and click “Open”. Enter your username (the username on the linux box whose home directory contains the public key for openssh) followed by the passphrase for your private key.

4. Now open tightvnc. The vnc server to use is localhost:0 (NB – not the address of the actual host. SSH is passing the data back and forth through the tunnel!). Then enter the password you used for vnc. If you don’t get prompted for a password then something went wrong! Finally, log into gnome.

Switching User Accounts

Note that you only have to establish the SSH connection once under one username and you’ll then be able to log into the box as often as you like, under any other username using the gnome desktop manager just as if you were sitting at the machine itself. However, I’ve found that the user switcher applet tends to fail, so you will probably find it’s best to log out as one user before logging in as the other.

Going Headless

This is potentially tricky because it involves editing the display settings in xorg.conf, and the settings vary from machine to machine, so your changes may need to be different from mine.

1. To be safe, back up the file before starting so it can be restored in an emergency. In the command window, type

sudo cp /etc/X11/xorg.conf /etc/X11/xorg.bak

2. Edit the X settings. This next line uses a command-line text editor since that might be all you can access if you’re shut out of Gnome at this point!

sudo nano /etc/X11/xorg.conf

3. Scroll down to the section which reads

Section "Files"

(which may or may not have some other lines in between the first and second), and either add the following just below it

Section "ServerFlags"
Option "AllowMouseOpenFail" "true"

…or, if you already have a section entitled “ServerFlags”, just add the second line somewhere inside the existing section.

4. According to the x11vnc FAQ‘s that should be enough but I found my x server wanted to go into low graphics mode because it couldn’t determine what resolution to use, which meant I was still unable to access vnc, so I also had to go down to the “Screen” section of xorg.conf and remove all screen the resolutions except the one I wanted to use. In other words:

Section "Screen"
<blah blah blah>
  SubSection "Display"
  Modes      "1024x768"

which did the trick.

[Addendum] Later, when I tried this same process on a differengt computer with a different (nvidia) graphics card, I found the vnc session would always default to 640×480 resolution if I started it with no monitor present. Switching from the proprietary driver to vesa helped a bit (800×600!) but I needed to be a bit more specific to get it the way I wanted, by setting refresh rates and passing some parameters to the device driver as well as stating a preference for the screen size. I’m sure there will be a lot of variation between graphics cards (in fact, I know there is because I googled it!) but here’s what I used for my Device, Monitor and Screen settings which seemed to work:

Section "Device"
  Identifier  "Configured Video Device"
  Driver  "vesa"
  Option  "NoDCC"  "True"
  Option  "IgnoreEDID"  "true"

Section "Monitor"
  Identifier  "Configured Monitor"
  Horizsync 30-81
  Vertrefresh 56-75

Section "Screen"
  Identifier  "Default Screen"
  Monitor  "Configured Monitor"
  Device  "Configured Video Device"
  Defaultdepth 24
    SubSection "Display"
    Modes  "1024x768"


Linux being a pretty eclectic creation, you’re more than likely to find that some aspect of your setup details will differ from what I’ve written above, but don’t be put off because whatever your problem you can be sure someone will have had the same one before, and Google will be able to help of you make your query specific enough. Please let me know if you have anything to add or any questions to ask.

Thanks – and good luck!


Pandas for Everyone

I bought this book today and I just love the name of it because it sounds like a Jeremy Corbyn manifesto pledge from the 2019 election that would have had the tory front bench making smug jokes about magic money trees. In this case, I suppose they’d have a point. Pandas are expensive and the government should really consider whether this panda distribution scheme is really a good use of public funds. This is why I’m a Lib Dem. They are sound on pandas.

Ah, sadly, that’s not it. “Pandas” is the name of a data analysis library that forms part of the Python programming language. I’ve been helping my daughter with python projects for her IT homework – manipulating files, implementing password protection, writing simple games – so I’ve got moderately familiar with it, and it’s very pleasingly unfussy, which I like. But now I’m really seeing a lot of the jobs and projects that come up in my searches have python as a desired skill so… It seems like building on what I have learned makes sense. I tend to stick to what I know, and that’s fine, it pays the bills, but times change and it seems worth adding new tools to the toolkit. My existing skill set does the job project after project but times change and there’s no sense relying on the same techniques to last forever.

I’ll try and have something intelligent to say on the subject when I’ve read it. I should probably post here more than once every two years after all. Most of my blogging activity these days is dedicated to an even more obscure topic, namely learning Portuguese. Yep, I really know how to go for that mass appeal. I’ll soon be able to give up my job and live off the ad revenue on a yacht with another smaller yacht to house my to-be-read book pile.

Exporting Form Data as PDFs

I’ve come across a few solutions to the problem of how to take form data, from a variety of sources and turn them into flat files. This is quite often needed in migrations out of case recording systems. Forms may be used for assessments, referrals, and other processes, and the number of templates used tends to be pretty high, so the work involved in transferring answers cleanly into an equivalent in the new system is far too high to be considered seriously (remember the fourth rule in the Practical Data Migration methodology!). 99.999% of it is usually unneeded by reporting but needs to be easily accessible by the practitioner, so what’s needed is a general purpose process that can take all the questions and answers and apply a standard format to them so that they are readable and easy on the eye.

Most of the methods I’ve seen in use seem to be slow (using the supplier’s own export feature, for example, I’ve seen one project creak on for months) or expensive (buying third party software) or both. Luckily, our old friend Talend can be used for this sort of job with a little bit of cunning. The approach taken is to embed the form questions/answers in html tags, which are pretty easy to use and format, and then pass the result into a java library that is capable of taking HTML and converting it to PDF. I’m sharing it online because I think it’s something that could be helpful to other LAs, and maybe save them a bit of cash. God knows, there’s little enough of it about in these austere times. Talend is free if you only use one developer (there is a paid version for more complex projects too), so it’s a cost-effective option. If you don’t have the skills in house and need someone to plumb it in for you, drop me a line to discuss terms.

In order to follow this tutorial you’ll need a few prerequisites:

  • Reasonably sound understanding of Talend
  • Decent SQL skills, probably including the ability to write functions in T-SQL, PL/SQL or whatever is your tipple of choice.
  • Ability to construct a reasonably sound html document.
  • No fear of Java. You don’t really need to know much about it, just not freak out if you see it.

OK? Still here? Shall we have a go?

Get hold of the Java Libraries

Download zip file of all jar files from latest version of openhtmltopdf here

Unzip the archive to a folder

Use tlibraryload to load all of the jar files one after the other. You’ll probably find that one of the jars – commons-logging – already exists in Talend so you won’t need that one but the others all need loading like this:


Then you’ll need to build a workflow that follows this sort of pattern (I’ll list them in the order shown in the screenshot and with the names they have there, so you can relate the narrative to the flow, but obviously the names themselves don’t really matter so much)

Make a Workflow


list document ids

An Input listing all the document IDs. In my case, that’s an MS SQL Server input containing a query that has just two columns: The document ID and a text string which I can use as name, but in reality you’ll actually probaby want some more metadata such as date of creation, who created it and so on as most load scripts will want this kind of info for each file they load.

take each in turn

A flow-to-iterate component. In other words, it will take all the rows handed to it by the previous component and pass each one in turn to the following stage so it can deal with each one, one at a time.

select rows as html

This is going to be by far the hardest component to build since it is effectively trying to build an html document out of the various parts of the form, which will be complete and valid XHTML (very important that – it has to be XHTML because the java will want it to be vaid XML as well as decently formatted HTML)

Let’s say you have a super-basic structure where each form has several sections, each setion has sevral questions and each question may or may not have an answer, which can either be a string or a date, you’d have something like this

“select page_order, question_order,

‘<span> style=\”color: #00006B;font-weight: bold;padding: 5px;\” >’ +

dbo.dm_xml_cleanse(the_question)+'</span> as opener,

coalesce(a.the_text,convert(varchar(32),thedate,103) as stringans,

‘<br/>’ closer

from forms f

inner join sections s on


inner join questions q on

left join answers a on “

In this code, form_id is just the name of the column in which the unique identifier of the form is held. That globalMap.get() function just retrieves it in order to limit the output to only the sections/questions/answers relevant to this specific instance of form it’s looking at right now.

Note that each has a section and a question order (which I am assuming to be _in_ order – ie, sorting on these two will get the questions into the order they appear in the front end). The text output is in three parts too: a beginning, a middle and an end. It doesn’t have to be like that, I just find it easier if I have the opening tag in one, the text in teh second and the closing tag in the third.

Obviously the schema for this query will need to give all those text strings plenty of elbow-room, otherwise the text will be truncated and nobody wants that.

dbo.dm_xml_cleanse is a t-sql function whose job is just to replace dodgy xml entities with better equivalents, and replace line breaks with <br/> tags. Obviously if the text already has markup within it you’ll need to deal with that first.

Of course, these tags on their own arent enough: you’ll also need to union it with other sections which will include things like a header section (section and question ID will both be set at 0 so it appears at the top) and footer (ditto but set at 100000000).

The header should have something like this in its text output

‘<!DOCTYPE html PUBLIC \”-//W3C//DTD XHTML 1.0 Transitional//EN\”


<html xmlns=\”\“>


<title>Migrated Form</title>

<meta http-equiv=\”Content-Type\” content=\”text/html; charset=UTF-8\” />


<!–insert stylesheet data here –>



<body> ‘

and you can add headings, divisions and other bits by unioning other queries in there: for example, a query on only form and section with a 0 in the question_order field would give you a row at the top of each section which you could use for a header, for example. You get the idea. Just stick to standard meat-and-potatoes html tags with some css to make it look more readable.

You can use images like this

<img src=\”file:///E:/mylocalfolder/pdf_logo.png\” style=\”float: right;\” ></img>

For some reason, I couldn’t get this to work if the image was on a network though, only a local drive.

You could also look at odd css tricks that I guess were developed for printable pages and are not commonly used in standard web format. For me, the most useful was

page-break-before: always;

…which I used in the style tag of the division surrounding each new section in order to break it up better and stop a lot of ugly splits in mid-question

join three parts together

All this is doing is reuniting the three text fields in each row into one single field

sort in order

Sort on the section_order and question_order fields to get all the rows in the correct order with the header at the top, the footer at the bottom and so on

aggregate all rows

Uses a “list” aggregate function to amalgamate all the text fields in all the rows into one huge text string

Set Global Variable

Assigns that text string to a single (and very large!) variable that can then be re-used in various places. Obviously make sure the list separator is a space, not any other character otherwise the other characters will bollox up the xml.


A tJava component where the file is actually created. There’s a small amount  of Java here that does the work.

Create the File

First of all, on the imports tab, you need to write

//import java.util.List;



import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;


Then on the code tab, this code, which will actually write a pdf file contining a pdf version of your form.

OutputStream os = new FileOutputStream(“E:\\myfilepath\\”+((String)globalMap.get(“filename”))+”.pdf”);

PdfRendererBuilder builder = new PdfRendererBuilder();  builder.withHtmlContent(((String)globalMap.get(“theHTML”)),”“);


“filename” is the column in the “list document ids” component that contains the name of the document to be created. Obviously this should be unique so either the file path or the name should probably contain the unique identifier for the form so that there’s no risk that one form overwrites another.

“theHTML” is the name of the huge text variable created in the “Set Global Variable” component. If the htnl isn’t valid xml, it’ll explode so while you’re in test mode, there’s an extra bit of code that can be used to write the HTML into a separate file before converting it into pdf, so it’s best to paste this ABOVE the previous block


FileOutputStream fwr = null;

File file;

file = new File(“E:\\pdf_exports\\asm\\”+((String)globalMap.get(“filename”))+”.htm”);

fwr = new FileOutputStream(file);

if (!file.exists()) {



byte[] toBytes = ((String)globalMap.get(“theHTML”)).getBytes();




If you get an error from the tJava Component, you should be able to find the file that was created just before the error. You can upload it to the W3C’s XML validator which will tell you what is wrong with the XHTML you constructed and you can keep tweaking till it goes through.

Taking it Further

As you can see, this is a basic outline. To turn it into a complete working process you’ll probably want a bit more such as an output that takes the metadata of each file created and stores it in a table (along with the file path) subject to confirming that the file ws created successfully, so you can track what you’ve captured so far. Subsequent loads will then check that table, disregard any that have already been captured and not subsequently updated, and then start creating new files based on what’s left. If you have a gazillion forms they most likely won’t all fit into a single folder so you’ll also need to create a function to construct a path to the folder each belongs in. This is mildly annoying because I found I couldn’t get Java to create the folders in a nice, easy way so I ended up having to add, just before the tJava component, a tFileTouch component to make a temporary file (this WILL create the folder tree automatically) and then a tFileDelete that got rid of the temp file, leaving just the folder behind it so that the Java can drop the file in the right place.

Building a List in Talend to Use as a Filter in Another Component

This is an answer to a question about Talend that was posted on StackOverflow. I wasn’t able to post as many pictures as I needed due to house rules in place, so I have moved the whole thing here and linked back to it. The general gist of the question was that there was a mahoosive table that would have eaten too much memory if its entire contents were dragged into Talend, so the user wanted to be able to take some values from a different table, string them out into a list, pop them into a context variable and then squirt them back into another SQl query so that he ended up with a smaller set of results to work with. With me so far? OK, read on!


This should be possible. I’m not working in MySQL but I have something roughly equivalent here that I think you should be able to adapt to your needs.

As you can see, I’ve got some data coming out of the table and getting filtered by tFilterRow_1 to only show the rows I’m interested in.


The next step is to limit it to just the field I want to use in the variable. I’ve used tMap_3 rather than a tFilterColumns because the field I’m using is a string and I wanted to be able to concatenate single quotes around it but if you’re using an integer you might not need to do that. And of course if you have a lot of repetition you might also want to get a tUniqueRows in there as well to save a lot of unnecessary repetition


The next step is the one that does the magic. I’ve got a list like this:
etc, and I want to turn it into ‘A1′,’A2′,’B1′,’B2’ so I can slot it into my where clause. For this, I’ve used tAggregateRow_1, selecting “list” as the aggregate function to use.


Next up, we want to take this list and put it into a context variable (I’ve already created the context variable in the metadata – you know how to do that, right? If not, here’s a quick rundown). Use another tMap component, feeding into a tContextLoad widget. tContextLoad always has two columns in its schema, so map the output of the tAggregateRows to the “value” column and enter the name of the variable in the “key”. In this example, my context variable is called MyList


Now your list is loaded as a text string and stored in the context variable ready for retrieval. So open up a new input and embed the variable in the sql code like this

from MY_SECOND_TABLE where the_selected_row in (“+

It should be as easy as that, and when I whipped it up it worked first time, but let me know if you have any trouble and I’ll see what I can do.

Merging TIFFs into PDFs Using Only Free Software

I had a tricky problem a while ago and nobody seemed to know how to do it so when I worked it out, I thought it might be fun to post a how-to here for other people to crib from and take the credit. Wait, is this such a great idea? Oh well, never mind, here goes…

The challenge is to take a group of scanned pages from a document management system and prepare them for migration into Servelec Corelogic’s Frameworki/Mosaic product. The documents are scanned on a page-by-page basis as TIFFs, and the objective is to merge the pages into a single file, either as TIFFs or as PDFs in a new folder, with the paths held in a database table. In this example, I’ve used nConvert, which is largely free, although if you use it commercially you should buy a license. There’s another free program that I believe can do the same job, although I haven’t specifically tried it – namely Irfanview.

The general strategy is:

  • List the where they’re stored in the file system or EDRMS
  • Use t-sql or pl/sql to write a command line function to grioup all the individual files (pages) together and merge them into a single file in the file system
  • Pass the location of the new file to the import process.

Starting in Talend Open Studio, the first step is to create as new job using the tFileList component as the starting point, to get a list of files in the folder you’re interested in.


Use an iterator to connect to the next step- a tFileProperties component, which you can use to get the file properties of each file in turn. Check the image below for the format to use. You can use this to store the details of all the files in a table called – in this example – FILE_SILESYSTEM.


To move to the next stage, I’ve used a T-SQL function to create a shell-command that  does two things: first, create a new folder for the files to live in, and second to invoke a third party app called  nConvert to merge the pages into a single file. In the command below, you can see the “md” command being used to create the folder. nConvert- a third party app – can then be called to either merge the files or to merge and conver them to pdfs.

cmd /c cd C:/test/smart_files/ &
md ID &
cd ID &
md 64398 &
nconvert -multi -out tiff -c 5 -o C:/test/smart_files/ID/64398/164994_v1.tif U:/00707000/00706853.tif U:/00707000/00706854.tif U:/00707000/00706855.tif U:/00707000/00706856.tif U:/00707000/00706857.tif U:/00707000/00706858.tif U:/00707000/00706859.tif U:/00707000/00706860.tif U:/00707000/00706861.tif U:/00707000/00706862.tif U:/00707000/00706863.tif U:/00707000/00706864.tif U:/00707000/00706865.tif U:/00707000/00706866.tif U:/00707000/00706867.tif U:/00707000/00706868.tif U:/00707000/00706869.tif U:/00707000/00706870.tif U:/00707000/00706871.tif U:/00707000/00706872.tif U:/00707000/00706873.tif U:/00707000/00706874.tif >>C:/test/output.txt

In the example above, I’m just merging them but it’s simple to merge them as a pdf by just chainging the format to

-out pdf



The content of the table can then be split in two; first, the bult of the table can be passed to the import process. The last column – containing the output of the T-SQL function is stored in the final column of a table and the output passed to a shell command using a tMap component:


into an iterator….


The iterator then passes the output of the function into a shell command and merges the files into a single file in the specified folder.

You now have a list of merged files in a format the import process can understand and a folder containing the merged files, all stored in the place in which the import process expects to find them. It should be straightforward to simply run the load procedure and scoop up the merged file into Mosaic.

Christmas Gift Ideas

It’s that time of year again, and project managers up and down the country are wondering what to put in their team members’ stockings. Well, have no fear, here’s my must-have gift-giving guide for the Data Guru who has everything.

1. A Better Computer

Is your data migration lead’s brow furrowed? Does he spend hours staring at his screen clenching and unclenching his fists as the record count ticks from 100 to 200 on a 10,000,000 record load? This might be a sign that the refurbished Pentium III laptop or the virtual box accessed through a dumb terminal that you thought would be so much more cost-effective than a new Dell wasn’t such a good choice after all.

As data flies in and out of it, headed for the target database, both of the test machine’s kilobytes fill up immediately and it starts furiously swapping to keep up. The lights dim, the smell of burning fills the air, development computer fails to respond to mouse-clicks, the screen fades to grey. This is when that lovely, christmassy scarlet colour can be seen in the cheeks of your colleague.

Why not log in to the purchasing portal and order a better computer? What it costs you will be more than made up in fees as work gets done more quickly and doesn’t spill over into extra days and evening work.

2. Talend Data Integration Suite

OK, so Open Studio is the best £0.00 you’ve ever spent, but there’s a whole other level of greatness!

3. The Force Awakens Tie-in Poster

The power of the Force (AKA The Disney Corp) has reached into the world of data migration, producing a system even more powerful than PDMv2, and now you can buy inspirational posters based on the movie script to help motivate your data migration lead to fight the power of the dark side.Yoda

4. Another Spreadsheet

This one is a perennial favourite, and ultimately what most data migration professionals are given every year. We’ve all seen this heart-warming yuletide scene: Late December, a few scant weeks before go-live, and the project team are pulling on their coats, ready to go down the pub for their Christmas do. As if suddenly remembering something, one of the BAs turns and says

“Oh by the way, I’ve just emailed you a spreadsheet the business have told me about. It has mission-critical data on it and they absolutely can’t go live without it. Merry Christmas!”

…and with that they are all gone, leaving the vision of a slowly turning egg-timer reflected in the tears of – one assumes – pure joy, streaming down the data migrator’s face.

Happy Christmas…. And remember, we’re making a list, we’re checking it twice….