I had a tricky problem a while ago and nobody seemed to know how to do it so when I worked it out, I thought it might be fun to post a how-to here for other people to crib from and take the credit. Wait, is this such a great idea? Oh well, never mind, here goes…
The challenge is to take a group of scanned pages from a document management system and prepare them for migration into Servelec Corelogic’s Frameworki/Mosaic product. The documents are scanned on a page-by-page basis as TIFFs, and the objective is to merge the pages into a single file, either as TIFFs or as PDFs in a new folder, with the paths held in a database table. In this example, I’ve used nConvert, which is largely free, although if you use it commercially you should buy a license. There’s another free program that I believe can do the same job, although I haven’t specifically tried it – namely Irfanview.
The general strategy is:
- List the where they’re stored in the file system or EDRMS
- Use t-sql or pl/sql to write a command line function to grioup all the individual files (pages) together and merge them into a single file in the file system
- Pass the location of the new file to the import process.
Starting in Talend Open Studio, the first step is to create as new job using the tFileList component as the starting point, to get a list of files in the folder you’re interested in.
Use an iterator to connect to the next step- a tFileProperties component, which you can use to get the file properties of each file in turn. Check the image below for the format to use. You can use this to store the details of all the files in a table called – in this example – FILE_SILESYSTEM.
To move to the next stage, I’ve used a T-SQL function to create a shell-command that does two things: first, create a new folder for the files to live in, and second to invoke a third party app called nConvert to merge the pages into a single file. In the command below, you can see the “md” command being used to create the folder. nConvert- a third party app – can then be called to either merge the files or to merge and conver them to pdfs.
cmd /c cd C:/test/smart_files/ &
md ID &
cd ID &
md 64398 &
nconvert -multi -out tiff -c 5 -o C:/test/smart_files/ID/64398/164994_v1.tif U:/00707000/00706853.tif U:/00707000/00706854.tif U:/00707000/00706855.tif U:/00707000/00706856.tif U:/00707000/00706857.tif U:/00707000/00706858.tif U:/00707000/00706859.tif U:/00707000/00706860.tif U:/00707000/00706861.tif U:/00707000/00706862.tif U:/00707000/00706863.tif U:/00707000/00706864.tif U:/00707000/00706865.tif U:/00707000/00706866.tif U:/00707000/00706867.tif U:/00707000/00706868.tif U:/00707000/00706869.tif U:/00707000/00706870.tif U:/00707000/00706871.tif U:/00707000/00706872.tif U:/00707000/00706873.tif U:/00707000/00706874.tif >>C:/test/output.txt
In the example above, I’m just merging them but it’s simple to merge them as a pdf by just chainging the format to
The content of the table can then be split in two; first, the bult of the table can be passed to the import process. The last column – containing the output of the T-SQL function is stored in the final column of a table and the output passed to a shell command using a tMap component:
into an iterator….
The iterator then passes the output of the function into a shell command and merges the files into a single file in the specified folder.
You now have a list of merged files in a format the import process can understand and a folder containing the merged files, all stored in the place in which the import process expects to find them. It should be straightforward to simply run the load procedure and scoop up the merged file into Mosaic.