Archive for 21 Jan 2008

Checksum for dataset comparison

While implementing software systems which need a file feed in from an external source, there is need to know whether the source data on the file has changed or not. If the source data in the file has not changed at all then there is no need to load the file into the company’s database.

More often than not, the file publishing source does not have an automated mechanism to alert the consumer (the clients needing the file) that there is change in the file. Due to this, the consumer processes have to download the file (every day maybe) from publishing source’s FTP site and then compare the records in that file with the records stored in company’s database.

The solution which I have implemented does not avoid the unnecessary download but it does avoid the comparison of source file records with the company’s database.

One of the most easiest way of keep your database in synch with the latest version of data from the producer is to utilize “checksum” functions. Unix provides checksum function (cksum, sum) as part of its standard OS build.

Steps to determine whether the currently retrieved file has changed or not are pretty simple :

1. Execute cksum (or sum or md5) on the previous copy of the file. Note : Before downloading the latest copy of the file, take a backup of the file currently saved on your file system.

2. Execute cksum on the currently downloaded file.

3. Compare two checksum numbers generated in previous steps.

If the checksums are the same, you can be rest assured that contents of the both the files are same.

MD5 is more powerful than cksum or sum but it is not part of the standard Solaris build, you may therefore might get some resistance from your server administrator to install a MD5 utility on your Solaris server. But if you already have a MD5 utility on your server, I would suggest you to take the MD5 route.

|