Automation in Data Migration and Data Validation

In the modern IT life whenever systems are upgraded, replaced, or consolidated the first and most important thing that comes into the picture is the “Existing Data” in the system. Before the thought of moving to a new and advanced system for better efficiency in work can flourish fully one has to take necessary measures to maintain the data as it and have it up and running in the new system as well. This is where Data-Migration Techniques play a vital role. At first, the word Data-Migration might mean very less to most of you … “all that one needs to do is maintain a backup of the data and restore it in the new system!!” Well, unfortunately, it’s not as simple as this. Data Migration entails more challenges, to name a few are high volume data, complex mappings, and transformations, sometimes it also demands real time synchronization between the old and the new systems. There are many tools introduced in the market that ensure accurate data transfers. Talend Open Studio is one of the most cost-effective tools available. Talend is fast and has the ability to support a wide range of business application software that includes Accounting, ERP, E-commerce, CRM moreover Talend has the ability to connect to Big Data infrastructures as well. Talend provides rich functionality that dramatically speeds project implementation and boosts developer productivity.

In the past, undertaking real time data migration projects involved intensive labor, manual coding, which could be extremely time-consuming and expensive. Consequently, even after realizing that real time data synchronization was the need of the hour, many organizations had to still keep it on their “wish list”. We at Bista Solutions have explored the use of Talend and achieved data migration, be it one-time migration or be it a real-time migration with very less or almost no human interference by automating the processes.

Major challenges in a data migration projects are being able to validate thousands of tables containing a huge number of records in a defined time frame. In fact, Gartner has reported that more than 50% of data migration projects will exceed the budget and end up not achieving the end result due to flawed execution.

Amid data migration, there is dependably a probability of missing data or data debasement. Therefore, it is necessary to ensure whether the given data set has migrated completely and correctly, considering both historical and incremental data migration. Incremental loading of data, specifically, can be challenging on the grounds that every database has its own particular structure. This makes it essential to test that the fields and jobs are loaded effectively and that files/databases are not corrupted. Having said this manually testing billions of records is also not feasible and realistic. And so, automating the validation processes to check accuracy in migrated data is the best solution.

To automate the validation processes, we at Bista Solutions make use of Talend Open Studio to write scripts that compare and validate the data present in both the legacy and new database. Through this post, we are sharing few of these techniques of achieving automation in data migration and data validation.

Automation Tips and Tricks using Talend Open Studio for Data Migration:

Tip 1: Using tRunJob to run multiple Jobs in a single Job Design

A job in Talend is nothing but a logic that we build to perform a specific operation. Be that as it may, imagine a scenario in which you have a few such individual jobs and you might want to execute them one after another.The first step of automation begins with having several migration scripts what we call as talend jobs placed in one frame and execute the tasks in sequence. Talend Open Studio facilitates this by providing the “tRunJob” component. The Talend tRunJob Component lets you put one Talend Job within another so that two or more talend jobs can be executed one after another in a row.

Say you are doing this data migration to support your ERP system that takes care of your daily data from Sales, Purchase, Inventory, Accounts and many other aspects that need to be migrated one after another because of the database dependencies. Also, you have built different jobs to migrate data for each of these tasks, in this case, tRunJob component will be of great help. You can create a master sequence of these tasks as Talend Subjobs and leave it for execution.

Here is a snapshot of a system described above:

1.Master sequence of daily transactional data

Each Child Job(Sales, Purchase, Accounts) is placed in tRunJob. A checkpoint is placed after completion of every Child Job. This checkpoint will log the execution activity in a given table which helps in tracking the Job Status. At the end of the Job, the tSendMail component will send out mails to the authorized people i.e the Stakeholders of the Job to indicate or to mark the end of the master sequence job.

Master sequence of daily transactional data

2.Individual child jobs can have the main logic to be executed as shown below

Each individual child job will have a series of logics that will achieve the desired data migration.

Individual child jobs can have the main logic to be executed as shown below

Tip 3 : Scheduling a Talend Job to run at given time interval

Once you’ve written your Talend Job you would want to run this job every day or even multiple times in a day at specific time intervals. Imagine how tiresome this would become after some point of time if you had to run the same job, again and again, every day. To help our situation, it would be wonderful if the job could somehow run on its own??. Thankfully, there is one such provision in Talend to achieve this and the process of running the job all by itself is called Scheduling a Job in Talend Open Studio.

In Talend Open Studio you can export your job in a number of ways which are as listed below:

  1. Standalone Jobs
  2. Axis WebService (WAR)
  3. Axis Webservice(ZIP)
  4. JBoss ESB
  5. Petals ESB
  6. OSGI Bundle for ESB

In our example snapshots, we have used the standalone type and we export the job as an Archive file. There are a couple of other settings one needs to perform while exporting the jobs such as exporting the context parameters along with the job, and selecting the build type as standalone etc.

Scheduling a Talend Job to run at given time interval

Once you build your job as an archive file you need to run this through the command line.

  • Running a Talend Job through command line in Windows

The archive file will have a bundle of executable files for the different variation of OS. For windows, the .bat is used. You can run this file using the following command

            “<jobName>_run.bat –context_param input=<not standard>”

(if your job uses dynamic parameters)

Now we can easily achieve scheduling of this job through Windows Task Scheduler. By setting the trigger you select the desired time interval at which the job is expected to run.

Running a Talend Job through command line in Windows

  • Running a Talend Job through command line in Linux

For running the executable file in Linux you can use the command given below

bash <JobName> –context_param input=<not standard>”

To set up a task in Linux to run on a repeat, you need to program a cron job. To open up the list of cron jobs, type “crontab -e”, here you can schedule tasks on each line with the following format:

1 2 3 4 5 /path/to/command arg1 arg2

Each number means:

1: Minutes, 2: Hours, 3: Day, 4: Month, 5: Day of the Week

The following entry runs every day on an hourly basis.

Running a Talend Job through command line in Linux

Tip 4 : Few other Components in Talend which help to achieve Automation in Data Migration

One of the major advantages of Talend Open Studio is the variety of components it offers, No other ETL tool can get close to these many options.Here are a few other components that help you run the jobs without human interaction:

  1. tPreJob and tPostJob

When you automate your processes you need to keep track of where your scripts start from and where do they end. These two components help control what happens before and after the content of a job executes. tPreJob is the entry point to the script and tPostJob is the exit point in between this two component the core logic of data migration can be written

  1. tSendMail

The tSendMail component is used to send the mail to the concerned stakeholders as and when required such as to indicate start and end of the process through an email service, sending important data to the stakeholders, sending daily reports as attachments etc.

  1. tPOP along with tFileInputMail

Consider a scenario, wherein the data migration for the daily activities such as sales and purchase is been performed on one particular server and a visualization report using tools like Tableau, PowerBI, Jaspersoft and so on to determine the health check of these activities is been calculated on a different and remote server. And say, after execution of every activity one CSV file is generated which contains the records required for the visualization report which is to be generated on another server. In this case, the user will have to manually send the files to another server. To help our situation, in this case, Talend offers tPOP component along with tFileInputMail which fetches email messages and attachment files from an email from any server and downloads them automatically to the desired destination server which can then for the visualization reports.

Here is a snapshot of a system which describes all of the above automation components in Talend.

tPOP along with tFileInputMail


We can conclude saying that Talend Open Studio can help in automating most of your data migration tasks and thus ensure accurate data migration by reducing errors introduced by human interactions. For more info on data migration using Talend Open Studio, you can get in touch with us through