this Saturday, the #PCM17 takes place in Mainz, Germany. PCM17 is the Pentaho Community Meeting that takes place at different locations around the globe. As it happens “around the corner” this time, I will be there and I am so excited. This is the 10th time it happens and there are so many interesting talks. As there are two different “Tracks” – Business and Technical, I will have a hard time deciding where to go – I will mostly stick to the technical track though.
There are talks about the separation of business- and IT rules in ETL Jobs, “Serverless” PDI and Machine Learning, a topic I am specifically interested in.
And – hey – CERN is talking and if there is anybody in the world that generates a lot of data it needs to handle, it’s CERN.
IT-Novum, who is organizer of the event, will do extensive blogging, so I will just lean back and enjoy the show – nothing to expect in my blog.
Follow me on Twitter for comments, impressions and pictures.
Increase MySQL output to 80K rows/second in Pentaho Data Integration
One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
- Tune MySQL for better performance on Inserts
- Use the MySQL Bulk loader step in PDI
- Write SQL statements to file with PDI and read them with mysql-binary
When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting.
[[UPDATE 10/2018: In some environments – especially with a high network load iseServerPrepStatements=true is worth a try]]
These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.
rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:
INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);
will be rewritten into:
INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);
The third option
useCompression=true compresses the traffic between the client and the MySQL server.
Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.
This all together increased the speed to around 84.000 rows a second! WOW!
Source: Julien Hofstede – Pentaho: Increase MySQL output to 80K rows/second in Pentaho Data Integration