PCM17 – the Pentaho Community Meeting 2017

Hi All,

this Saturday, the #PCM17 takes place in Mainz, Germany. PCM17 is the Pentaho Community Meeting that takes place at different locations around the globe. As it happens “around the corner” this time, I will be there and I am so excited. This is the 10th time it happens and there are so many interesting talks. As there are two different “Tracks” – Business and Technical, I will have a hard time deciding where to go – I will mostly stick to the technical track though.

There are talks about the separation of business- and IT rules in ETL Jobs, “Serverless” PDI and Machine Learning, a topic I am specifically interested in.

And – hey – CERN is talking and if there is anybody in the world that generates a lot of data it needs to handle, it’s CERN.

IT-Novum, who is organizer of the event, will do extensive blogging, so I will just lean back and enjoy the show – nothing to expect in my blog.

Follow me on Twitter for comments, impressions and pictures.

Cheers

Andre

Re-Post: Julien Hofstede – Pentaho: Increase MySQL output to 80K rows/second in Pentaho Data Integration

Increase MySQL output to 80K rows/second in Pentaho Data Integration

One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
  1. Tune MySQL for better performance on Inserts
  2. Use the MySQL Bulk loader step in PDI
  3. Write SQL statements to file with PDI and  read them with mysql-binary

When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting.


useServerPrepStmts=false
rewriteBatchedStatements=true
useCompression=true

[[UPDATE 10/2018: In some environments – especially with a high network load iseServerPrepStatements=true is worth a try]]

These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.

Used together, useServerPrepStmts=false and rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:


INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);

will be rewritten into:


INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

The third option useCompression=true compresses the traffic between the client and the MySQL server.

Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.

This all together increased the speed to around 84.000 rows a second! WOW!

 

Source: Julien Hofstede – Pentaho: Increase MySQL output to 80K rows/second in Pentaho Data Integration