Increase MySQL output to 80K rows/second in Pentaho Data Integration
One of our clients has a MySQL table with around 40M records. To load the table it took around 2,5 hours. When i was watching the statistics of the transformation I noticed that the bottleneck was the write to the database. I was stuck at around 2000 rows/second. You can imagine that it will take a long time to write 40M records at that speed.
I was looking in what way I could improve the speed. There were a couple of options:
- Tune MySQL for better performance on Inserts
- Use the MySQL Bulk loader step in PDI
- Write SQL statements to file with PDI and read them with mysql-binary
When i discussed this with one of my contacts of Basis06 they faced a similar issue a while ago. He mentioned that speed can be boosted by using some simple JDBC-connection setting.
These options should be entered in PDI at the connection. Double click the connection go to Options and set these values.
rewriteBatchedStatements=true will “fake” batch inserts on the client. Specifically, the insert statements:
INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);
will be rewritten into:
INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);
The third option
useCompression=true compresses the traffic between the client and the MySQL server.
Finally I increased the number of copies of the output step to 2 so that there are two treads inserting into the database.
This all together increased the speed to around 84.000 rows a second! WOW!
Source: Julien Hofstede – Pentaho: Increase MySQL output to 80K rows/second in Pentaho Data Integration
I have been struggeling with date/time calculations for the last couple of years and meanwhile I have quite a collection I would like to share. Note that I have avoided something like date_format(current_date,’%y-%m-01′) because I dont find that very elegant
Simple date calculations
SELECT current_date + interval 1 day
Yesterday (you might guess….)
SELECT current_date - interval 1 day
A week ago
SELECT current_date - interval 1 week
Rather complex date calculations
The first day of last month
SELECT last_day(current_date - interval 2 month) + interval 1 day
The last day of last month
SELECT last_day(current_date - interval 1 month)
The last day of last year
SELECT current_date - INTERVAL DAYOFYEAR(current_date) DAY
the first day of this year
SELECT current_date - INTERVAL DAYOFYEAR(current_date)-1 DAY
SELECT current_date - INTERVAL weekday(current_date) day
If you have more to add, please feel free to put them into the comments and I will happily share them here.