| Revenue Source Veteran
Join Date: Oct 2005 Posts: 9,077 Jack of All Trades
CyberSpace
| Using GROUP BY WITH ROLLUP for Reporting Performance Optimization -
09-17-2007
Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you're looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?
For illustration purposes I've created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups. PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]CREATE TABLE `dt` ([/font]
- [FONT='Courier New', Courier, monospace] `grp` int(10) UNSIGNED NOT NULL,[/font]
- [FONT='Courier New', Courier, monospace] `slack` varchar(50) DEFAULT NULL[/font]
- [FONT='Courier New', Courier, monospace]) ENGINE=MyISAM DEFAULT CHARSET=latin1 [/font]
And I'm using some silly like query to illustrate some search is required, so count(*) can't be optimized away for MyISAM Tables.
To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries: PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC LIMIT 10;[/font]
- [FONT='Courier New', Courier, monospace]+------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
- [FONT='Courier New', Courier, monospace]+------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
- [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
- [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
- [FONT='Courier New', Courier, monospace]| 2082 | 294 | [/font]
- [FONT='Courier New', Courier, monospace]| 9729 | 293 | [/font]
- [FONT='Courier New', Courier, monospace]| 3760 | 292 | [/font]
- [FONT='Courier New', Courier, monospace]| 2588 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 7918 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 4055 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 6950 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]+------+-----+[/font]
- [FONT='Courier New', Courier, monospace]10 rows IN SET (21.12 sec)[/font]
- [FONT='Courier New', Courier, monospace] [/font]
- [FONT='Courier New', Courier, monospace]mysql> SELECT count(*) cnt FROM dt WHERE slack LIKE "a%";[/font]
- [FONT='Courier New', Courier, monospace]+---------+[/font]
- [FONT='Courier New', Courier, monospace]| cnt |[/font]
- [FONT='Courier New', Courier, monospace]+---------+[/font]
- [FONT='Courier New', Courier, monospace]| 2352996 | [/font]
- [FONT='Courier New', Courier, monospace]+---------+[/font]
- [FONT='Courier New', Courier, monospace]1 row IN SET (18.71 sec) [/font]
This allows us to get results in about 40 seconds and as we can see is pretty inefficient - almost half of the time is spent counting the rows which are already traversed and counted for group by operation.
The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way. PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC; [/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
- [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
- [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
- [FONT='Courier New', Courier, monospace]...[/font]
- [FONT='Courier New', Courier, monospace]| 7195 | 180 | [/font]
- [FONT='Courier New', Courier, monospace]| 6703 | 178 | [/font]
- [FONT='Courier New', Courier, monospace]| 10000 | 107 | [/font]
- [FONT='Courier New', Courier, monospace]| 0 | 105 | [/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]10001 rows IN SET (21.30 sec) [/font]
We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small: PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY NULL; [/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]| 2257 | 251 | [/font]
- [FONT='Courier New', Courier, monospace]| 2418 | 266 | [/font]
- [FONT='Courier New', Courier, monospace]| 5842 | 258 | [/font]
- [FONT='Courier New', Courier, monospace]| 90 | 276 | [/font]
- [FONT='Courier New', Courier, monospace]...[/font]
- [FONT='Courier New', Courier, monospace]| 2595 | 243 | [/font]
- [FONT='Courier New', Courier, monospace]| 4446 | 239 | [/font]
- [FONT='Courier New', Courier, monospace]| 9802 | 233 | [/font]
- [FONT='Courier New', Courier, monospace]| 1861 | 227 | [/font]
- [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
- [FONT='Courier New', Courier, monospace]10001 rows IN SET (21.39 sec) [/font]
This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).
The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups: PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup ORDER BY cnt DESC LIMIT 11;[/font]
- [FONT='Courier New', Courier, monospace]ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP AND ORDER BY [/font]
Oops. Bad luck - for some reason you can't use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.
OK Lets see how fast it is without order by (and limit): PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup; [/font]
- [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
- [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
- [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
- [FONT='Courier New', Courier, monospace]| 0 | 105 | [/font]
- [FONT='Courier New', Courier, monospace]| 1 | 259 | [/font]
- [FONT='Courier New', Courier, monospace]| 2 | 200 | [/font]
- [FONT='Courier New', Courier, monospace]...[/font]
- [FONT='Courier New', Courier, monospace]| 9999 | 235 | [/font]
- [FONT='Courier New', Courier, monospace]| 10000 | 107 | [/font]
- [FONT='Courier New', Courier, monospace]| | 2352996 | [/font]
- [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
- [FONT='Courier New', Courier, monospace]10002 rows IN SET (28.79 sec) [/font]
As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL - it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.
So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three: PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> SELECT * FROM (SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup) t ORDER BY cnt DESC LIMIT 11;[/font]
- [FONT='Courier New', Courier, monospace]+------+---------+[/font]
- [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
- [FONT='Courier New', Courier, monospace]+------+---------+[/font]
- [FONT='Courier New', Courier, monospace]| NULL | 2352996 | [/font]
- [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
- [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
- [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
- [FONT='Courier New', Courier, monospace]| 2082 | 294 | [/font]
- [FONT='Courier New', Courier, monospace]| 9729 | 293 | [/font]
- [FONT='Courier New', Courier, monospace]| 3760 | 292 | [/font]
- [FONT='Courier New', Courier, monospace]| 7918 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 4055 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 3749 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]| 6950 | 290 | [/font]
- [FONT='Courier New', Courier, monospace]+------+---------+[/font]
- [FONT='Courier New', Courier, monospace]11 rows IN SET (29.68 sec) [/font]
Use of extra temporary table for buffering helps us to get result set we're looking for and workaround this limitation, though it of course comes with performance penalty.
This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.
Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events. UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is - it is using filesort as group by execution method, not temporary table as ordinary GROUP BY: PLAIN TEXT
SQL: - [FONT='Courier New', Courier, monospace]mysql> EXPLAIN SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp \G[/font]
- [FONT='Courier New', Courier, monospace]*************************** 1. row ***************************[/font]
- [FONT='Courier New', Courier, monospace] id: 1[/font]
- [FONT='Courier New', Courier, monospace] select_type: SIMPLE[/font]
- [FONT='Courier New', Courier, monospace] TABLE: dt[/font]
- [FONT='Courier New', Courier, monospace] type: ALL[/font]
- [FONT='Courier New', Courier, monospace]possible_keys: NULL[/font]
- [FONT='Courier New', Courier, monospace] KEY: NULL[/font]
- [FONT='Courier New', Courier, monospace] key_len: NULL[/font]
- [FONT='Courier New', Courier, monospace] ref: NULL[/font]
- [FONT='Courier New', Courier, monospace] rows: 37748736[/font]
- [FONT='Courier New', Courier, monospace] Extra: USING WHERE; USING TEMPORARY; USING filesort[/font]
- [FONT='Courier New', Courier, monospace]1 row IN SET (0.01 sec)[/font]
- [FONT='Courier New', Courier, monospace] [/font]
- [FONT='Courier New', Courier, monospace]mysql> EXPLAIN SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup \G[/font]
- [FONT='Courier New', Courier, monospace]*************************** 1. row ***************************[/font]
- [FONT='Courier New', Courier, monospace] id: 1[/font]
- [FONT='Courier New', Courier, monospace] select_type: SIMPLE[/font]
- [FONT='Courier New', Courier, monospace] TABLE: dt[/font]
- [FONT='Courier New', Courier, monospace] type: ALL[/font]
- [FONT='Courier New', Courier, monospace]possible_keys: NULL[/font]
- [FONT='Courier New', Courier, monospace] KEY: NULL[/font]
- [FONT='Courier New', Courier, monospace] key_len: NULL[/font]
- [FONT='Courier New', Courier, monospace] ref: NULL[/font]
- [FONT='Courier New', Courier, monospace] rows: 37748736[/font]
- [FONT='Courier New', Courier, monospace] Extra: USING WHERE; USING filesort[/font]
- [FONT='Courier New', Courier, monospace]1 row IN SET (0.00 sec) [/font]
As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP. Using GROUP BY WITH ROLLUP for Reporting Performance Optimization - Read More... |