Revenue Source

Welcome to the Revenue Source affiliate marketing forums.

You are viewing our internet marketing and SEO forums as a guest which gives you limited access to most of our discussions.  By joining our free community, you will have access to post affiliate marketing topics, communicate privately with other members (PM), exchange SEO strategies, and access many other special features.  Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems, please don't hesitate to contact us.

Go Back   Revenue Source > Site Design & Development > Databases
Reload this Page Using GROUP BY WITH ROLLUP for Reporting Performance Optimization
Tags: , , , , ,

Reply
 
LinkBack Thread Tools Search this Thread
Old
  (#1 (permalink))
Affiliate Blogs is Offline
Revenue Source Veteran
Affiliate Blogs has a brilliant future here!
 
Affiliate Blogs's Avatar
 
Join Date: Oct 2005
Posts: 9,077
Jack of All Trades
CyberSpace United States
   
Using GROUP BY WITH ROLLUP for Reporting Performance Optimization - 09-17-2007

Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you're looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?
For illustration purposes I've created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]CREATE TABLE `dt` ([/font]
  2. [FONT='Courier New', Courier, monospace] `grp` int(10) UNSIGNED NOT NULL,[/font]
  3. [FONT='Courier New', Courier, monospace] `slack` varchar(50) DEFAULT NULL[/font]
  4. [FONT='Courier New', Courier, monospace]) ENGINE=MyISAM DEFAULT CHARSET=latin1 [/font]



And I'm using some silly like query to illustrate some search is required, so count(*) can't be optimized away for MyISAM Tables.
To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC LIMIT 10;[/font]
  2. [FONT='Courier New', Courier, monospace]+------+-----+[/font]
  3. [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
  4. [FONT='Courier New', Courier, monospace]+------+-----+[/font]
  5. [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
  6. [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
  7. [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
  8. [FONT='Courier New', Courier, monospace]| 2082 | 294 | [/font]
  9. [FONT='Courier New', Courier, monospace]| 9729 | 293 | [/font]
  10. [FONT='Courier New', Courier, monospace]| 3760 | 292 | [/font]
  11. [FONT='Courier New', Courier, monospace]| 2588 | 290 | [/font]
  12. [FONT='Courier New', Courier, monospace]| 7918 | 290 | [/font]
  13. [FONT='Courier New', Courier, monospace]| 4055 | 290 | [/font]
  14. [FONT='Courier New', Courier, monospace]| 6950 | 290 | [/font]
  15. [FONT='Courier New', Courier, monospace]+------+-----+[/font]
  16. [FONT='Courier New', Courier, monospace]10 rows IN SET (21.12 sec)[/font]
  17. [FONT='Courier New', Courier, monospace] [/font]
  18. [FONT='Courier New', Courier, monospace]mysql> SELECT count(*) cnt FROM dt WHERE slack LIKE "a%";[/font]
  19. [FONT='Courier New', Courier, monospace]+---------+[/font]
  20. [FONT='Courier New', Courier, monospace]| cnt |[/font]
  21. [FONT='Courier New', Courier, monospace]+---------+[/font]
  22. [FONT='Courier New', Courier, monospace]| 2352996 | [/font]
  23. [FONT='Courier New', Courier, monospace]+---------+[/font]
  24. [FONT='Courier New', Courier, monospace]1 row IN SET (18.71 sec) [/font]



This allows us to get results in about 40 seconds and as we can see is pretty inefficient - almost half of the time is spent counting the rows which are already traversed and counted for group by operation.
The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC; [/font]
  2. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  3. [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
  4. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  5. [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
  6. [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
  7. [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
  8. [FONT='Courier New', Courier, monospace]...[/font]
  9. [FONT='Courier New', Courier, monospace]| 7195 | 180 | [/font]
  10. [FONT='Courier New', Courier, monospace]| 6703 | 178 | [/font]
  11. [FONT='Courier New', Courier, monospace]| 10000 | 107 | [/font]
  12. [FONT='Courier New', Courier, monospace]| 0 | 105 | [/font]
  13. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  14. [FONT='Courier New', Courier, monospace]10001 rows IN SET (21.30 sec) [/font]



We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY NULL; [/font]
  2. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  3. [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
  4. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  5. [FONT='Courier New', Courier, monospace]| 2257 | 251 | [/font]
  6. [FONT='Courier New', Courier, monospace]| 2418 | 266 | [/font]
  7. [FONT='Courier New', Courier, monospace]| 5842 | 258 | [/font]
  8. [FONT='Courier New', Courier, monospace]| 90 | 276 | [/font]
  9. [FONT='Courier New', Courier, monospace]...[/font]
  10. [FONT='Courier New', Courier, monospace]| 2595 | 243 | [/font]
  11. [FONT='Courier New', Courier, monospace]| 4446 | 239 | [/font]
  12. [FONT='Courier New', Courier, monospace]| 9802 | 233 | [/font]
  13. [FONT='Courier New', Courier, monospace]| 1861 | 227 | [/font]
  14. [FONT='Courier New', Courier, monospace]+-------+-----+[/font]
  15. [FONT='Courier New', Courier, monospace]10001 rows IN SET (21.39 sec) [/font]



This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).
The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup ORDER BY cnt DESC LIMIT 11;[/font]
  2. [FONT='Courier New', Courier, monospace]ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP AND ORDER BY [/font]



Oops. Bad luck - for some reason you can't use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.
OK Lets see how fast it is without order by (and limit):
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup; [/font]
  2. [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
  3. [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
  4. [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
  5. [FONT='Courier New', Courier, monospace]| 0 | 105 | [/font]
  6. [FONT='Courier New', Courier, monospace]| 1 | 259 | [/font]
  7. [FONT='Courier New', Courier, monospace]| 2 | 200 | [/font]
  8. [FONT='Courier New', Courier, monospace]...[/font]
  9. [FONT='Courier New', Courier, monospace]| 9999 | 235 | [/font]
  10. [FONT='Courier New', Courier, monospace]| 10000 | 107 | [/font]
  11. [FONT='Courier New', Courier, monospace]| | 2352996 | [/font]
  12. [FONT='Courier New', Courier, monospace]+-------+---------+[/font]
  13. [FONT='Courier New', Courier, monospace]10002 rows IN SET (28.79 sec) [/font]



As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL - it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.
So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> SELECT * FROM (SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup) t ORDER BY cnt DESC LIMIT 11;[/font]
  2. [FONT='Courier New', Courier, monospace]+------+---------+[/font]
  3. [FONT='Courier New', Courier, monospace]| grp | cnt |[/font]
  4. [FONT='Courier New', Courier, monospace]+------+---------+[/font]
  5. [FONT='Courier New', Courier, monospace]| NULL | 2352996 | [/font]
  6. [FONT='Courier New', Courier, monospace]| 9879 | 300 | [/font]
  7. [FONT='Courier New', Courier, monospace]| 3888 | 298 | [/font]
  8. [FONT='Courier New', Courier, monospace]| 1793 | 297 | [/font]
  9. [FONT='Courier New', Courier, monospace]| 2082 | 294 | [/font]
  10. [FONT='Courier New', Courier, monospace]| 9729 | 293 | [/font]
  11. [FONT='Courier New', Courier, monospace]| 3760 | 292 | [/font]
  12. [FONT='Courier New', Courier, monospace]| 7918 | 290 | [/font]
  13. [FONT='Courier New', Courier, monospace]| 4055 | 290 | [/font]
  14. [FONT='Courier New', Courier, monospace]| 3749 | 290 | [/font]
  15. [FONT='Courier New', Courier, monospace]| 6950 | 290 | [/font]
  16. [FONT='Courier New', Courier, monospace]+------+---------+[/font]
  17. [FONT='Courier New', Courier, monospace]11 rows IN SET (29.68 sec) [/font]



Use of extra temporary table for buffering helps us to get result set we're looking for and workaround this limitation, though it of course comes with performance penalty.
This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.
Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.
UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is - it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:
PLAIN TEXT
SQL:
  1. [FONT='Courier New', Courier, monospace]mysql> EXPLAIN SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp \G[/font]
  2. [FONT='Courier New', Courier, monospace]*************************** 1. row ***************************[/font]
  3. [FONT='Courier New', Courier, monospace] id: 1[/font]
  4. [FONT='Courier New', Courier, monospace] select_type: SIMPLE[/font]
  5. [FONT='Courier New', Courier, monospace] TABLE: dt[/font]
  6. [FONT='Courier New', Courier, monospace] type: ALL[/font]
  7. [FONT='Courier New', Courier, monospace]possible_keys: NULL[/font]
  8. [FONT='Courier New', Courier, monospace] KEY: NULL[/font]
  9. [FONT='Courier New', Courier, monospace] key_len: NULL[/font]
  10. [FONT='Courier New', Courier, monospace] ref: NULL[/font]
  11. [FONT='Courier New', Courier, monospace] rows: 37748736[/font]
  12. [FONT='Courier New', Courier, monospace] Extra: USING WHERE; USING TEMPORARY; USING filesort[/font]
  13. [FONT='Courier New', Courier, monospace]1 row IN SET (0.01 sec)[/font]
  14. [FONT='Courier New', Courier, monospace] [/font]
  15. [FONT='Courier New', Courier, monospace]mysql> EXPLAIN SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup \G[/font]
  16. [FONT='Courier New', Courier, monospace]*************************** 1. row ***************************[/font]
  17. [FONT='Courier New', Courier, monospace] id: 1[/font]
  18. [FONT='Courier New', Courier, monospace] select_type: SIMPLE[/font]
  19. [FONT='Courier New', Courier, monospace] TABLE: dt[/font]
  20. [FONT='Courier New', Courier, monospace] type: ALL[/font]
  21. [FONT='Courier New', Courier, monospace]possible_keys: NULL[/font]
  22. [FONT='Courier New', Courier, monospace] KEY: NULL[/font]
  23. [FONT='Courier New', Courier, monospace] key_len: NULL[/font]
  24. [FONT='Courier New', Courier, monospace] ref: NULL[/font]
  25. [FONT='Courier New', Courier, monospace] rows: 37748736[/font]
  26. [FONT='Courier New', Courier, monospace] Extra: USING WHERE; USING filesort[/font]
  27. [FONT='Courier New', Courier, monospace]1 row IN SET (0.00 sec) [/font]



As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.


Using GROUP BY WITH ROLLUP for Reporting Performance Optimization - Read More...
  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads for: Using GROUP BY WITH ROLLUP for Reporting Performance Optimization
Thread Thread Starter Forum Replies Last Post
Andrew Johnstone's Blog: MySQL & PHP Performance Optimization Tips Affiliate Blogs Programming Help 0 07-25-2007 12:58 PM
System size and Performance Optimization Affiliate Blogs Databases 0 02-27-2007 04:12 PM
Using LoadAvg for Performance Optimization Affiliate Blogs Databases 0 12-04-2006 01:05 PM
The Reporting Initiative: Phase Two, Performance Reports Affiliate Marketing News Commission Junction 0 11-02-2006 04:43 PM
The Reporting Initiative: Phase Two, Performance Reports Affiliate Marketing News Commission Junction 0 10-17-2006 07:48 PM



© 2004-6 RevenueSource.com.  All rights reserved.  Do not duplicate or redistribute in any form.
This website and its logos/design are property of RevenueSource.com.  All rights reserved. vBSEO 3.2.0 RC7


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34