SQL GROUP BY: Aggregating Data Effectively

GROUP BY groups rows by column values and applies aggregate functions:

SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id;

This counts orders per user, collapsing multiple order rows into one summary row per user.

Basic Aggregation

Common aggregate functions:

SELECT 
    category,
    COUNT(*) as product_count,
    AVG(price) as avg_price,
    MIN(price) as min_price,
    MAX(price) as max_price,
    SUM(price) as total_value
FROM products
GROUP BY category;

How GROUP BY Works

Given this data:

orders:
| user_id | amount |
|---------|--------|
| 1       | 50     |
| 1       | 30     |
| 2       | 20     |
| 2       | 40     |
| 3       | 15     |

Grouped by user_id:

SELECT user_id, SUM(amount) as total
FROM orders
GROUP BY user_id;

| user_id | total |
|---------|-------|
| 1       | 80    |
| 2       | 60    |
| 3       | 15    |

Multiple Columns

Group by multiple columns:

SELECT 
    category,
    status,
    COUNT(*) as count
FROM products
GROUP BY category, status;

| category    | status | count |
|-------------|--------|-------|
| Electronics | active | 50    |
| Electronics | sold   | 30    |
| Clothing    | active | 40    |
| Clothing    | sold   | 25    |

HAVING Clause

Filter aggregated results with HAVING (not WHERE):

-- Users who placed more than 5 orders
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

WHERE filters before grouping. HAVING filters after:

SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE amount > 10          -- Filter individual rows first
GROUP BY user_id
HAVING COUNT(*) > 2;       -- Then filter groups

Non-Aggregated Columns

Every non-aggregated column must be in GROUP BY:

-- Wrong - name not in GROUP BY
SELECT user_id, name, COUNT(*)
FROM users
JOIN orders ON users.id = orders.user_id
GROUP BY user_id;

-- Correct
SELECT user_id, name, COUNT(*)
FROM users
JOIN orders ON users.id = orders.user_id
GROUP BY user_id, name;

Common Aggregate Functions

COUNT(*) - count all rows:

SELECT category, COUNT(*) FROM products GROUP BY category;

COUNT(column) - count non-NULL values:

SELECT COUNT(phone) FROM users;  -- Counts users with phone numbers

SUM - total:

SELECT SUM(amount) as revenue FROM orders;

AVG - average:

SELECT AVG(price) as avg_price FROM products;

MIN/MAX - extremes:

SELECT MIN(price), MAX(price) FROM products;

GROUP BY with JOIN

SELECT 
    users.name,
    COUNT(orders.id) as order_count,
    SUM(orders.amount) as total_spent
FROM users
LEFT JOIN orders ON users.id = orders.user_id
GROUP BY users.id, users.name;

Date Grouping

Group by date parts:

-- PostgreSQL
SELECT 
    DATE_TRUNC('month', created_at) as month,
    COUNT(*) as order_count
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

-- MySQL
SELECT 
    DATE_FORMAT(created_at, '%Y-%m') as month,
    COUNT(*) as order_count
FROM orders
GROUP BY DATE_FORMAT(created_at, '%Y-%m')
ORDER BY month;

DISTINCT vs GROUP BY

For unique values without aggregation:

-- Using DISTINCT
SELECT DISTINCT category FROM products;

-- Using GROUP BY
SELECT category FROM products GROUP BY category;

DISTINCT is simpler. Use GROUP BY when you need counts or other aggregates.

Ordering Results

SELECT category, COUNT(*) as count
FROM products
GROUP BY category
ORDER BY count DESC;

Order by aggregate values or grouped columns.

Conditional Aggregation

SELECT 
    category,
    COUNT(*) as total,
    COUNT(CASE WHEN price > 100 THEN 1 END) as expensive_count,
    COUNT(CASE WHEN price <= 100 THEN 1 END) as cheap_count
FROM products
GROUP BY category;

Common Mistakes

Forgetting to include non-aggregated columns:

-- Error
SELECT user_id, name, COUNT(*)
FROM orders
GROUP BY user_id;  -- Missing 'name'

Using WHERE instead of HAVING:

-- Wrong - WHERE can't use aggregates
SELECT user_id, COUNT(*) as count
FROM orders
WHERE COUNT(*) > 5  -- Error
GROUP BY user_id;

-- Correct
SELECT user_id, COUNT(*) as count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

Performance Tips

Index columns used in GROUP BY
Filter with WHERE before GROUP BY
Limit groups when possible
Consider covering indexes

Real-World Example

SELECT 
    DATE_TRUNC('day', created_at) as date,
    status,
    COUNT(*) as order_count,
    SUM(amount) as revenue,
    AVG(amount) as avg_order_value,
    COUNT(DISTINCT user_id) as unique_customers
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', created_at), status
HAVING COUNT(*) >= 10
ORDER BY date DESC, revenue DESC;