What performs better, SQL FILTER or CASE?
I’ve found an interesting question on Twitter, recently. Is there any performance impact of using FILTER
in SQL (PostgreSQL, specifically), or is it just syntax sugar for a CASE
expression in an aggregate function?
As a quick reminder, FILTER
is an awesome standard SQL extension to filter out values before aggregating them in SQL. This is very useful when aggregating multiple things in a single query.
These two are the same:
SELECT
fa.actor_id,
-- These:
SUM(length) FILTER (WHERE rating = 'R'),
SUM(length) FILTER (WHERE rating = 'PG'),
-- Are the same as these:
SUM(CASE WHEN rating = 'R' THEN length END),
SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
As of jOOQ 3.17, these SQL dialects are known to support FILTER
natively:
- CockroachDB
- Firebird
- H2
- HSQLDB
- PostgreSQL
- SQLite
- YugabyteDB
Should it matter?
But back to the question. Does it really matter in terms of performance? Should it? Obviously, it shouldn’t matter. The two types of aggregate function expressions can be proven to mean exactly the same thing. And in fact, that’s what jOOQ does if you’re using FILTER
on any other SQL dialect. Put the above query in our SQL translation tool, translate to Oracle, for example, and you’ll be getting:
SELECT
fa.actor_id,
sum(CASE WHEN rating = 'R' THEN length END),
sum(CASE WHEN rating = 'PG' THEN length END),
sum(CASE WHEN rating = 'R' THEN length END),
sum(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor fa
LEFT JOIN film f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
The other way should be possible as well in an optimiser.
Does it matter?
But is this being done? Let’s try comparing the following 2 queries on PostgreSQL, against the sakila database:
Query 1:
SELECT
fa.actor_id,
SUM(length) FILTER (WHERE rating = 'R'),
SUM(length) FILTER (WHERE rating = 'PG')
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
Query 2:
SELECT
fa.actor_id,
SUM(CASE WHEN rating = 'R' THEN length END),
SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
I will be using this benchmark technique, and will post the benchmark code at the end of this blog post. The results of running each query 500x are clear (less time is better):
Run 1, Statement 1: 00:00:00.786621
Run 1, Statement 2: 00:00:00.839966
Run 2, Statement 1: 00:00:00.775477
Run 2, Statement 2: 00:00:00.829746
Run 3, Statement 1: 00:00:00.774942
Run 3, Statement 2: 00:00:00.834745
Run 4, Statement 1: 00:00:00.776973
Run 4, Statement 2: 00:00:00.836655
Run 5, Statement 1: 00:00:00.775871
Run 5, Statement 2: 00:00:00.845209
There’s a consistent 8% performance penalty for using the CASE
syntax, compared to the FILTER
syntax on my machine, running PostgreSQL 15 in docker. The actual difference in a non-benchmark query may not be as impressive, or more impressive, depending on hardware and data sets. But clearly, one thing seems to be a bit better in this case than the other.
Since these types of syntaxes are typically used in a reporting context, the differences can definitely matter.
Adding an auxiliary predicate
You might think there’s additional optimisation potential, if we make the predicates on the RATING
column redundant, like this:
Query 1:
SELECT
fa.actor_id,
SUM(length) FILTER (WHERE rating = 'R'),
SUM(length) FILTER (WHERE rating = 'PG')
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
AND rating IN ('R', 'PG') -- Redundant predicate here
GROUP BY fa.actor_id
Query 2:
SELECT
fa.actor_id,
SUM(CASE WHEN rating = 'R' THEN length END),
SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
AND rating IN ('R', 'PG')
GROUP BY fa.actor_id
Note it has to be placed in the LEFT JOIN
‘s ON
clause, in order not to tamper with the results. It can’t be placed in the query’s WHERE
clause. An explanation for this difference is here.
What will the benchmark yield now?
Run 1, Statement 1: 00:00:00.701943
Run 1, Statement 2: 00:00:00.747103
Run 2, Statement 1: 00:00:00.69377
Run 2, Statement 2: 00:00:00.746252
Run 3, Statement 1: 00:00:00.684777
Run 3, Statement 2: 00:00:00.745419
Run 4, Statement 1: 00:00:00.688584
Run 4, Statement 2: 00:00:00.740979
Run 5, Statement 1: 00:00:00.688878
Run 5, Statement 2: 00:00:00.742864
So, indeed, the redundant predicate improved things (in a perfect world, it shouldn’t, but here we are. The optimiser doesn’t optimise this as well as it could). But still, the FILTER
clause outperforms CASE
clause usage.
Conclusion
In a perfect world, two provably equivalent SQL syntaxes also perform the same way. But this isn’t always the case in the real world, where optimisers make tradeoffs between:
- Time spent optimising rare syntaxes
- Time spent executing queries
In a previous blog post (which is probably outdated by now), I’ve shown a lot of these cases, where the optimisation decision doesn’t depend on any cost model and data sets and should always be done, ideally. There was a tendency of such optimisations being favoured by RDBMS that have an execution plan cache (e.g. Db2, Oracle, SQL Server), in case of which the optimisation needs to be done only once per cached plan, and then the plan can be reused. In RDBMS that don’t have such a cache, optimisation time is more costly per query, so less can be expected.
I think this is a case where it’s worth looking into simple patterns of expressions in aggregate functions. AGG(CASE ..)
is such a popular idiom, and 8% is quite the significant improvement, that I think PostgreSQL should fix this. We’ll see. In any case, since FILTER
is already:
- Better performing
- Better looking
You can safely switch to this nice standard SQL syntax already now.
Benchmarking
While in this case, the improvement is worth it irrespective of actual measurements (because performance can hardly be worse, and readability actually improves), always be careful with such benchmark results. Do measure things yourself, and if you cannot reproduce a performance problem, then don’t necessarily touch perfectly sound logic just because a blog post told you so.
Benchmark code
As promised, this was the benchmark code used for this blog post:
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 500;
rec RECORD;
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT
fa.actor_id,
SUM(length) FILTER (WHERE rating = 'R'),
SUM(length) FILTER (WHERE rating = 'PG')
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
AND rating IN ('R', 'PG')
GROUP BY fa.actor_id
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT
fa.actor_id,
SUM(CASE WHEN rating = 'R' THEN length END),
SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa
LEFT JOIN film AS f
ON f.film_id = fa.film_id
AND rating IN ('R', 'PG')
GROUP BY fa.actor_id
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
RAISE INFO '';
END LOOP;
END$$;
The benchmark technique is described here.