PROGRAMMING LANGUAGES

What performs better, SQL FILTER or CASE?

I’ve found an interesting question on Twitter, recently. Is there any performance impact of using FILTER in SQL (PostgreSQL, specifically), or is it just syntax sugar for a CASE expression in an aggregate function?

As a quick reminder, FILTER is an awesome standard SQL extension to filter out values before aggregating them in SQL. This is very useful when aggregating multiple things in a single query.

These two are the same:

SELECT 
  fa.actor_id,

  -- These:
  SUM(length) FILTER (WHERE rating = 'R'),
  SUM(length) FILTER (WHERE rating = 'PG'),

  -- Are the same as these:
  SUM(CASE WHEN rating = 'R' THEN length END),
  SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa 
LEFT JOIN film AS f 
  ON f.film_id = fa.film_id
GROUP BY fa.actor_id

As of jOOQ 3.17, these SQL dialects are known to support FILTER natively:

  • CockroachDB
  • Firebird
  • H2
  • HSQLDB
  • PostgreSQL
  • SQLite
  • YugabyteDB

Should it matter?

But back to the question. Does it really matter in terms of performance? Should it? Obviously, it shouldn’t matter. The two types of aggregate function expressions can be proven to mean exactly the same thing. And in fact, that’s what jOOQ does if you’re using FILTER on any other SQL dialect. Put the above query in our SQL translation tool, translate to Oracle, for example, and you’ll be getting:

SELECT
  fa.actor_id,
  sum(CASE WHEN rating = 'R' THEN length END),
  sum(CASE WHEN rating = 'PG' THEN length END),
  sum(CASE WHEN rating = 'R' THEN length END),
  sum(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor fa
  LEFT JOIN film f
    ON f.film_id = fa.film_id
GROUP BY fa.actor_id

The other way should be possible as well in an optimiser.

Does it matter?

But is this being done? Let’s try comparing the following 2 queries on PostgreSQL, against the sakila database:

Query 1:

SELECT 
  fa.actor_id,
  SUM(length) FILTER (WHERE rating = 'R'),
  SUM(length) FILTER (WHERE rating = 'PG')
FROM film_actor AS fa 
LEFT JOIN film AS f 
  ON f.film_id = fa.film_id
GROUP BY fa.actor_id

Query 2:

SELECT 
  fa.actor_id,
  SUM(CASE WHEN rating = 'R' THEN length END),
  SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa 
LEFT JOIN film AS f 
  ON f.film_id = fa.film_id
GROUP BY fa.actor_id

I will be using this benchmark technique, and will post the benchmark code at the end of this blog post. The results of running each query 500x are clear (less time is better):

Run 1, Statement 1: 00:00:00.786621
Run 1, Statement 2: 00:00:00.839966

Run 2, Statement 1: 00:00:00.775477
Run 2, Statement 2: 00:00:00.829746

Run 3, Statement 1: 00:00:00.774942
Run 3, Statement 2: 00:00:00.834745

Run 4, Statement 1: 00:00:00.776973
Run 4, Statement 2: 00:00:00.836655

Run 5, Statement 1: 00:00:00.775871
Run 5, Statement 2: 00:00:00.845209

There’s a consistent 8% performance penalty for using the CASE syntax, compared to the FILTER syntax on my machine, running PostgreSQL 15 in docker. The actual difference in a non-benchmark query may not be as impressive, or more impressive, depending on hardware and data sets. But clearly, one thing seems to be a bit better in this case than the other.

Since these types of syntaxes are typically used in a reporting context, the differences can definitely matter.

Adding an auxiliary predicate

You might think there’s additional optimisation potential, if we make the predicates on the RATING column redundant, like this:

Query 1:

SELECT 
  fa.actor_id,
  SUM(length) FILTER (WHERE rating = 'R'),
  SUM(length) FILTER (WHERE rating = 'PG')
FROM film_actor AS fa 
LEFT JOIN film AS f 
  ON f.film_id = fa.film_id
  AND rating IN ('R', 'PG') -- Redundant predicate here
GROUP BY fa.actor_id

Query 2:

SELECT 
  fa.actor_id,
  SUM(CASE WHEN rating = 'R' THEN length END),
  SUM(CASE WHEN rating = 'PG' THEN length END)
FROM film_actor AS fa 
LEFT JOIN film AS f 
  ON f.film_id = fa.film_id
  AND rating IN ('R', 'PG')
GROUP BY fa.actor_id

Note it has to be placed in the LEFT JOIN‘s ON clause, in order not to tamper with the results. It can’t be placed in the query’s WHERE clause. An explanation for this difference is here.

What will the benchmark yield now?

Run 1, Statement 1: 00:00:00.701943
Run 1, Statement 2: 00:00:00.747103

Run 2, Statement 1: 00:00:00.69377
Run 2, Statement 2: 00:00:00.746252

Run 3, Statement 1: 00:00:00.684777
Run 3, Statement 2: 00:00:00.745419

Run 4, Statement 1: 00:00:00.688584
Run 4, Statement 2: 00:00:00.740979

Run 5, Statement 1: 00:00:00.688878
Run 5, Statement 2: 00:00:00.742864

So, indeed, the redundant predicate improved things (in a perfect world, it shouldn’t, but here we are. The optimiser doesn’t optimise this as well as it could). But still, the FILTER clause outperforms CASE clause usage.

Conclusion

In a perfect world, two provably equivalent SQL syntaxes also perform the same way. But this isn’t always the case in the real world, where optimisers make tradeoffs between:

  • Time spent optimising rare syntaxes
  • Time spent executing queries

In a previous blog post (which is probably outdated by now), I’ve shown a lot of these cases, where the optimisation decision doesn’t depend on any cost model and data sets and should always be done, ideally. There was a tendency of such optimisations being favoured by RDBMS that have an execution plan cache (e.g. Db2, Oracle, SQL Server), in case of which the optimisation needs to be done only once per cached plan, and then the plan can be reused. In RDBMS that don’t have such a cache, optimisation time is more costly per query, so less can be expected.

I think this is a case where it’s worth looking into simple patterns of expressions in aggregate functions. AGG(CASE ..) is such a popular idiom, and 8% is quite the significant improvement, that I think PostgreSQL should fix this. We’ll see. In any case, since FILTER is already:

  • Better performing
  • Better looking

You can safely switch to this nice standard SQL syntax already now.

Benchmarking

While in this case, the improvement is worth it irrespective of actual measurements (because performance can hardly be worse, and readability actually improves), always be careful with such benchmark results. Do measure things yourself, and if you cannot reproduce a performance problem, then don’t necessarily touch perfectly sound logic just because a blog post told you so.

Benchmark code

As promised, this was the benchmark code used for this blog post:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 500;
  rec RECORD;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          fa.actor_id,
          SUM(length) FILTER (WHERE rating = 'R'),
          SUM(length) FILTER (WHERE rating = 'PG')
        FROM film_actor AS fa 
        LEFT JOIN film AS f 
          ON f.film_id = fa.film_id
          AND rating IN ('R', 'PG')
        GROUP BY fa.actor_id
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts);
    v_ts := clock_timestamp();

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          fa.actor_id,
          SUM(CASE WHEN rating = 'R' THEN length END),
          SUM(CASE WHEN rating = 'PG' THEN length END)
        FROM film_actor AS fa 
        LEFT JOIN film AS f 
          ON f.film_id = fa.film_id
          AND rating IN ('R', 'PG')
        GROUP BY fa.actor_id
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts);
    RAISE INFO '';
  END LOOP;
END$$;

The benchmark technique is described here.



Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button