Use MULTISET Predicates to Compare Data Sets – Java, SQL and jOOQ.
Questions that might be a bit more difficult to solve using ordinary SQL are questions of the kind:
What films have the same actors as a given film X?
As always, we’re using the sakila database for this example. What would be a possible way to solve this with SQL (for example, PostgreSQL, to be specific)? The following query gives an overview of actors per film:
SELECT
film_id,
array_agg(actor_id ORDER BY actor_id) actors
FROM film_actor
GROUP BY film_id
It produces something like this:
|film_id|actors | |-------|----------------------------------| |1 |1,10,20,30,40,53,108,162,188,198| |2 |19,85,90,160 | |3 |2,19,24,64,123 | |4 |41,81,88,147,162 | |5 |51,59,103,181,200 | |6 |21,23,62,108,137,169,197 | |... |... |
Note that in SQL, arrays behave like lists, i.e. they maintain their ordering, so ordering the array explicitly is important to be able to compare the actors with each other. Now, we want to find all films that share the same actor set, from the above:
WITH t AS (
-- Previous query
SELECT
film_id,
array_agg(actor_id ORDER BY actor_id) actors
FROM film_actor
GROUP BY film_id
)
SELECT
array_agg(film_id ORDER BY film_id) AS films,
actors
FROM t
GROUP BY actors
ORDER BY count(*) DESC, films
The result is now:
|films |actors | |--------|----------------------------------| |97,556|65 | |1 |1,10,20,30,40,53,108,162,188,198| |2 |19,85,90,160 | |3 |2,19,24,64,123 | |4 |41,81,88,147,162 | |5 |51,59,103,181,200 | |6 |21,23,62,108,137,169,197 | |... |... |
So, as we can see, there are only 2 films which share the same set of actors, and those films are FILM_ID IN (97, 556)
. (The Sakila database is a bit boring as the data sets are generated).
Using MULTISET comparisons
While the above is already quite cool, in this article, I’d like to showcase a lesser known feature of the jOOQ 3.15 MULTISET
support, namely the fact that they can be compared with one another.
And as is the nature of SQL standard MULTISET
, ordering is irrelevant, so we don’t have to add any explicit ORDER BY
clause for such a comparison. In fact, it’s not 100% irrelevant. You can order a MULTISET
for projection purposes, so the ordering will be maintained by jOOQ. But when you use them in predicates, jOOQ will override your ORDER BY
clause.
Using jOOQ, we can write:
ctx.select(FILM.FILM_ID, FILM.TITLE)
.from(FILM)
.where(
multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(FILM.FILM_ID))
).eq(multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
))
)
.orderBy(FILM_ID)
.fetch();
It’s a bit less efficient than a query of the previous form as it accesses the FILM_ACTOR
table from two subqueries, though only one of them is correlated. Using the default JSONB
emulation, the following query is generated:
SELECT film.film_id, film.title
FROM film
WHERE (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = film.film_id
) AS t
) = (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = 97
) AS t
)
ORDER BY film.film_id
I promised that no ORDER BY
clause was needed for MULTISET
, and this is still true for the jOOQ code. However, behind the scenes, jOOQ has to order the JSON arrays by their contents to make sure that two MULTISET
values are the same, irrespective of their order.
The result is the same two IDs as the previous result showed:
+-------+--------------+ |film_id|title | +-------+--------------+ | 97|BRIDE INTRIGUE| | 556|MALTESE HOPE | +-------+--------------+
Comparing MULTISET_AGG, instead
If you prefer using joins and GROUP BY
to generate the film’s actor MULTISET
, you can do that as well, with jOOQ. This time, we’re using:
- Implicit joins to simplify access to the
FILM.TITLE
fromFILM_ACTOR
- A
MULTISET
predicate in theHAVING
clause, usingMULTISET_AGG
Here’s the jOOQ version:
ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.from(FILM_ACTOR)
.groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.having(multisetAgg(FILM_ACTOR.ACTOR_ID).eq(multiset(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
)))
.orderBy(FILM_ACTOR.FILM_ID)
.fetch();
The backing, generated SQL looks like this:
SELECT film_actor.film_id, alias_75379701.title
FROM film_actor
JOIN film AS alias_75379701
ON film_actor.film_id = alias_75379701.film_id
GROUP BY film_actor.film_id, alias_75379701.title
HAVING jsonb_agg(
jsonb_build_array(film_actor.actor_id) ORDER BY film_actor.actor_id
) = (
SELECT coalesce(
jsonb_agg(jsonb_build_array(v0) ORDER BY t.v0),
jsonb_build_array()
)
FROM (
SELECT film_actor.actor_id AS v0
FROM film_actor
WHERE film_actor.film_id = 97
) AS t
)
ORDER BY film_actor.film_id
Notice how the implicit join is expanded automatically, whereas the HAVING
predicate again uses the usual JSONB
emulation for MULTISET
and MULTISET_AGG
.
Alternatives
In the above examples, we’ve compared MULTISET
expressions that project single columns, in other words, Result<Record1<Long>>
nested collection types. Nothing keeps you from adding more columns to the equation. jOOQ will always ensure that your query type checks and that the generated SQL is correct.
An alternative to using MULTISET
would be using ARRAY_AGG
and ARRAY
(now you have to ORDER BY
explicitly, again). With jOOQ:
ctx.select(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.from(FILM_ACTOR)
.groupBy(FILM_ACTOR.FILM_ID, FILM_ACTOR.film().TITLE)
.having(arrayAgg(FILM_ACTOR.ACTOR_ID)
.orderBy(FILM_ACTOR.ACTOR_ID).eq(array(
select(FILM_ACTOR.ACTOR_ID)
.from(FILM_ACTOR)
.where(FILM_ACTOR.FILM_ID.eq(97L))
.orderBy(FILM_ACTOR.ACTOR_ID)
)))
.orderBy(FILM_ACTOR.FILM_ID)
.fetch();
With SQL:
SELECT film_actor.film_id, film.title
FROM film_actor
JOIN film
ON film_actor.film_id = film.film_id
GROUP BY film_actor.film_id, film.title
HAVING array_agg(film_actor.actor_id ORDER BY film_actor.actor_id) =
ARRAY (
SELECT film_actor.actor_id
FROM film_actor
WHERE film_actor.film_id = 97
ORDER BY film_actor.actor_id
)
ORDER BY film_actor.film_id