How to Filter a SQL Nested Collection by a Value – Java, SQL and jOOQ.
I stumbled upon a very interesting question on Stack Overflow about how to use jOOQ’s MULTISET
operator to nest a collection, and then filter the result by whether that nested collection contains a value.
The question is jOOQ specific, but imagine, you have a query that nests collections using JSON in PostgreSQL. Assuming, as always, the Sakila database. Now, PostgreSQL doesn’t support the SQL standard MULTISET
operator, but we can use ARRAY
, which works almost the same way.
SELECT
f.title,
ARRAY(
SELECT ROW(
a.actor_id,
a.first_name,
a.last_name
)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
WHERE fa.film_id = f.film_id
ORDER BY a.actor_id
)
FROM film AS f
ORDER BY f.title
This produces all films and their actors as follows (I’ve truncated the arrays for readability purposes. You get the point):
title |array ---------------------------+-------------------------------------------------------------------------------------- ACADEMY DINOSAUR |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)" ACE GOLDFINGER |"(19,BOB,FAWCETT)","(85,MINNIE,ZELLWEGER)","(90,SEAN,GUINESS)","(160,CHRIS,DEPP)" ADAPTATION HOLES |{"(2,NICK,WAHLBERG)","(19,BOB,FAWCETT)","(24,CAMERON,STREEP)","(64,RAY,JOHANSSON)","(1 AFFAIR PREJUDICE |{"(41,JODIE,DEGENERES)","(81,SCARLETT,DAMON)","(88,KENNETH,PESCI)","(147,FAY,WINSLET)" AFRICAN EGG |{"(51,GARY,PHOENIX)","(59,DUSTIN,TAUTOU)","(103,MATTHEW,LEIGH)","(181,MATTHEW,CARREY)" AGENT TRUMAN |{"(21,KIRSTEN,PALTROW)","(23,SANDRA,KILMER)","(62,JAYNE,NEESON)","(108,WARREN,NOLTE)", AIRPLANE SIERRA |{"(99,JIM,MOSTEL)","(133,RICHARD,PENN)","(162,OPRAH,KILMER)","(170,MENA,HOPPER)","(185 AIRPORT POLLOCK |"(55,FAY,KILMER)","(96,GENE,WILLIS)","(110,SUSAN,DAVIS)","(138,LUCILLE,DEE)" ALABAMA DEVIL |{"(10,CHRISTIAN,GABLE)","(22,ELVIS,MARX)","(26,RIP,CRAWFORD)","(53,MENA,TEMPLE)","(68,
Now, the question on Stack Overflow was, how to filter this result by whether the ARRAY
(or MULTISET
) contains a specific value.
Filtering the ARRAY
We can’t just add a WHERE
clause to the query. Because of the logical order of operations in SQL, the WHERE
clause “happens before” the SELECT
clause, so the ARRAY
is not yet available to WHERE
. We could, however, wrap everything in a derived table and do this, instead:
SELECT *
FROM (
SELECT
f.title,
ARRAY(
SELECT ROW(
a.actor_id,
a.first_name,
a.last_name
)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
WHERE fa.film_id = f.film_id
ORDER BY a.actor_id
) AS actors
FROM film AS f
) AS f
WHERE actors @> ARRAY[(
SELECT ROW(a.actor_id, a.first_name, a.last_name)
FROM actor AS a
WHERE a.actor_id = 1
)]
ORDER BY f.title
Excuse the unwieldy ARRAY @> ARRAY
operator. I’m not aware of a better approach here, because it’s hard to unnest a structurally typed RECORD[]
array in PostgreSQL, if we don’t use a nominal type (CREATE TYPE ...
). If you know a better way to filter, please let me know in the comments section. Here’s a better version:
SELECT *
FROM (
SELECT
f.title,
ARRAY(
SELECT ROW(
a.actor_id,
a.first_name,
a.last_name
)
FROM actor AS a
JOIN film_actor AS fa USING (actor_id)
WHERE fa.film_id = f.film_id
ORDER BY a.actor_id
) AS actors
FROM film AS f
) AS f
WHERE EXISTS (
SELECT 1
FROM unnest(actors) AS t (a bigint, b text, c text)
WHERE a = 1
)
ORDER BY f.title
Anyway, this produces the desired result:
title |actors ---------------------+------------------------------------------------------------------------------------------------- ACADEMY DINOSAUR |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)","(40,JOHNN ANACONDA CONFESSIONS |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(22,ELVIS,MARX)","(150,JAYNE,NOLTE)","(164,HUMPHREY ANGELS LIFE |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(7,GRACE,MOSTEL)","(47,JULIA,BARRYMORE)","(91,CHRIS BULWORTH COMMANDMENTS|"(1,PENELOPE,GUINESS)","(65,ANGELA,HUDSON)","(124,SCARLETT,BENING)","(173,ALAN,DREYFUSS)" CHEAPER CLYDE |"(1,PENELOPE,GUINESS)","(20,LUCILLE,TRACY)" COLOR PHILADELPHIA |{"(1,PENELOPE,GUINESS)","(106,GROUCHO,DUNST)","(122,SALMA,NOLTE)","(129,DARYL,CRAWFORD)","(163,CH ELEPHANT TROJAN |{"(1,PENELOPE,GUINESS)","(24,CAMERON,STREEP)","(37,VAL,BOLGER)","(107,GINA,DEGENERES)","(115,HARR GLEAMING JAWBREAKER |{"(1,PENELOPE,GUINESS)","(66,MARY,TANDY)","(125,ALBERT,NOLTE)","(143,RIVER,DEAN)","(155,IAN,TANDY
Now, all the results are guaranteed to be films in which 'PENELOPE GUINESS'
was an ACTOR
. But is there a better solution?
Using ARRAY_AGG instead
However, in native PostgreSQL, it would be better (in this case) to use ARRAY_AGG
, I think:
SELECT
f.title,
ARRAY_AGG(ROW(
a.actor_id,
a.first_name,
a.last_name
) ORDER BY a.actor_id) AS actors
FROM film AS f
JOIN film_actor AS fa USING (film_id)
JOIN actor AS a USING (actor_id)
GROUP BY f.title
HAVING bool_or(true) FILTER (WHERE a.actor_id = 1)
ORDER BY f.title
This produces the exact same result:
title |actors ---------------------+------------------------------------------------------------------------------------------------ ACADEMY DINOSAUR |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)","(40,JOHN ANACONDA CONFESSIONS |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(22,ELVIS,MARX)","(150,JAYNE,NOLTE)","(164,HUMPHRE ANGELS LIFE |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(7,GRACE,MOSTEL)","(47,JULIA,BARRYMORE)","(91,CHRI BULWORTH COMMANDMENTS|"(1,PENELOPE,GUINESS)","(65,ANGELA,HUDSON)","(124,SCARLETT,BENING)","(173,ALAN,DREYFUSS)" CHEAPER CLYDE |"(1,PENELOPE,GUINESS)","(20,LUCILLE,TRACY)" COLOR PHILADELPHIA |{"(1,PENELOPE,GUINESS)","(106,GROUCHO,DUNST)","(122,SALMA,NOLTE)","(129,DARYL,CRAWFORD)","(163,C ELEPHANT TROJAN |{"(1,PENELOPE,GUINESS)","(24,CAMERON,STREEP)","(37,VAL,BOLGER)","(107,GINA,DEGENERES)","(115,HAR GLEAMING JAWBREAKER |{"(1,PENELOPE,GUINESS)","(66,MARY,TANDY)","(125,ALBERT,NOLTE)","(143,RIVER,DEAN)","(155,IAN,TAND
How does it work?
- We’re grouping by
FILM
and aggregate the contents per film into a nested collection. - We can now use
HAVING
to filter on groups. BOOL_OR(TRUE)
isTRUE
as soon as theGROUP
is non-emptyFILTER (WHERE a.actor_id = 1)
was that filter criteria, which we place in the group
So, the HAVING
predicate is TRUE
if there is at least one ACTOR_ID = 1
, or NULL
otherwise, which has the same effect as FALSE
. If you’re a purist, wrap the predicate in COALESCE(BOOL_OR(...), FALSE)
Clever or neat, or a bit of both?
Doing this with jOOQ
Here’s the jOOQ version, that works on any RDBMS that supports MULTISET_AGG
(ARRAY_AGG
emulation is still pending):
ctx.select(
FILM_ACTOR.film().TITLE,
multisetAgg(
FILM_ACTOR.actor().ACTOR_ID,
FILM_ACTOR.actor().FIRST_NAME,
FILM_ACTOR.actor().LAST_NAME))
.from(FILM_ACTOR)
.groupBy(FILM_ACTOR.film().TITLE)
.having(boolOr(trueCondition())
.filterWhere(FILM_ACTOR.actor().ACTOR_ID.eq(1)))
.orderBy(FILM_ACTOR.film().TITLE)
.fetch();
While the powerful MULTISET
value constructor gets most of the fame with jOOQ users, let’s not forget there is also a slightly less powerful, but occasionally really useful MULTISET_AGG
aggregate function, which can be used for aggregations or as a window function!