Using local JSON variables to emulate window functions in MySQL 5.7
One of MySQL 8’s biggest improvements is the support of window functions. As I always said in conferences, there’s SQL before window functions and SQL after window functions. Once you start using them, you’ll use them everywhere.
Some of you poor souls are unfortunate enough to be stuck on MySQL 5.7, either of your own choosing, or because you’re using a clone / fork that is still 5.7 compatible. While for most people, this blog post is just for your amusement, or nostalgia, for some of you this post will be quite useful.
Using local variables
A lot of Stack Overflow questions or blog posts out there show the same old trick using local variables. In a procedural context, local variables make perfect sense. For example, this statement batch.
SET @c = (SELECT COUNT(*) FROM information_schema.tables);
-- More processing
-- Return the result:
SELECT @c;
A bit hairier is the fact that these local variables can be declared within a query, and incremented procedurally within a query:
SELECT
a,
-- Use and increment your variable in SELECT
@rn := @rn + 1
FROM
(
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t,
-- Declare your variable in FROM
(SELECT @rn := 0) r
ORDER BY a;
And boom, you have a ROW_NUMBER() OVER (ORDER BY a)
window function! The result being:
|a |@rn := @rn + 1|
|---|--------------|
|3 |1 |
|4 |2 |
This works quite incidentally, because the expression incrementing the row number “happens to” be evaluated in the desired order, row by row, because of the query’s ORDER BY a
clause. Revert it:
SELECT
a, @rn := @rn + 1
FROM (
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;
And you still get the desired result:
|a |@rn := @rn + 1|
|---|--------------|
|4 |1 |
|3 |2 |
This is really hairy, because it violates the idea of SQL’s logical order of operations, which most RDBMS agree upon. It assumes ORDER BY
“happens before” SELECT
, just because the optimiser chooses to do things this way. You can tamper with the optimiser and break the “feature” easily, e.g. by adding DISTINCT
:
SELECT DISTINCT
a, @rn := @rn + 1
FROM (
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;
Now the result is no longer what we wanted (how could it possibly be?):
|a |@rn := @rn + 1|
|---|--------------|
|4 |2 |
|3 |1 |
The reason is that DISTINCT
is typically implemented using a sort or a hashmap, both will not preserve any ordering, and according to the aforementioned logical order of operations, this is perfectly fine, because ORDER BY
is supposed to “happen after” SELECT
and after DISTINCT
, at least logically.
But if you’re careful, and cover everything with enough tests, you could still use this trick. After all, being stuck with MySQL 5.7 is already painful enough, so why not treat yourself to an “almost window function”.
Note: Just to indicate how much of a bad idea depending on this incidental feature is, MySQL 8.x now issues a deprecation warning:
Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: ‘SET variable=expression, …’, or ‘SELECT expression(s) INTO variables(s)’.
The main reason I’ve seen this syntax being used on Stack Overflow so far is to emulate
ROW_NUMBER
, so, I’d say, good riddance (now that MySQL 8 has window function support)
PARTITION BY using ORDER BY
What I haven’t seen much on Stack Overflow or in blogs, is PARTITION BY
support. Most solutions I’ve seen use ORDER BY
to implement partitioning, which is fine. For example:
SELECT
a, b,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
IF (
@prev = a,
@rn := @rn + 1,
CASE WHEN (@prev := a) IS NOT NULL OR TRUE THEN @rn := 1 END
) AS rn2
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 2 AS a, 4 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := 0, @prev := NULL) r
ORDER BY a, b DESC;
Producing:
|a |b |rn1|rn2|
|---|---|---|---|
|1 |5 |1 |1 |
|1 |3 |2 |2 |
|2 |6 |1 |1 |
|2 |4 |2 |2 |
A few notes:
- The desired
PARTITION BY
andORDER BY
clauses both have to be reflected in the top level query. If you only wanted toORDER BY b DESC
, notORDER BY a
as well, tough luck. (If you want to play around with this, try removing theROW_NUMBER()
function, which also orders stuff bya
, implicitly) - I’ve tried to put all the variable assignment logic into a single expression in order to avoid any extra columns being generated. This makes the expression a bit more ugly than it needed to be.
PARTITION BY using JSON
A more robust, but perhaps slower approach to emulating PARTITION BY
would be to maintain a JSON object that keeps track of each partition key’s ROW_NUMBER()
, because why not?
Behold this beauty:
SELECT
a, b,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
json_extract(
@rn := json_set(
@rn, @path := concat('$."', a, '"'),
(coalesce(json_extract(@rn, @path), 0) + 1)
),
@path
) AS rn2,
@rn AS debug -- Added for debugging purposes only
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 2 AS a, 4 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '') r
ORDER BY b DESC;
Check out the results:
|a |b |rn1|rn2|debug |
|---|---|---|---|--------------------|
|2 |6 |1 |1.0|"1": 2.0, "2": 1.0|
|1 |5 |1 |1.0|"1": 1.0 |
|2 |4 |2 |2.0|"1": 2.0, "2": 2.0|
|1 |3 |2 |2.0|"1": 2.0 |
You can try this on MySQL 5.7 (removing the ROW_NUMBER()
, of course), and you’ll see this works perfectly fine! How does it work?
- We start with an empty object
in the
FROM
clause. - On every row that is incidentally ordered by the
ORDER BY b DESC
clause, we’ll extract the row number value for the partition keyPARTITION BY a
. This is done with a dynamically created JSON path expressionconcat('$."', a, '"')
. For example:$."1"
or$."2"
. - At first, this value is
NULL
, of course, so we turn it to zero withCOALESCE(<expr>, 0)
. - We add
1
to it - Then we
JSON_SET
the value back into the object, assigning the result back to@rn
. - Then, we re-extract the value we’ve just calculated
This could be simplified a bit if it wasn’t just a single expression, but since I’m thinking about implementing this emulation in jOOQ (see #14529), I wanted to do the exercise of keeping the projection unchanged (imagine, the jOOQ user writes ROW_NUMBER()
with jOOQ, and wants this to “just work”).
Caveats:
- If the
PARTITION BY
clause has multiple expressions, then the composite value would have to be used as a key, e.g. using some “impossible” concatenation token (a token that can’t appear in the data set), or a hash value (risking collisions, of course), or an additional lookup, making things quite complicated. - The
concat('$."', a, '"')
expression doesn’t properly quotea
yet, in case it contains double quotes. - If multiple distinct window function calculations with distinct
ORDER BY
clauses are required, then this approach won’t work as easily. It might be possible to calculate things with one derived table nest level per window function (?). However, multiple distinctPARTITION BY
clauses are fine. Just generate a separate@rn
variable per distinctPARTITION BY
clause. - The JSON document might lose data type information. For example, in JSON, numbers may be represented as floats, so if you require decimal precision, perhaps you should work with JSON strings instead, and cast things back and forth, always valuing correctness over performance.
Do you think you’ve seen everything? Let’s do something even more hairy:
DENSE_RANK with PARTITION BY
We won’t stop here, because once we’ve chosen this crazy path, we might as well see it to the end. Let’s emulate DENSE_RANK()
, which is a bit harder, making the SQL more “beautiful”:
SELECT
a, b,
DENSE_RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
json_extract(
@rn := json_set(@rn,
@rnpath := concat('$."rn-', a, '"'),
(coalesce(json_extract(@rn, @rnpath), 0) + IF (
json_extract(@rn, @prepath := concat('$."pre-v-', a, '"')) = b,
0, 1
)),
@prepath,
b
),
@rnpath
) AS rn2,
@rn AS debug
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '') r
ORDER BY b DESC;
Here’s the result:
|a |b |rn1|rn2|debug |
|---|---|---|---|------------------------------------------------------|
|2 |6 |1 |1.0|"rn-1": 2.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6|
|1 |5 |1 |1.0|"rn-1": 1.0, "pre-v-1": 5 |
|1 |5 |1 |1.0|"rn-1": 1.0, "pre-v-1": 5 |
|1 |3 |2 |2.0|"rn-1": 2.0, "pre-v-1": 3 |
How does it differ?
- We now have to remember not just the previous row number value per partition (
"rn-1"
,"rn-2"
), but also the previous value ofb
(theORDER BY
criteria) per partition ("pre-v-1"
,"pre-v-2"
). - Then, we increment the row number per partition only if the previous value is different from the current value
Caveats:
- There can still be multiple
PARTITION BY
expressions, as well as path escaping problems, see caveats ofROW_NUMBER
for details. - If there are multiple
ORDER BY
columns, the"pre-v-n"
values would have to remember their composite value, e.g. by nesting a JSON object. This is a bit simpler to take into account than multiplePARTITION BY
expressions
Hairy enough? Let’s go deeper
RANK with PARTITION BY
Who would have thought that RANK
is harder than DENSE_RANK
(see this article for a direct comparison of the functions). Now, in addition to remembering the previous ordering value per partition, we also need to remember the previous rank per partition (all the while continuing to count up the row number).
Note, you can refactor this to something more readable if you remove the jOOQ imposed single expression restriction, but where’s the challenge in that, right? Here it is, bow before it in awe (or terror):
SELECT
a, b,
RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
coalesce(
json_extract(
@rn := json_set(@rn,
@rnpath := concat('$."rn-', a, '"'),
@currn := coalesce(json_extract(@rn, @rnpath), 0) + 1,
@prevpath := concat('$."pre-v-', a, '"'),
b,
@prernpath := concat('$."pre-rn-', a, '"'),
IF (json_extract(@rn, @prevpath) = b,
coalesce(json_extract(@rn, @prernpath), @currn) div 1,
@currn
)
),
@prernpath
),
@currn
) AS rn2,
@rn AS debug
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '') r
ORDER BY b DESC;
It produces:
|a |b |rn1|rn2|debug | |---|---|---|---|----------------------------------------------------------------------------------------| |2 |6 |1 |1.0|"rn-1": 3.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6, "pre-rn-1": 3.0, "pre-rn-2": 1.0| |1 |5 |1 |1.0|"rn-1": 1.0, "pre-v-1": 5, "pre-rn-1": 1.0 | |1 |5 |1 |1.0|"rn-1": 2.0, "pre-v-1": 5, "pre-rn-1": 1.0 | |1 |3 |3 |3.0|"rn-1": 3.0, "pre-v-1": 3, "pre-rn-1": 3.0 |
How does it work? “Simply”:
Caveats:
PERCENT_RANK and CUME_DIST
I’m not convinced that these can be emulated with the local variable based approach. In principle:
PERCENT_RANK() OVER w
is just(RANK() OVER w - 1) / (COUNT(*) OVER () - 1)
CUME_DIST() OVER w
is just(RANK() OVER w) / (COUNT(*) OVER ())
But as we’ll see below, it’s not possible (I think?) to emulate COUNT(*) OVER ()
using this local variable based approach. You could, maybe, do another round of calculations when wrapping things in a derived table, though.
LEAD, LAG, etc.
Some of these can also be emulated with the above technique, in particular the ones that are “backward looking”.
LAG
: For example, withLAG
, we just have to remember again the"pre-v-x"
for each partition, and produce it again on the current row. Depending on theLAG
‘sOFFSET
, we might need to keep around a JSON array of values, always appending the current value to the array, and removing the first value, like in a FIFO queue.LEAD
: The forward looking functions just have to reverse theORDER BY
clause. For example, allLEAD
functions can be implemented withLAG
as well.FIRST_VALUE
: This is a bit simpler thanLAG
, as we don’t have to keep an entire JSON array of values. We just remember the first one, and then keep reproducing this.LAST_VALUE
is again just the inverse ofFIRST_VALUE
with reversedORDER BY
clause.NTH_VALUE
needs a counter per partition, to be sure we catch the Nth value. Alternatively, we can again store everything in a JSON array until it reaches size N.IGNORE NULLS
can be implemented by skipping all theNULL
values from being entered into the aforementioned FIFO queue- Things get a bit trickier when there is a
RANGE
orROWS
clause, in case of which the JSON array / FIFO queue has to be shifted. This affectsFIRST_VALUE
more thanLEAD
, I’d say.
The actual implementation is left as an exercise to the user. (Probably about time to consider upgrading to MySQL 8, by now!)
Aggregate functions
All SQL aggregate functions can be turned into window functions by appending OVER ()
. For example:
SUM(x)
is an aggregate function, aggregating data per group generated by theGROUP BY
clause, shared by the entire query.SUM(x) OVER ()
is the corresponding window function, aggregating data per partition generated by thePARTITION BY
clause per window function (or rather, per explicit or implicit window specification)
Since these previously discussed local variable based approaches are row-by-row based calculations, I don’t think it’s possible to emulate partition wide aggregate window functions, because these require being able to look at the entire partition, including rows that haven’t yet been projected.
However (never give up!), some window frames can be emulated also for aggregate functions, especially the backward looking ones. For simplicity, I’ll just try emulating this:
SUM(b) OVER (
PARTITION BY a
ORDER BY b DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
Note: without explicit window frame,
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is implicit, and that means that in order to include tied rows in the sum, we’d have to again be forward looking, which I don’t think is possible with the local variable row-by-row based approach.However, it might be possible to emulate alternative backward looking
ROWS
frames. That exercise is again left to the reader.
So, let’s do this:
SELECT
a, b,
SUM(b) OVER w AS sum1,
json_extract(
@w := json_set(@w,
@spath := concat('$."s-', a, '"'),
(coalesce(json_extract(@w, @spath), 0) + b),
@cpath := concat('$."c-', a, '"'),
(coalesce(json_extract(@w, @cpath), 0) + 1)
),
@spath
) AS sum2,
COUNT(*) OVER w AS cnt1,
json_extract(@w, @cpath) AS cnt2,
AVG(b) OVER w AS avg1,
json_extract(@w, @spath) / json_extract(@w, @cpath) AS avg2,
@w AS debug
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @w := '') r
WINDOW w AS (
PARTITION BY a
ORDER BY b DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY b DESC;
The output seems correct:
|a |b |sum1|sum2|cnt1|cnt2|avg1 |avg2 |debug | |---|---|----|----|----|----|------|------------|-------------------------------------------------| |2 |6 |6 |6.0 |1 |1.0 |6 |6 |"c-1": 3.0, "c-2": 1.0, "s-1": 13.0, "s-2": 6.0| |1 |5 |5 |5.0 |1 |1.0 |5 |5 |"c-1": 1.0, "s-1": 5.0 | |1 |5 |10 |10.0|2 |2.0 |5 |5 |"c-1": 2.0, "s-1": 10.0 | |1 |3 |13 |13.0|3 |3.0 |4.3333|4.3333333333|"c-1": 3.0, "s-1": 13.0 |
Notes:
- I’ve stored all the window calculations in the same JSON object, assuming they all share the same window specification, so they can reuse their values (e.g.
AVG(x) = SUM(x) / COUNT(x)
, and thusAVG(x) OVER w = SUM(x) OVER w / COUNT(x) OVER w
) - Other than that, things work pretty much just like for
ROW_NUMBER()
Conclusion
This has been a fun blog post to write. I hope it was helpful to you either as an exercise to think about what window functions really do, or in the worst case, to help you poor soul actually implement things this way on MySQL 5.7.
There have been a lot of caveats. This emulation approach doesn’t always work and makes (heavy) assumptions about your query. For example:
- You can’t use
DISTINCT
- You can’t use arbitrary
ORDER BY
clauses that don’t match the window function’s - You can’t use multiple window functions with different window specifications
- You can’t use forward looking window frames (including frameless aggregate window functions that aggregate the entire partition)
There are probably more caveats that haven’t been discussed here. If you’re diligent, and test things heavily, however, you might be able to pull off using these approaches. Good luck (and don’t blame me 😅)