Skip to content

Latest commit

 

History

History
424 lines (387 loc) · 24 KB

array-functions-as-window.md

File metadata and controls

424 lines (387 loc) · 24 KB
title linkTitle weight description
Using array functions to mimic window-functions alike behavior
Using array functions to mimic window-functions alike behavior
100
Using array functions to mimic window-functions alike behavior.

There are cases where you may need to mimic window functions using arrays in ClickHouse. This could be for optimization purposes, to better manage memory, or to enable on-disk spilling, especially if you’re working with an older version of ClickHouse that doesn't natively support window functions.

Here’s an example demonstrating how to mimic a window function like runningDifference() using arrays:

Step 1: Create Sample Data

We’ll start by creating a test table with some sample data:

DROP TABLE IS EXISTS test_running_difference

CREATE TABLE test_running_difference
ENGINE = Log AS
SELECT 
    number % 20 AS id, 
    toDateTime('2010-01-01 00:00:00') + (intDiv(number, 20) * 15) AS ts, 
    (number * round(xxHash32(number % 20) / 1000000)) - round(rand() / 1000000) AS val
FROM numbers(100)


SELECT * FROM test_running_difference;

┌─id─┬──────────────────ts─┬────val─┐
│  02010-01-01 00:00:00-1209 │
│  12010-01-01 00:00:0043 │
│  22010-01-01 00:00:004322 │
│  32010-01-01 00:00:00-25 │
│  42010-01-01 00:00:0013720 │
│  52010-01-01 00:00:00903 │
│  62010-01-01 00:00:0018062 │
│  72010-01-01 00:00:00-2873 │
│  82010-01-01 00:00:006286 │
│  92010-01-01 00:00:0013399 │
│ 102010-01-01 00:00:0018320 │
│ 112010-01-01 00:00:0011731 │
│ 122010-01-01 00:00:00857 │
│ 132010-01-01 00:00:008752 │
│ 142010-01-01 00:00:0023060 │
│ 152010-01-01 00:00:0041902 │
│ 162010-01-01 00:00:0039406 │
│ 172010-01-01 00:00:0050010 │
│ 182010-01-01 00:00:0057673 │
│ 192010-01-01 00:00:0051389 │
│  02010-01-01 00:00:1566839 │
│  12010-01-01 00:00:1519440 │
│  22010-01-01 00:00:1574513 │
│  32010-01-01 00:00:1510542 │
│  42010-01-01 00:00:1594245 │
│  52010-01-01 00:00:158230 │
│  62010-01-01 00:00:1587823 │
│  72010-01-01 00:00:15-128 │
│  82010-01-01 00:00:1530101 │
│  92010-01-01 00:00:1554321 │
│ 102010-01-01 00:00:1564078 │
│ 112010-01-01 00:00:1531886 │
│ 122010-01-01 00:00:158749 │
│ 132010-01-01 00:00:1528982 │
│ 142010-01-01 00:00:1561299 │
│ 152010-01-01 00:00:1595867 │
│ 162010-01-01 00:00:1593667 │
│ 172010-01-01 00:00:15114072 │
│ 182010-01-01 00:00:15124279 │
│ 192010-01-01 00:00:15109605 │
│  02010-01-01 00:00:30135082 │
│  12010-01-01 00:00:3037345 │
│  22010-01-01 00:00:30148744 │
│  32010-01-01 00:00:3021607 │
│  42010-01-01 00:00:30171744 │
│  52010-01-01 00:00:3014736 │
│  62010-01-01 00:00:30155349 │
│  72010-01-01 00:00:30-3901 │
│  82010-01-01 00:00:3054303 │
│  92010-01-01 00:00:3089629 │
│ 102010-01-01 00:00:30106595 │
│ 112010-01-01 00:00:3054545 │
│ 122010-01-01 00:00:3018903 │
│ 132010-01-01 00:00:3048023 │
│ 142010-01-01 00:00:3097930 │
│ 152010-01-01 00:00:30152165 │
│ 162010-01-01 00:00:30146130 │
│ 172010-01-01 00:00:30174854 │
│ 182010-01-01 00:00:30189194 │
│ 192010-01-01 00:00:30170134 │
│  02010-01-01 00:00:45207471 │
│  12010-01-01 00:00:4554323 │
│  22010-01-01 00:00:45217984 │
│  32010-01-01 00:00:4531835 │
│  42010-01-01 00:00:45252709 │
│  52010-01-01 00:00:4521493 │
│  62010-01-01 00:00:45221271 │
│  72010-01-01 00:00:45-488 │
│  82010-01-01 00:00:4576827 │
│  92010-01-01 00:00:45131066 │
│ 102010-01-01 00:00:45149087 │
│ 112010-01-01 00:00:4571934 │
│ 122010-01-01 00:00:4525125 │
│ 132010-01-01 00:00:4565274 │
│ 142010-01-01 00:00:45135980 │
│ 152010-01-01 00:00:45210910 │
│ 162010-01-01 00:00:45200007 │
│ 172010-01-01 00:00:45235872 │
│ 182010-01-01 00:00:45256112 │
│ 192010-01-01 00:00:45229371 │
│  02010-01-01 00:01:00275331 │
│  12010-01-01 00:01:0072668 │
│  22010-01-01 00:01:00290366 │
│  32010-01-01 00:01:0046074 │
│  42010-01-01 00:01:00329207 │
│  52010-01-01 00:01:0026770 │
│  62010-01-01 00:01:00287619 │
│  72010-01-01 00:01:00-2207 │
│  82010-01-01 00:01:00100456 │
│  92010-01-01 00:01:00165688 │
│ 102010-01-01 00:01:00194136 │
│ 112010-01-01 00:01:0094113 │
│ 122010-01-01 00:01:0035810 │
│ 132010-01-01 00:01:0085081 │
│ 142010-01-01 00:01:00170256 │
│ 152010-01-01 00:01:00265445 │
│ 162010-01-01 00:01:00254828 │
│ 172010-01-01 00:01:00297238 │
│ 182010-01-01 00:01:00323494 │
│ 192010-01-01 00:01:00286252 │
└────┴─────────────────────┴────────┘

100 rows in set. Elapsed: 0.003 sec. 

This table contains IDs, timestamps (ts), and values (val), where each id appears multiple times with different timestamps.

Step 2: Running Difference Example

If you try using runningDifference directly, it works block by block, which can be problematic when the data needs to be ordered or when group changes occur.

select id, val, runningDifference(val) from (select * from test_running_difference order by id, ts);

┌─id─┬────val─┬─runningDifference(val)─┐
│  0-12090 │
│  06683968048 │
│  013508268243 │
│  020747172389 │
│  027533167860 │
│  143-275288 │
│  11944019397 │
│  13734517905 │
│  15432316978 │
│  17266818345 │
│  24322-68346 │
│  27451370191 │
│  214874474231 │
│  221798469240 │
│  229036672382 │
│  3-25-290391 │
│  31054210567 │
│  32160711065 │
│  33183510228 │
│  34607414239 │
│  413720-32354 │
│  49424580525 │
│  417174477499 │
│  425270980965 │
│  432920776498 │
│  5903-328304 │
│  582307327 │
│  5147366506 │
│  5214936757 │
│  5267705277 │
│  618062-8708 │
│  68782369761 │
│  615534967526 │
│  622127165922 │
│  628761966348 │
│  7-2873-290492 │
│  7-1282745 │
│  7-3901-3773 │
│  7-4883413 │
│  7-2207-1719 │
│  862868493 │
│  83010123815 │
│  85430324202 │
│  87682722524 │
│  810045623629 │
│  913399-87057 │
│  95432140922 │
│  98962935308 │
│  913106641437 │
│  916568834622 │
│ 1018320-147368 │
│ 106407845758 │
│ 1010659542517 │
│ 1014908742492 │
│ 1019413645049 │
│ 1111731-182405 │
│ 113188620155 │
│ 115454522659 │
│ 117193417389 │
│ 119411322179 │
│ 12857-93256 │
│ 1287497892 │
│ 121890310154 │
│ 12251256222 │
│ 123581010685 │
│ 138752-27058 │
│ 132898220230 │
│ 134802319041 │
│ 136527417251 │
│ 138508119807 │
│ 1423060-62021 │
│ 146129938239 │
│ 149793036631 │
│ 1413598038050 │
│ 1417025634276 │
│ 1541902-128354 │
│ 159586753965 │
│ 1515216556298 │
│ 1521091058745 │
│ 1526544554535 │
│ 1639406-226039 │
│ 169366754261 │
│ 1614613052463 │
│ 1620000753877 │
│ 1625482854821 │
│ 1750010-204818 │
│ 1711407264062 │
│ 1717485460782 │
│ 1723587261018 │
│ 1729723861366 │
│ 1857673-239565 │
│ 1812427966606 │
│ 1818919464915 │
│ 1825611266918 │
│ 1832349467382 │
│ 1951389-272105 │
│ 1910960558216 │
│ 1917013460529 │
│ 1922937159237 │
│ 1928625256881 │
└────┴────────┴────────────────────────┘

100 rows in set. Elapsed: 0.005 sec. 

The output may look inconsistent because runningDifference requires ordered data within blocks.

Step 3: Using Arrays for Grouping and Calculation

Instead of using runningDifference, we can utilize arrays to group data, sort it, and apply similar logic more efficiently.

Grouping Data into Arrays - You can group multiple columns into arrays by using the groupArray function. For example, to collect several columns as arrays of tuples, you can use the following query:

SELECT 
    id, 
    groupArray(tuple(ts, val))
FROM test_running_difference
GROUP BY id

┌─id─┬─groupArray(tuple(ts, val))──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:00',-2873),('2010-01-01 00:00:15',-128),('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:45',-488),('2010-01-01 00:01:00',-2207)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Sorting Arrays - To sort the arrays by a specific element, for example, by the second element of the tuple, you can use the arraySort function:

SELECT 
    id, 
    arraySort(x -> (x.2), groupArray((ts, val)))
FROM test_running_difference
GROUP BY id

┌─id─┬─arraySort(lambda(tuple(x), tupleElement(x, 2)), groupArray(tuple(ts, val)))─────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:00',-2873),('2010-01-01 00:01:00',-2207),('2010-01-01 00:00:45',-488),('2010-01-01 00:00:15',-128)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

20 rows in set. Elapsed: 0.004 sec. 

This sorts each array by the val (second element of the tuple) for each id.

Simplified Sorting Example - We can rewrite the query in a more concise way using WITH clauses for better readability:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows
SELECT 
    id, 
    sorted_window_rows
FROM test_running_difference
GROUP BY id

Applying Calculations with Arrays - Once the data is sorted, you can apply array functions like arrayMap and arrayDifference to calculate differences between values in the arrays:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff
SELECT 
    id, 
    sorted_window_rows_val_column_diff
FROM test_running_difference
GROUP BY id

┌─id─┬─sorted_window_rows_val_column_diff─┐
│  0 │ [0,68048,68243,72389,67860]        │
│  1 │ [0,19397,17905,16978,18345]        │
│  2 │ [0,70191,74231,69240,72382]        │
│  3 │ [0,10567,11065,10228,14239]        │
│  4 │ [0,80525,77499,80965,76498]        │
│  5 │ [0,7327,6506,6757,5277]            │
│  6 │ [0,69761,67526,65922,66348]        │
│  7 │ [0,2745,-3773,3413,-1719]          │
│  8 │ [0,23815,24202,22524,23629]        │
│  9 │ [0,40922,35308,41437,34622]        │
│ 10 │ [0,45758,42517,42492,45049]        │
│ 11 │ [0,20155,22659,17389,22179]        │
│ 12 │ [0,7892,10154,6222,10685]          │
│ 13 │ [0,20230,19041,17251,19807]        │
│ 14 │ [0,38239,36631,38050,34276]        │
│ 15 │ [0,53965,56298,58745,54535]        │
│ 16 │ [0,54261,52463,53877,54821]        │
│ 17 │ [0,64062,60782,61018,61366]        │
│ 18 │ [0,66606,64915,66918,67382]        │
│ 19 │ [0,58216,60529,59237,56881]        │
└────┴────────────────────────────────────┘

20 rows in set. Elapsed: 0.005 sec. 

You can do also a lot of magic with arrayEnumerate and accessing different values by their ids.

Reverting Arrays Back to Rows - You can convert the arrays back into rows using arrayJoin:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
	arrayJoin(sorted_window_rows_val_column_diff) as diff
SELECT 
    id, 
    diff
FROM test_running_difference
GROUP BY id

Or use ARRAY JOIN to join the arrays back to the original structure:

SELECT 
  id,
  diff,
  ts
FROM 
(
WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column
SELECT 
    id, 
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
    arrayMap(x -> x.1, sorted_window_rows) as sorted_window_rows_ts_column
FROM test_running_difference
GROUP BY id
) as t1
ARRAY JOIN sorted_window_rows_val_column_diff as diff, sorted_window_rows_ts_column as ts

This allows you to manipulate and analyze data within arrays effectively, using powerful functions such as arrayMap, arrayDifference, and arrayEnumerate.