by Nuša Žnuderl
This article was updated in 2019 to reflect updated kdb+ terminology. The title and some text have been changed to reflect that the term “iterator” replaces the term “adverb,” with additions to the article written by KX Librarian Stephen Taylor. Please refer to the Kdb+ Developer’s site, code.staging.kx.com for further explanation and examples.
Iteration in programming languages is not exactly a problem begging for a solution. As a Java developer once said, “I can write for-loops in my sleep.”
The ghost of Turing Award winner Kenneth Iverson challenges programmers like that Java developer to wake up! As a young mathematician working on economic input-output matrices with Leontieff at Harvard in the 1950’s, Iverson developed a notation to describe computational processes. A mathematical notation, not a programming language. Like any mathematician, Iverson wanted abstractions that would help solve interesting problems.
Kdb+ and the q programming language are direct descendants of Iverson’s notation. A key feature of all the Iversonian languages is that wherever possible iteration is implicit, you don’t specify it, it just happens. If A
and B
are lists of the same length, then A+B
is a list of the sums of their corresponding items. If one, or both, are nested lists, the addition recurses through the nesting all the way down.
Q operators and keywords that do this are known as atomic, others require explicit iteration. Characteristically, q does this with minimal fuss. The iterators make code shorter, clearer, and almost always, more efficient than the alternative loopy modus operandi. These are qualities that differentiate code written by proficient q users from the rest.
Iterators were once known as adverbs, a term inherited from another Iversonian language, J. Their theory is covered in traditional q manuals, and in Jeff Borror’s book Q for Mortals, chapters 1 and 6, yet an average q user often struggles to use them in their q code. The examples below attempt to bridge the knowledge gap by demonstrating the application of iterators in solutions to a handful of problems q users regularly come across in real life.
Case 1
We have a table with column id
of type string. We can retrieve records with IDs starting with "abc"
like this: select from tab where id like "abc*"
, but how can we retrieve data for IDs starting with "abc"
or "def"
?
We construct our Where clause to use the aggregate any on the keyword like that we apply to the list of our id
stems using iterator Each Right /:
.
q)t:([]id:("abc123";"geh789";"def678";"abc321";"xyz1010"); p:5?10)
q)?[t;enlist(any;(like/:;`id;enlist ("abc*";"def*")));0b;()]
id p
----------
"abc123" 8
"def678" 9
"abc321" 5
Case 2
We are investigating data in a table and would like to see only the rows exhibiting changes in a set of fields that we are interested in.
We can use uniform keyword differ
(which uses the iterator Each Prior ':
to check the match between each consecutive items in a list) to each column of interest using the iterator Each Right /:
.
If we are interested in rows where all columns of interest changed simultaneously, we apply to the result the aggregate all. If instead we are interested in retrieving rows where any columns of interest changed, we replace all
with any
.
First, let’s apply this to a small, in-memory table.
q)t:([]
col1:1 1 3 3 3 7 7 1;
col2:1 4 4 4 6 7 6 4;
col3:2 1 2 1 2 2 2 2;
col4:`a`b`c`d`d`d`d`d;
col5:2 2 3 6 1 7 7 8 )
q)// retrieve rows where col1 and col4 changed
q)?[t;enlist(all;(enlist,differ,/:`col1`col4));0b;()]
col1 col2 col3 col4 col5
------------------------
1 1 2 a 2
3 4 2 c 3
q)// retrieve rows where col1 or col4 changed
q)?[t;enlist(any;(enlist,differ,/:`col1`col4));0b;()]
col1 col2 col3 col4 col5
------------------------
1 1 2 a 2
1 4 1 b 2
3 4 2 c 3
3 4 1 d 6
7 7 2 d 7
1 4 2 d 8
However, what if instead of an in-memory table we are working with data stored in a date-partitioned HDB, and we are trying to examine a daily time series of changes in a set of fields from this HDB that is too large to read into memory for a single date, never mind when looking across a date range? Such an examination can inform an efficient schema design where we separate fields which change together into separate tables.
We can approach this problem by reasoning as follows:
- Since we do not have enough memory to select all columns of interest, even for a single date, we will compare a subset of columns of interest across 2 consecutive days at a time.
To do this we write a function getDiffs
which takes as arguments:
- the HDB name
tab
(e.g.`ref
) - a symbol list of field names
fns
we can read from the HDB in one go (e.g.`col1`col2
- the dataset key
fk
(e.g.`id
) for which we want to track changes across days - date
d
for which we wish to run the comparison (e.g. 2017.02.10) - an in-memory table
seed
consisting of fields`date`id`col1`col2
for the date prior to the comparison dated
The function selects fields date
,fk
,fns
from the HDB for date d
to a local table t
. It then union-joins this table t
with the seed, and orders it in ascending order of fk
, such that if the same key exists on both dates, the records are adjacent. It then performs the any-differ-Each-Right algorithm we saw above and returns the result.
getDiffs:{[tab;fns;fk;seed;d]
coll:`date,fk,fns;
t:?[tab;enlist(=;`date;d);0b;{x!x}coll];
?[fk xasc t uj seed;enlist(any;enlist,differ,/:fns);0b;coll] }
We now need to apply this function recursively across the date range of interest. We can do this by applying the Over iterator /
to getDiffs
, and in addition to constant arguments tab
, fns
and fk
, pass to it the table seed
, which is the starting point for our comparison, and a list of dates.
This application of the iterator overachieves recursion:
- on the first iteration function
getDiffs
is applied withseed
and the first date in our list - on the second iteration
getDiffs
runs using thetable
output from the first iteration as its first argument, and the second date in our list as the second argument - and so on until it is applied to all dates in the list.
We wrap this logic in a function getPartDiffs
:
getPartDiffs:{[tab;fns;fk;seed;dates]
seed getDiffs[tab;fns;fk;;;]/ dates }
In the next step we apply getPartDiffs
to all fields of interest. We could split the full list of fields into manageable chunks we can read into memory, and apply every chunk to getPartDiffs
using the each
keyword, which is defined using the iterator Each as {x'y}
. This would get differences in the first cluster of fields across an entire date range, then it would proceed to the next cluster, and so on in sequence until it runs through all clusters. However, we can optimize on execution time if we run our session with multiple slaves (-s
command line option) and use the keyword peach instead of each
, which is defined using the Each Parallel iterator as {x':y}
. This will simultaneously kick off the execution of as many field clusters as there are slaves available. We express this logic in function getFullDiffs
where fnr
is the number of fields per cluster.
getFullDiffs:{[tab;fns;fk;seed;dates;fnr]
res:getPartDiffs[tab;;fk;seed;dates] peach fnr cut fns;
uj/[(`date,fk) xkey/:res] }
The output of getPartDiffs
invoked with peach
is a list of tables, all of which we key on date
and fk
, and union-join into a single table which represents cumulative differences in a set of fields across a date range.
For completeness, we may want to join to the output of getFullDiffs
any IDs which were present on our initial date but did not exhibit any differences during the time period we examined, and are therefore not present in the output of getFullDiffs
. Also, we may want to wrap the procedure into a single main function like this:
main:{[tab;sd;ed;fns;fk;fnr]
k:`date,fk;
dates:sd+1+til (ed-sd);
seed:?[tab;enlist (=;`date;sd);0b;x!x:k,fns];
res:getFullDiffs[tab;fns;fk;seed;dates;fnr];
unchangedKeys:?[res;();();(distinct;fk)];
(k xkey ?[seed;((=;`date;sd);(not;(in;fk;enlist unchangedKeys)));0b;()]) uj res }
Sample invocation of main
is:
q)t:main[`ref;2017.02.01;2017.02.28;`col1`col2`col3`col4`col5`col6`;`id;3]
We ran this procedure across 28 days for 9 fields in a heavily repetitive HDB (stored on DAS) that contains approximately 2.5 million records a day. The session ran with 6 slaves. We chunked the 9 fields of interest into 3 clusters of 3 fields each. Execution completed in approximately 11 seconds, with peak memory consumption of 1.5GB.
Case 3
Our data is fragmented across multiple tables. We frequently need to access data across tables, but we do not always need all fields from every table. We know the fields that link different tables, but we may not remember which field resides in which table. We would like to create a generic function which looks for the fields of interest in the correct tables and returns the result in one table.
This logic is wrapped into function getData. It uses several iterators to achieve minimum amount of data operations while remaining dynamic with respect to input fields.
each
is used in multiple places to modify a unary keyword (cols
,where
,count
) for application to every item of a list rather than the list as a whole- iterator Each Right
/:
is used to modify the binary keywordin
such that it searches for the list of fieldsf
in every list of columns in the dictionaryd
- iterator Each
'
is used to modify the select lambda by applying it pairwise to the corresponding lists of tables and fields - iterator Over
/
modifies the binary keyworduj
such that it applies cumulatively to the list of tables
getData:{[f;k]
tabs:tables[];
f:f,k;
d:tabs! colseach tabs;
d:f where each f in/: d;
d:(where 1<count each d)#d;
uj/[{[t;f;k] k xkey ?[t;();0b;f!f]}[;;k]'[key d;value d]] }
Sample call:
q)getData[`col1`col2;`date`sym]
Case 4
We have a daily time series of index constituents, their volumes and prices. The index constituency may change day over day. We calculate the daily volume weighted price of the index using the built-in function wavg
. The time series is naturally jumpy, but we also notice a few outliers.
Upon inspection we see that the outliers are caused by heavy index constituents which appear on a given day, and then completely disappear from the index. These are bad data points which we would like to exclude from the volume-weighted average calculation. In particular, we define a rule that a constituent will only contribute to the wavg
calculation if it is present in the index on two consecutive days.
To demonstrate the approach to solving this problem, we will use the following example, where wavg
on 2016.09.15 is an outlier we want to investigate further.
tab:([]
date:2016.09.01+-1+12 13 14 15 16 19 where 3 4 4 5 4 4;
sym:`m`c`e`m`b`c`e`m`b`c`e`m`b`f`c`e`m`b`c`e`m`b`c`e;
v:401.048 200.761 10.326 414.167 4.14167 207.146 10.283 366.41 3.6641 190.268
10.336 337.372 3.37372 400 195.094 9.955 501.436 5.01436 301.386 27.851
316.3 3.163 179.112 7.809;
p:21.5904 52.1189 183.2507 21.2702 12.1232 51.5525 180.6675 21.2577 12.1159
51.7377 180.3477 21.4726 12.2702 10 52.4969 182.1248 21.3916 12.2478 52.4457
181.238 21.3912 12.3253 52.3503 181.2017 )
q)show res1:select wav:v wavg p by date from tab
date | wav
----------| --------
2016.09.12| 34.32981
2016.09.13| 33.6559
2016.09.14| 34.24265
2016.09.15| 24.67822
2016.09.16| 37.86343
2016.09.19| 34.7495
We use xgroup
to collapse the table to one row per date. This table contains three nested columns: for each date, the column sym
contains a list of constituents, v
contains a list of volumes and p
contains a list of prices. Sym f
on 2016.09.15 is the outlier we want to remove.
q)select sym,v,p by date from tab
date | sym v p
----------| --------------------------------------------------------------------------------
2016.09.12| `m`c`e 401.048 200.761 10.326 21.5904 52.1189 183.2507
2016.09.13| `m`b`c`e 414.167 4.14167 207.146 10.283 21.2702 12.1232 51.5525 180.6675
2016.09.14| `m`b`c`e 366.41 3.6641 190.268 10.336 21.2577 12.1159 51.7377 180.3477
2016.09.15| `m`b`f`c`e 337.372 3.37372 400 195.094 9.955 21.4726 12.2702 10 52.4969 182.1248
2016.09.16| `m`b`c`e 501.436 5.01436 301.386 27.851 21.3916 12.2478 52.4457 181.238
2016.09.19| `m`b`c`e 316.3 3.163 179.112 7.809 21.3912 12.3253 52.3503 181.2017
In this collapsed table we can compare the list of constituents with that of the previous day and retain only common constituents using inter
with the Each Prior iterator ':
on the nested column sym
.
q)select date, sym, validSym:(inter':) sym from `date xgroup tab
date sym validSym
--------------------------------
2016.09.12 `m`c`e `symbol$()
2016.09.13 `m`b`c`e `m`c`e
2016.09.14 `m`b`c`e `m`b`c`e
2016.09.15 `m`b`f`c`e `m`b`c`e
2016.09.16 `m`b`c`e `m`b`c`e
2016.09.19 `m`b`c`e `m`b`c`e
We observe that we successfully removed the outlier `f
on 2016.09.15 – but it is worth noting that the price of using this method is that we lost some valid data points. For example, we now cannot calculate the average for the first day as there is no in-sample data to compare it to even though all data points may be valid. Also, sym `b
enters the dataset on 2016.09.13 and is not an outlier, but we will only begin counting it as a valid sym on 2016.09.14.
We extend this approach one step further to retrieve the position of validSym
in the original list of syms. For this we use the Each iterator with the Find operator ?
.
q)select date, sym, validSymPos:sym?'(inter':) sym from `date xgroup tab
date sym validSymPos
---------------------------------
2016.09.12 `m`c`e `long$()
2016.09.13 `m`b`c`e 0 2 3
2016.09.14 `m`b`c`e 0 1 2 3
2016.09.15 `m`b`f`c`e 0 1 3 4
2016.09.16 `m`b`c`e 0 1 2 3
2016.09.19 `m`b`c`e 0 1 2 3
Now that we know the position of data points we want to include in the calculation of wavg
, we retrieve the lists of volumes and prices using Apply At @
with the iterator Each '
. We then use Each one more time in the application of wavg
.
q)res2:1!select date,adjWav:{wavg'[x@'z;y@'z]}[v;p;sym?(inter':) sym] from `date xgroup tab
q)res2
date | adjWav
----------| -------
2016.09.12|
2016.09.13| 21.2702
2016.09.14| 12.1159
2016.09.15| 12.2702
2016.09.16| 12.2478
2016.09.19| 12.3253
Finally, in order to demonstrate another useful application of Each, let’s display side-by-side the wavg
calculation before (res1
) and after (res2
) the removal of the outlier.
q)res1,'res2
date | wav adjWav
----------| ----------------
2016.09.12| 34.32981
2016.09.13| 33.6559 21.2702
2016.09.14| 34.24265 12.1159
2016.09.15| 24.67822 12.2702
2016.09.16| 37.86343 12.2478
2016.09.19| 34.7495 12.3253
Case 5
Continuing from the previous example, we would like to further filter out the noise of price fluctuations and observe monthly trends in our price series, i.e. we want to examine the average, minimum, maximum, and standard deviation for a rolling window of 21 business days.
The application of these functions to a rolling window is so frequent that they have their own built-in functions in q: mavg
, mmin
, mmax
and mdev
. They take as their first argument the length of the moving window, and a numeric list to apply the function to as their second argument.
q)-10#select date,ma:21 mavg adjWav,minim:21 mmin adjWav,maxim:21 mmax adjWav,devi:21 mdev adjWav from 1_t
date ma minim maxim devi
----------------------------------------------
2017.02.17 44.39342 42.59418 46.12253 1.053702
2017.02.21 44.37788 42.59418 46.12253 1.032939
2017.02.22 44.42241 42.59418 46.12253 1.044719
2017.02.23 44.39069 42.59418 46.12253 1.077618
2017.02.24 44.32905 42.59418 46.12253 1.125346
2017.02.27 44.27849 42.59418 46.12253 1.147834
2017.02.28 44.21612 42.59418 46.12253 1.113893
2017.03.01 44.33964 42.59418 46.12253 1.106236
2017.03.02 44.41428 42.59418 46.12253 1.078212
We can generalize the concept of applying a function to a moving window for an arbitrary function f
. One approach is to use the Scan iterator \
to build a moving list by adding one item each time while dropping the oldest value. We can then apply an arbitrary function f
to every list using the each
keyword.
q)mwin:{[f;w;l] f each { 1_x,y }\[w#0n;`float$l]}
q)mwin[::;3;100+til 5]
100
100 101
100 101 102
101 102 103
102 103 104
We could also pass mwin
a custom function f
. For example, we may want to attach a higher weight to more recent data points in each moving window as defined in function wa
.
q)mwin[avg;3;100+til 10] ~ (3 mavg 100+til 10)
1b
q)wa:{w:1%1+reverse til count x;w wavg x}
q)-10#select date,wma:mwin[wa;5;adjWav] from t
date wma
------------------
2017.02.17 45.597
2017.02.21 45.655
2017.02.22 45.4277
2017.02.23 44.3347
2017.02.24 43.7596
2017.02.27 43.5462
2017.02.28 43.7518
2017.03.01 44.3833
2017.03.02 44.5088
2017.03.03 44.0903
To sum up, you can really improve your results by using iterators instead of looping constructs. Long term the benefits are vastly improved performance because you are doing things in the q way, so the interpreter can help you.
To see a video presentation of this paper by Nuša Žnuderl please visit the KX Youtube channel.
Nuša Žnuderl is an expert kdb+/q programmer and KX consultant currently based in New York City. Code snippets were created by Jack Kiernan, a KX data scientist, also currently based in New York. Updating of the article was done by Stephen Taylor, KX Librarian, currently based near London.