Machine Learning Toolkit Update: Multi-Parameter FRESH and Updated Utilities

25 April 2019 | 7 minutes

By Diane O’Donoghue

The KX machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented and the code provided to access the exciting world of machine learning with KX.

This release, which is the first in a series of planned releases in 2019, provides both updates to the functionality of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm as well as the addition of a number of accuracy metrics, preprocessing functions and utilities. In conjunction with code changes, modifications to the namespace structure of the toolkit have been made to streamline the code and improve user experience.

The toolkit is available in its entirety on the KX Github here, with supporting documentation on code.staging.kx.com

As with all the libraries released from the KX machine learning team, the ML-Toolkit and its constituent sections are available as open source, Apache 2 software, and are supported for our clients.

Background

The Machine Learning Toolkit (ML-Toolkit) contains general use functions for preprocessing data and scoring the results from machine learning algorithms. These can be used alongside the FRESH algorithm to allow users to easily perform machine learning tasks on structured time-series data.

Since the initial release of the ML-Toolkit, numerous functions have been added or updated in order to improve performance, add functionality and allow machine learning tasks to be performed on a broader range of datasets.

A description of changes that have been made are outlined briefly below. Full documentation of the expected behavior of the functions is available at https://code.staging.kx.com/v2/ml/toolkit/.

Technical Description

Utilities

The utilities section of the toolkit has now been split into three distinct sections – preprocessing, metrics and utils. This structure allows for future expansion of the toolkit into a wider variety of sections and for the individual loading of specific sections of the utilities.

The major change within this section is the removal of the `.ml.util` namespace. All functions within utilities are now contained in the `.ml` namespace to remove ambiguity which arose between true utility functions and remaining toolkit functionality.

The primary additions to the toolkit have been made within the preprocessing and metrics sections. As only aesthetic changes to outputs were made within the utils script, such modifications are not outlined here.

Preprocessing

Additional functions have been added to the toolkit to preprocess data and deal with the inability of machine learning models to handle specific data types, namely categorical and date/time types.

The forms of encoding created to handle such behavior are as follows:

Frequency encoding
Lexicographical encoding
Time-split encoding

Preprocessing features in this manner, via frequency and lexicographical encoding, can produce a marked improvement in performance over one hot encoding methods.

Time-series data can play an important role in the outcome of certain models. By extracting additional information from time-series columns, through splitting it into its constituent parts (such as the day of the week, month, season etc) during the preprocessing stages, a machine learning model can learn patterns within the data. For example, it may be possible to find that peak demand for a product is always at the weekend. The following shows how the new function `.ml.timesplit` is used to separate time and datetime columns into their constituent parts.

q)2#timetab:([]`timestamp$2000.01.01+til 5;5?0u;5?10;5?10)
x                             x1    x2 x3
-----------------------------------------
2000.01.01D00:00:00.000000000 21:51 7  6
2000.01.02D00:00:00.000000000 02:55 5  7
q).ml.timesplit[timetab;::] /default behaviour encode all time/date cols
x2 x3 x_dow x_year x_mm x_dd x_qtr x_wd x_hh x_uu x_ss x1_hh x1_uu
------------------------------------------------------------------
7  6  0     2000   1    1    1     0    0    0    0    21    51
5  7  1     2000   1    2    1     0    0    0    0    2     55
q).ml.timesplit[timetab;`x1]
x                             x2 x3 x1_hh x1_uu
-----------------------------------------------
2000.01.01D00:00:00.000000000 6  8  21    51   
2000.01.02D00:00:00.000000000 6  1  02    55

Metrics

Given the variety of scenarios which may arise, an extensive set of scoring metrics for testing results of regression and classification models have been supplied within the toolkit. In addition to those available within the initial toolkit release, functions have been added for the computation of f1-score, r2-score, matthews-correlation coefficient and root mean squared error, among others.

Given that a variety of new classification metrics are now present, a number of these have been wrapped together to create a table known as a classification report to display the performance of a model in predicting the correct class.

q)xr:1000?2 /vector of predicted labels
q)yr:1000?2 /vector of true labels
q).ml.classreport[xr;yr]
class    | precision recall    f1_score  support
---------| -------------------------------------
0        | 0.5171717 0.4885496 0.5024534 524    
1        | 0.4693069 0.4978992 0.4831804 476    
avg/total| 0.4932393 0.4932244 0.4928169 1000

FRESH

For a detailed explanation of how the FRESH algorithm operates, both in regard to feature extraction and selection, please read the relevant blog here. The following shows how the feature extraction and selection procedures have been updated since the last release.

Feature Extraction

The function to complete the extraction of features is as follows:

.ml.fresh.createfeatures[tab;aggs;cnames;ptab]

The inputs to the first three parameters are the same as those in the initial release. The major modification to the function is in the fourth parameter.

Previously, this parameter took in a dictionary of the functions to be applied to the dataset, with support only provided for functions that took the data from individual ID’s within columns as input. In the new release, `.ml.fresh.createfeatures` allows both single and multi-parameter functions to be applied during feature extraction. This can be done by passing the function a table (defined as default by `.ml.fresh.params`) as the fourth argument.

The below outlines the structure of this table.

q)show ptab:.ml.fresh.params /example of the hyperparam dict
f              | pnum pnames         pvals                  valid
---------------|------------------------------------------------------
absenergy      | 0    ()             ()                     1
abssumchange   | 0    ()             ()                     1
count          | 0    ()             ()                     1
autocorr       | 1    ,`lag          ,0 1 2 3 4 5 6 7 8 9   1
binnedentropy  | 1    ,`lag          ,2 5 10                1
c3             | 1    ,`lag          ,1 2 3                 1
..

Functions to be applied to the data are determined by the ‘valid’ column. As such, the use of table updates can limit the functions that are to be applied or the hyperparameters for multi-input functions.

Feature Significance

Once feature extraction has been performed on the data, feature significance can be used to select a subset of features that are deemed to be statistically significant. In the previous release, there was a restriction that with this feature significance testing must be performed using the Benjamini-Hochberg-Yekutieli procedure. A reformatting of this function allowed for further methods to be introduced, the options for which are as follows:

Benjamini-Hochberg-Yekutieli (BHY) procedure – passed a p-value and determines if the in question feature meets a defined False Discovery Rate (FDR) level defined by the user as a float.
K-significant features – Returns a list of the k-best features with the lowest p-values.
Percentile significant features – Returns significant features based on the p-score being within the top p percentile.

Below is an example of how these methods are applied:

/Set the target vector of predicted valued
q)targets:value exec avg col2+.001*col2 by date from tab
q)t:value cfeats

/return features that have a FDR of 0.05
q)show benj:ml.fresh.significantfeatures[t;targets;.ml.fresh.benjhoch .05]
`col2_mean`col2_sumval`col2_fftcoeff_maxcoeff_10_coeff_0_real`col2_fftcoeff_m
..
q)count benj
31

/return the 30 best significant features
q)ksig:.ml.fresh.significantfeatures[t;targets;.ml.fresh.ksigfeat 30]
`col2_mean`col2_sumval`col2_fftcoeff_maxcoeff_10_coeff_0_real`col2_fftcoeff_m
..
q)count ksig
30

/return the features with the top 0.45 percentile
q)perc:.ml.fresh.significantfeatures[t;targets;.ml.fresh.percentile .45]
`col1_absenergy`col1_abssumchange`col1_countabovemean`col1_firstmax`col1_firs
..
q)count perc
193

If you would like to further investigate the use any of the functions contained in the ML-Toolkit, check out the files on our GitHub here and visit https://code.staging.kx.com/v2/ml/toolkit/ to find documentation and the complete list of the functions available within the ML-Toolkit. Example implementations of a wide range of functionality are also available here.

For steps regarding the set up of your machine learning environment, see the installation guide available at https://code.staging.kx.com/v2/ml/setup/ which details the installation of kdb+/q, embedPy and JupyterQ.

Please do not hesitate to contact ai@devweb.kx.com if you have any suggestions or queries.