Feature Engineering in kdb+

By Fionnuala Carr

As part of KX25, the international kdb+ user conference held May 18th in New York City, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement a different machine learning technique in kdb+, primarily using embedPy, to solve all kinds of machine learning problems, from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within KX. (For more about the KX machine learning team please watch Andrew Wilson’s presentation at Kx25 on the KX Youtube channel).

Background

Feature engineering is an essential part of the machine learning pipeline. Feature engineering can be considered to be the process of creating or combining data variables which will give us new and valuable insights into our datasets. These features can help us to learn about the structure of the problem. Feature selection, is a related problem, and can be considered to be the process of finding the most important variables in our dataset. Having the correct features can improve the accuracy of the model. This is often the most valuable task that can completed to improve the performance of a model. Better features allow us to use less complex models which are faster and easier to understand and can often provide the same or even higher accuracy.

Manual feature engineering requires both knowledge of the chosen algorithm and domain knowledge of the dataset. In this notebook, an investigation of four different scaling algorithms will be undertaken and we will check if they had an impact on the results of a k-Nearest Neighbors classifier. We will focus on four different scaling algorithms:

The Standard Scaler scales the data via a linear transformation that transforms the mean value of each feature to 0 and the standard deviation of each feature to 1.

The MinMax Scaler scales the data via a linear transformation that transforms the minimum and maximum values of each feature to [0,1].

The Robust Scaler scales the data via a linear transformation that transforms the interquartile range of each feature to [-1,1].

The Quantiles Transformer converts the distribution of each feature to a standard normal distribution. This is not a linear transformation, but a transformation based on the distribution of the inputs.

Many machine learning algorithms cannot operate on labeled data. They require all input and output variables to be in a numerical form. This means that if a dataset contains categorical data, it needs to be converted to be numeric. There are two options to convert these variables to numerical data which is integer encoding or one-hot encoding. Integer encoding involves assigning each unique category value to an integer value. This method is suitable for categorical variables that are ordered. For variables that don’t have any orders, we apply one-hot encoding to this variable. We assign an integer index to each distinct category, and represent each value with a vector of 0s and 1s. The vector has length equal to the number of distinct categories, with a 0 in every position except the relevant category index (given value 1).

We will investigate the impact of using one-hot encoding (versus a basic enumeration) using a Neural Network. Neural networks are a set of methods, modeled loosely after the human brain that can recognize patterns. They consist of different input and output layers, as well as hidden layers in which they manipulate the dataset into something the output layer can use. For further explanation and discussion about neural network models, please check out ML01 Neural Networks and its partner blog released recently (https://github.com/KxSystems/mlnotebooks/tree/master/notebooks/ and https://devweb.kx.com/blog/neural-networks-in-kdb-2/ ).

Technical description

Normalization

In the first part of the notebook, an investigation of the four scaling algorithms discussed previously is undertaken and the impact on a k-Nearest Neighbors classifiers is quantified. The dataset that will be used in this task is the UCI Breast Cancer Wisconsin (Diagnostics) dataset. This consists of 569 patients with 30 different features that have been classified as either malignant or benign. The features contained in the dataset describe the characteristics of cell nuclei present in the breast mass. Ten features are computed for each cell nucleus which include the radius, texture and perimeter. This particular dataset can be found in the Python scikit-learn module in which we can import it using embedPy and store it as q data.

Given the dataset is loaded as q data, this allows us to explore and manipulate the dataset quickly. We can determine the shape of the dataset using q defined lambdas and find out if the classes of the dataset are balanced by exploring simple qSQL queries. This allows us to get an insight of the dataset and informs us of what necessary steps must be taken before we apply the data to the K-Nearest Neighbors classifier.

The dataset is split into a training set and a test set using the testtrainsplit function that is defined in func.q . The k-Nearest Neighbors classifier requires the calculation of a distance between points in the feature space. The features must, therefore, all be on the same scale. Otherwise, the feature with the most magnified scale will get the highest weighting. This process of scaling features is often referred to as normalization. There are a number of scaling algorithms that are available within scikit-learn in which they can be imported using embedPy.

These scalers are applied to the raw dataset and a knn classifier is applied to each of the different scaled datasets. The accuracies are plotted as a function of the number of neighbors via matplotlib to discern the optimal value of neighbors (k). From the plot, it can be seen that the different scalers perform comparably for approximately k<40. However, the Quantiles transformer proves to be considerably more robust as k increases. This suggests that, at least for a subset of the feature variables, ordering is more important than actual feature value.

One hot encoding

In the second part of the notebook, we examine the impact of using one-hot encoding (versus a basic enumeration) by applying a neural network to the MNIST dataset. This is a large collection of handwritten digits which have been provided to the public by the US National Institute of Standards and Technology. This dataset can be found in the keras module in Python and it can be imported to q via embedPy in which we can store it as q data.

The dataset consists of 4 different byte type datasets: training and test images (defined in our case as xtrain & xtest), which contain the images of the handwritten digits as matrices where each of the 28×28 pixels is represented as a value in the matrix. The datasets also contain the label associated with each image which is the actual digit values between 0 and 9 (we define these as ytrain and ytest), which allows us to determine if our model is accurately identifying the relevant digits. We prepare the dataset by normalizing the the pixel values and casting the label values to floats. As in the first section, we get the shape and determine if the classes of the dataset are balanced using qSQL queries.

We define another copy of the handwritten digits by apply one-hot encoding to the labels using the onehot function that is defined in func.q. Two convolutional neural network models are constructed to accommodate for continuous labels and one-hot labels. Using embedPy, we import the different layers from keras that we will be using to build the framework of the neural network model. We apply two convolutional layers which applies a 3×3 filter or kernel across the image. These layers are used to detect edges and details in the image which will help the model to determine the label of the image. These layers are followed by a Maxpooling and flattening layer. A dense layer is then used to combine the information from different convolutional filters.

We want the network to output one continuous value that we will train as the continuous value of the labels. We achieve this by defining a dense layer with one output node and employ a relu activation function. By using the one-hot encoding of the labels, we apply a softmax layer generating 10 class probabilities and a cross-entropy function. Essentially, the output layer is the only layer that changes in both models. We train the model for 20 epochs which means that the neural network model has seen each image in the dataset 20 times. The batch size is 128 which is the number of training examples in one forward/backward pass.

We plot the performance of both models using matplotlib. Comparing both models, it can be seen that that one hot encoding provides a smoother prediction accuracy. It also achieves a higher accuracy much faster than the continuous model. One-hot encoding allows the backpropagation algorithm to provide separate training signals for each class image. From these results, it can be seen that training with one-hot labels therefore seems considerably faster and, ultimately, more successful.

If you would like to further investigate the uses of the embedPy and machine learning algorithms, check out ML 04 Feature Engineering notebook on GitHub (github.com/kxsystems/mlnotebooks), where several functions commonly employed in machine learning problems are also provided together with some functions to create several interesting graphics. You can use Anaconda to integrate into your Python installation to set up your machine learning environment, or you can build your own, which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/.

Don’t hesitate to contact ai@devweb.kx.com if you have any suggestions or queries.

Other articles in this JupyterQ series of blogs by the KX Machine Learning Team:

Natural Language Processing in kdb+ by Fionnuala Carr

Neural Networks in kdb+ by Esperanza López Aguilera

Dimensionality Reduction in kdb+ by Conor McCarthy

Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr

Decision Trees in kdb+ by Conor McCarthy

Random Forests in kdb+ by Esperanza López Aguilera

Reference

Hassan Shee Khamis, Kipruto W. Cheruiyot, Stephen Kimani. Application of k-nearest Neighbour Classification in Medical Data Mining. Available from: https://www.researchgate.net/publication/270163293_Application_of_k-Nearest_Neighbour_Classification_in_Medical_Data_Mining

Nurul E’zzati Md Isa1, Amiza Amir1, Mohd Zaizu Ilyas1, and Mohammad Shahrazel Razalli1. The Performance Analysis of K-Nearest Neighbors (K-NN) Algorithm for Motor Imagery Classification Based on EEG Signal.

Gurney K. An Introduction to Neural Networks. UCL Press. 1997. Available from: https://www.inf.ed.ac.uk/teaching/courses/nlu/assets/reading/Gurney_et_al.pdf

Michael Nielsen. Neural Network and Deep learning. Available from: http://neuralnetworksanddeeplearning.com/chap1.html

Krizhevsky A, Sutskever I, Hinton G, ImageNet Classification with Deep Convolutional Neural Networks. Available from: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Feature Engineering in kdb+

Demo kdb, the fastest time-series data analytics engine in the cloud