Advanced Pattern Matching and Text Manipulation in kdb+ with embedPy

25 November 2020 | 9 minutes

By Jun Bing Neo

Introduction

A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. Such patterns are frequently used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.

Regular expressions provide a flexible and concise means to match strings of text. For example, a regular expression could be used to search through large volumes of text and change all occurrences of “cat” to “dog”. Such manipulation can play an important role in the ingestion, transformation and querying of data, and for that reason, kdb+ provides both in-built regex functionality and the ability to import functionality from external libraries that provide similar utilities. One such example is Python. In this blog, we describe how Python’s regex can be easily integrated with q using the re module to fully utilize the mutual power of both languages together.

Importing Python Functions

The first step in running run Python functions in q is to make use of the embedPy functionalities to import Python modules and functions. More information on the installation of embedPy can be found on Github.

Assuming that we have already installed embedPy into our version of q, the syntax required to import a function is as follows:

sub:.p.import[`re;`:sub ;<];

The function .p.import takes in the module name, followed by the methods or functions within this module. In this case,”re” is the module name, and “sub” is the function name within this module. Lastly, “<” means that we want the function to return a q object from a Python object.

With this method of importing Python functions, it means we can run a Python function and return a q object. In the next section, we will look at more examples of running the actual function itself.

Basic functions

The first example to showcase the capabilities of Python regexes. The first base case shows that the functionalities of Python regex substitution with sub is comparable to q’s ssr.

//Load embedPy
q)\l p.q
//Same base case to substitute
q)sub:.p.import[`re;`:sub ;<];
//Required syntax
//sub[<pattern>;<replacement>;<string>]    
q)sub["te.t";"do";"testing"]
"doing"
q)ssr["testing";"te?t";"do"]
"doing"

Next, we would like to have a more complicated example and substitute the digits in a string.

//Substitute the 8 digits
q)str:"testing20181016.csv"
q)ssr[str;raze 8#enlist "[0-9]";"NewNumber"]
"testingNewNumber.csv"
q)sub["\\d{8}";"NewNumber";"testing20181016"]
"testingNewNumber.csv"

From this example, we can see that it is still rather simple to replace a continuous segment of 8 digits to another string.

So what happens if we complicate the example and try to replace segments of digits that are either 6 or 8 continuous digits? In a normal date format, it is common that the date can either be of 2 formats “20181213” or “181213”. Hence, we need a way to determine if the segment is 6 or 8 digits and substitute it if needed.

//Find the value which matches 6 or 8 digits which are continuous
q)str:"testing20181016.csv"
q)valToUse:max val where not null raze {str ss raze x#enlist "[0-9]"} each val:6 8
//The string is an 8 digit date format, we expect the result to be 8
q)valToUse
8
q)ssr[str;raze valToUse#enlist "[0-9]";"NewNumber"]
"testingNewNumber.csv"
//Using Python regex functionality
q)sub["\\d{6,8}";"NewNumber";"testing20181016"]
"testingNewNumber.csv"

Let us look at another example. We would like to substitute a continuous 6 or 8 digits in a string. Also note that the string has multiple segments which are 6 or 8 digits.

//New string example
str: "testing20181016.csv.170292"
sub["\\d{6,8}";"NewNumber";str]
"testingNewNumber.csv.NewNumber"
//No easy way to do it using q functions

These examples illustrate how incorporating specialized external functionality can sometimes simplify the implementation.

Advanced Features

These are simple Python functions loaded using embedPy into q. However, we can do so much more with it by creating our own functions in Python and then importing to q using embedPy

In these functions, I have used the compile method for the regexes which you will see in the script below.

The purpose of the compile method is to compile the regex pattern which will be used for matching later. It’s advisable to compile regex when it’ll be used several times in your program. Resaving the resulting regular expression object for reuse, which re.compile does, is more efficient.

A custom regex script regUtils.py has been created which consists of the functions regMatch, regSearch and regFormat.

Match

def regMatch(string,pattern,flags=0):
    assert type(pattern) is str, 'pattern must be a string type'
    if type(string) is list:
        assert all([type(i) is str for i in string]), 'string input contains elements which are not of string type'
        prog = re.compile(pattern,flags)
        return [bool(prog.fullmatch(i)) for i in string]
    else:
        assert type(string) is str, 'string input is not of string type'
        return bool(re.fullmatch(pattern,string,flags))

The below examples aim to match a string to a pattern.

q).reg.regMatch:.p.import[`regUtils;`:regMatch;<];
q)like["hello test hello";"*te*hello"]
nyi
q).reg.regMatch["hello test hello";".*te.*hello"]
1b

As the imported Python function supports some additional “*” search patterns it may be more appropriate depending on the implementation.

Another use case is to match a group of pattern for a repeated number of times

q).reg.regMatch["many many examples";"(many ){2}examples"]
1b

The regex can then be employed during querying and filtering a table as below:

//Create a table
q)t:([] idx:1 2 3 4 5;   str:("testhello1";"testhello2";"hellono1pattern"; "hellono2test";"mytest"));
q)t
idx str
---------------------
1   "testhello1"
2   "testhello2"
3   "hellono1pattern"
4   "hellono2test"
5   "mytest"

//Selecting with a pattern
q)select from t where str like "*no*test*"
nyi
q)select from t where str like "*no*", str like "*test*"
idx str       
-------------
4   "no2test"
q)select from t where .reg.regMatch[a;".*no.*test.*"]
idx str       
-------------
4   "no2test"

The example above illustrates the benefit of a regex match expression over using multiple where clauses that may in some instances result in false positives. Here is a simple example:

//False positive case
q)t2:([] idx:1 2 3 4 5 6;   str:("testhello1";"testhello2";"hellono1pattern"; "hellono2test";"mytest";"test2no"));
q)select from t where str like "*no*", str like "*test*"
idx str          
------------------
4   "hellono2test"
6   "test2no"

This generates 2 results and 1 of them is a false positive which is not ideal.

Here are some simple examples of the regMatch function

String	Pattern	Result
test	t.st	1b
many many examples	(many ){2}examples	1b
hello test hello	.te.hello	1b
test	t*st	0b
test	t(b\|a)st	0b

Search

def regSearch(string,pattern,flags=0):
    assert type(pattern) is str, 'pattern must be a string type'
    assert type(string) is str, 'string input is not of string type'
    res = re.search(pattern,string,flags)
    if res:
        return res.group(0)
    else:
        return ''

The below examples aim to search for a pattern in a string. The string below is an example of a field which contains an ISIN number which are 2 alphabets followed by 9 digits and we aim to extract the ISIN number from this string.

q)str:"|2033|FR103789727|AT.X|20000|"
q).reg.regSearch:.p.import[`regUtils;`:regSearch;<];
q).reg.regSearch[str;"\\w{2}\\d{9}"]
"FR103789727"

//Alternative implementation in q
q)alphabetCnt:2;
q)digitCnt:9;
q)totalCnt:alphabetCnt+digitCnt
q)str (til totalCnt) +first str ss (raze alphabetCnt#enlist"[A-Z]"),(raze digitCnt#enlist "[0-9]")
"FR103789727"

Here are some simple examples of the regSearch function

String	Pattern	Result
An analysis of the anacondas antlers	an\\w{1}	ana
\|2033\|FR103789727\|AT.X\|20000\|	\\w{2}\\d{9}	FR103789727
these tests are for testing	test\\w+	tests
testing done for these test	test\\w+	testing

Formatting

The regFormat function would not be considered a regex functionality. However, it is an interesting function to be able to format the string into a form which can be used in conjunction with the regex functionalities.

def regFormat(string,pattern):
    if not type(pattern) is list: pattern=[pattern]
    return string.format(*pattern)

The below examples aim to add thousands separators to a number.

q).reg.regFormat:.p.import[`regUtils;`:regFormat;<];
q)number:1000.12;
q).reg.regFormat["{:,}";number]
"1,000.12"

Cheatsheet

A table containing useful regexes and their underlying meaning is available here. The full documentation for the re module is available on the Python website.

The best way to effectively use these regexes is to practice more. The only way to get familiar with it is by repeatedly experimenting and using these functions ourselves!

Conclusion

This shows the augmented capabilities that Python can provide and synergistically interact with q to fully expand the capabilities of both languages when used together.

There are many use cases for regexes in many fields. An application of regex is in feature engineering and data processing for NLP. It could be used to search through a large dataset of emails and extract the sender, recipient and content of each one matching certain words and phrases. Specific applications of it are for detecting insider trading, collusion in market surveillance systems where data from emails are collected and scanned through to pick up any suspicious activities.

The custom Python script used in the above examples can be found on Github under the embedpy-regex repository.

Further Reads:

A comparison of Python and q for data analysis:

http://dev2.kxsystems.net/blog/a-comparison-of-python-and-q-for-data-analysis/

embedPy Github repo:

https://github.com/kxsystems/embedpy

Kx NLP Libraries:

Alternative Regex Libraries: