How to Build and Manage Databases Using PyKX

Conor McCarthyTechnical Lead for PyKX

24 January 2024 | 7 minutes

Introduction

For many years, KX and its partners have looked to provide solutions that simplify database management in kdb+, this has seen the launch of several products and frameworks, including kdb Insights Enterprise, kdb Insights Database, and Data Intellects, TorQ.

With the introduction of PyKX nearly 2 years ago, we delivered an initial new wave of capability, increasing access to our powerful analytic tooling, via a familiar python interface and opening kdb+ data estates to a new wave of developers. Over the years this has continued to evolve and harden capabilities.

Earlier this week, we released the next phase in this journey as part of PyKX 2.3. Specifically, through the beta release of a new feature; centred around “database management”, providing greater flexibility in the provisioning and management of kdb+ databases via Python.

With the initial release, developers will be able to:

Create a database and tables from any historical data.
Add new partitions and tables to a database.
Add and rename tables and columns.
Create copies of table columns.
Apply functions to columns.
Update data types of columns
Add & fix missing partitions from tables.
Interrogate the content/structure of your database.

Click here for a full breakdown of the API.

Walkthrough

Let’s explore some of these features further by walking through example management tasks.

Warning:
Operations on persisted databases can result in changes being made, which you may find difficult to revert. For example, the application of functions, which change the values of rows within a column may result in an updated value from which you cannot get the original data. Prior to using this functionality for complex use-cases you should be clear of the intention of your changes and ideally be able to mitigate issues with access to a backup of your data.

To begin, you will need to enable PyKX beta features:

import os
os.environ['PYKX_BETA_FEATURES'] = 'True'
import pykx as kx

Once initialised, you can validate what features are enabled by running the following command.

print(kx.config.beta_features)
print(kx.beta_features)

True
[‘Database Management’]

Next, create a new directory to host your database and initialise the DB class.

os.makedirs('/tmp/my_database')
db = kx.DB(path='/tmp/my_database')

From here, you can generate a new dataset which will be saved to the database.

from datetime import date 
N = 100000 
dataset = kx.Table(data={ 
 'date': kx.random.random(N, [date(2020, 1, 1), date(2020, 1, 2)]), 
 'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']), 
 'price': kx.random.random(N, 10.0)  
 }) 
dataset


99999	2020.01.02	MSFT	7.364259
100,000 rows × 3 columns
	date	sym	price

0	2020.01.01	MSFT	4.802218
1	2020.01.01	GOOG	0.5058034
2	2020.01.01	AAPL	7.055037
3	2020.01.02	AAPL	2.924899
4	2020.01.01	MSFT	1.084145
5	2020.01.02	AAPL	1.1094
6	2020.01.01	GOOG	0.8396528
7	2020.01.02	MSFT	7.527356
8	2020.01.01	MSFT	7.716519
9	2020.01.02	MSFT	0.7148808
10	2020.01.01	MSFT	2.539819
11	2020.01.02	GOOG	9.568963
12	2020.01.02	GOOG	5.546168
13	2020.01.01	GOOG	9.614568
14	2020.01.01	AAPL	3.907669
15	2020.01.02	GOOG	6.349116
16	2020.01.01	MSFT	9.47742
17	2020.01.02	MSFT	3.082102
18	2020.01.01	GOOG	2.703125
19	2020.01.01	MSFT	2.438608
20	2020.01.02	AAPL	0.6931198
21	2020.01.02	MSFT	2.896335
22	2020.01.01	GOOG	2.624367
…	…	…	…

With the dataset created, save it to disk by partitioning(splitting) the data into sub-directories based on date.

db.create(dataset, table_name='db_table', partition='date')

Congratulations, you have now created your first kdb+ database. You can access this table as a property of the database.

db.db_table

	date	sym	price

0	2020.01.01	MSFT	4.802218
1	2020.01.01	GOOG	0.5058034
2	2020.01.01	AAPL	7.055037
3	2020.01.01	MSFT	1.084145
4	2020.01.01	GOOG	0.8396528
5	2020.01.01	MSFT	7.716519
6	2020.01.01	MSFT	2.539819
7	2020.01.01	GOOG	9.614568
8	2020.01.01	AAPL	3.907669
9	2020.01.01	MSFT	9.47742
10	2020.01.01	GOOG	2.703125
11	2020.01.01	MSFT	2.438608
12	2020.01.01	GOOG	2.624367
13	2020.01.01	AAPL	2.20948
14	2020.01.01	AAPL	7.839242
15	2020.01.01	AAPL	0.8549648
16	2020.01.01	AAPL	5.564782
17	2020.01.01	AAPL	3.42925
18	2020.01.01	MSFT	6.195387
19	2020.01.01	AAPL	8.137941
20	2020.01.01	MSFT	5.73365
21	2020.01.01	AAPL	7.278849
22	2020.01.01	AAPL	1.252762
…	…	…	…
99999	2020.01.02	MSFT	7.364259

100,000 rows × 3 columns

Now, let’s explore how to add data into an existing database table by using the db.create function.

new_data = kx.Table(data={ 
 'sym': kx.random.random(3333, ['AAPL', 'GOOG', 'MSFT']), 
 'price': kx.random.random(3333, 10.0)  
 }) 

db.create(new_data, 'db_table', date(2020, 1, 3))

Writing Database Partition 2020-01-03 to table db_table

Next, you can rename your table to ‘finance_data’ and the column ‘sym’ to a more descriptive name.

db.rename_table('db_table', 'finance_data')
db.rename_column('finance_data', 'sym', 'ticker')

2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.01/db_table to :/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.02/db_table to :/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.03/db_table to :/tmp/my_database/2020.01.03/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.03/finance_data

With more significant editing tasks, you may want to perform the operation on a practice column and protect your original dataset. To demonstrate, create a copy of the ‘price’ column and then perform a function to double the content values. Finally, modify the data type of your new column to be less memory intensive.

db.copy_column('finance_data', 'price', 'price_copy') 
db.apply_function('finance_data', 'price_copy', lambda x:2*x) 
db.set_column_type('finance_data', 'price_copy', kx.RealAtom)

We can now look at the content of finance_data to see how these changes have modified the database.

db.finance_data

	date	ticker	price	price_copy

0	2020.01.01	MSFT	4.802218	9.604437e
1	2020.01.01	GOOG	0.5058034	1.011607e
2	2020.01.01	AAPL	7.055037	14.11007e
3	2020.01.01	MSFT	1.084145	2.16829e
4	2020.01.01	GOOG	0.8396528	1.679306e
5	2020.01.01	MSFT	7.716519	15.43304e
6	2020.01.01	MSFT	2.539819	5.079639e
7	2020.01.01	GOOG	9.614568	19.22914e
8	2020.01.01	AAPL	3.907669	7.815337e
9	2020.01.01	MSFT	9.47742	18.95484e
10	2020.01.01	GOOG	2.703125	5.40625e
11	2020.01.01	MSFT	2.438608	4.877215e
12	2020.01.01	GOOG	2.624367	5.248734e
13	2020.01.01	AAPL	2.20948	4.41896e
14	2020.01.01	AAPL	7.839242	15.67848e
15	2020.01.01	AAPL	0.8549648	1.70993e
16	2020.01.01	AAPL	5.564782	11.12956e
17	2020.01.01	AAPL	3.42925	6.8585e
18	2020.01.01	MSFT	6.195387	12.39077e
19	2020.01.01	AAPL	8.137941	16.27588e
20	2020.01.01	MSFT	5.73365	11.4673e
21	2020.01.01	AAPL	7.278849	14.5577e
22	2020.01.01	AAPL	1.252762	2.505525e
103332	2020.01.03	MSFT	4.152568	8.305137e
…	…	…	…	…

103,333 rows × 3 columns

Let’s also check the number of data rows, per partition within the database.

db.partition_count()

date	finance_data

2020.01.01	49859
2020.01.02	50141
2020.01.03	3333

Finally, using the create function that you used earlier, and the fill_database function, you can add the missing partitions to a newly onboarded table.

orders = kx.Table(data = { 
    'id': kx.random.random(100, kx.GUIDAtom.null), 
    'type': kx.random.random(100, ['Buy', 'Sell'])}) 

db.create(orders, 'orders', date(2020, 1, 3)) 
db.fill_database()

Writing Database Partition 2020-01-03 to table orders
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.02
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.01

You can see that the orders table has now been onboarded and is accessible.

print(db.tables)
db.orders

[‘finance_data’, ‘orders’]

	date	id	type

0	2020.01.03	29f1e3e9-dba5-12da-acf8-51154deb3dc9	Buy
1	2020.01.03	5cda0be6-60a5-07fc-f929-a3d03c6106d1	Buy
2	2020.01.03	47d62791-e47a-e8ee-1afa-09cfe0657581	Buy
3	2020.01.03	b15ac75d-4bd9-a6af-5882-fc9f2e0771c1	Sell
4	2020.01.03	2df69841-083e-77d4-1dd6-cf3919e8c19d	Sell
5	2020.01.03	9e98c564-1f25-ca7d-9b3d-b270a2acf79f	Sell
6	2020.01.03	8b4fe1cb-da0c-8589-f9b5-8faa57f73ab2	Sell
7	2020.01.03	2a47001a-0ffd-8a0e-3b42-45a5f11980a1	Buy
8	2020.01.03	8113ba9a-7065-6471-5b58-4c194c9a2360	Sell
9	2020.01.03	0418a02c-a1c3-db83-e82e-ff5d3b4ef847	Buy
10	2020.01.03	52a1be39-1ceb-87b7-0e4d-f156f32860a0	Buy
11	2020.01.03	1a536bf8-5f1f-5a6e-83de-0c6f8d632c8a	Buy
12	2020.01.03	2de7b0aa-67a7-2150-aeb9-d2cd5ac855b8	Buy
13	2020.01.03	1f6bf7de-b72a-4acb-9c80-207f7be4cd96	Buy
14	2020.01.03	c22bcb71-61f7-5431-54e0-a5bb315680e2	Sell
15	2020.01.03	ed925950-bccf-4684-1a7b-09ce2ecd4edc	Sell
16	2020.01.03	2fd437b0-0a61-d5b2-6749-c48cd7344561	Sell
17	2020.01.03	5961b504-53d0-1cf5-ace6-234bf7569f0b	Buy
18	2020.01.03	8bf69a5e-5225-b873-5759-d52b70bfa40b	Buy
19	2020.01.03	898bfe1e-6445-bf9c-932d-d2cbff828d11	Buy
20	2020.01.03	34b6cff5-3a4c-cd4b-2be0-bef15fe4b031	Sell
21	2020.01.03	41958ee9-09a9-45c7-ce71-0c0fcd12555c	Sell
22	2020.01.03	520d5ab7-ca9a-ff07-c89e-051f0379908d	Buy
…	…	…	…
99	2020.01.03	1672522c-87d2-ad8f-85d3-f55662b189cb	Buy

100 rows × 3 columns

Conclusions

Creating and managing databases at scale is a tough challenge and one where mistakes in operations can have a meaningful impact on business performance and use-case efficiency. While the Database Management feature does not solve all problems in this area, it is intended to provide you with an initial grounding. Allowing you, in Python, to create and perform basic operations on kdb+ databases to make this task easier.

As mentioned throughout the blog, this is presently a beta feature and subject to updates prior to a general availability release later this year. If you run into difficulties with this feature, or have suggestions for additional functionality, please open an issue on our public GitHub or alternatively open a pull request.

How to Build and Manage Databases Using PyKX

Introduction

Walkthrough

Conclusions

Demo kdb, the fastest time-series data analytics engine in the cloud