Secondary in-memory index representations in Python

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Definition

"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

I have opened an issue here

So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.

For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.

Answer

You're on the right track with using PyArrow for dictionary encoding to build a secondary index in memory. The combination of numpy and PyArrow can offer excellent performance for both memory footprint and speed, especially for use cases like yours where you need to filter, transform, and update the index efficiently.

Let's break down the solution and approach to building a secondary index in memory that serves well for both filtering queries and transformations.

Key Ideas

Dictionary Encoding: This is a key component in your secondary index. By encoding values as dictionary indices, you can create compact representations that facilitate faster access and more efficient memory usage.
In-Memory Index Construction: You can build an in-memory index using numpy arrays for fast numerical operations and pyarrow for efficient memory management, storage, and I/O.
Efficient Querying: By using the dictionary-encoded values and indices, you can efficiently query data via indexed lookups and transformations.

General Approach

Dictionary-Encoding with PyArrow:
- pyarrow.Array.dictionary_encode helps reduce memory by replacing repeated values with integer indices. This will be particularly useful when you have a large dataset with many repeated values.
Secondary Index Construction:
- A secondary index is essentially a map from values to pointers (or indices) where those values appear. In this case, a dictionary-encoded array will have a mapping from the original values to compact indices, which can then be used to identify where those values occur.
Handling Missing/Null Values:
- You must ensure that missing or None values are correctly handled in the indexing structure. PyArrow's dictionary-encoded arrays can support nulls and might make your task easier by supporting NA values natively.

1. Constructing a Secondary Index

Let's start with your example and work through how to build a dictionary-encoded secondary index.

import numpy as np
import pyarrow as pa

# Your example data
pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

# Convert the value array to a PyArrow array
val_pa = pa.array(val, type=pa.float32())

# Dictionary encoding the 'val' array
dict_encoded_val = val_pa.dictionary_encode()

print("Dictionary Encoded Array:")
print(dict_encoded_val)

Explanation:

pa.array(val, type=pa.float32()) converts the NumPy array val into a PyArrow array.
dictionary_encode() creates a dictionary-encoded version of the array, reducing the memory footprint.

Output:

<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

2. Building the Secondary Index

Now that you have the dictionary-encoded array, you can start building a secondary index. The secondary index maps the dictionary values to the positions (indices) where they appear in the original array.

# Extract dictionary values and indices
dict_values = dict_encoded_val.dictionary
indices = dict_encoded_val.indices.to_numpy()

# Create a mapping from the dictionary values to the indices (pointers to records)
from collections import defaultdict

secondary_index = defaultdict(list)

# For each index in the array, append its position to the corresponding value in the index
for idx, value in enumerate(indices):
    secondary_index[dict_values[value].as_py()].append(pk[idx])

# Print the secondary index
for value, records in secondary_index.items():
    print(f"Value {value} found in records: {records}")

Explanation:

dict_values holds the distinct values (the dictionary values), and indices holds the indices of the original val array mapped to the dictionary values.
We use a defaultdict(list) to collect the indices (primary keys) for each distinct value.

Example Output:

Value 15.5 found in records: [1]
Value 3.75 found in records: [2]
Value 142.88 found in records: [3, 4]
Value None found in records: [5, 6, 7]
Value 7.2 found in records: [8]
Value 2.1 found in records: [9]

3. Querying the Secondary Index

With this secondary index, you can perform efficient lookups to find records for specific values. For example, to get all records where val is 142.88:

# Querying for a specific value in the secondary index
query_value = 142.88
records = secondary_index[query_value]

print(f"Records with value {query_value}: {records}")

Output:

Records with value 142.88: [3, 4]

4. Efficient Range Queries

If you want to perform range queries, it becomes a little more involved, but you can use the dictionary encoding to identify the range of values you want and efficiently extract records. For instance, to find all records with values greater than 3.0:

# Range query: Find all records with value > 3.0
range_records = []
for value, records in secondary_index.items():
    if value is not None and value > 3.0:
        range_records.extend(records)

print(f"Records with value > 3.0: {range_records}")

5. Updating the Index

Updating the index when new data is added can be tricky. You may need to re-encode the new val array with dictionary_encode and update the secondary index accordingly. Alternatively, you could directly manipulate the dictionary encoding, but that can become more complicated when dealing with large datasets.

Performance Considerations

Memory Efficiency: By using PyArrow’s dictionary encoding, you can save memory, especially if you have a large number of repeated values.
Query Speed: Indexing with dictionary-encoded arrays allows for fast lookups since you are working with integer indices rather than raw floating-point numbers or strings.
Flexibility: This approach is flexible for updates (e.g., adding/removing records) and transformations, as you can easily modify or extend the secondary index.

Conclusion

By using PyArrow and Numpy together, you can build a memory-efficient secondary index that supports fast lookups, range queries, and easy updates. The dictionary encoding feature in PyArrow is particularly useful here, as it reduces the size of your data and improves query performance.