Matthew Borger

Matthew Borger

Computer Engineer

matthew@borgernet.com

© 2020

Substring Searching on Encrypted Data

How do you provide substring searching on encrypted values stored in a database?

One solution that maintains security is to store a Bloom filter with every record in the database which can be used to test substring set membership.

UPDATE: It seems likely that the encrypted values could be guessed from the Bloom filter using some statistical analysis. When I have time I’ll put together an exploit to prove or disprove this.

What is a Bloom filter?

This explanation from Wikipedia sums it up.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives.

Getting the example code

Grab the example code from Github as parts of it will be referenced below.

Architecting the Bloom filter

The Bloom filter is stored as a set of bits. Some number of hash functions are used to determine which bits are set to 1. The bit length, number of hash functions and the expected number of inserted items are used to determine what the false probability percentage is. Wikipedia provides some equations for determining what values to use. As part of the example code these equations are provided in bloomEquations.py to experiment with.

#!/usr/bin/env python3

"""
k represents the number of hash functions
m is the length of the bit string
n is the expected number of elements to be inserted
p is the false positive probability
"""

from math import log as ln, exp as e

def optK(m,n):
    return (m//n)*ln(2)

def optM(n,p):
    return -((n*ln(p))//(ln(2)**2))

def optP(m,n):
    return e(-(m//n)*ln(2)**2)

This example stores a first and last name as part of a user record. The goal is to provide the ability to perform a substring search on either the users first or last name. In order to do that not only does the name need to be inserted but all possible substrings of the name need to be inserted into the Bloom filter as well. Some research reveals that the average length of a persons first or last name to be 10 characters. For the name “Joe” all the items to be inserted in the filter would be:

  1. j
  2. jo
  3. joe
  4. o
  5. oe
  6. e

The number of entries in this case is 6 or 3+2+1. The pattern turns out to be a summation series based on the length of the name which can be calcualated by (n^2 + n) / 2. So the number of substrings for a name of 10 characters is 55. This is the expected number of entries, or n. The rest of the parameters can be calculated using the provided equations in python3.

$> python3
>>> from bloomEquations import *

To see how many bits are needed to have a 1% false positive rate use the optM function.

>>> optM(55, 0.01)
528.0

A 10% false positive rate.

>>> optM(55, 0.10)
264.0

A 50% false positive rate.

>>> optM(55, 0.50)
80.0

Using 256 bits yields around a 10% false positive rate and only adds 32 bytes to an encrypted field that is probably 64 bytes depending on what strength of encryption is used. To figure out the exact false positive rate that a 256 bit Bloom filter will have, use the optP function.

>>> optP(256, 55)
0.14634154270296498

The result is about 15%. Since the application is substring searching, it is acceptable to have some amount of false positives which the user will ignore in the search results.

The last piece is deciding how many hash functions to use. This can be calculated by using the optK function.

>>> optK(256, 55)
2.772588722239781

This Bloom filter should have 3 hash functions to minimize the false positive rate.

To sum up the Bloom filter design:

  • k, the number of hash functions, will be 3
  • m, the size of the Bloom filter, will be 256
  • n, the expected number of insertions, is 55
  • p, the false positive rate, will be 15%

Database schema

PostgreSQL is used for this example because of its bit string datatype and ability to perform binary operations within SQL.

To demonstrate how the searching scheme works, presume there are user records defined with the following fields:

  • name varchar(10)
  • bloom bit(4)

This table contains the following data:

name | bloom 
-----+-------
Foo  | 0100
Bar  | 0010
Baz  | 0011

The name field contains the stored string, this column could be encrypted. The bloom field contains the resulting bit string from putting name through the Bloom filter.

To perform the substring search, the query word is put through the Bloom filter and the resulting bit string is compared to each bloom record in the table using a bitwise and operation. The goal is to see if at least the same bits from the query word are set in the stored bit string for each record.

If the query word is ba and the Bloom filter bit string is 0010, the SQL would be:

SELECT * FROM users WHERE bloom & b'0010' = b'0010';

Which results in:

name | bloom 
-----+-------
Bar  | 0010
Baz  | 0011

This bit string reduction operation allows substring searching without knowing what the original stored string is. Any database that supports bitwise and operations will work but there may be additional work in finding an efficient way to store the Bloom filter bit string.

For the provided example code, make sure you have a database with the name of your user account and setup the table as follows:

CREATE TABLE users(
	first       VARCHAR(50),
	first_bloom BIT(256),
	last        VARCHAR(50),
	last_bloom  BIT(256)
	);

Walking through the code

The code compiles into three programs, insert, searchBloom and searchName. insert.cpp takes two arguments per line, a first and last name, generates their Bloom filters and inserts them into the database. searchBloom.cpp takes one argument, a substring, generates its Bloom filter and queries against the Bloom filters stored for both the first and last name. searchName.cpp takes one argument, a substring, and performs a LIKE search against the stored first and last names.

The hash folder contains the sources for the three hash functions, Murmur3, SpookyV2 and FNV. This code was taken directly from their upstream sources.

The interesting work happens in util.cpp which contains three functions. There are two helper functions for calculating a summation series and another for generating all possible substrings.

The generateBloom function takes in a string, transforms it to lower case and inserts all substrings into a bitset which represents the Bloom filter.

#define BLOOM_SIZE 256

std::bitset<BLOOM_SIZE>
generateBloom(const std::string str)
{
	std::bitset<BLOOM_SIZE> bloom;

	// Arbitrarily chosen seed value
	const int seed {42};

	// Transform the string to lower case
	std::string lowerStr;
	lowerStr.resize(str.length());
	std::transform(str.begin(), str.end(), lowerStr.begin(), ::tolower);

	for(const auto subStr : listSubStrings(lowerStr)) {
		unsigned int murmurHash;
		MurmurHash3_x86_32(subStr.c_str(), subStr.length(), seed, &murmurHash);
		bloom.set(murmurHash % BLOOM_SIZE);

		unsigned int spookyHash = SpookyHash::Hash32(subStr.c_str(), subStr.length(), seed);
		bloom.set(spookyHash % BLOOM_SIZE);

		char* buf = (char*) calloc(subStr.length(), sizeof(char));
		std::strncpy(buf, subStr.c_str(), subStr.length());
		unsigned long long fnvHash = fnv64a(reinterpret_cast<unsigned char*>(buf), (uint64_t)subStr.length());
		bloom.set(fnvHash % BLOOM_SIZE);
		delete buf;
	}

	return bloom;
}

All substrings are always inserted when generating the Bloom filters, even for searching. This yields more accurate results since if the desired string includes the whole substring, then it must also include all the substrings.

Performance metrics

These metrics were calculated using the provided 100,000 name dataset on a Core i7-3517U with 8 GB of ram running Ubuntu 14.04. All results from time are averaged over 5 runs.

Results of searching for ‘matt’ using a traditional LIKE query and the Bloom filter:

$> time ./searchName matt | wc -l
174

real	0m0.181s

$> time ./searchBloom matt | wc -l
174

real	0m0.141s

Both methods yield the same result set and the Bloom filter method is slightly faster.

Results of searching for ‘ly’:

$> time ./searchName ly | wc -l
5745

time 0m0.217s

$> time ./searchBloom ly | wc -l
6069

time 0m0.183

The false positive property of the Bloom filter is shown with this example. The Bloom filter search yields 324 records that do not actually contain the string ‘ly’. The search is still consistently faster than a LIKE query.

Additional Considerations

Sorting & Pagination

Bloom filters allow a set reduction to be performed by the database but they do not provide any means of sorting the result set (eg alphabetical order). Sorting will have to be punted to the client or whatever step has access to the unencrypted data. In order to support pagination, there needs to be an ordering guarantee between page fetches. This can be achieved by adding an ORDER BY clause on any of the fields. Just be aware that fetching page 1 does not guarantee the result records to be sorted earlier (eg for alphabetical sorting, page 1 contains records earlier in the alphabet).

Platform independent hash functions

While hash functions are guaranteed to be functional (eg provide the same output given the same input), these functions do not guarantee to be functional across platforms. For example Google’s FarmHash functions explicitly state that the functions are platform dependent so using them across say an Intel Core i5 and an ARM Cortex A8 will result in different outputs given the same inputs. For a distributed system where the hardware may be mixed, platform indepenet functions have to be used to reliably generate equivalent Bloom filters. Cryptographic hash functions guarnatee platform independence but they are orders of magnitude slower than the ones chosen for this example. It would be a good idea to verify the chosen hash functions are equivalent between compilers since some rely on assembly level tricks or specific SSE instructions for distribution and speed.