**How do you provide substring searching on encrypted values stored
in a database?**

One solution that maintains security is to store a Bloom filter with every record in the database which can be used to test substring set membership.

**UPDATE:** It seems likely that the encrypted values could be guessed
from the Bloom filter using some statistical analysis. When I have time
I’ll put together an exploit to prove or disprove this.

## What is a Bloom filter?

This explanation from Wikipedia sums it up.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives.

## Getting the example code

Grab the example code from Github as parts of it will be referenced below.

## Architecting the Bloom filter

The Bloom filter is stored as a set of bits. Some number of hash functions are used
to determine which bits are *set* to 1. The bit length, number of hash functions
and the expected number of inserted items are used to determine what the false
probability percentage is. Wikipedia provides some
equations for determining what values to use. As part of the example code these
equations are provided in `bloomEquations.py`

to experiment with.

```
#!/usr/bin/env python3
"""
k represents the number of hash functions
m is the length of the bit string
n is the expected number of elements to be inserted
p is the false positive probability
"""
from math import log as ln, exp as e
def optK(m,n):
return (m//n)*ln(2)
def optM(n,p):
return -((n*ln(p))//(ln(2)**2))
def optP(m,n):
return e(-(m//n)*ln(2)**2)
```

This example stores a first and last name as part of a user record. The goal is to provide the ability to perform a substring search on either the users first or last name. In order to do that not only does the name need to be inserted but all possible substrings of the name need to be inserted into the Bloom filter as well. Some research reveals that the average length of a persons first or last name to be 10 characters. For the name “Joe” all the items to be inserted in the filter would be:

- j
- jo
- joe
- o
- oe
- e

The number of entries in this case is 6 or 3+2+1. The pattern turns out
to be a summation series based on the length of the name which can be calcualated
by (n^2 + n) / 2. So the number of substrings for a name of 10 characters is 55.
This is the expected number of entries, or `n`

. The rest of the parameters
can be calculated using the provided equations in `python3`

.

```
$> python3
>>> from bloomEquations import *
```

To see how many bits are needed to have a 1% false positive rate use the `optM`

function.

```
>>> optM(55, 0.01)
528.0
```

A 10% false positive rate.

```
>>> optM(55, 0.10)
264.0
```

A 50% false positive rate.

```
>>> optM(55, 0.50)
80.0
```

Using 256 bits yields around a 10% false positive rate and only adds
32 bytes to an encrypted field that is probably 64 bytes depending
on what strength of encryption is used.
To figure out the exact false positive rate that a 256 bit Bloom
filter will have, use the `optP`

function.

```
>>> optP(256, 55)
0.14634154270296498
```

The result is about 15%. Since the application is substring searching, it is acceptable to have some amount of false positives which the user will ignore in the search results.

The last piece is deciding how many hash functions to use.
This can be calculated by using the `optK`

function.

```
>>> optK(256, 55)
2.772588722239781
```

This Bloom filter should have 3 hash functions to minimize the false positive rate.

To sum up the Bloom filter design:

- k, the number of hash functions, will be 3
- m, the size of the Bloom filter, will be 256
- n, the expected number of insertions, is 55
- p, the false positive rate, will be 15%

## Database schema

PostgreSQL is used for this example because of its bit string datatype and ability to perform binary operations within SQL.

To demonstrate how the searching scheme works, presume there are user records defined with the following fields:

- name varchar(10)
- bloom bit(4)

This table contains the following data:

```
name | bloom
-----+-------
Foo | 0100
Bar | 0010
Baz | 0011
```

The `name`

field contains the stored string, this column could
be encrypted. The `bloom`

field contains the resulting bit string
from putting `name`

through the Bloom filter.

To perform the substring search, the query word is put through
the Bloom filter and the resulting bit string is compared to
each `bloom`

record in the table using a bitwise and operation.
The goal is to see if *at least* the same bits from the query
word are set in the stored bit string for each record.

If the query word is `ba`

and the Bloom filter bit string
is `0010`

, the SQL would be:

```
SELECT * FROM users WHERE bloom & b'0010' = b'0010';
```

Which results in:

```
name | bloom
-----+-------
Bar | 0010
Baz | 0011
```

This bit string reduction operation allows substring searching without knowing what the original stored string is. Any database that supports bitwise and operations will work but there may be additional work in finding an efficient way to store the Bloom filter bit string.

For the provided example code, make sure you have a database with the name of your user account and setup the table as follows:

```
CREATE TABLE users(
first VARCHAR(50),
first_bloom BIT(256),
last VARCHAR(50),
last_bloom BIT(256)
);
```

## Walking through the code

The code compiles into three programs, `insert`

, `searchBloom`

and `searchName`

.
`insert.cpp`

takes two arguments per line, a first and last name,
generates their Bloom filters and inserts them into the database.
`searchBloom.cpp`

takes one argument, a substring, generates
its Bloom filter and queries against the Bloom filters
stored for both the first and last name.
`searchName.cpp`

takes one argument, a substring, and performs
a LIKE search against the stored first and last names.

The `hash`

folder contains the sources for the three
hash functions, Murmur3, SpookyV2 and FNV. This code
was taken directly from their upstream sources.

The interesting work happens in `util.cpp`

which
contains three functions. There are two helper
functions for calculating a summation series
and another for generating all possible substrings.

The `generateBloom`

function takes in a string,
transforms it to lower case and inserts all
substrings into a bitset which represents the
Bloom filter.

```
#define BLOOM_SIZE 256
std::bitset<BLOOM_SIZE>
generateBloom(const std::string str)
{
std::bitset<BLOOM_SIZE> bloom;
// Arbitrarily chosen seed value
const int seed {42};
// Transform the string to lower case
std::string lowerStr;
lowerStr.resize(str.length());
std::transform(str.begin(), str.end(), lowerStr.begin(), ::tolower);
for(const auto subStr : listSubStrings(lowerStr)) {
unsigned int murmurHash;
MurmurHash3_x86_32(subStr.c_str(), subStr.length(), seed, &murmurHash);
bloom.set(murmurHash % BLOOM_SIZE);
unsigned int spookyHash = SpookyHash::Hash32(subStr.c_str(), subStr.length(), seed);
bloom.set(spookyHash % BLOOM_SIZE);
char* buf = (char*) calloc(subStr.length(), sizeof(char));
std::strncpy(buf, subStr.c_str(), subStr.length());
unsigned long long fnvHash = fnv64a(reinterpret_cast<unsigned char*>(buf), (uint64_t)subStr.length());
bloom.set(fnvHash % BLOOM_SIZE);
delete buf;
}
return bloom;
}
```

All substrings are always inserted when generating the Bloom filters, even for searching. This yields more accurate results since if the desired string includes the whole substring, then it must also include all the substrings.

## Performance metrics

These metrics were calculated using the provided 100,000 name dataset on a Core i7-3517U with 8 GB of ram running
Ubuntu 14.04. All results from `time`

are averaged over 5 runs.

Results of searching for ‘matt’ using a traditional LIKE query and the Bloom filter:

```
$> time ./searchName matt | wc -l
174
real 0m0.181s
$> time ./searchBloom matt | wc -l
174
real 0m0.141s
```

Both methods yield the same result set and the Bloom filter method is slightly faster.

Results of searching for ‘ly’:

```
$> time ./searchName ly | wc -l
5745
time 0m0.217s
$> time ./searchBloom ly | wc -l
6069
time 0m0.183
```

The false positive property of the Bloom filter is shown with this example. The Bloom filter search yields 324 records that do not actually contain the string ‘ly’. The search is still consistently faster than a LIKE query.

## Additional Considerations

### Sorting & Pagination

Bloom filters allow a set reduction to be performed by the database but they do
not provide any means of sorting the result set (eg alphabetical order). Sorting will
have to be punted to the client or whatever step has access to the unencrypted data.
In order to support pagination, there needs to be an ordering guarantee between page
fetches. This can be achieved by adding an `ORDER BY`

clause on any of the fields.
Just be aware that fetching page 1 does not guarantee the result records to be
sorted earlier (eg for alphabetical sorting, page 1 contains records earlier in the
alphabet).

### Platform independent hash functions

While hash functions are guaranteed to be functional (eg provide the same output given
the same input), these functions *do not* guarantee to be functional across platforms.
For example Google’s FarmHash functions explicitly state that the functions are
platform dependent so using them across say an Intel Core i5 and an ARM Cortex A8
will result in different outputs given the same inputs. For a distributed
system where the hardware may be mixed, platform indepenet functions have to be used
to reliably generate equivalent Bloom filters. Cryptographic hash functions
guarnatee platform independence but they are orders of magnitude slower than the ones chosen
for this example. It would be a good idea to verify the chosen hash functions
are equivalent between compilers since some rely on assembly level tricks or specific
SSE instructions for distribution and speed.