Running Set Operations on Large Data Sets

Running Set Operations on Large Data Sets Efficiently

Francois Dang Ngoc —

Published: 6/23/2020 | Updated: 3/28/2022

Set operations like unions and intersections on a collection of data points are useful for a variety of scenarios. We here at Simulmedia also rely on them heavily. However using them on a very large collection presents a significant challenge.

What are set operations?

Sets are implemented in most programming languages but are not natively available widely in popular data stores such as relational databases. Redis for large datasets has been a popular option. However because it is an in-memory datastore, it cannot manage a large number of sets at a reasonable cost. Other solutions such as Pilosa have recently been implemented to solve that need but they require a long learning curve.

On the other hand, Postgres which has been a popular relational database that can accomodate billions of rows does not provide support for sets by default.

In this blog post we explore how to use set operation efficiently in a Postgres. There are multiple ways to do it:

Using intarray
Using roaring bitmap (extension that needs to be built)

In order to illustrate these two different approaches, we will use the Instacart dataset that was open-sourced in 2017. It contains over 3 millions of anonymized orders placed by over 200,000 users on around 50,000 products.

The data preparation is described in this page: https://gist.github.com/darthbear/332318706ba7ac56e2a19bdf5f4daf1e#file-simulmedia-roaringbitmap-post-md

Using intarray

Intarray allows a compact storage of integer arrays in Postgres and to compute operations on them. It is supported in most Postgres distributions (AWS, Google Cloud and Microsoft Azure).

In Postgres, let's use the extension intarray:

CREATE extension intarray

And create the purchases table using intarray:

CREATE TABLE purchases_intarray (
	product_id INT NOT NULL PRIMARY KEY,
	user_ids INT[]
);

Now, let's create sets of users for each products:

INSERT INTO purchases_intarray
	SELECT
		product_id,
		SORT(ARRAY_AGG(DISTINCT user_id)::INT[])
FROM order_products
JOIN orders USING (order_id)
GROUP BY product_id;

You can check the sets:

SELECT
product_id,
		product_name,
		ICOUNT(user_ids) as num_users,
		user_ids
FROM purchases_intarray
JOIN products USING (product_id)
ORDER BY num_users DESC
LIMIT 10;

+--------------+------------------------+-------------+-------------------------
| product_id   | product_name           | num_users   | user_ids
|--------------+------------------------+-------------+-------------------------
| 24852        | Banana                 | 76125       | [2, 10, 16, 21, 27, 28,
| 13176        | Bag of Organic Bananas | 65655       | [1, 2, 7, 11, 12, 22, 27
| 21137        | Organic Strawberries   | 61129       | [3, 7, 16, 18, 21, 27, 2
| 21903        | Organic Baby Spinach   | 56766       | [3, 6, 8, 16, 22, 24, 28
| 47626        | Large Lemon            | 48614       | [10, 28, 39, 51, 53, 54,
| 26209        | Limes                  | 46658       | [7, 10, 11, 14, 18, 21,
| 16797        | Strawberries           | 44857       | [2, 3, 10, 17, 24, 27, 3
| 47209        | Organic Hass Avocado   | 44704       | [2, 27, 28, 38, 50, 59,
| 47766        | Organic Avocado        | 43954       | [2, 3, 8, 21, 28, 35, 37
| 39275        | Organic Blueberries    | 38720       | [7, 27, 32, 35, 42, 52,
+--------------+------------------------+-------------+-------------------------

We can see that the most popular products are bananas, strawberries, spinach and lemons. In our example, let's try to see how many people are buying bananas (organic and non-organic) and strawberries (organic and non-organic):

(24852 OR 13176) AND (21137 OR 16797)

Which translates as a set intersection query as follows:

SELECT
    (
	(SELECT user_ids FROM purchases_intarray WHERE product_id=24852) |
(SELECT user_ids FROM purchases_intarray WHERE product_id=13176)
) & (
(SELECT user_ids FROM purchases_intarray WHERE product_id=21137) |
(SELECT user_ids FROM purchases_intarray WHERE product_id=16797)
    ) AS user_ids;

+-------------------------------------------------------------------------------
| user_ids
|-------------------------------------------------------------------------------
| [2, 7, 10, 16, 21, 27, 28, 32, 34, 35, 37, 39, 43, 46, 50, 54, 56, 59, 60, 62,
+-------------------------------------------------------------------------------
Time: 0.052s

This query ran almost instantly. If we had to run it in a normal query with Postgres, we would have needed to use INTERSECT which would have taken much more time.

Using a table that contains all the products per user:

CREATE TABLE user_products AS
SELECT DISTINCT product_id, user_id
FROM order_products
JOIN orders USING (order_id);
ALTER TABLE user_products ADD PRIMARY KEY (product_id, user_id);

We can select all the user_ids that match the requirement as follows:

SELECT user_id FROM user_products WHERE product_id IN (24852, 13176)
INTERSECT
SELECT user_id FROM user_products WHERE product_id IN (21137, 16797)

+-----------+
| user_id   |
|-----------|
| 194150    |
| 167276    |
| 95503     |

Time: 0.330s

This takes 0.330 seconds which is about 6 times as slow. And the difference is getting bigger as the arrays grow larger.

Using Roaring Bitmap

Storing integers as intarray is very efficient when the sets are small. However for larger sets, it takes more space than storing the set as a bitmap array. For example, if you consider the set (0, 2, 4, 5, 6, 7) to be stored as an integer array, It will take much more space than storing it as the bit array “10101111”.

Another way to compress even further is to use Run Length Encoding (RLE) which stores groups of 1s. For instance, 00011110000011000 will be stored as (3, 4), (12, 2) that denotes a group of 4 1s at position 3 and a group of 2 1s at position 12.

Roaring Bitmap takes the best of the 3 worlds by splitting a set in blocks of 65,536 numbers. In this case, in each block, numbers will be from 0 to 65,635 and can be encoded using 16 bits.

Therefore the following rules can apply:

If there are less than 4,096 numbers in the set, then encode the set as an integer array. It will take at most 4,096 x 16 bits = 65,536 bits
If not, consider a bitmap of 65,536 bits to store the members of the set.
If the set is dense and/or have numbers packed together, RLE might be a better choice.

Installing the roaringbitmap extension

In order to install the roaringbitmap extension, first build the extension and install it as follows:

git clone https://github.com/ChenHuajun/pg_roaringbitmap
cd pg_roaringbitmap
make install
make installcheck

Then in Postgres:

CREATE extension roaringbitmap;
-- This shows the roaring bitmap as an array instead of a byte representation
SET roaringbitmap.output_format='array';

Let's create a new table purchases_roaring:

CREATE TABLE purchases_roaring (
	product_id INT NOT NULL PRIMARY KEY,
	user_ids roaringbitmap
);

Creating the sets of users in the purchase table is similar to how we built the intarray previously:

INSERT INTO purchases_roaring
	SELECT
		product_id,
		RB_BUILD_AGG(DISTINCT user_id)
FROM order_products
JOIN orders USING (order_id)
GROUP BY product_id;

Again, let's check a few rows:

SELECT
product_id,
		product_name,
		RB_CARDINALITY(user_ids) as num_users,
		user_ids
FROM purchases_roaring
JOIN products USING (product_id)
ORDER BY num_users DESC
LIMIT 10;

+--------------+------------------------+-------------+-------------------------
| product_id   | product_name           | num_users   | user_ids
|--------------+------------------------+-------------+-------------------------
| 24852        | Banana                 | 76125       | {2,10,16,21,27,28,32,34,
| 13176        | Bag of Organic Bananas | 65655       | {1,2,7,11,12,22,27,40,41
| 21137        | Organic Strawberries   | 61129       | {3,7,16,18,21,27,28,32,3
| 21903        | Organic Baby Spinach   | 56766       | {3,6,8,16,22,24,28,35,37
| 47626        | Large Lemon            | 48614       | {10,28,39,51,53,54,56,62
| 26209        | Limes                  | 46658       | {7,10,11,14,18,21,28,32,
| 16797        | Strawberries           | 44857       | {2,3,10,17,24,27,35,39,4
| 47209        | Organic Hass Avocado   | 44704       | {2,27,28,38,50,59,62,63,
| 47766        | Organic Avocado        | 43954       | {2,3,8,21,28,35,37,41,54
| 39275        | Organic Blueberries    | 38720       | {7,27,32,35,42,52,65,68,
+--------------+------------------------+-------------+-------------------------

Again we can select all the users that bought bananas and strawberries. The query is the same as the one with intarrays.

SELECT
    (
	(SELECT user_ids FROM purchases_roaring WHERE product_id=24852) |
(SELECT user_ids FROM purchases_roaring WHERE product_id=13176)
) & (
(SELECT user_ids FROM purchases_roaring WHERE product_id=21137) |
(SELECT user_ids FROM purchases_roaring WHERE product_id=16797)
    ) AS user_ids;

+-------------------------------------------------------------------------------
| user_ids
|-------------------------------------------------------------------------------
| {2,7,10,16,21,27,28,32,34,35,37,39,43,46,50,54,56,59,60,62,63,65,70,73,77,79,8
+-------------------------------------------------------------------------------
Time: 0.022s

Compared to the intarray solution which took 0.052 seconds, roaring bitmap performed twice as fast which is about 15 times faster than using a normal table structure and doing set intersection in normal SQL (with INTERSECT). In the experiments section we will show that they perform even better than Redis for large datasets.

Experiments

We will compare the performance of these two approaches with Redis which is a common datastore to compute set operations.

We will use 4 synthetic datasets:

Medium size sets: 100,000 users with a total of 10,000 products:
- 100K_10K_SPARSE: sparse with 10% of popular products purchased by between 25% and 50% users
- 100_K10K_DENSE: dense with 50% of popular products purchased by between 25% and 50% users
Large size sets: 1,000,000 users with a total of 1,000 products:
- 1M_1K_SPARSE: sparse with 10% of popular products purchased by between 25% and 50% users
- 1M_1K_DENSE: dense with 50% of popular products purchased by between 25% and 50% users

For non popular products, they will be purchased by under 10% of the users.

For our tests, we use a MacBook Pro with 2.9 GHz 6-Core Intel Core i9 and 32GB of RAM.

Using Redis

We use Redis 6.0.1 (To install and start Redis on MacOS you can use the following commands: brew install redis then brew services start redis)

In order to load the data into Redis we can use a simple script that will send a lot of SADD commands. For example: SADD 1 100 101 102 103

It will create a set with the key 1 containing the elements 100, 101, 102 and 103.

Using Postgres

We use Postgres 12.1 with the extensions intarray 1.2 and roaringbitmap 0.5.

When we generate the data, instead of generating a row for each (product_id, user_id) pair, we generate the data in a format that can be ingested directly into intarrays or roaring bitmap. Each row will look as follows:

10,{1, 10, 70} 12,{10, 70, 80}

Where the first column is the product_id and {1, 10, 70} is the array of user_ids.

Postgres intarray

We create the tables

purchases_intarray
CREATE TABLE purchases_intarray (
	product_id INT PRIMARY KEY,
	user_ids INT[]
);

To import the data:

\COPY purchases_roaring FROM 'small_data.csv' WITH CSV;

Postgres roaringbitmap

The process is similar and the data imported use the same format as for intarray.

We create the table purchases_intarray as follows:

CREATE TABLE purchases_roaring (
	product_id INT PRIMARY KEY,
	user_ids ROARINGBITMAP
);

To import the data:

\COPY purchases_roaring FROM 'small_data.csv' WITH CSV;

Data footprint

Table view of above graph showing redis, intarray, roaring comparison

We can see that Postgres stored the data in intarray in a compact way being less than 7% the size needed in Redis. Using roaringbitmap, the footprint decreased even more to 1-2% of what Redis uses.

RAM required

We show for each the RAM required to run 1 light query / heavy query. Running 10 at the same time will take 10 times as much RAM.

Bar graphs showing comparison in RAM usage for light queries vs. heavy queries

Table showing Redis, Postgres Intarray, and Postgres roaring comparisons

Using the Redis open source version, all the data is loaded in memory so the RAM required is basically the same as the footprint.

In order to determine the RAM required by Postgres we run postgres in docker using different memory sizes. We can see that storing the data in Postgres takes significantly less space than Redis. Also Roaring Bitmap really shines when we consider large sets.

Note that Redis loads the data in memory which quickly becomes a limitation when it comes to big dataset. Postgres, on the other hand, uses considerably less disk space and RAM. As queries are becoming more and more complex Postgres intarray consumes more and more RAM but using Roaring Bitmap does not increase RAM consumption much.

Response time

For our experiments we consider 2 kinds of workload:

Light queries that consist of 10 intersections of 10 unions each
Heavy queries that consist of 10 intersections of 100 unions each

For each of the experiments, we measure the time taken to run each query using different concurrency levels (between 1 and 25).

Medium size sets (100,000 users and 10,000 items) We represent here the response time that was obtained for 10,000 medium size sets that are sparse for the 2 different workloads.

Bar graph showing response time obtained for 10,000 medium size sets

We can see that Postgres is significantly faster than Redis for a light or heavy workload. Because Redis is single-threaded, concurrent operations are serialized which makes it not suitable for applications that have to compute a lot of set operations concurrently. On the other hand Postgres can run the queries in parallel (equal to the number of cores) which makes it more efficient than Redis as the concurrency increases.

Postgres roaring performs much better than intarray with very low response time that makes it suitable for applications that require low response time.

We can see that for a relatively small dataset with lighter query (10 intersections of 10 unions), Redis takes about 0.5 seconds to compute the query and returns the response. Running queries in parallel does not help since Redis is single threaded. Postgres in this case performs about 10 times better and can process queries in parallel up to a certain point.

We represent below the response time that was obtained for 10,000 medium size sets that are dense for the 2 different workloads. Again roaring bitmap thanks to its compression performs much better than the other methods.

Bar graph showing response time for 10,000 medium sized sets using roaring bitmap

Large size sets (1,000,000 users and 1,000 items) We represent below the response time that was obtained for 1,000 large size sets that are sparse for the 2 different workloads.

Response time obtained for 1,000 large size sets

We can see that the difference between Redis and Postgres are even more exacerbated and so is the difference between Postgres intarray and Postgres roaringbitmap. So the larger the sets are the more benefit we get from using Postgres, and even more by using Postgres roaringbitmap.

We represent below the response time that was obtained for 1,000 large size sets that are dense for the 2 different workloads.

Bar graph exhibiting the cases where large and dense sets are detrimental to Redis

This chart clearly exhibits the cases where large and dense sets are detrimental to Redis which makes Postgres clearly a winner, especially Postgres using roaringbitmap.

Conclusion

In this blog post, we showed how to use Postgres to store sets and run set operations on them. We also show that it performs better than Redis using less memory. Using roaring bitmap is much faster than using intarray. However it is not readily available on Amazon RDS so in order to use it, one has to deploy his/her own Postgres in EC2 and cannot benefit from features such as backup and upgrades that come out of the box with RDS.