Document search system design

5 min readJul 14, 2024

Design a search system that returns a set of all document ids that contains the all terms in a search string. Limited by dictionary, similar to applying double quotes in Google while searching. Let’s assume that inverted index is already given.

Schema of inverted index

{
    <string>: string[]
}

Following is one example.

{
    w1: [d1, d2,....... dn]
    w2: [d1, d2,....... dn]
    w3: [d1, d2,....... dn]
    .
    .
    wk: [d1, d2,....... dn]
}

Simple Search Algorithm

We can take following approach to search the given phrase.

For all terms; get their sorted document ids list
Use min priority queue i.e. heap binary tree of size of words in the phrase
If tree is unival then document is common across all word
Once unival is found then remove all these documentIds from priority queue to return the result.

Let’s take following example to understand this approach.

Interview : [1,3,5,7,9]
Kickstart: [2,4,5,8]
Rock: [3, 5, 7]

Let’s build 3 size priority queue to implement k-way merge sort. We will get the first [5,5,5] as unival tree and that would be our answer.

Time complexity: O(nk(logk + k))

Space complexity: O(k)

Here k is number of terms and n is number of size of documents list.

Note: Here n*k documents are added into priority queue with cost of O(logk) and every time to check if tree is unival is not takes O(k) therefore overall time complexity is O(nk(logk + k))

Since documents size i.e. ncan be in the size of trillions therefore k is negligible in compare to n . Therefore we can say that time complexity would be O(n) .
I.e. in the worst case time complexity is O(trillion) which is not acceptable. Therefore before we try to scale the solution we need to optimize the latency.

Key observation on this approach

API response time or latency has to come down from worst case O(trillion) to something that maps to 100ms
O(10,000) ~ 100 ms (Assumption with commodity servers i.e. run for loop 10,000 times and measure time taken)
Irrespective of document size user would like to execute search with a worst case of O(10,000)
In our case; O(trillion) will be O(1000,000,000,000) ~ 1000,000,00 * 100 ms ~ 2.77 hours which is not acceptable.

Optimize search algorithm

Since our goal is to reduce n which means need to apply sharding on input data.

Sharding

There are two ways to shard the data

Horizontal shardig ie. divide data horizontally and place each shards on separate server (For example Spanner, Mongodb, Cassandra etc)
Vertical sharding i.e. divide data vertically and place each shard on separate server.

Scatter and Gather

Let’s try to understnad Scatter and Gather design pattern to optimize the search algorithm.

Without Scatter and Gather we generally design our system as below.

This works well with small size of provider. But doesn’t work if provider is TB or PB size.

With Scatter and Gather design we split provider to small chunk. Every request to provider should follow same specification so that same algorithm can be execute. We have to also make sure that it should be easier for Gather to merge data.

Horizontal sharding to optimize search

We can shard data horizontally. Following is example.

[aa to ap] = shard0 goes on server 0
[aq to az] = shard1 goes on server 1
………………………………………….
………………………………………….
[zq to zz] = shardn goes to server n
Load balancer has to do merge
For example; if we search for “hello world”
“hello” will return 1 trillion documents as per shard
“world” will return 1 trillion documents as per shard
Now load balancer needs to merge these two result

Horizontal sharding defeat our goal as work item is still took high i.e. 1 trillion. We are looking to reduce work item as O(10,000) as worst case i.e. 100ms.

Shard data into vertically

We can vertically shard data. It means we can keep all keys in each shard but with smaller documents.

Following is example range for shard0

w0 - [1 to 10,000] 
w1 - [1 to 10,000]
.
.
wk - [1 to 10,000]

Following is example range for shard1

w0 - [10,000 to 20,000] 
w1 - [10,000 to 20,000] 
.
.
wk - [10,000 to 20,000]

We can execute the same algorithm on each shard and then Gather will take care of merging result.

Let’s take following example to understand this approach.

"hello": [1, 5, 7, 15000] 
"world": [1,5, 17000]

This will be vertically shard as below

On shard0:
"hello": [1, 5, 7] 
"world": [1, 5]

On shard1:
"hello": [15000] 
"world": [17000]

Now on each machine, it is only working on small subset as 10k . This design is based on map reduce paradigm.

We have improved the latency. Let’s see if we need to improve throughput as well.

Throughput

Let’s say if we get 10 search per second.
How many searches each shards receives?
10 bcz request is send to every shard
With 100ms each shard is running in 100ms to search in 10k documents
A single shard can do 300 searches per seconds
If all shards starts receiving 300 searches per seconds; how much will be our throughput?
300
So we are bottleneck by single shard system

Let’s say if we want 40k searches per seconds then we will have to replicate whole architecture to parallel processing.

Q. How much replicas we need to handle 40k searches per seconds?
Ans. Replicas of each shard = 40k/300 ~ 133

Google is throughput hungry problem.

This post is based on system design taught in Interview Kickstart. Feel free to join Interview Kickstart course as they are best :)

Happy system designing :-)