Document search system design
Design a search system that returns a set of all document ids that contains the all terms in a search string. Limited by dictionary, similar to applying double quotes in Google while searching. Let’s assume that inverted index is already given.
Schema of inverted index
{
<string>: string[]
}
Following is one example.
{
w1: [d1, d2,....... dn]
w2: [d1, d2,....... dn]
w3: [d1, d2,....... dn]
.
.
wk: [d1, d2,....... dn]
}
Simple Search Algorithm
We can take following approach to search the given phrase.
- For all terms; get their sorted document ids list
- Use min priority queue i.e. heap binary tree of size of words in the phrase
- If tree is unival then document is common across all word
- Once unival is found then remove all these documentIds from priority queue to return the result.
Let’s take following example to understand this approach.
Interview : [1,3,5,7,9]
Kickstart: [2,4,5,8]
Rock: [3, 5, 7]
Let’s build 3 size priority queue to implement k-way
merge sort. We will get the first [5,5,5]
as unival tree and that would be our answer.
Time complexity: O(nk(logk + k))
Space complexity: O(k)
Here k
is number of terms and n
is number of size of documents list.
Note: Here n*k
documents are added into priority queue with cost of O(logk)
and every time to check if tree is unival is not takes O(k)
therefore overall time complexity is O(nk(logk + k))
Since documents size i.e. n
can be in the size of trillions therefore k
is negligible in compare to n
. Therefore we can say that time complexity would be O(n)
.
I.e. in the worst case time complexity is O(trillion)
which is not acceptable. Therefore before we try to scale the solution we need to optimize the latency.
Key observation on this approach
- API response time or latency has to come down from worst case
O(trillion)
to something that maps to100ms
O(10,000) ~ 100 ms
(Assumption with commodity servers i.e. run for loop 10,000 times and measure time taken)- Irrespective of document size user would like to execute search with a worst case of
O(10,000)
- In our case;
O(trillion)
will beO(1000,000,000,000) ~ 1000,000,00 * 100 ms ~ 2.77 hours
which is not acceptable.
Optimize search algorithm
Since our goal is to reduce n
which means need to apply sharding on input data.
Sharding
There are two ways to shard the data
- Horizontal shardig ie. divide data horizontally and place each shards on separate server (For example Spanner, Mongodb, Cassandra etc)
- Vertical sharding i.e. divide data vertically and place each shard on separate server.
Scatter and Gather
Let’s try to understnad Scatter and Gather design pattern to optimize the search algorithm.
Without Scatter and Gather we generally design our system as below.
This works well with small size of provider. But doesn’t work if provider is TB
or PB
size.
With Scatter and Gather design we split provider to small chunk. Every request to provider should follow same specification so that same algorithm can be execute. We have to also make sure that it should be easier for Gather to merge data.
Horizontal sharding to optimize search
We can shard data horizontally. Following is example.
[aa to ap] = shard0
goes on server0
[aq to az] = shard1
goes on server1
- ………………………………………….
- ………………………………………….
[zq to zz] = shardn
goes to servern
- Load balancer has to do merge
- For example; if we search for
“hello world”
- “hello” will return 1 trillion documents as per shard
- “world” will return 1 trillion documents as per shard
- Now load balancer needs to merge these two result
Horizontal sharding defeat our goal as work item is still took high i.e. 1 trillion. We are looking to reduce work item as O(10,000)
as worst case i.e. 100ms.
Shard data into vertically
We can vertically shard data. It means we can keep all keys in each shard but with smaller documents.
Following is example range for shard0
w0 - [1 to 10,000]
w1 - [1 to 10,000]
.
.
wk - [1 to 10,000]
Following is example range for shard1
w0 - [10,000 to 20,000]
w1 - [10,000 to 20,000]
.
.
wk - [10,000 to 20,000]
We can execute the same algorithm on each shard and then Gather will take care of merging result.
Let’s take following example to understand this approach.
"hello": [1, 5, 7, 15000]
"world": [1,5, 17000]
This will be vertically shard as below
On shard0:
"hello": [1, 5, 7]
"world": [1, 5]
On shard1:
"hello": [15000]
"world": [17000]
Now on each machine, it is only working on small subset as 10k
. This design is based on map reduce paradigm.
We have improved the latency. Let’s see if we need to improve throughput as well.
Throughput
- Let’s say if we get 10 search per second.
- How many searches each shards receives?
- 10 bcz request is send to every shard
- With 100ms each shard is running in 100ms to search in 10k documents
- A single shard can do 300 searches per seconds
- If all shards starts receiving 300 searches per seconds; how much will be our throughput?
- 300
- So we are bottleneck by single shard system
Let’s say if we want 40k
searches per seconds then we will have to replicate whole architecture to parallel processing.
Q. How much replicas we need to handle 40k searches per seconds?
Ans. Replicas of each shard = 40k/300 ~ 133
Google is throughput hungry problem.
This post is based on system design taught in Interview Kickstart. Feel free to join Interview Kickstart course as they are best :)
Happy system designing :-)