Google drive system design
Design Google drive scale application which allows
- Users to upload and download their files from any device.
- Users should be able to share files or folders with other users.
- It should also support automatic synchronization between devices, i.e., after updating a file on one device, it should get synchronized on all devices.
- The system should support storing large files up to
X
GB.
High level system design
One high level, we can have following services to implement Google drive application.
We have following problem to handle
- On upload error ,retry entire file upload is very bad user experience as repeating entire upload will multiply time to upload the file.
- As per research more than 60% content on internet are duplicate. Therefore design should consider optimizing storage for duplicate contents.
- On uploading a new file (or modified) content synchronize it to rest of the clients for that user.
Optimize upload service
Instead of upload one single file of large size, we can chunk into small size of 10MB
and upload each chunk separately.
Client side design
On high level, client can be designed as below.
Watcher
It will monitor the local workspace folders and notify the chunker of any action performed by the users, e.g. when users create, delete, or update files or folders.
Chunker
Chunker component will take care of splitting large file into small chunks at client system. It will also produce metadata which will be used by storage component.
It will also be responsible for reconstructing a file from its chunks.
It will also detect the parts of the files that have been modified by the user and only choose those parts to get uploaded to the Cloud Storage; this will save us bandwidth and synchronization time.
It will also write events into Upload Chunks
Queue to notify uploader.
Chunking algorithm
Following are different chunking algorithm that can be used chunkers to split files.
Fixed size file Chunking
We can choose the fixed size for chunk and then use that size to simply split file. Then create hash for each chunk which then later used to compare two chunks to decide if it is duplicate or not.
Fixed size chunking works fine if files were not edited. If one of the file is edited by even a single character from beginning then two same files will produced different chunks which means we will have to treat as separate files.
Even with this limitation, this approach is still practiced due to easy and less cpu utilization.
Content Define file Chunking
In this approach, instead of chunk size, we choose a separator and use that separator to chunk file.
It helps to maintain the chunk hash for all chunks except the first one if some edit were made on the second file.
This is CPU intensive as well as selection of separator bring another complexities.
Chunks Database
Client needs to maintain it’s own storage to store chunks metadata and upload status. Following is schema for Chunks
table.
ChunkId FileId ChunkOrder ChunkSize UploadStatus ModifiedTimestamp
1 1 1 10mb UPLOADED xxx
2 1 2 10mb INPROGRESS xxx
3 1 3 10mb NOT_UPLOADED xxx
4 1 4 10mb NOT_UPLOADED xxx
5 1 5 10mb DOWNLOADED xxx
Uploader
It will process the pending messages in the Upload Chunks Queue
. It will make server Upload Service
to upload the file content with it’s metadata.
Once chunk is uploaded successfully then it will update the Chunks
table to show the status of chunk upload.
Synchronizer
Server will notify this component if chunks is uploaded by other client. It will simply enqueue message into Download Chunks Queue
for processing.
Downloader
Downloader will process messages from Download Chunks Queue
and make server Download Service
to download the chunk and update the local file system.
Server side design
Following is high level server side design.
Upload Service
Upload service takes care of uploading file chunks to object storage and also updating Metadata
table. It also publish message into Chunk Uploaded Queue
for synchronization.
Sync Service
Sync Service will process message from Chunk Uploaded Queue
. It will look into UserClients
table to find out the list of connected clients for the given user. Then it will fanout making call to Session server for each client to notify client.
Session Server
Every client first make Registration
rpc call to Session server to get the target session server. If not already established then this rpc will allocate a new Session server. Then client establish bi-directional connection to session server.
Session Server also expose DispatchEvent
rpc which is invoked by Sync service to perform fanout.
To learn more about Session server concept, please go through following design.
File Sharing service
Please go through following design to understand how privacy is implemented as a shared service.
Happy learning :-)