Chat system design

Dilip Kumar
5 min readJul 15, 2024

--

Design a Chat system to support one to one chat and chat in groups. Also support the online status to other users.

Approach 1: Single server system design

We can design Chat application using a single server as below.

  1. User A sends a message to user B through the Chat server.
  2. Chat server writes messages to the database and then acknowledges user A.
  3. User B keeps polling to the chat server on interval and reads the messages for user B.

Following are drawback in this design.

  1. Single point of failure
  2. Not scalable

Approach 2: Use server side push mechanism

We can use server side push mechanism to send message to connected clients. This helps to splits write path and publish path. Following is modified design.

  1. Every client on load, establish bi-directional connection with Persistent Server.
  2. User A uses separate Write Message micro service to send message to user B.
  3. Write Message micro service persist messages to the Messages database table as well as publish to Events queue and then acknowledges user A.
  4. Events callback reads batch of messages from queue and makes RPC call to Persistent Server to deliver message to connected target user.

This design is better than earlier one but still can not scale well.

Approach 3: Scalable real time messaging system

To scale we need many instances of Persistent Server. It means Events Callback will have to first discover the target Persistent Server for given target user.

We can store mapping of user and Persistent server in the separate database table. On first time a next available server is assigned to user to establish bi-directional connection also writes mapping to database table.

Following is overall design.

Client establish session to Persistent server

First time registration

  1. Client make rpc call to RegisterClient
  2. RegisterClient make rpc call to Discovery service
  3. Discovery service query Placements table and found no entry
  4. It returns the load balancer group as a response to the RegisterClient.
  5. RegisterClient redirect request to Load balancer to establish session with Persistent Server.
  6. Persistent Server writes entry into Placements table to store the User and it’s placement to Persistent Server.
  7. RegisterClient returns the address of the allocated Persistent Server to the client.
  8. Client establishes the bidirectional WebSocket connection to the designated Persistent Server.

Second time onwards registration

  1. Client make rpc call to RegisterClient
  2. RegisterClient make rpc call to Discovery service
  3. Discovery service query Placements table and found existing entry
  4. RegisterClient returns the address of the existing Persistent Server to the client.
  5. Client establishes the bidirectional WebSocket connection to the designated Persistent Server.

Sending real time events to 1:1 chat

  1. User A wants to send message to target user B
  2. User A makes call to WriteMessage service
  3. WriteMessage writes message data to the Messages table to persist the data.
  4. WriteMessage also writes into the Events queue for async message delivery and returns an OK response to A.
  5. EventsCallback receives message from Events queue
  6. EventsCallback calls Discovery service to find out placement for user B
  7. EventsCallback service returns the current Persistent Server allocated to user B
  8. EventsCallback calls Persistent Server to deliver message to user B
  9. Persistent Server uses push mechanism to deliver message to user B

Sending real time events to all users in a group

  1. User A wants to post message to all users in a group.
  2. User A makes call to WriteMessage service to group G1
  3. WriteMessage writes message data to the Messages table to persist the data.
  4. WriteMessage also writes into the Events queue for sync message delivery and returns an OK response to A.
  5. EventsCallback receives message from Events queue
  6. EventsCallback reads the Groups table to find out the target users for message delivery.
  7. EventsCallback then fanouts to deliver message to all the recipient users.

Each fanout call to deliver message to recipient makes following call

  1. Calls Discovery service to find out placement for user X
  2. Discovery service returns the current Persistent Server allocated to user X
  3. Fanout calls Persistent Server to deliver message to user X
  4. Persistent Server uses push mechanism to deliver message to user X

Design to detect user presence

Now let’s update Chat design to detect the online/offline user presence.

Approach 1: Client polls

Write path
1. Client heartbeat (polls) to Peristent Server every 15 sec

2. Chat Server batch 30 secs of data and writes it into to database in batch

Read path
1. Every one minute, the client makes a read request to the Status server to get the other user’s online status.
2. In the request payload, client sends the list of userIds interested to know the online status to the server.

3. Status Server Read the online/offline status from UserStatus database and return result to the client.

Approach 2: Server push

Write Path

Heart beat
1. Client heartbeat (polls) to Chat Server every 15 sec

2. Chat Server batch 15 secs of data and writes it into to database and queue in batch
Subscription
1. Client will subscribe for list of users to receive the presence update
2. Subscriber and its preference list of users mapping will be stored in the Subscriptions table with ttl.

Server push path

Queue subscribers will retrieve the list of subscribers listed for the user’s online status and push the presence update to subscribers.

Sync protocol to catch up missing events

High level design
1. Each message is attached with revision timestamp which is equal to the time when the message was posted.
2. Use revision timestamp for each direct message or group message as the last timestamp of the posted message.
3. EventsCallback makes a read to Messages table to find out the prev_rev_timestamp for every message which builds a sorted order.

Unordered message delivery

  1. Client uses prev_rev_timestamp to build the sorting order of messages.
  2. In case of missing messages it makes a catchup call to the server to return the missing events.

Sync the missing messages
1. Users may go offline so it will miss the live delivery of messages.
2. Once user comeback to online then it makes rpc call to server to get the latest revision_timestamp for each direct mesage or group message
3. Client will make a catchup call to the server for each direct message or group message to fetch the missing events.

Happy system designing :-)

--

--

Dilip Kumar
Dilip Kumar

Written by Dilip Kumar

With 18+ years of experience as a software engineer. Enjoy teaching, writing, leading team. Last 4+ years, working at Google as a backend Software Engineer.

Responses (1)