System Design

TLDR: use clusters, use caching, sharding and replication, use redundancy, use parallelisation (multi-threading, load balancing etc.), use async workflows, avoid points of contention and sync issue.


Drive the discussion. Part of the signal the interviewer hopes to gather is whether you’ve learned how to build large systems through hard experience. Your ability to anticipate and work around typical problems is part of that signal. Make it a conversation: Be sure to ask clarifying questions. But make sure you drive towards a good solution.


Part 1: Understand the system requirements.

Do not make any assumptions. Understand the constraints involved. Find the exact scope of the problem you are solving. . Your interviewer is interested in understanding your thought processes.

Ascertain if its system design.

How can we repeatedly find a string in a large dataset?

An eager candidate might immediately jump to the conclusion that having a large dataset searched repeatedly means it must be sorted and then binary-searched in log(n), and will start writing the code for mergesort on the whiteboard. All while the interviewer is thinking of a large document corpus and trying to get them to design a search engine.

Unless you ask some clarifying questions, like: “How big is the data set?”, “What are its characteristics?”, “How do we obtain it?”, you might not realize what you’re up against.’

Generate all the possible use cases

Use case is a description of sequences of events that, taken together, lead to a system doing something useful. Who is going to use it and how they are going to use it. The system may be very simple or very complicated.

Design a url shortening service

Shortening: createTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL
Customer url definition if available
Analytics on url usage
Automatic link expiry
Manual link removal of a url you shortened

Twitter

Who can post a tweet? (answer: any user)
Who can read the tweet? (answer: any user — as all tweets are public)
Will a tweet contain photos or videos (answer: for now, just photos)
Can a user follow another user? (answer: yes).
Can a user ‘like’ a tweet? (answer: yes).
Can a user search for tweets (answer: yes).
What gets included in the user feed (answer: tweets from everyone whom you are following) Is feed a list of tweets in chronological order? (answer: for now, yes).

Instagram

Users can upload/download and view photos
Users can follow other users
Users can see a feed of top photos from other users they follow (The system should be able to generate and display a user’s News Feed consisting of top photos from all the people the user follows.) Users can search for photos based on titles.
Optional: Adding tags, searching based on tags, commenting, suggestions, tagging users to photos etc.

a messaging system

Is it reasonable to assume that the messages can be sent by all users?
Does the user need to send messages to one or multiple persons?
Does the user need read receipts?
Reasonable to assume that the recipeients need to be notified?
Does the user need to see his message history?

an email server.

How do you store mail, especially as the system gets large enough that it won’t fit on one machine?
How do you handle mailing lists with large numbers of recipients?
How do you handle people abusing the system for spam?

a client-server API to build a rich document editor.

How does the client request data on the document from the server, especially as the document gets large enough that we wouldn’t want to download it in a single request
How do we represent the rich document aspects like bold and italics in our API response?
How do we design the system so that new features can be added on the server without breaking older clients?


Define System Interface or APIs

This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong.

TinyUrl.

CreateTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL

Twitter.

postTweet(user_id, tweet_text, image_url, user_location, timestamp, …)
generateTimeline(user_id, current_time)
searchTweet(string) recordUserTweetLike(user_id, tweet_id, timestamp, …)

Messaging

sendMessage(sender, recipient, message) getAllConversations(user) getConversation(user, recipient) searchMessages(searchString)


Estimate the scale of the system with a back-of-the-envelope calc

List the scale of the system such as requests per second, requests types, data written per second, data read per second.

This is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.

TinyURL

could be meant to serve just a few thousand users, but each could be sharing millions of URLs. It could be meant to handle millions of clicks on the shortened URLs, or dozens. The service may have to provide extensive statistics about each shortened URL (which will increase your data size), or statistics may not be a requirement at all. How many millions of new url per mon/second?

“Say you are one of the top url shortening websites outside the top 3”

Find out total number of URL’s shortened per month

No. of tweets/month 15 billion. Assume b/w 1/20 - 10 new tweets have a shortened URL so 750 million - 1.5 billion new urls shortened per month from twitter. Take the upper limit - 1.5 billion urls shortened month. Twitter sis one of the main drivers for url shortening, so we can take this as the total.

Find out the url shortening serviced by us.

Use 80/20 rule for url shortened by outside top 3: 300 million say we service 1/3td of these requests: 100 million

Find out the total number of requests per month Use 90/10 rule for redirect/shortening division Gives 1 billion requests/month

Find out total new urls in 5 years: 100 million * 60 = 6 billion

How many characters do we need for our tinyURL?

Find out number of hash characters required for 6 billion: 6 billion base 62 = 6-7 characters
With 62 character set and 7 characters, we can have 62^7 combinations = 3.5 trillion (Note: any number between 0-3.5 trillion can be represented using 43 bits. 62^7 ~ 2^43) If our service is generating 1000 tinyurl/sec, it will take us 110 years to exhaust. If our service is generating 1 million/sec, it will exhaust in 40 days. (Do the math)

Find out total storage needs
Assume 500 character for long urls
Assume 6 character for short urls
Total storage = 6 billion * (500 bytes + 6) = 3 TB + 36 GB

Bandwidth needed: Find out writes/per second = 100 million/30243600 = 40 shortening writes/sec
data written per sec : 40* (500+6) bytes = 20 kb
Data read per sec: 360(500+6) = 180 kb
If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. 360
360024 = 30 million urls read per day. 6 million urls to be cached. 6500 = 3*10^9 bytes = 3 GB.

Twitter

How many total users are there (answer: we expect to reach 200 Million users in the first year). How many daily active users are there (100 million users sign-in everyday) What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.) How much storage would we need? This will depend on whether users can upload photos and videos in their tweets? What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers. Calculate how much total storage is needed for the next 5 years.

Instagram

Let’s assume we have 500M total users, with 1M daily active users.
2M new photos every day, 23 new photos every second.
Average photo file size => 200KB
Total space required for 1 day of photos
2M * 200KB => 400 GB
Total space required for 10 years: 400GB * 365 (days a year) * 10 (years) ~= 1425TB

Messaging System:

How many users are we talking about here? How many messages sent? How many messages read?

Dropbox:

What operations does this data store need to support? What operations is it optimized for?

For example, you might need to determine how long it will take to generate 100 image thumbnails from disk or how much memory a data structure will take. The Powers of two table and Latency numbers every programmer should know are handy references.

Power Exact Value Approx Value Bytes
7 128    
8 256    
10 1024 1 thousand 1 KB
16 65,536   64 KB
20 1,048,576 1 million 1 MB
30 1,073,741,824 1 billion 1 GB
32 4,294,967,296   4 GB
40 1,099,511,627,776 1 trillion 1 TB

Latency Comparison Numbers

Power Time Ratio  
L1 cache reference 0.5 ns    
Branch mispredict 5 ns    
L2 cache reference 7 ns 14x L1 cache  
Mutex lock/unlock 100 ns    
Main memory reference 100 ns 20x L2 cache, 200x L1 cache  
Compress 1K bytes with Zippy 10,000 ns 10 us  
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us  
Read 4 KB randomly from SSD* 150,000 ns   150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us  
Round trip within same datacenter 500,000 ns 500 us  
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

Notes

Handy metrics based on numbers above:


Find out the desired properties of the system

Messaging

What are the latency requirements for sender -> receiver message deliver?
How are you going to store messages?
How do you push new messages to clients? Do you push at all, or rely on a pull based model?

Instagram:

Highly available
Reliable: No photo or video uploaded should ever be lost.
Consistency can take a hit in the interest of availability: it is fine if some users dont see the photo for a while
The system would be read-heavy.
Focus on building a system that can retrieve photos quickly. Low latency on reads is expected while viewing photos. Acceptable latency is 200ms for News Feed Generation


Part II: High-Level Architecture

Draw a block diagram with 5–6 boxes representing core components of your system.

You should identify enough components that are needed to solve the actual problem from end-to-end. While drawing, comment on the component, the need for it, and the choices you can make at that level. Visualize an abstract design: This should be based on the constraints clarified in the previous step. This gives you a chance to consolidate all the information and reaffirm your design choice. Be sure to focus on the basic components of the system and the relationships between them. Candidate should be able to identify various entities of the system, how they will interact with each other and different aspect of data management like storage, transfer, encryption, etc

Twitter

At a high level, we would need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we’ll have a lot more read traffic (as compared to write), we can decide to have separate servers for handling reads v.s writes. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We would also need a distributed file storage system for storing photos (and videos) and a search index and infrastructure to enable searching of tweets.

TinyUrl

An Application Layer and a Storage Layer

Instagram:

Use object storage server to store photos (Spotify uses it for songs, fb for photos, files in Dropbox) We can use distributed file storage like HDFS or S3.


Part III Dicuss Detailed design for selected components

Interviewers feedback should always guide you towards which parts of the system she wants you to explain further. Sometimes I did feel that the interviewer is not interested in a particular aspect of the system being designed, like API design or estimations. But sometimes they do. So keep this thing in mind, don’t spend a lot of time on some aspect that the interviewer is not interested. Always try to discuss as many aspects as possible. At each level, ask your interviewer for specifications (should you suggest a simple starting point, or talk about what a mature system might look like?) and talk about several options (applying the ideas from your reading). Discussing tradeoffs in your design is key. Your interviewer cares less about whether your design is good in itself, and more about whether you are able to talk about the trade-offs (positives and negatives) of your decisions. Practice this.


Low level Design of data model (schema or class diagram)

For each component, write the specific APIs for each component. You may need to finish the detailed OOD design for a particular function. You may also need to design the database schema for the database. Although it’s not required early on, this will clarify how data will flow among different components of the system and later will also guide you towards data partitioning.

Twitter:

User: UserID, Name, Email, DoB, CreationData, LastLogin
Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp
UserFollows: UserdID1, UserID2
FavoriteTweets: UserID, TweetID, TimeStamp


Example: Application Layer for TinyURL


Example: Storage Layer for Instagram

Schema

Storage Size Estimation

User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes: UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes.

If we have 500 million users, we will need 32GB of total storage 500 million * 68 ~= 32GB

Photo: Each row in Photo’s table will be of 284 bytes: PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes

If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day: 2M * 284 bytes ~= 0.5GB per day. For 10 years we will need 1.88TB of storage.

UserFollow: Each row in the UserFollow table will be of 8 bytes.

If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table: 500 million users * 500 followers * 8 bytes ~= 1.82TB.

Total space required for all tables for 10 years will be 3.7TB: 32GB + 1.88TB + 1.82TB ~= 3.7TB

Sharding

To handle such large data, we can shard data > based on user ID: Let’s assume we shard based on the ‘UserID’ so that we can keep all photos of a user on the same shard. If one DB shard is 1TB, we will need four shards to store 3.7TB of data. Let’s assume for better performance and scalability we keep 10 shards. So we’ll find the shard number by UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID. How can we generate PhotoIDs? Each DB shard can have its own auto-increment sequence for PhotoIDs, and since we will append ShardID with each PhotoID, it will make it unique throughout our system.

Cons of this scheme:
* How would we handle hot users? Several people follow such hot users, and a lot of other people sees any photo they upload.
* Some users will have a lot of photos compared to others, thus making a non-uniform distribution of storage. Sharding your data based on user IDs will always result in overloaded partitions for celebrity users.
* What if we cannot store all pictures of a user on one shard? If we distribute photos of a user onto multiple shards, will it cause higher latencies?
* Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s data if that shard is down or higher latency if it is serving high load etc


Example: Storage Layer for TinyURL

Purging or DB Cleanup:

Storage Estimation


Part IV Identifying and resolve bottlenecks.

Provide different approaches, their pros and cons, and why would you choose one? A good solution compares and contrasts different approaches. It explores the tradeoffs present in any complex engineering problem and it makes intelligent, reasoned decisions about those tradeoff. Try to discuss as many bottlenecks as possible and different approaches to mitigate them.

Example: Twitter How would we handle high-traffic users e.g. celebrities who have millions of followers?


Scale out or up


High Load on any component? Do load balancing

What is Load Balancing (NGinx, HAPProxy)?


Limited Data Storage? Shard


Limited Read/Write Throughput? Replicate or Shard

How to deal with increasing loads on database?

Instagram: Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. As we know that web servers have a connection limit, so we should keep this thing in mind before designing our system. If we assume that a web server can have a maximum of 500 connections at any time, this would mean it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the system. Separating photos’ read and write requests will also allow us to scale and optimize each of these operations independently.


Bad Latency? Cache

However, it helps more than it hurts even when unsophisticated caching solutions are used.

Centralized caching vs distributed caching (each node has its own cache)

We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcache, which can store full URLs with their respective hashes. The application servers, before hitting backend storage, can quickly check if the cache has the desired URL.

How much cache should we have?
We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs.

Which cache eviction policy would best fit our needs?
When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of which URLs are accessed recently. To further increase the efficiency, we can replicate our caching servers to distribute load between them.

How can each cache replica be updated?
Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update their cache by adding the new entry. If a replica already has that entry, it can simply ignore it.

Since user’s timeline will contain the most recent (and relevant) tweets, should we try to store our data in a way that is optimized to scan latest tweets?

Our service would need a massive-scale photo delivery system to serve the globally distributed users. Our service should push its content closer to the user using a large number of geographically distributed photo cache servers and use CDNs (for details see Caching).

We can introduce a cache for metadata servers to cache hot database rows. We can use Memcache to cache the data and Application servers before hitting database can quickly check if the cache has desired rows. Least Recently Used (LRU) can be a reasonable cache eviction policy for our system. Under this policy, we discard the least recently viewed row first.


Single Points of Failure? Replicate


Bad Latency? Do Asynchronous Workflows

Could pre-gnerating of tinyurl and store them in caches if they are based on counter, We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique

Can concurrency cause problems? As soon as a key is used, it should be marked in the database to ensure it doesn’t get used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem
Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.
For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which is acceptable, given the huge number of keys we have.
KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock to) the data structure holding the keys before removing keys from it and giving them to a server

Can each app server cache some keys from key-DB?
Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letter keys.

How would we perform a key lookup?
We can look up the key in our database or key-value store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status, or redirect the user back to the homepage.

Should we impose size limits on custom aliases?
Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema).

Content Delivery for higher latency

for Youtube, Netflix, Instagram, Twitter


Practice

The best way to practice for system design interviews is to actually sit down and design a system, i.e. your day-to-day work. Instead of doing the minimal work, go deeper into the tools, frameworks, and libraries you use. For example, if you use HBase, rather than simply using the client to run some DDL and do some fetches, try to understand its overall architecture, such as the read/write flow, how HBase ensures strong consistency, what minor/major compactions do, and where LRU cache and Bloom Filter are used in the system. You can even compare HBase with Cassandra and see the similarities and differences in their design. Then when you are asked to design a distributed key-value store, you won’t feel ambushed.


References

27 October 2018