TLDR: use clusters, use caching, sharding and replication, use redundancy, use parallelisation (multi-threading, load balancing etc.), use async workflows, avoid points of contention and sync issue.
Drive the discussion. Part of the signal the interviewer hopes to gather is whether you’ve learned how to build large systems through hard experience. Your ability to anticipate and work around typical problems is part of that signal. Make it a conversation: Be sure to ask clarifying questions. But make sure you drive towards a good solution.
Do not make any assumptions. Understand the constraints involved. Find the exact scope of the problem you are solving. . Your interviewer is interested in understanding your thought processes.
How can we repeatedly find a string in a large dataset?
An eager candidate might immediately jump to the conclusion that having a large dataset searched repeatedly means it must be sorted and then binary-searched in log(n), and will start writing the code for mergesort on the whiteboard. All while the interviewer is thinking of a large document corpus and trying to get them to design a search engine.
Unless you ask some clarifying questions, like: “How big is the data set?”, “What are its characteristics?”, “How do we obtain it?”, you might not realize what you’re up against.’
Use case is a description of sequences of events that, taken together, lead to a system doing something useful. Who is going to use it and how they are going to use it. The system may be very simple or very complicated.
Design a url shortening service
Shortening: createTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL
Customer url definition if available
Analytics on url usage
Automatic link expiry
Manual link removal of a url you shortened
Who can post a tweet? (answer: any user)
Who can read the tweet? (answer: any user — as all tweets are public)
Will a tweet contain photos or videos (answer: for now, just photos)
Can a user follow another user? (answer: yes).
Can a user ‘like’ a tweet? (answer: yes).
Can a user search for tweets (answer: yes).
What gets included in the user feed (answer: tweets from everyone whom you are following) Is feed a list of tweets in chronological order? (answer: for now, yes).
Users can upload/download and view photos
Users can follow other users
Users can see a feed of top photos from other users they follow (The system should be able to generate and display a user’s News Feed consisting of top photos from all the people the user follows.) Users can search for photos based on titles.
Optional: Adding tags, searching based on tags, commenting, suggestions, tagging users to photos etc.
a messaging system
Is it reasonable to assume that the messages can be sent by all users?
Does the user need to send messages to one or multiple persons?
Does the user need read receipts?
Reasonable to assume that the recipeients need to be notified?
Does the user need to see his message history?
an email server.
How do you store mail, especially as the system gets large enough that it won’t fit on one machine?
How do you handle mailing lists with large numbers of recipients?
How do you handle people abusing the system for spam?
a client-server API to build a rich document editor.
How does the client request data on the document from the server, especially as the document gets large enough that we wouldn’t want to download it in a single request
How do we represent the rich document aspects like bold and italics in our API response?
How do we design the system so that new features can be added on the server without breaking older clients?
This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong.
TinyUrl.
CreateTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL
Twitter.
postTweet(user_id, tweet_text, image_url, user_location, timestamp, …)
generateTimeline(user_id, current_time)
searchTweet(string) recordUserTweetLike(user_id, tweet_id, timestamp, …)
Messaging
sendMessage(sender, recipient, message) getAllConversations(user) getConversation(user, recipient) searchMessages(searchString)
List the scale of the system such as requests per second, requests types, data written per second, data read per second.
This is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.
TinyURL
could be meant to serve just a few thousand users, but each could be sharing millions of URLs. It could be meant to handle millions of clicks on the shortened URLs, or dozens. The service may have to provide extensive statistics about each shortened URL (which will increase your data size), or statistics may not be a requirement at all. How many millions of new url per mon/second?
“Say you are one of the top url shortening websites outside the top 3”
Find out total number of URL’s shortened per month
No. of tweets/month 15 billion. Assume b/w 1/20 - 10 new tweets have a shortened URL so 750 million - 1.5 billion new urls shortened per month from twitter. Take the upper limit - 1.5 billion urls shortened month. Twitter sis one of the main drivers for url shortening, so we can take this as the total.
Find out the url shortening serviced by us.
Use 80/20 rule for url shortened by outside top 3: 300 million say we service 1/3td of these requests: 100 million
Find out the total number of requests per month Use 90/10 rule for redirect/shortening division Gives 1 billion requests/month
Find out total new urls in 5 years: 100 million * 60 = 6 billion
How many characters do we need for our tinyURL?
Find out number of hash characters required for 6 billion: 6 billion base 62 = 6-7 characters
With 62 character set and 7 characters, we can have 62^7 combinations = 3.5 trillion (Note: any number between 0-3.5 trillion can be represented using 43 bits. 62^7 ~ 2^43) If our service is generating 1000 tinyurl/sec, it will take us 110 years to exhaust. If our service is generating 1 million/sec, it will exhaust in 40 days. (Do the math)
Find out total storage needs
Assume 500 character for long urls
Assume 6 character for short urls
Total storage = 6 billion * (500 bytes + 6) = 3 TB + 36 GB
Bandwidth needed: Find out writes/per second = 100 million/30243600 = 40 shortening writes/sec
data written per sec : 40* (500+6) bytes = 20 kb
Data read per sec: 360(500+6) = 180 kb
If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. 360360024 = 30 million urls read per day. 6 million urls to be cached. 6500 = 3*10^9 bytes = 3 GB.
How many total users are there (answer: we expect to reach 200 Million users in the first year). How many daily active users are there (100 million users sign-in everyday) What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.) How much storage would we need? This will depend on whether users can upload photos and videos in their tweets? What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers. Calculate how much total storage is needed for the next 5 years.
Let’s assume we have 500M total users, with 1M daily active users.
2M new photos every day, 23 new photos every second.
Average photo file size => 200KB
Total space required for 1 day of photos
2M * 200KB => 400 GB
Total space required for 10 years: 400GB * 365 (days a year) * 10 (years) ~= 1425TB
Messaging System:
How many users are we talking about here? How many messages sent? How many messages read?
Dropbox:
What operations does this data store need to support? What operations is it optimized for?
For example, you might need to determine how long it will take to generate 100 image thumbnails from disk or how much memory a data structure will take. The Powers of two table and Latency numbers every programmer should know are handy references.
Power | Exact Value | Approx Value | Bytes |
---|---|---|---|
7 | 128 | ||
8 | 256 | ||
10 | 1024 | 1 thousand | 1 KB |
16 | 65,536 | 64 KB | |
20 | 1,048,576 | 1 million | 1 MB |
30 | 1,073,741,824 | 1 billion | 1 GB |
32 | 4,294,967,296 | 4 GB | |
40 | 1,099,511,627,776 | 1 trillion | 1 TB |
Latency Comparison Numbers
Power | Time | Ratio | |
---|---|---|---|
L1 cache reference | 0.5 ns | ||
Branch mispredict | 5 ns | ||
L2 cache reference | 7 ns | 14x L1 cache | |
Mutex lock/unlock | 100 ns | ||
Main memory reference | 100 ns | 20x L2 cache, 200x L1 cache | |
Compress 1K bytes with Zippy | 10,000 ns | 10 us | |
Send 1 KB bytes over 1 Gbps network | 10,000 ns | 10 us | |
Read 4 KB randomly from SSD* | 150,000 ns | 150 us ~1GB/sec SSD | |
Read 1 MB sequentially from memory | 250,000 ns | 250 us | |
Round trip within same datacenter | 500,000 ns | 500 us | |
Read 1 MB sequentially from SSD* | 1,000,000 ns | 1,000 us | 1 ms ~1GB/sec SSD, 4X memory |
Disk seek | 10,000,000 ns | 10,000 us | 10 ms 20x datacenter roundtrip |
Read 1 MB sequentially from 1 Gbps | 10,000,000 ns | 10,000 us | 10 ms 40x memory, 10X SSD |
Read 1 MB sequentially from disk | 30,000,000 ns | 30,000 us | 30 ms 120x memory, 30X SSD |
Send packet CA->Netherlands->CA | 150,000,000 ns | 150,000 us | 150 ms |
Notes
Handy metrics based on numbers above:
Messaging
What are the latency requirements for sender -> receiver message deliver?
How are you going to store messages?
How do you push new messages to clients? Do you push at all, or rely on a pull based model?
Instagram:
Highly available
Reliable: No photo or video uploaded should ever be lost.
Consistency can take a hit in the interest of availability: it is fine if some users dont see the photo for a while
The system would be read-heavy.
Focus on building a system that can retrieve photos quickly. Low latency on reads is expected while viewing photos. Acceptable latency is 200ms for News Feed Generation
Draw a block diagram with 5–6 boxes representing core components of your system.
You should identify enough components that are needed to solve the actual problem from end-to-end. While drawing, comment on the component, the need for it, and the choices you can make at that level. Visualize an abstract design: This should be based on the constraints clarified in the previous step. This gives you a chance to consolidate all the information and reaffirm your design choice. Be sure to focus on the basic components of the system and the relationships between them. Candidate should be able to identify various entities of the system, how they will interact with each other and different aspect of data management like storage, transfer, encryption, etc
At a high level, we would need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we’ll have a lot more read traffic (as compared to write), we can decide to have separate servers for handling reads v.s writes. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We would also need a distributed file storage system for storing photos (and videos) and a search index and infrastructure to enable searching of tweets.
TinyUrl
An Application Layer and a Storage Layer
Instagram:
Use object storage server to store photos (Spotify uses it for songs, fb for photos, files in Dropbox) We can use distributed file storage like HDFS or S3.
Interviewers feedback should always guide you towards which parts of the system she wants you to explain further. Sometimes I did feel that the interviewer is not interested in a particular aspect of the system being designed, like API design or estimations. But sometimes they do. So keep this thing in mind, don’t spend a lot of time on some aspect that the interviewer is not interested. Always try to discuss as many aspects as possible. At each level, ask your interviewer for specifications (should you suggest a simple starting point, or talk about what a mature system might look like?) and talk about several options (applying the ideas from your reading). Discussing tradeoffs in your design is key. Your interviewer cares less about whether your design is good in itself, and more about whether you are able to talk about the trade-offs (positives and negatives) of your decisions. Practice this.
For each component, write the specific APIs for each component. You may need to finish the detailed OOD design for a particular function. You may also need to design the database schema for the database. Although it’s not required early on, this will clarify how data will flow among different components of the system and later will also guide you towards data partitioning.
Twitter:
User: UserID, Name, Email, DoB, CreationData, LastLogin
Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp
UserFollows: UserdID1, UserID2
FavoriteTweets: UserID, TweetID, TimeStamp
maintain internal counter for each host. assign unique 6 bits for each host. concatenate host + timestamp + random/incremental value.
High probability of collision if no. of req is 1000/s. Every worker is doing 20 req/sec (for which timestamps and worker address is same), so it has only 32 slots or use more than 43 bits to store tinyurl. or use part of timestamp and increase randomness/counter
Does a get request from database for longUrl, and if not available, does a write.
We can store photos in a distributed file storage like HDFS or S3.
We can store the above schema in a distributed key-value store to enjoy the benefits offered by NoSQL. All the metadata related to photos can go to a table, where the ‘key’ would be the ‘PhotoID’ and the ‘value’ would be an object containing PhotoLocation, UserLocation, CreationTimestamp, etc.
We need to store relationships between users and photos, to know who owns which photo. We also need to store the list of people a user follows. For both of these tables, we can use a wide-column datastore like Cassandra. For the ‘UserPhoto’ table, the ‘key’ would be ‘UserID’ and the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different columns. We will have a similar scheme for the ‘UserFollow’ table. Cassandra or key-value stores in general, always maintain a certain number of replicas to offer reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for certain days (to support undeleting) before getting removed from the system permanently.
User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes: UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes.
If we have 500 million users, we will need 32GB of total storage 500 million * 68 ~= 32GB
Photo: Each row in Photo’s table will be of 284 bytes: PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes
If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day: 2M * 284 bytes ~= 0.5GB per day. For 10 years we will need 1.88TB of storage.
UserFollow: Each row in the UserFollow table will be of 8 bytes.
If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table: 500 million users * 500 followers * 8 bytes ~= 1.82TB.
Total space required for all tables for 10 years will be 3.7TB: 32GB + 1.88TB + 1.82TB ~= 3.7TB
To handle such large data, we can shard data > based on user ID: Let’s assume we shard based on the ‘UserID’ so that we can keep all photos of a user on the same shard. If one DB shard is 1TB, we will need four shards to store 3.7TB of data. Let’s assume for better performance and scalability we keep 10 shards. So we’ll find the shard number by UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID. How can we generate PhotoIDs? Each DB shard can have its own auto-increment sequence for PhotoIDs, and since we will append ShardID with each PhotoID, it will make it unique throughout our system.
Cons of this scheme:
* How would we handle hot users? Several people follow such hot users, and a lot of other people sees any photo they upload.
* Some users will have a lot of photos compared to others, thus making a non-uniform distribution of storage. Sharding your data based on user IDs will always result in overloaded partitions for celebrity users.
* What if we cannot store all pictures of a user on one shard? If we distribute photos of a user onto multiple shards, will it cause higher latencies?
* Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s data if that shard is down or higher latency if it is serving high load etc
Provide different approaches, their pros and cons, and why would you choose one? A good solution compares and contrasts different approaches. It explores the tradeoffs present in any complex engineering problem and it makes intelligent, reasoned decisions about those tradeoff. Try to discuss as many bottlenecks as possible and different approaches to mitigate them.
Example: Twitter How would we handle high-traffic users e.g. celebrities who have millions of followers?
What is Load Balancing (NGinx, HAPProxy)?
Routing strategy:
How to deal with increasing loads on database?
Segregate Writes from Reads: Identify write-intensive data that could be moved out of the relational database altogether. If it’s a structured dataset which is only ever stored and retrieved by key, rather than being searched by value, then it could go in some simpler, cheaper, more write-scalable, store, like Riak or something. (Example: this is true of quite a lot of user-centric data in many applications - user preferences, profile, history, etc. writes to tables which record user activity (data needed for analytics and user support) , doesn’t really need to be in the main OLTPish database; it could be in a flat file, a NoSQL store, I suspect that You could remove a lot of pointless write load by putting it in a separate table.
Instagram: Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. As we know that web servers have a connection limit, so we should keep this thing in mind before designing our system. If we assume that a web server can have a maximum of 500 connections at any time, this would mean it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the system. Separating photos’ read and write requests will also allow us to scale and optimize each of these operations independently.
Is the database too slow and does it need some in-memory caching? How much and at which layer should we introduce cache to speed things up? Use caching for faster query response time. Avoid writing to disk until you must - cache like crazy. Putting a cache in front of a database can help absorb uneven loads and spikes in traffic.
However, it helps more than it hurts even when unsophisticated caching solutions are used.
Strategies:
Centralized caching vs distributed caching (each node has its own cache)
We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcache, which can store full URLs with their respective hashes. The application servers, before hitting backend storage, can quickly check if the cache has the desired URL.
How much cache should we have?
We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs.
Which cache eviction policy would best fit our needs?
When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of which URLs are accessed recently. To further increase the efficiency, we can replicate our caching servers to distribute load between them.
How can each cache replica be updated?
Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update their cache by adding the new entry. If a replica already has that entry, it can simply ignore it.
Since user’s timeline will contain the most recent (and relevant) tweets, should we try to store our data in a way that is optimized to scan latest tweets?
Our service would need a massive-scale photo delivery system to serve the globally distributed users. Our service should push its content closer to the user using a large number of geographically distributed photo cache servers and use CDNs (for details see Caching).
We can introduce a cache for metadata servers to cache hot database rows. We can use Memcache to cache the data and Application servers before hitting database can quickly check if the cache has desired rows. Least Recently Used (LRU) can be a reasonable cache eviction policy for our system. Under this policy, we discard the least recently viewed row first.
Is there any single point of failure in our system? What are we doing to mitigate it? Do we’ve enough replicas of the data so that if we lose a few servers, we can still serve our users? Similarly, do we’ve enough copies of different services running, such that a few failures will not cause total system shutdown?
Losing files is not an option for our service. Therefore, we will store multiple copies of each file, such that if one storage server dies, we can retrieve the photo from the other copy present on a different storage server. This same principle also applies to other components of the system too. If we want to have high availability of the system, we need to have multiple replicas of services running in the system. So that if a few services die down, the system still remains available and serving. Redundancy removes the single point of failures in the system. If only one instance of a service is required to be running at any point, we can run a redundant secondary copy of the service that is not serving any traffic but whenever primary has any problem it can take control after the failover. Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production, and one fails or degrades, the system can failover to the healthy copy. Failover can happen automatically or require manual intervention.
Isn’t KGS the single point of failure? Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys.
Zookeeper is a centralised configuration management tool. Used for consensus like leader-election and distributed locking. Scales very well for the reads but not very well for the writes. Keeps all data in-memory, so you cannot store large data. used for tons of reads highly available.
Kafka is a fault-tolerant, highly available. queue used for pub sub or streaming applications. Keeps all messages ordered in a partition.
Could pre-gnerating of tinyurl and store them in caches if they are based on counter, We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique
Can concurrency cause problems? As soon as a key is used, it should be marked in the database to ensure it doesn’t get used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem
Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.
For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which is acceptable, given the huge number of keys we have.
KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock to) the data structure holding the keys before removing keys from it and giving them to a server
Can each app server cache some keys from key-DB?
Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letter keys.
How would we perform a key lookup?
We can look up the key in our database or key-value store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status, or redirect the user back to the homepage.
Should we impose size limits on custom aliases?
Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema).
for Youtube, Netflix, Instagram, Twitter
A content delivery network (CDN) is a system of distributed servers (network) that deliver webpages and other Web content to a user based on the geographic locations of the user, the origin of the webpage and a content delivery server. CDNs (processing done close to you, e.g. Netflix) and Edge (dedicated network from the processing point close to you all the way to the data centre - not the general internet)
Our service would need a massive-scale photo delivery system to serve the globally distributed users. Our service should push its content closer to the user using a large number of geographically distributed photo cache servers and use CDNs (for details see Caching).
The best way to practice for system design interviews is to actually sit down and design a system, i.e. your day-to-day work. Instead of doing the minimal work, go deeper into the tools, frameworks, and libraries you use. For example, if you use HBase, rather than simply using the client to run some DDL and do some fetches, try to understand its overall architecture, such as the read/write flow, how HBase ensures strong consistency, what minor/major compactions do, and where LRU cache and Bloom Filter are used in the system. You can even compare HBase with Cassandra and see the similarities and differences in their design. Then when you are asked to design a distributed key-value store, you won’t feel ambushed.
References