System Design

TLDR: use clusters, use caching, sharding and replication, use redundancy, use parallelisation (multi-threading, load balancing etc.), use async workflows, avoid points of contention and sync issue.

Drive the discussion. Part of the signal the interviewer hopes to gather is whether you’ve learned how to build large systems through hard experience. Your ability to anticipate and work around typical problems is part of that signal. Make it a conversation: Be sure to ask clarifying questions. But make sure you drive towards a good solution.

Part 1: Understand the system requirements.

Do not make any assumptions. Understand the constraints involved. Find the exact scope of the problem you are solving. . Your interviewer is interested in understanding your thought processes.

Ascertain if its system design.

If its not clear whether the question is a sys arch or algorithmic one, try and frame the question as a method signature. If it’s a system architecture question, it will force the interviewer to steer you away from it and think in more abstract terms.

How can we repeatedly find a string in a large dataset?

An eager candidate might immediately jump to the conclusion that having a large dataset searched repeatedly means it must be sorted and then binary-searched in log(n), and will start writing the code for mergesort on the whiteboard. All while the interviewer is thinking of a large document corpus and trying to get them to design a search engine.

Unless you ask some clarifying questions, like: “How big is the data set?”, “What are its characteristics?”, “How do we obtain it?”, you might not realize what you’re up against.’

Generate all the possible use cases

Use case is a description of sequences of events that, taken together, lead to a system doing something useful. Who is going to use it and how they are going to use it. The system may be very simple or very complicated.

Design a url shortening service

Shortening: createTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL
Customer url definition if available
Analytics on url usage
Automatic link expiry
Manual link removal of a url you shortened

Twitter

Who can post a tweet? (answer: any user)
Who can read the tweet? (answer: any user — as all tweets are public)
Will a tweet contain photos or videos (answer: for now, just photos)
Can a user follow another user? (answer: yes).
Can a user ‘like’ a tweet? (answer: yes).
Can a user search for tweets (answer: yes).
What gets included in the user feed (answer: tweets from everyone whom you are following) Is feed a list of tweets in chronological order? (answer: for now, yes).

Instagram

Users can upload/download and view photos
Users can follow other users
Users can see a feed of top photos from other users they follow (The system should be able to generate and display a user’s News Feed consisting of top photos from all the people the user follows.) Users can search for photos based on titles.
Optional: Adding tags, searching based on tags, commenting, suggestions, tagging users to photos etc.

a messaging system

Is it reasonable to assume that the messages can be sent by all users?
Does the user need to send messages to one or multiple persons?
Does the user need read receipts?
Reasonable to assume that the recipeients need to be notified?
Does the user need to see his message history?

an email server.

How do you store mail, especially as the system gets large enough that it won’t fit on one machine?
How do you handle mailing lists with large numbers of recipients?
How do you handle people abusing the system for spam?

a client-server API to build a rich document editor.

How does the client request data on the document from the server, especially as the document gets large enough that we wouldn’t want to download it in a single request
How do we represent the rich document aspects like bold and italics in our API response?
How do we design the system so that new features can be added on the server without breaking older clients?

Define System Interface or APIs

This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong.

TinyUrl.

CreateTiny(longURL) -> tinyURL
Redirection: getLong(tinyURL) -> longURL

Twitter.

postTweet(user_id, tweet_text, image_url, user_location, timestamp, …)
generateTimeline(user_id, current_time)
searchTweet(string) recordUserTweetLike(user_id, tweet_id, timestamp, …)

Messaging

sendMessage(sender, recipient, message) getAllConversations(user) getConversation(user, recipient) searchMessages(searchString)

Estimate the scale of the system with a back-of-the-envelope calc

List the scale of the system such as requests per second, requests types, data written per second, data read per second.

This is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.

Do estimates for traffic, storage, bandwidth, memory

TinyURL

could be meant to serve just a few thousand users, but each could be sharing millions of URLs. It could be meant to handle millions of clicks on the shortened URLs, or dozens. The service may have to provide extensive statistics about each shortened URL (which will increase your data size), or statistics may not be a requirement at all. How many millions of new url per mon/second?

“Say you are one of the top url shortening websites outside the top 3”

Find out total number of URL’s shortened per month

No. of tweets/month 15 billion. Assume b/w 1/20 - 10 new tweets have a shortened URL so 750 million - 1.5 billion new urls shortened per month from twitter. Take the upper limit - 1.5 billion urls shortened month. Twitter sis one of the main drivers for url shortening, so we can take this as the total.

Find out the url shortening serviced by us.

Use 80/20 rule for url shortened by outside top 3: 300 million say we service 1/3td of these requests: 100 million

Find out the total number of requests per month Use 90/10 rule for redirect/shortening division Gives 1 billion requests/month

Find out total new urls in 5 years: 100 million * 60 = 6 billion

How many characters do we need for our tinyURL?

Find out number of hash characters required for 6 billion: 6 billion base 62 = 6-7 characters
With 62 character set and 7 characters, we can have 62^7 combinations = 3.5 trillion (Note: any number between 0-3.5 trillion can be represented using 43 bits. 62^7 ~ 2^43) If our service is generating 1000 tinyurl/sec, it will take us 110 years to exhaust. If our service is generating 1 million/sec, it will exhaust in 40 days. (Do the math)

Find out total storage needs
Assume 500 character for long urls
Assume 6 character for short urls
Total storage = 6 billion * (500 bytes + 6) = 3 TB + 36 GB

Bandwidth needed: Find out writes/per second = 100 million/30243600 = 40 shortening writes/sec
data written per sec : 40* (500+6) bytes = 20 kb
Data read per sec: 360(500+6) = 180 kb
If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. 360360024 = 30 million urls read per day. 6 million urls to be cached. 6500 = 3*10^9 bytes = 3 GB.

Twitter

How many total users are there (answer: we expect to reach 200 Million users in the first year). How many daily active users are there (100 million users sign-in everyday) What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.) How much storage would we need? This will depend on whether users can upload photos and videos in their tweets? What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers. Calculate how much total storage is needed for the next 5 years.

Instagram

Let’s assume we have 500M total users, with 1M daily active users.
2M new photos every day, 23 new photos every second.
Average photo file size => 200KB
Total space required for 1 day of photos
2M * 200KB => 400 GB
Total space required for 10 years: 400GB * 365 (days a year) * 10 (years) ~= 1425TB

Messaging System:

How many users are we talking about here? How many messages sent? How many messages read?

Dropbox:

What operations does this data store need to support? What operations is it optimized for?

For example, you might need to determine how long it will take to generate 100 image thumbnails from disk or how much memory a data structure will take. The Powers of two table and Latency numbers every programmer should know are handy references.

Power	Exact Value	Approx Value	Bytes
7	128
8	256
10	1024	1 thousand	1 KB
16	65,536		64 KB
20	1,048,576	1 million	1 MB
30	1,073,741,824	1 billion	1 GB
32	4,294,967,296		4 GB
40	1,099,511,627,776	1 trillion	1 TB

Latency Comparison Numbers

Power	Time	Ratio
L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns	14x L1 cache
Mutex lock/unlock	100 ns
Main memory reference	100 ns	20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy	10,000 ns	10 us
Send 1 KB bytes over 1 Gbps network	10,000 ns	10 us
Read 4 KB randomly from SSD*	150,000 ns		150 us ~1GB/sec SSD
Read 1 MB sequentially from memory	250,000 ns	250 us
Round trip within same datacenter	500,000 ns	500 us
Read 1 MB sequentially from SSD*	1,000,000 ns	1,000 us	1 ms ~1GB/sec SSD, 4X memory
Disk seek	10,000,000 ns	10,000 us	10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps	10,000,000 ns	10,000 us	10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk	30,000,000 ns	30,000 us	30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA	150,000,000 ns	150,000 us	150 ms

Notes

1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Handy metrics based on numbers above:

Read sequentially from disk at 30 MB/s
Read sequentially from 1 Gbps Ethernet at 100 MB/s
Read sequentially from SSD at 1 GB/s
Read sequentially from main memory at 4 GB/s
6-7 world-wide round trips per second
2,000 round trips per second within a data center Latency numbers visualized

Find out the desired properties of the system

Web UI vs API. Whether to the client/server interaction or backend architecture or both (answer: we want to understand the interaction between client/server but we will focus on how to scale the backend).
(1) simplest, (2) highest performance, (3) easiest to extend (4) secure (5) fault-tolerant/resilient/highly available
Special system requirements such as multi-threading, read or write oriented.

Messaging

What are the latency requirements for sender -> receiver message deliver?
How are you going to store messages?
How do you push new messages to clients? Do you push at all, or rely on a pull based model?

Instagram:

Highly available
Reliable: No photo or video uploaded should ever be lost.
Consistency can take a hit in the interest of availability: it is fine if some users dont see the photo for a while
The system would be read-heavy.
Focus on building a system that can retrieve photos quickly. Low latency on reads is expected while viewing photos. Acceptable latency is 200ms for News Feed Generation

Part II: High-Level Architecture

Draw a block diagram with 5–6 boxes representing core components of your system.

You should identify enough components that are needed to solve the actual problem from end-to-end. While drawing, comment on the component, the need for it, and the choices you can make at that level. Visualize an abstract design: This should be based on the constraints clarified in the previous step. This gives you a chance to consolidate all the information and reaffirm your design choice. Be sure to focus on the basic components of the system and the relationships between them. Candidate should be able to identify various entities of the system, how they will interact with each other and different aspect of data management like storage, transfer, encryption, etc

Briefly discuss the role of each component e.g. compute, storage, front-end, back-end, caching, queueing, networking, load-balancing.
- Usually, a scalable system includes:
  - webserver (load balancer)
  - service (service partition)
  - database (primary/secondary database cluster plug cache).
  - Which database system should we use? Would NoSQL like Cassandra best fits our needs, or we should use MySQL-like solution. What kind of blob storage should we use to store photos and videos?
Talk about what components you’ll use and how they fit together - “Responsibilities will be divided as so between Service A and Service B…” Also describe the implementation details - “A pub-sub queue makes sense here because…”

Twitter

At a high level, we would need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we’ll have a lot more read traffic (as compared to write), we can decide to have separate servers for handling reads v.s writes. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We would also need a distributed file storage system for storing photos (and videos) and a search index and infrastructure to enable searching of tweets.

TinyUrl

An Application Layer and a Storage Layer

Instagram:

Use object storage server to store photos (Spotify uses it for songs, fb for photos, files in Dropbox) We can use distributed file storage like HDFS or S3.

Part III Dicuss Detailed design for selected components

Interviewers feedback should always guide you towards which parts of the system she wants you to explain further. Sometimes I did feel that the interviewer is not interested in a particular aspect of the system being designed, like API design or estimations. But sometimes they do. So keep this thing in mind, don’t spend a lot of time on some aspect that the interviewer is not interested. Always try to discuss as many aspects as possible. At each level, ask your interviewer for specifications (should you suggest a simple starting point, or talk about what a mature system might look like?) and talk about several options (applying the ideas from your reading). Discussing tradeoffs in your design is key. Your interviewer cares less about whether your design is good in itself, and more about whether you are able to talk about the trade-offs (positives and negatives) of your decisions. Practice this.

Low level Design of data model (schema or class diagram)

For each component, write the specific APIs for each component. You may need to finish the detailed OOD design for a particular function. You may also need to design the database schema for the database. Although it’s not required early on, this will clarify how data will flow among different components of the system and later will also guide you towards data partitioning.

Twitter:

User: UserID, Name, Email, DoB, CreationData, LastLogin
Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp
UserFollows: UserdID1, UserID2
FavoriteTweets: UserID, TweetID, TimeStamp

Example: Application Layer for TinyURL

Worker host gets the request. generates tiny url based on:
- Random hash
- use MD5: convertToBase62(md5(longURL))[:6] for 6 characters or md5(longURL)[:43 bits]
- MD5: convertToBase62(md5(longURL + randomSalt))[:6]
- use a counter:
  - shared counter among all workers (like Zookeeper)
  - single point of failure, too much load
maintain internal counter for each host. assign unique 6 bits for each host. concatenate host + timestamp + random/incremental value.
High probability of collision if no. of req is 1000/s. Every worker is doing 20 req/sec (for which timestamps and worker address is same), so it has only 32 slots or use more than 43 bits to store tinyurl. or use part of timestamp and increase randomness/counter
Range based counter:
- maintain information which host is working in which range in zookeeper
- highly scalable: can add new worker threads without changing the mechanism
does one of three things:
Does a get request from database for longUrl, and if not available, does a write.
Race conditions: Stable production system would not deploy this technique. Segue into discussing race conditions while handling writes (e.g. TinyUrl two different workers writing the (longUrl, tinyURL) to the db
- Do “put if absent” (requires DB support). For RDBMS which support ACID its a trivial operation. for NoSql they might or might not support it.
- Segue into Rdbms vs NoSql.
- Puts and then gets to verify. If not, retry.

Example: Storage Layer for Instagram

Schema

We can store photos in a distributed file storage like HDFS or S3.
We can store the above schema in a distributed key-value store to enjoy the benefits offered by NoSQL. All the metadata related to photos can go to a table, where the ‘key’ would be the ‘PhotoID’ and the ‘value’ would be an object containing PhotoLocation, UserLocation, CreationTimestamp, etc.
We need to store relationships between users and photos, to know who owns which photo. We also need to store the list of people a user follows. For both of these tables, we can use a wide-column datastore like Cassandra. For the ‘UserPhoto’ table, the ‘key’ would be ‘UserID’ and the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different columns. We will have a similar scheme for the ‘UserFollow’ table. Cassandra or key-value stores in general, always maintain a certain number of replicas to offer reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for certain days (to support undeleting) before getting removed from the system permanently.

Storage Size Estimation

User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes: UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes.

If we have 500 million users, we will need 32GB of total storage 500 million * 68 ~= 32GB

Photo: Each row in Photo’s table will be of 284 bytes: PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes

If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day: 2M * 284 bytes ~= 0.5GB per day. For 10 years we will need 1.88TB of storage.

UserFollow: Each row in the UserFollow table will be of 8 bytes.

If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table: 500 million users * 500 followers * 8 bytes ~= 1.82TB.

Total space required for all tables for 10 years will be 3.7TB: 32GB + 1.88TB + 1.82TB ~= 3.7TB

Sharding

To handle such large data, we can shard data > based on user ID: Let’s assume we shard based on the ‘UserID’ so that we can keep all photos of a user on the same shard. If one DB shard is 1TB, we will need four shards to store 3.7TB of data. Let’s assume for better performance and scalability we keep 10 shards. So we’ll find the shard number by UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID. How can we generate PhotoIDs? Each DB shard can have its own auto-increment sequence for PhotoIDs, and since we will append ShardID with each PhotoID, it will make it unique throughout our system.

Cons of this scheme:
* How would we handle hot users? Several people follow such hot users, and a lot of other people sees any photo they upload.
* Some users will have a lot of photos compared to others, thus making a non-uniform distribution of storage. Sharding your data based on user IDs will always result in overloaded partitions for celebrity users.
* What if we cannot store all pictures of a user on one shard? If we distribute photos of a user onto multiple shards, will it cause higher latencies?
* Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s data if that shard is down or higher latency if it is serving high load etc

Example: Storage Layer for TinyURL

Purging or DB Cleanup:

Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link?
- If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.
  - Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
  - A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low.
- We can have a default expiration time for each link (e.g., two years).
- After removing an expired link, we can put the key back in the key-DB to be reused.
Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever.

Storage Estimation

What would be the key-DB size?
With base64 encoding, we can generate 68.7B unique six letters keys. If we need one byte to store one alpha-numeric character, we can store all these keys in: 6 (characters per key) * 68.7B (unique keys) = 412 GB.

Part IV Identifying and resolve bottlenecks.

Provide different approaches, their pros and cons, and why would you choose one? A good solution compares and contrasts different approaches. It explores the tradeoffs present in any complex engineering problem and it makes intelligent, reasoned decisions about those tradeoff. Try to discuss as many bottlenecks as possible and different approaches to mitigate them.

Example: Twitter How would we handle high-traffic users e.g. celebrities who have millions of followers?

Scale out or up

Horizontal Scaling
- Can be done cheaply with commodity machines
- Suitable when the tasks do not become more computationally expensive with scale.
Vertical Scaling (Stack overflow)
- Added complexity with a distributed system architecture
- Simplicity of having a single codebase leading to a single binary that can be deployed with ease
- In this era of multi-core, large RAM machines and advances in parallel programming techniques, scale up is still a viable strategy. Joel boasts that for 1/10 the hardware they have performance comparable to similarly size sites.

High Load on any component? Do load balancing

Load: Perhaps your system needs a load balancer and many machines behind it to handle the user requests. What components need better load balancing?

What is Load Balancing (NGinx, HAPProxy)?

Definition
- Takes requests/incoming traffic and routes/distributes/delegates it to different backend servers/workers
Benefits:
- Takes care of routing traffic geographically (e.g. Could be legal reasons why a company is required to route traffic geographically to a data centre in the same area)
- Distibutes load evenly
Inner workings:
- Needs to know the state of each server it routes requests to. It does periodic health checks.
L4 or L7
- L4 network layer level (client and destination ip addresses, port number)Keeps table of i.p ranges
- L7 HTTP level (uses HTTP Uri or is HTTP-aware) to use the routing. Takes a look at the http request (e.g. the headers) to make the routing decision. Most load balancers operate at L7 level.
- DNS level: At the level of resolving names to i.p addresses, you can take a decision where to route it to (Google makes heavy use of this, as they want to route the traffic to the closest data centre possible)
Routing strategy:
- Round robin strategy
  - Servers can get overloaded if it is still processing some requests when its turn comes
- Least connections
  - Maintains a list of connections a server has at any given a time, forwards request to those machines which has a lower number of connections
- Strategies to ensure same request goes to same server (share state across requests):
  - Primitive hash function decide backend server based on a Hash the ip address If the number of backend servers changes, you’d need to change all the existing mappings:
  - Consistent hashing: https://www.paperplanes.de/2011/12/9/the-magic-of-consistent-hashing.html Whenever a remap needs to happen, the number of active connections that need to shift to a different server is as small as possible.
Things to keep in mind:
- Single point of failure
PROXY servers vs Load Balancers
- Provide load balancing
- can provide security by hiding irrelevant client-specific details like I.P address
- Your back-end service does not support HTTPS but you need it
- To decrypt messages if they are encrypted
- To add some other feature to the request that cannot be easily done with the back end such as adding Basic Authentication or some form of common logging/auditing
- o enforce an addition or change to outgoing responses common across several back end services
- It can scatter-gathers the requests to multiple storage nodes (in case of db proxies) or servers in parallel and sends the merged response back to the client.
- it can have a a locally cached routing table that reflects the distribution of partitions among all the storage nodes within the cluster for all databases. Th

Limited Data Storage? Shard

Storing Data: Maybe the data is so huge that you need to distribute your database on multiple machines. What are some of the downsides that occur from doing that? Since we’ll be storing a huge amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issues can it cause?

Limited Read/Write Throughput? Replicate or Shard

How to deal with increasing loads on database?

Segregate Writes from Reads: Identify write-intensive data that could be moved out of the relational database altogether. If it’s a structured dataset which is only ever stored and retrieved by key, rather than being searched by value, then it could go in some simpler, cheaper, more write-scalable, store, like Riak or something. (Example: this is true of quite a lot of user-centric data in many applications - user preferences, profile, history, etc. writes to tables which record user activity (data needed for analytics and user support) , doesn’t really need to be in the main OLTPish database; it could be in a flat file, a NoSQL store, I suspect that You could remove a lot of pointless write load by putting it in a separate table.
Replicate for Parallel Reads: If the fraction of the load that is read is at all significant (even 20%), move that to a read-only slave. (MySQL and PostgreSQL both make it fairly straightforward to have one or more read-only slaves following a read-write master, and to use them as hot standbys for the master too.
- Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master. (Sharding doesn’t use replication.)
- However, splitting reads like this might be tricky; if you’re doing things properly, then you will wrap every unit of work in a transaction, even if it’s read-only, to guarantee a consistent view of data. Since you can’t (AFAIK!) have a single transaction efficiently use two different databases (XA is not efficient), that means you have to be able to identify read-only transactions when they start, so you can send their queries to the read-only database. That will very likely require logic in the application code that isn’t currently there, and might not be simple to retrofit. In our case, it was relatively simple: we send transactions from interactive user sessions to the read-write master, and transactions from our read-only REST API to the read-only slave. We could probably do a better job, by identifying read-only user requests, and sending those to the slave as well, but we haven’t had to do that yet.
- If you’re not doing things properly, and are using no transactions, or implicit or explicit transactions around each query, then this is much simpler. A load-balancer between the app and the database can route the queries.)
- Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTubehad a big problem with replication overhead as they scaled.
Parallel machines handling writes: If segregating reads doesn’t do the job, then it’s time to look at having several machines handling writes. There are two ways to split writes: sharding, and multi-master replication of a shared database (I don’t really consider sharding to be multi-master, because any item of data belongs to a single master). Neither of these are easy, which is why this is a step of last resort.
- Multi-master is simpler from the application’s point of view, because it doesn’t care which master it talks to, but is less likely to help performance, because any write to one master still has to be applied by the others. It’s only useful if the write traffic involves a lot of read work (eg inserts from selects, or you are using transactions and have units of work which do a lot of reading and a small amount of writing - which is not uncommon), and your multi-mastering operates at the row rather than the statement level.
- Sharding is more complex. How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user’s media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions. Just how much of a problem this is will depend very sensitively on your access patterns. For a classic e-commerce system, for example, users make read-only access to catalogue data, which could therefore go in a single-master, many-slave database, and read-write access to their own order data, with no online access across multiple orders, which should shard very nicely. For a crowdsourced e-commerce system (like, famously, Etsy, or eBay), you have some users (shopkeepers) writing to the catalogue while others are reading from it. That makes sharding harder. Sadly, where I work, although our application is a financial system, it has many of the data characteristics of a social network; I’m not sure there is a productive way to shard it.

Instagram: Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. As we know that web servers have a connection limit, so we should keep this thing in mind before designing our system. If we assume that a web server can have a maximum of 500 connections at any time, this would mean it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the system. Separating photos’ read and write requests will also allow us to scale and optimize each of these operations independently.

Bad Latency? Cache

Is the database too slow and does it need some in-memory caching? How much and at which layer should we introduce cache to speed things up? Use caching for faster query response time. Avoid writing to disk until you must - cache like crazy. Putting a cache in front of a database can help absorb uneven loads and spikes in traffic.
When to use:
- any db operations are computationally expensive
- some operations are used more than others
- latency is important. Caching reduces response times seen by your end-user
- read traffic is high. Caching reduces both read traffic to your DB
To keep in mind:
- Cache cannot be the source of truth
- Because cache keeps the data in-memory, cache size has to be small
- Cache must follow some eviction strategy
- Caching introduces its own complexities in terms of staleness, coherency, operability, etc.

However, it helps more than it hurts even when unsophisticated caching solutions are used.

What do cache?
- User sessions
- Fully rendered web pages
- Activity streams
- User graph data
Caching at the database query level
- hash of the query as a key and store the result
- Hard to delete a cached result with complex queries
- If one piece of data changes such as a table cell, you need to delete all cached queries that might include the changed cell
Caching at the object level
- See data as objects in application code. Have your application assemble the dataset from the database into a class instance or a data structure.
- Remove the object from cache if its underlying data has changed
- Allows for asynchronous processing: workers assemble objects by consuming the latest cached object
  - e.g. Memchached
Strategies:
- Cache aside
- refresh ahead
- write-through
- write-behind

Centralized caching vs distributed caching (each node has its own cache)

Tiny URL

We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcache, which can store full URLs with their respective hashes. The application servers, before hitting backend storage, can quickly check if the cache has the desired URL.

How much cache should we have?
We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs.

Which cache eviction policy would best fit our needs?
When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of which URLs are accessed recently. To further increase the efficiency, we can replicate our caching servers to distribute load between them.

How can each cache replica be updated?
Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update their cache by adding the new entry. If a replica already has that entry, it can simply ignore it.

Twitter

Since user’s timeline will contain the most recent (and relevant) tweets, should we try to store our data in a way that is optimized to scan latest tweets?

Instagram

Our service would need a massive-scale photo delivery system to serve the globally distributed users. Our service should push its content closer to the user using a large number of geographically distributed photo cache servers and use CDNs (for details see Caching).

We can introduce a cache for metadata servers to cache hot database rows. We can use Memcache to cache the data and Application servers before hitting database can quickly check if the cache has desired rows. Least Recently Used (LRU) can be a reasonable cache eviction policy for our system. Under this policy, we discard the least recently viewed row first.

How can we build more intelligent cache? If we go with 80-20 rule, i.e., 20% of daily read volume for photos is generating 80% of traffic which means that certain photos are so popular that the majority of people reads them. This dictates we can try caching 20% of daily read volume of photos and metadata.

Single Points of Failure? Replicate

Is there any single point of failure in our system? What are we doing to mitigate it? Do we’ve enough replicas of the data so that if we lose a few servers, we can still serve our users? Similarly, do we’ve enough copies of different services running, such that a few failures will not cause total system shutdown?
- Instagram
  
  Losing files is not an option for our service. Therefore, we will store multiple copies of each file, such that if one storage server dies, we can retrieve the photo from the other copy present on a different storage server. This same principle also applies to other components of the system too. If we want to have high availability of the system, we need to have multiple replicas of services running in the system. So that if a few services die down, the system still remains available and serving. Redundancy removes the single point of failures in the system. If only one instance of a service is required to be running at any point, we can run a redundant secondary copy of the service that is not serving any traffic but whenever primary has any problem it can take control after the failover. Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production, and one fails or degrades, the system can failover to the healthy copy. Failover can happen automatically or require manual intervention.
- TinyURL
  
  Isn’t KGS the single point of failure? Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys.

Bad Latency? Do Asynchronous Workflows

Benefits
- reduce request times for expensive operations that would otherwise be performed in-line.
- doing time-consuming work in advance, such as periodic aggregation of data.
Tools
- Zookeeper is a centralised configuration management tool. Used for consensus like leader-election and distributed locking. Scales very well for the reads but not very well for the writes. Keeps all data in-memory, so you cannot store large data. used for tons of reads highly available.
- Kafka is a fault-tolerant, highly available. queue used for pub sub or streaming applications. Keeps all messages ordered in a partition.
Case Study: TinyUrl

Could pre-gnerating of tinyurl and store them in caches if they are based on counter, We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique

Can concurrency cause problems? As soon as a key is used, it should be marked in the database to ensure it doesn’t get used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem
Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.
For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which is acceptable, given the huge number of keys we have.
KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock to) the data structure holding the keys before removing keys from it and giving them to a server

Can each app server cache some keys from key-DB?
Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letter keys.

How would we perform a key lookup?
We can look up the key in our database or key-value store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status, or redirect the user back to the homepage.

Should we impose size limits on custom aliases?
Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema).

Content Delivery for higher latency

for Youtube, Netflix, Instagram, Twitter

A content delivery network (CDN) is a system of distributed servers (network) that deliver webpages and other Web content to a user based on the geographic locations of the user, the origin of the webpage and a content delivery server. CDNs (processing done close to you, e.g. Netflix) and Edge (dedicated network from the processing point close to you all the way to the data centre - not the general internet)
This service is effective in speeding the delivery of content of websites with high traffic and websites that have global reach. The closer the CDN server is to the user geographically, the faster the content will be delivered to the user.
- Instagram:
  
  Our service would need a massive-scale photo delivery system to serve the globally distributed users. Our service should push its content closer to the user using a large number of geographically distributed photo cache servers and use CDNs (for details see Caching).
Performance: How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?

Practice

The best way to practice for system design interviews is to actually sit down and design a system, i.e. your day-to-day work. Instead of doing the minimal work, go deeper into the tools, frameworks, and libraries you use. For example, if you use HBase, rather than simply using the client to run some DDL and do some fetches, try to understand its overall architecture, such as the read/write flow, how HBase ensures strong consistency, what minor/major compactions do, and where LRU cache and Bloom Filter are used in the system. You can even compare HBase with Cassandra and see the similarities and differences in their design. Then when you are asked to design a distributed key-value store, you won’t feel ambushed.

Designing a storage/sharing service: Dropbox, Youtube, Instagram
Designing a service that deals with geographical data: Uber, Yelp etc.
Design google maps.
Design a game. For example, a client-server application which allows people to play chess with one another.
Designing a social media service. How would you store the relations in a social network like Facebook and implement a feature where one user receives notifications when their friends like the same things as they do? Facebook, Twitter.
Designing a communication service: Messenger, High traffic web server, etc. Whatsapp
Designing a search related service: Search engine, web crawler. Google
Design a database system. Espresso
Designing an API for a bank. A payment system

References

27 October 2018

Notes by Piyush

System Design

Part 1: Understand the system requirements.

Ascertain if its system design.

Unless you ask some clarifying questions, like: “How big is the data set?”, “What are its characteristics?”, “How do we obtain it?”, you might not realize what you’re up against.’

Generate all the possible use cases

Define System Interface or APIs

Estimate the scale of the system with a back-of-the-envelope calc

Find out the desired properties of the system

Part II: High-Level Architecture

Part III Dicuss Detailed design for selected components

Low level Design of data model (schema or class diagram)

Example: Application Layer for TinyURL

Example: Storage Layer for Instagram

Schema

Storage Size Estimation

Sharding

Example: Storage Layer for TinyURL

Purging or DB Cleanup:

Storage Estimation

Part IV Identifying and resolve bottlenecks.

Scale out or up

High Load on any component? Do load balancing

Limited Data Storage? Shard

Limited Read/Write Throughput? Replicate or Shard

Bad Latency? Cache

Single Points of Failure? Replicate

Bad Latency? Do Asynchronous Workflows

Content Delivery for higher latency

Practice