Backend Design

Properties of Systems Real-World Performance. You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network.

Concurrency. Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence?

Networking. Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor? {Fallacy of distributed computing #2: People assume latency is zero. Ignorance of network latency, and of the packet loss it can cause, induces application- and transport-layer developers to allow unbounded traffic, greatly increasing dropped packets and wasting bandwidth.) (Fallacy of distributed computing #4: People assume the bandwidth is infinite.). Ignorance of bandwidth limits on the part of traffic senders can result in bottlenecks.

Availability and Reliability. ? {Fallacy of distributed computing #1: People assume the network is reliable) Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability [ e.g. Software applications are written with little error-handling on networking errors. During a network outage, such applications may stall or infinitely wait for an answer packet, permanently consuming memory or other resources. When the failed network becomes available, those applications may also fail to retry any stalled operations or require a (manual) restart.]

Abstraction. You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?

The interviewer usually is interested in the choice of resources like Databases, Storage, Logging; Use of multithreading, messaging queues, Cache , efficient algorithms of the functionalities; identifying all the data and classes that would be important to store if anything goes wrong in the system; what functionalities are exposed to the end user and what not so that they might not screw up the whole system?

Mysql query cache Memache: save results of a query in a ram. Use LRU mechanism for eviction/garbage collection

Redundancy Complacency regarding network security results in being blindsided by malicious users and programs that continually adapt to security measures.

Bloom filters: probability based space efficient data structure used to tell whether an element is a member of a set. It can have false positives but no false negatives Count min sketch: Data structure used to calculate the frequency of events. If you have a million events, and you want to track top k events with some error tolerance, this will be a space efficient way to find it (instead of keeping all the events) Blobstore (Amazon S3) used for storing very large files on the cloud.

MapReduce is processing in-parallel of large data. Map filters and sorts, while reduce aggregates or summaries. Implementation: Hadoop

Virtual machine are a way of giving you an OS on top of a shared hardware so that you feel you are the owner of the hardware. Containers is a way of running an application and its dependencies in an isolated environment.


Sessions and cookies

Cookies are a way to handle sessions. Sessions are needed because REST design of APIs is stateless: it is not allowed to differentiate one request from the other. Server stores whatever info it cares to maintain between requests (like the logged-in user’s username and/or ID, for example), and assigns that information an ID (called a “session ID”). It then tells the browser that session ID, in such a way that the browser can hand the ID back when it’s time to make another request. https://stackoverflow.com/questions/3521311/how-does-iis-recognize-different-sessions-in-net/3521393#3521393 https://stackoverflow.com/questions/13206870/what-does-rests-principle-of-statelessness-actually-means

Hardware:

Data centre has racks and racks have hosts. Latencies involved What happens if a rack/data centre goes down?

Limitations and their effect on latency and throughput: CPU HardDrive Network resources

Random (try to avoid in disks) vs Sequential (good for disks) read and write.


Properties of a storage system

)


Other dbs:


Property Relational Document
TL;DR better support for joins needed to resolve many-to-one and many-to-many relationships schema flexibility, better locality performance (but not really), and less object-relational impedance mismatch in one-to-many relationships
Data Model relations (tables in SQL). Each relation an unordered collection of tuples (rows in SQL). Foreign keys in earlier (Oracle). Support for query-able and indexable multivalued data and JSON in a column later (PostGreSql, IBM DB2, MS SQL) documents (more dynamic and expressive than the restrcted relational model?). non-queryable structured datatypes like XML and JSON document in a column, let the application interpret its content
Schema evolution Example: change from column fullName to separate columns userFirstName and userSecondName ALTER TABLE users ADD COLUMN firstName text; UPDATE users SET firstName = split_part(fullName, ` `,1) UPDATE users if (user && user.fullName){ user.firstName = user.fullName.split(" ")[0] }
  Schema changes have a bad rep of being slow and requiring downtime. Not true, except in MySQL, which copies entire table. Running UPDATE on a large table likely to be slow on any db: if not acceptable, leave new field as null and fill only on read, just as we do in NoSql. Useful for documentiing and enforcing structure. 1 Schema-on-read, as app assumes some schema when reading documents. Advantageous when Heterogenous objects are to be stored and it is not practicle to have a separate table for them or when structure of objects detemined by external systems which may change
Locality If you want to fetch a profile, you either have to query multiple tables on user_id, or do a multi-table join. Multiple index lookups and disk seeks. Locality provided by features in some dbs. Google’s spanner db offers the same locality in a relational model by allowing the schema to declare that a tables rows are to be interleaved within a parent table. Oracle has a multi-table index cluser tables feature. column family concept in BigTable (used in Casandra and HBase) Documents stored as a single continuous string encoded as JSON, XML or binary variant. All the info is in one place; one query is sufficent. Advantage only if you need large parts of the doc at the same time, because the db needs to load the whole even if a small part if needed. On Updates, the entire doc needs to be written. So it is recommended to keep docs small and avoid writes that inrease size. these requirements significantly reduce the set of situations on which doc stores are useful.
Consistency Strongly Consistent (reads will see the latest write) Eventually consistent (the reads will see some write, which might not be the latest write
Application Level Complexity shredding a document into multiple tables can lead to cumbersome schemas and unnecessary complicated application code2 but good for highly inter-connected data which is not self-contained If document like structure (tree of one-to-many relatonshps, where typically the entire tree is loaded) or many-to-many relatonships not needed, then not very complex, because we dont need joins, so no need to emulate them in app code 3. But deeply nested documents can cause difficulties (“the second item in the list of positions for userx” vs directly refering to position Id). For many-to-many, two options: denomralize but additonal work to keep the data consistent, or normalize and emulate joins in app code. both complex, and the latter slower because of multiple roundtrip queries to db.
Relationship between entities Good for many to one to many (one Linkedin User has worked at multiple organizations) or many-to-many e.g. (Linkedin Resume where one user can belong to multiple organizations and one organization can be associated with multiple users. If we want to query the latter, we would need to join with the user table). As applications grow, data tends to become interocnnected, which needs support for joins (e.g. recommendation feature on a user profile includng recommender’s photo should update the photo) Good for one-to-many, entitities whch are naturally grouped together (like experiences, education tc. in a resume document in Linkedin, or an event log). But difficult to refer to deeply nested structures.
Fault Tolerance …. ….
Concurrency ….
Applications Business data processing: transactions and batch processing. But generalizes well: social networks, publishing, ecommerce, games, SaaS Aapps Higher scalability than relational, larger data and high throughput apps Key Value Store: applications that require key-value access but not a timeline consistent change-capture stream and other features provided by an RDBMS. NoSql, Document store (if your data has secondary structure, like JSON or XML, and your , rather than subdivided). Wide-column store (One row can have many different types of data, flexible number of columns in a row?). Self-contained data like resume
Examples Commercial. Convergence: PostGre, MySQQL, support JSOn AND XML and querying inside it Most Free and Open Source Key-Value Store like Voldemort. There’s some converegecne: Mongo supports client side joins to reduce complexity, but still needs to make network roundtrps. RethinkDB also supports joins in its query. MongoDB, CouchDB, RethinkDB, Espresso

Polyglot Persistence

Keep some of your data on Mongo and other stuff that show relations between objects on a Graph Database What this essentially means is: there’s no one DB that gives you the best results for all of the features you want to build.
For example: if you want a search box to find your real-life friends on the app, you might want to use a search index DB like Solr, Lucene or Elasticsearch.
If you want to find friends of friends who live or work in San Francisco, you might want to use a graph DB like Neo4j for it’s capability to search based on required links (called “edges”) with other things like: work location - hometown - friend relations.
If you want to have a big collection of posted activities by users (schema-less because a post could have many types and attributes), you might want to use a document store like CouchDB or MongoDB.
What you should take away is: use one DB as your primary datastore (doesn’t matter, but document stores and relational databases are pretty popular) and use a selection of other DBs which you can replicate data towards, for better query-ability depending on the feature. It’s common for these companies to use entirely different stack to implement different graph features on the same graph data.
For example, Facebook Graph Search system is entirely decoupled from Facebook’s newsfeed system, even though they are operating on the same (at least logically) graph data. The reasons for using different stacks are: (a) Difference in latency criteria, (b) Business impact of the features, (c) Modular design for ease of maintenance, and so on. So, if we look carefully all these Social Networks are using graph database concepts but the graph database abstraction is often hidden behind their highly optimized implementation. The graph tends to be an overlay in these cases and decoupled from fit-for purpose KV, column and document stores.K-V stores with rich value fields can capture what’s in a graph by loading whatever data there is associated with a key into the value column. Structure can inhibit flexibility and speed, thus the use of K-V and column stores.

Transaction Processing (stock keeping n warehouses) vs Batch Processing (reporting, payroll, invoicing) All Use Cases


OLTP VS OLAP

Storage Engine for analytics:

Benefits

Column oriented storage, compression with bitmasks and sorting on frequently queried columns can optimize read times.


Encoding

How to choose an encoding for your data?

Converting a data structure into a byte representation, to use in dataflows Why encode?

Three types of encoding:

Three types of dataflow:

https://stackoverflow.com/questions/5449034/what-is-the-difference-between-rest-and-http-protocols http://www.looah.com/source/view/2284 https://news.ycombinator.com/item?id=18702495 https://stackoverflow.com/questions/41141577/graphql-or-rest


Asynchronous message flows look like somewhere in between network data flows and db data flow

Benefits

Concurrency in a single process:

Distributed Actor frameworks: message broker + actor programming model. How to ensure compatibility with rolling upgrades, when messages are encoded?

Database Design


Distributed Databases

Replication

User-facing problems with replication lag (temporary inconsistency with asynchronous replication aka eventual consistency)

Multiple leader replication

Leaderless (Dynamo, Riak, Cassandra, Voldermort): All replicas accept writes

Designed to operate with tolerance for conflicting writes (consistency), latency spikes (availability), newtork interruptons (partitions)

How about when a node is down?

How many nodes can be down? What are the conditons under which a read is guaranteed to return the latest value? (Consistency)

In general, in leaderless architecture, one does not get consistency guarantees similar to leader-architecture (read-after-write consistency, monotonic reads, consistent prefix reads). The closes one gets is quorum consistency, which ensures that atleast one of the nodes overlap between the sets from which you read and you write. But they

unlike in leader replication (where replication logs can be used to figure out how far behind a replica is), monitoring health in terms of how stale a data is is difficult. No fixed order is applied. If only read repair, no limit on how stale a data is

Multiple-data centre operation

Concurrency is define by a happens-before relationship.


Partitioning

Shard: MongoDb, ElasticSearch, SolrCloud
Region: HBase
Tablet: BigTable
VNode: Cassandra, Riak
vBucket: CouchBase

Pros:

Cons:


  1. There’s not many good solutions for schema flexibility in the RDBMS world. It usually revolves around storing xml/json or twisted database design. It works, but it’s really awkard and it can eventually wreck your whole application (maintenance or performance wise). This is especially true when you have a lot of data. Schema evolution in RDBMS’s is accomplished by expensive alter-table statements and is extremely painful. Altering tables requires manual DBA intervention and blocks application teams from iterating quickly. Fast product iteration is essential for an internet company like LinkedIn. 

  2. Relational databases have many many faults, but they make a lot of common tasks simple while hiding both the cost and complexity. If you want to know how many black Prius cars are in inventory, for example, then that’s pretty easy to do. But hard - impedance mismatch: a translation layer between objects and tables. placated somewhat by ORM frameworks like Hibernate and ActiveRecord 

  3. Easy : no impedance mismatch but shifts complexity to applicaton layer when joins are needed (but some provide supports for special query operations). You would have program a counter of black Prius cars yourself, up front, in code. There are no aggregate operators. You must maintain secondary indexes. There’s no searching. There are no distributed queries across partitions. There’s no Group By or Order By. There are no cursors for easy paging through result sets. Returning even 100 large records at time may timeout. There may be quotas that are very restrictive because they must limit the amount of IO for any one operation. Query languages may lack expressive power. The biggest problem of all is that transactions can not span arbitrary boundaries. There are no ACID guarantees beyond a single record or small entity group. Once you wrap your head around what this means for the programmer it’s not a pleasant prospect at all. References must be manually maintained. Relationships must be manually maintained. There are no cascading deletes that act correctly during a failure. Every copy of denormalized data must be manually tracked and updated taking into account the possibility of partial failures and externally visible inconsistency. All this functionality must be written manually by you in your code. While flexibility to write your own code is great in an OLAP/map-reduce situation, declarative approaches still cover a lot of ground and make for much less brittle code. What you gain is the ability to write huge quantities of data. What you lose is complacency. The programmer must be very aware at all times that they are dealing with a system where it costs a lot to perform distribute operations and failure can occur at anytime. All this may be the price of building a truly scalable and distributed system, but is this really the price you want to pay? 

21 February 2018