Backend Design
Properties of Systems
Real-World Performance. You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network.
Concurrency. Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence?
Networking. Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor? {Fallacy of distributed computing #2: People assume latency is zero. Ignorance of network latency, and of the packet loss it can cause, induces application- and transport-layer developers to allow unbounded traffic, greatly increasing dropped packets and wasting bandwidth.)
(Fallacy of distributed computing #4: People assume the bandwidth is infinite.). Ignorance of bandwidth limits on the part of traffic senders can result in bottlenecks.
Availability and Reliability. ? {Fallacy of distributed computing #1: People assume the network is reliable) Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability
[ e.g. Software applications are written with little error-handling on networking errors. During a network outage, such applications may stall or infinitely wait for an answer packet, permanently consuming memory or other resources. When the failed network becomes available, those applications may also fail to retry any stalled operations or require a (manual) restart.]
Abstraction. You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?
The interviewer usually is interested in the choice of resources like Databases, Storage, Logging; Use of multithreading, messaging queues, Cache , efficient algorithms of the functionalities; identifying all the data and classes that would be important to store if anything goes wrong in the system; what functionalities are exposed to the end user and what not so that they might not screw up the whole system?
Mysql query cache
Memache: save results of a query in a ram. Use LRU mechanism for eviction/garbage collection
Redundancy
Complacency regarding network security results in being blindsided by malicious users and programs that continually adapt to security measures.
Bloom filters: probability based space efficient data structure used to tell whether an element is a member of a set. It can have false positives but no false negatives
Count min sketch: Data structure used to calculate the frequency of events. If you have a million events, and you want to track top k events with some error tolerance, this will be a space efficient way to find it (instead of keeping all the events)
Blobstore (Amazon S3) used for storing very large files on the cloud.
MapReduce is processing in-parallel of large data. Map filters and sorts, while reduce aggregates or summaries. Implementation: Hadoop
Virtual machine are a way of giving you an OS on top of a shared hardware so that you feel you are the owner of the hardware. Containers is a way of running an application and its dependencies in an isolated environment.
Sessions and cookies
Cookies are a way to handle sessions. Sessions are needed because REST design of APIs is stateless: it is not allowed to differentiate one request from the other. Server stores whatever info it cares to maintain between requests (like the logged-in user’s username and/or ID, for example), and assigns that information an ID (called a “session ID”). It then tells the browser that session ID, in such a way that the browser can hand the ID back when it’s time to make another request.
https://stackoverflow.com/questions/3521311/how-does-iis-recognize-different-sessions-in-net/3521393#3521393
https://stackoverflow.com/questions/13206870/what-does-rests-principle-of-statelessness-actually-means
Hardware:
Data centre has racks and racks have hosts. Latencies involved
What happens if a rack/data centre goes down?
Limitations and their effect on latency and throughput:
CPU
HardDrive
Network resources
Random (try to avoid in disks) vs Sequential (good for disks) read and write.
- Most applicatons are built by layers one data model on top of another (services and their APIs -> data structures and there APIs -> JSON/XML -> bytes in memory -> electrcal pulses etc.) Every data model embodies assumptions about how it is going to be used
Properties of a storage system
- Performane: response times, usually expressed in terms of percentiles not averages and a definition of the work load (e.g. 95% of simple reads are served under 1 second and 99% of simple writes are served under 2 seconds). You can usually handle performance on both reads and writes by buffering writes and caching for reads. Writes can be buffered to avoid overwhelming your DB at peak times and to shield users from long-write latency spikes. Just like caching, it introduces dependencies on additional complex systems – e.g. your write buffers need to be fault-tolerant to avoid losing writes.
- Scalability/Elasticity: how many concurrent users can you handle. Sometimes this is expressed as requests-per-second. For session-heavy databases, you may well hit a connection or session limit before you hit the RPS limit. A good DB supports read-scaling (e.g. 10x read slaves for a write master) for read heavy loads, write-scaling (e.g. sharded or horizontally-partitioned data sets), use of lock-free data structures and algorithms.: The ability to scale clusters horizontally by simply adding additional nodes
- Security
- Fault Toleranit: The system should continue to operate in the presence of individual machine failures with no operator intervention required. what types of failures can this system withstand (e.g. ceil(n/2 - 1) crash failures).
- Ease of debugging: Your system might be a little more difficult to debug unless you collect logs n one place and it’s possible to follow a single user across multiple components.
-
- Monitoring It’s possible for your system to fail in ways that weren’t possible before. For instance there might be a network issue between two specific nodes and it’s hard to figure that out unless you have proper monitoring in place.
- Payment-related services had hourly or daily batch jobs. Anything with big resource spikes probably shouldn’t share a machine with latency-sensitive code (like online payment processing or just web handlers).
- One needed a large in-process cache in order to deliver good performance; it would have consumed to much memory on each instance of the monolith.
Some services used large ML models, which also would have consumed too much of memory on monolith instances.
-
CPU vs memory is one area, but also being blocked on IO and network saturation.
-
Modularity. This makes the application easier to understand, develop, test, and become more resilient to architecture erosion.[1] It parallelizes development by enabling small autonomous teams to develop, deploy and scale their respective services independently. independently.[2] It also allows the architecture of an individual service to emerge through continuous refactoring.[3] Microservices-based architectures enable continuous delivery and deployment
It can be seductive to over-fragment a system, and suddenly have hard-to-debug microservices for all those little pieces, when really all that’s needed in a case like this is a front-end app that does login, sessions, billing, everything except the rendering work which goes off to a queue-based worker service.
- Availabality/yield = % of requests served or probability of a request getting a response Traditional relationtional databases chose consistency over availability
- Consistency. ACID consistency,t read-after write consistency as well as eventually consistent reads
- harvest = % of data used in each request
- to an end user, a series of operations are performed as if they are performed on a single machine
- Guaranteed secondary index consistency with base data. (Secondary Indexing needed which enables Text and attribute earch etc.)
- Transactional updates within a single partition
- Partitional tolerance: messages are dropped from one part of the system to another. in practice, we never have partition tolerance.
- Distributed: The ability to distribute a single database across multiple nodes in a cluster. The number of partitions as well as the replication factor should be configurable per database
)
- ACID Compliance: vs BASE compliance
- Schema Evolution: Schema evolution should be supported with zero downtime and no coordination with DBA teams
- Change Capture Stream: Changes to the primary data store should be forwarded to downstream consumers via a change capture stream.
- Bulk Ingest: It should be possible to produce datasets offline (e.g. on HDFS via Hadoop) and import the data for online serving.
- Optimistic vs pessimistic locking. https://stackoverflow.com/questions/129329/optimistic-vs-pessimistic-locking
Other dbs:
- hiearchical
- data as a tree of recorders nested withn reords (JSON-like). didn’t support joins, made many-to-many relationships difficult.
- IBM’s IMS used for stock keeping for Apollo Space program)
network and XML based.
*
Property |
Relational |
Document |
TL;DR |
better support for joins needed to resolve many-to-one and many-to-many relationships |
schema flexibility, better locality performance (but not really), and less object-relational impedance mismatch in one-to-many relationships |
Data Model |
relations (tables in SQL). Each relation an unordered collection of tuples (rows in SQL). Foreign keys in earlier (Oracle). Support for query-able and indexable multivalued data and JSON in a column later (PostGreSql, IBM DB2, MS SQL) |
documents (more dynamic and expressive than the restrcted relational model?). non-queryable structured datatypes like XML and JSON document in a column, let the application interpret its content |
Schema evolution Example: change from column fullName to separate columns userFirstName and userSecondName |
ALTER TABLE users ADD COLUMN firstName text; UPDATE users SET firstName = split_part(fullName, ` `,1) UPDATE users |
if (user && user.fullName){ user.firstName = user.fullName.split(" ")[0] } |
|
Schema changes have a bad rep of being slow and requiring downtime. Not true, except in MySQL, which copies entire table. Running UPDATE on a large table likely to be slow on any db: if not acceptable, leave new field as null and fill only on read, just as we do in NoSql. Useful for documentiing and enforcing structure. |
Schema-on-read, as app assumes some schema when reading documents. Advantageous when Heterogenous objects are to be stored and it is not practicle to have a separate table for them or when structure of objects detemined by external systems which may change |
Locality |
If you want to fetch a profile, you either have to query multiple tables on user_id, or do a multi-table join. Multiple index lookups and disk seeks. Locality provided by features in some dbs. Google’s spanner db offers the same locality in a relational model by allowing the schema to declare that a tables rows are to be interleaved within a parent table. Oracle has a multi-table index cluser tables feature. column family concept in BigTable (used in Casandra and HBase) |
Documents stored as a single continuous string encoded as JSON, XML or binary variant. All the info is in one place; one query is sufficent. Advantage only if you need large parts of the doc at the same time, because the db needs to load the whole even if a small part if needed. On Updates, the entire doc needs to be written. So it is recommended to keep docs small and avoid writes that inrease size. these requirements significantly reduce the set of situations on which doc stores are useful. |
Consistency |
Strongly Consistent (reads will see the latest write) |
Eventually consistent (the reads will see some write, which might not be the latest write |
Application Level Complexity |
shredding a document into multiple tables can lead to cumbersome schemas and unnecessary complicated application code but good for highly inter-connected data which is not self-contained |
If document like structure (tree of one-to-many relatonshps, where typically the entire tree is loaded) or many-to-many relatonships not needed, then not very complex, because we dont need joins, so no need to emulate them in app code . But deeply nested documents can cause difficulties (“the second item in the list of positions for userx” vs directly refering to position Id). For many-to-many, two options: denomralize but additonal work to keep the data consistent, or normalize and emulate joins in app code. both complex, and the latter slower because of multiple roundtrip queries to db. |
Relationship between entities |
Good for many to one to many (one Linkedin User has worked at multiple organizations) or many-to-many e.g. (Linkedin Resume where one user can belong to multiple organizations and one organization can be associated with multiple users. If we want to query the latter, we would need to join with the user table). As applications grow, data tends to become interocnnected, which needs support for joins (e.g. recommendation feature on a user profile includng recommender’s photo should update the photo) |
Good for one-to-many, entitities whch are naturally grouped together (like experiences, education tc. in a resume document in Linkedin, or an event log). But difficult to refer to deeply nested structures. |
Fault Tolerance |
…. |
…. |
Concurrency |
…. |
… |
Applications |
Business data processing: transactions and batch processing. But generalizes well: social networks, publishing, ecommerce, games, SaaS Aapps |
Higher scalability than relational, larger data and high throughput apps Key Value Store: applications that require key-value access but not a timeline consistent change-capture stream and other features provided by an RDBMS. NoSql, Document store (if your data has secondary structure, like JSON or XML, and your , rather than subdivided). Wide-column store (One row can have many different types of data, flexible number of columns in a row?). Self-contained data like resume |
Examples |
Commercial. Convergence: PostGre, MySQQL, support JSOn AND XML and querying inside it |
Most Free and Open Source Key-Value Store like Voldemort. There’s some converegecne: Mongo supports client side joins to reduce complexity, but still needs to make network roundtrps. RethinkDB also supports joins in its query. MongoDB, CouchDB, RethinkDB, Espresso |
Polyglot Persistence
Keep some of your data on Mongo and other stuff that show relations between objects on a Graph Database
What this essentially means is: there’s no one DB that gives you the best results for all of the features you want to build.
For example: if you want a search box to find your real-life friends on the app, you might want to use a search index DB like Solr, Lucene or Elasticsearch.
If you want to find friends of friends who live or work in San Francisco, you might want to use a graph DB like Neo4j for it’s capability to search based on required links (called “edges”) with other things like: work location - hometown - friend relations.
If you want to have a big collection of posted activities by users (schema-less because a post could have many types and attributes), you might want to use a document store like CouchDB or MongoDB.
What you should take away is: use one DB as your primary datastore (doesn’t matter, but document stores and relational databases are pretty popular) and use a selection of other DBs which you can replicate data towards, for better query-ability depending on the feature.
It’s common for these companies to use entirely different stack to implement different graph features on the same graph data.
For example, Facebook Graph Search system is entirely decoupled from Facebook’s newsfeed system, even though they are operating on the same (at least logically) graph data. The reasons for using different stacks are:
(a) Difference in latency criteria,
(b) Business impact of the features,
(c) Modular design for ease of maintenance, and so on.
So, if we look carefully all these Social Networks are using graph database concepts but the graph database abstraction is often hidden behind their highly optimized implementation. The graph tends to be an overlay in these cases and decoupled from fit-for purpose KV, column and document stores.K-V stores with rich value fields can capture what’s in a graph by loading whatever data there is associated with a key into the value column. Structure can inhibit flexibility and speed, thus the use of K-V and column stores.
Transaction Processing (stock keeping n warehouses) vs Batch Processing (reporting, payroll, invoicing)
All Use Cases
OLTP VS OLAP
Storage Engine for analytics:
Benefits
- Optimized for analytics access patterns (read heavy, indexing algorithms that allow aggregation)
- Doesn’t interfere with OLTP systems
Column oriented storage, compression with bitmasks and sorting on frequently queried columns can optimize read times.
Encoding
How to choose an encoding for your data?
Converting a data structure into a byte representation, to use in dataflows
Why encode?
- Pointers, as they refer to actual memory addresses, dont make sense when data is out of a particular environment
- They are more efficient for use cases.
- They make our architecture is evolvable or not evolvable
- allows rolling upgrade (different nodes running different versions of the applications). It is important that encoding provides backward or forward compatibility
- use of microservces
Three types of encoding:
- Programming Language Built-in Libraries (Java.io.Serializable). very convenient because they allow in-memory data structures to be saved and restored with minimal additional code. but bad idea to use these for encoding other than transient purposes:
- interoperatibility problems with other languages: get locked in.
- security problems: decoding process needs to have the power to instantiate arbitrary classes (because they need to restore in-memory objects). an attacker can slip in their arbitrary byte sequences that is able to receive calls remotely
- versioning and efficiency is an afterthought. java’s serializable performes badly
- fail to provide forward and backward compatibilitu
- Standardized Texual encoding can be written and read by many languages. But take up lots of space or less efficient compared to binary.
- JSON: built in support in web browsers. powerful schema languages for much detailed validaton rules (e..g. “the string value of this field matches regular expression” or “the integer value of this field between 0 and 100.”
- cannot differentiate between int and floating point; doesnt specify precision. e.g. numbers > 253 cannot be represented without precision details in 754 double precision floatng point format. so when parsed by languages like JS (which also supports same floating point format) loses those details. Twitter uses a 64 bit number to represent tweet, and when it sends a JSON API response sends ID twice: once as a JSON number (which cannot represent it precisely) and one as a string.
- only supports unicode character strings, doesnt support binary string . People encode binary data as Base 64 characters to work around ths. this hack increases data size.
- XML. powerful schema languages for much detailed validaton rules (e..g. “the string value of this field matches regular expression” or “the integer value of this field between 0 and 100.”
- too verbose, unnecessarily complicated.
- cannot differentiate between nums and strings
- only supports unicode character strings, doesnt support binary string . People encode binary data as Base 64 characters to work around ths. this hack increases data size.
- CSV: less powerful than JSON and XML. somewhat vague (what happens when an entry is comma or newline character). cannot differentiate between nums and strings. doesnt support schemas: upto the application to decide the meaning of each row and column. have to handle changes in schema interpretation manually.
- Binary: More efficient- faster to parse or more compact. For terabytes of data, gains can be sufficient. Good for internal use in an org, where less pressure to use a standard format.
- Schema-Less encoding: would need to store object field names so bigger in size (66 byte example). Not clear such small space reduction worth the loss in human readability. .
- JSON based: BSON, BJSON, BISON, MessagePack
- XML based: WBXML, InfoStack
- Schema based encoding with schema languages. Restriction (just like SQL databases) We need to support schema evolution But benefits (1) More compact since they can omit field names from encodings (2) self-documenting (3) if we have db of schemas allows to check backward and forward compatibility before making changes (4) code generation from schemas helps in statically typed languages: allows efficient in-memory data structures for decoded data, which help in type checking at compile time and autocompletion in IDEs when writing programs for accessing these data structures (5) with simpler schema languages than JSON and XML, simpler to implement, simpler to use.
- ASN: Schema definition language standardized in 84. Its Binary encoding used in SSL Certficates. Badly documented, very complex
- propriety encodings e.g. relational databases have network protocol for querying and getting responses decoded by drivers provided.
- Protocol Buffers. only one type of encoding (33 bytes).
- no lists or array datatypes, but a repeated marker. Good: Flexibility, can convert a single valued field to a multiple valued.
- Thrift (field names replaced by field tags):
- List with a subtype parameter. Good, Allows nested lists,
- BinaryProtocol (59 bytes)
- Compact Protocol (34 bytes): packs the field tag and type in the same byte, uses variable length integerers (how? top of each byte describes whether next byte to be used or not. so -63 to 64 can fit in 1 byte, -8192 and 8191 fit in two bytes…)
- How do these two encodings maintain compatibility with schema changes?
- Protocol and Thrift:
- Adding a field: Do not add a field tag which is required (otherwise new reader schemas reading old records would fail; not forward compatible). Do not change field tags when adding. When old reader schema decodes new records with a new field tag, it uses the annotation to find the number bytes to skip.
- Removing a field: Do not remove required field tags (otherwise old reader schemas reading new records would fail; not backward compatible. Do not change field tags when removing fields (do not reuse removed field tag).
- Renaming a field: Do not change field tags when renaming
- Changing the datatype: possible but risk of losing precision (old 32 bit integer schema tries to decode new 64 bit Integer, truncates)
- Dynamic schema generation done manually.e.g. a db column names have to be mapped to field tags carefully whenever a db schema is changed
- Avro (32 bytes): requires the encoder and decoder BOTH to know the schema. How? depends on context. (1) with large records sent in a file, the top of the file can contain a schema (2) in a db context, a schema version number column can be attached with each record + a table with (intger or hash(schema) as version number , schema (3) records sent over a network, schema version negotiated at setup and persists for lifetime of connection.
- Avro Has Union types(to be used for null) and defaults instead of required/optional. and code generation is optional
- How maintain compatibility with schema changes?. Writer and reader schemas don’t have to be the same, only compatible. Forward compatibility means having old version of schema as reader and new as writer; backward means having old as writer and new as reader.
- Adding a field: You can only add a field which has a default value: this is used to fill in when a field is missing.
- Removing a field: You can onlyremove a field which has a default value.
- Renaming a field: Tricky, Reader schema can have aliases (so when we add a new name we can add the old name to alias). Backward compatible but not forward.
- Changing the datatype: Possible, Provided avro can convert the type
- Adding a branch to union type: backward compatible but not forward compatble
- No tag numbers, only field names. This allows Dynamic schema generation. .e.g even if db changes, we just interpret the column names as field names
Three types of dataflow:
https://stackoverflow.com/questions/5449034/what-is-the-difference-between-rest-and-http-protocols
http://www.looah.com/source/view/2284
https://news.ycombinator.com/item?id=18702495
https://stackoverflow.com/questions/41141577/graphql-or-rest
- Asynchronous Message Queues
Asynchronous message flows look like somewhere in between network data flows and db data flow
- client’s request (message) goes to another process with low latency, just like in network flows
- client’s request goes through an intermediary which stores messages temporarily (just like db)
Benefits
- improve reliability. act as buffer if the recipient is unavailable or overloaded. prevents messages from getting lost: can redeliver message to a process that has crashed.
- allows the sender to not know the IP address and port number of the recipeint (useful in cloud deployments where IP and port numbers are not fixed)
- allows one message to be sent to serveral recipients
- decouples senders and recipeints. They dont need to each other (pub-sub pattern
- Message queues receive, hold, and deliver messages. They dont enforce any datamodel: it is just a squence of bytes with some metadata, so any encoding format can be used.
- If an operation is too slow to perform inline, you can use a message queue with the following workflow:
- An application publishes a job to the queue, then notifies the user of job status. The user is not blocked and the job is processed in the background. During this time, the client might optionally do a small amount of processing to make it seem like the task has completed.
- For example, if posting a tweet, the tweet could be instantly posted to your timeline, but it could take some time before your tweet is actually delivered to all of your followers.
- A worker picks up the job from the queue, processes it, then signals the job is complete
- A consumer of a topic can publish message to another topic. But be careful as to the preservation of unknown fields.
- A reply queue can be consumed by the sender of the orginal message (allowing Req/reponse dataflow)
Concurrency in a single process:
- threads (comes with associated problems of race conditions, locking, deadlock)
- actor programmng model (clients with some local state that send and receive message).
- presumes that message delivery is not guaranteed. less of a fundamental mismatch between calls in the same program or remote calls.
- so location transparency works better.
- only one message processed at a time by each actor
- scheduled by the framework
Distributed Actor frameworks: message broker + actor programming model.
How to ensure compatibility with rolling upgrades, when messages are encoded?
- Akka uses Java’s serialization but you can use protocol buffers
- Orealans uses custom data encoding format and does not support rollng upgrade. Need to set up migrations to new cluster
- Erland OTP: possible but needs planning
Database Design
- How to recover in case of crashes?
- How to make reads fast?
- Simplest database: append every write to the end of a file.
- log-structured
- Compaction and garbage collection takes place as a background process
- B-Trees
Distributed Databases
Replication
- Why replication?
- Availability: if some parts of the system fail.
- Latency: To keep data geographically close to users
- Scalability/Increase throughput on read: scaling out number of machines when load is high on read queries
- Disconnected Operation (Partition Tolerance?)
- Replication vs Sharding
- Sharding is partitioning; replication is copying.
- It’s all about data changes with time. Replication is easy when data doesn’t change ith time. Strategies for replicating changes:
- Single leader
- Multi leader
- Leaderless
- Synchronous vs Asynchronous
- Why Synchronous? Gurantees consistent followers. Outages in leader? no problemo.
- Why not? Blocking. Because if synchronous follower doesn’t respond, write cannot be processed. Leader blocks and waits. Problemo, especcially if espcially if followers are many or geographically distributed. Worse, the more nodes you have, more likely and worser the problem. ( common on the web, where reads are much more than wrtes)
- Option 1: Make on of the followers synchronous, and other asynchronous.
- Option 2: Make all asynchronous? Outages in leader? Data loss. Weakens durabilty.
- How to ensure that new followers have an accurate copy of the leader’s data?
- Just copying data might not make sense, as the file copy mechanism will see different data at different points of time (a record might be deleted or updated, making the leader inconsistent with the folloer)
- Locking the database (till the copy is complete) for consistency goes against high availability.
- Best way: take a consistent snapshot of db, associate it with an exact position in leader’s replication log (called log sequence number or binlog coordinates), as is done when taking backups. Copy the snapshot. After that, the folloer requests data that has changed since snapshot.
- What happens when a node (leader or follower) fails?
- Follower failure: Each follower keeps a log of the changes it has received from the leader. so it can recover (from failures) quite easily from the log: it knows the last transaction process before the fault, and thus can connect to the leader and request all the data changes that occured since the time it was disconnected.
- Leader failure: One of the follower needs to take over (failover)
- Manually handle failover
- Automate:
- How to determine node is outed? Timeouts, because lots of reasons for outage (power failure, network failure, crashes) etc.
- How to elect a new leader? Election, allow a previous controller node to elect etc. Best candidate? Node with latest data. How to get nodes to agree? consensus problem
- Clients have to be reconfigured to reroute their writes. If old leader comes back, the clients might end up writing to that instead. System needs to ensure old leader steps down
- Problems with automation:
- Data loss due to inconsistency between old leader (when it comes back) and new leader, especcially when replication is asynchronous. Violates client’s durability expectations.
- Dangerous if other storage systems coordinate with database. e.g. Github issue, out of data follower promoted wwhich used autoincrementing counter to assign primary keys and reused some primary keys assigned to old leader. These primary keys also used in Redis store, so the reuse of primary keys resulted in inconsistency between Redis and MySQL, releasing private data disclosed to wrong users
- Split brain between old leader and new leader: if both accept writes, then conflicts, data loss and corrupton.
- Safety catch: Automatically detect and shut down one of them. Without care, both might end up shutting down.
- If timeout is due to a temporary load spike, then declaring dead can make things worse. Deciding an appropriate timeout is a problem.
- How is single leader replication implemented?
- Whenever a leader writes new data to its storage, it also sends daa to all the followers as part of a replication log or change stream. Each follower takes the log and updates itself in the same order.
- Statement based (MySQL earlier versions): sends every write request statement (e.g. INSERT, UPDATE OR DELETE) and reach follower parses and executes the statement as if the client sent it directly. Common Problems:
- If the statement contains a non-deterministicc reference lke NOW/getTime or RAND
- If the statemenet has side effects local to the database (stored procedures)
- If the statement depends on other entries in the db (e.g. autoincrement), they must be executed in the same order. This can be limiting when there are concurrently
- Workaround: the leader replaces non-deterministic function calls with their fixed return value when the statement is logged. VoltDB works by requring transactions to be deterministic.
- Write-Ahead Log Based (PostgreSQL, Oracle):
- dbs usually store every write in an append-only sequence of bytes (this is the main place of storage in log structured database, and a mechanism to recover after crashes for b-tree databases). We can send this log to followers for replication
- Problem: Couples the follower to the leader’s version of db, because WAL contains data in a low-level format which might change with different versions. Operational impact: Coupling ensures we cannot upgrade our software without downtime. Upgrades without downtime require we upgrade the version of followers, then do a failover.
- Logical log based replication (MySQL’s binlog): maintain a separate log in a standardized format, decoupled from storage internals
- for a relational db, this is usually at the granularity level of a row (as opposed to a byte)
- For Inserts, all column values are stored
- For Delete, whatever info is necessary to uniquely identify the row suffices (primary key, or if absent, the old values of all columns)
- For Update, whatever info is necessary to identity the updated values (e.g. primary key of row + all/new column values)
- Benefit: Can be used by external applications (Change data capture)
- data warehouse for offline analysis
- building custom indexes and caches
- Trigger based (move replication up the application layer): Benefits: flexbility:replicate only a subset of data. replicate to another kind of db, or include conflic resolution logic). Problem: greater overhead, prone to bugs and other limitations. Two types:
- Read the dblog and make data changes available to applicaton (e.g. Oracle GoldenGate)
- Use stored procedures which are autoexecuted when a data change (write transaction) occurs. Opportunity to write changelog to another table and make available to external process (which can do whatever- apply app logic, replicate). used in Databus for Oracle and Bucardo for PostGreSQL.
User-facing problems with replication lag (temporary inconsistency with asynchronous replication aka eventual consistency)
- Reading your own writes (read-after-write consistency).
- Check what is being read. Is it something the user could have written? (e.g. personal profile info). Then send to leader
- if geographically distributed must be routed to the data centre that contains the leader
- If most things user reads are what they edit (e.g. note taking app), use other criteria. e.g. check last updated time of the stuff being read, if it is < 1 min, read from leader.
- cross-device: might need to first route requests from all device to the same datacentre
- Client remembers timeStamp (system clock or log sequence number) of last write. Any query to a follower is checked against this, if the follower is behind, then queries are rerouted or waited till the follower has caught up.
- cross-device read-after-write consistency would need to centralize this metadata (timestamp of last write)
- User moves backward in time on succesive reads (Monotonic Reads guarantee) (e.g. disappearing comments) Stronger guarantee than eventual consistency, weaker than strong consistency.
- Send the same user to the same replica. Can be chosen based on hash of useriD. If replica fails, reroute the query.
- Violaton of causality (Consistent prefix reads) If a sequene of writes happen in a certain order, then anyone reading those writes would be guaranteed to see the same order (seeing correct order of a chat conversaton e.g.)
- Problem Only in sharded db’s: no global ordering of writes. Ensure that writes causally related to each other are written to the same partition. some algorithms which keep track of dependenceies (happens-before relationship)
Multiple leader replication
- Benefits
- Better performance on writes than single leader.
- Better tolerance of network failure between data centres
- Better handling of outages: each datacentra can operate independenly by doing local failovers.
- Offline Devices (Whatsapp or Calender app). Multiple devices with local databases is equivalent to multi-leader arch with each device as a data centre with extremely unreliable network connecton. Replication lag might be days or months. (Couch DB makes db configuration good in this use case)
- Collaborative Editing (Google Doc).
- unit of change very small (one key stroke) and avoid locking
- Problems
- Write conflicts. In theory you can allow synchronous conflict detection: accept writes only after checking with other leader. But this you would lose the main advantage of multi-leader arch: allowing each to accept writes independelty. Might as well as use single leader (lock the db before another user can edit it.) Solutions (at the level of row, not transaction. A transaction may make server writes):
- Avoid them. Make sure the requests from one user is only sent to one leader in one datacentre (just like resolutions to reading your own writes, you avoid conflicts by writing to only one db)
- Converge:
- Give each write a unique ID/timestamp/UUID/hash of key and value and pick the highest ones.
- Or give each replica a unique ID and pick the write originated at the ihghest ID.
- Merge records and present both.
- Record conflicts in an explicit data structure that preserves all info, and write app ccode that resolves the conflict (e.g. Git?)
- on write: the code may be executed on write: as soon as the conflicc is detected. typically cannot prompt the urser
- on read: (Couch DB): prompt the user or automatically resolves the conlict.
- Automatic resolution
- Conflict free replicated datatypes: family of ds designed for concurrent modification by multiple users that auti resolves conflict in a sensible way. implemented in Riak. two-way merge function
- Mergeable Persistent Data Structures: three-way merge functon
- Operational Transformations: designed for concurrent modification of an ordered list of items.e.g. Google Docs
- Because it is retrofitted in many dbs, it is considered dangerous (surprising interactions with db features like autoincrementing keys, triggers, integreity constrants).
- How to prevent replication loops in multiple leader topologies (star, circular) where a write travels to multiple nodes before it reaches all replicas? Tag unique node ids to replication log, and check it before processing.
- Node outages in star and circular topologies: use denser topologies without single point of failure
- Denser topology like all-to-all has the problem as some paths might be faster than others. Leader 1 get an insert, Leader 3 gets the insert and then an update on that insert, but Leader 2 gets the updae before it gets the insert. Problemo! PostGreSQL BDR does not provide causal order of writes and Tungsten Replicator for MySQL doesn’t try to detecct conflicts.
- Can putting timestamps on the changelogs solve the problem? When leader to receives the correct timestamps, it will correctly order these. But clocks cannot be trusted to be in sync
- Version vectors for concurrent writes.
Leaderless (Dynamo, Riak, Cassandra, Voldermort): All replicas accept writes
Designed to operate with tolerance for conflicting writes (consistency), latency spikes (availability), newtork interruptons (partitions)
- Approach 1 Clients directly send writes to serveral replicas
- Approach 2: A designated coordinator node replicates writes everywhere for clients.
How about when a node is down?
- Stale responses to clients on reads. Solution?
- Send read requests in paralell and chose the latest version of the record.
- How does node catch up?
- Read repair by clients on detecing a stale response
- Anti-entropy: background process which copies data. But different from copy based on replication logs: (1) order of writes is not important and (2) maybe significant delay before data is copied. Voldemort does not have anti-entropy. Without this, values which are rarely read be missing (reduced durability)
How many nodes can be down? What are the conditons under which a read is guaranteed to return the latest value? (Consistency)
In general, in leaderless architecture, one does not get consistency guarantees similar to leader-architecture (read-after-write consistency, monotonic reads, consistent prefix reads). The closes one gets is quorum consistency, which ensures that atleast one of the nodes overlap between the sets from which you read and you write. But they
- Quorum consistency: Read write values that obey r + w > n (called quorum reads and writes: minimum number of reads and write responses needed for it to be valid, where n is the number of nodes on which any given value is stored (might be different from total nodes, as is the case in sharding)
- Common configuration: keep both n+1/2
- If writes are rare and reads are a lot, can keep w = n, r = 1. This would give high latency on reads and low latency on writes, and writes would fail if any one does.
- Consistency limitations :
- Sloppy quorums, while assuring durability, can lead to reading stale value. Riak enabled, Cassandra, Voldemord disabled by default.
- Conflicting, concurrent writes: merge the two. (Other strategies results in data lass. Last write wins results in data loss due to clock skew)
- Write happens concurrently with a read. Undetermined whether read returns old value or new value.
- No rollback on writes if only successful on < w nodes. If a write fails, subsequent reads may or maynot return from the write.
- If a node fails, and replaces a new value with an old value on catchup, resulting in < w writes for a value, quorum conditon might fail.
- Unlucky timing problems
- if r + w < n, then we trade-off quorum consistency for latency + availability. The smaller the r and w, the greater the likelyhood of stale reads.
unlike in leader replication (where replication logs can be used to figure out how far behind a replica is), monitoring health in terms of how stale a data is is difficult. No fixed order is applied. If only read repair, no limit on how stale a data is
Multiple-data centre operation
- n refers to all. Each write is sent to all replicas, regardless of data centre, but the quorum writes needed to mark success is only needed from local. Across data centre writes happenß asynchronously. (Cassandra, Voldemort).
- n refers to local only. Cross data centre replcation happens in the background. Riak
Concurrency is define by a happens-before relationship.
- Write Conflicts
- Similar to multiple leader + conflicts on read repair or hinted handoff. Writes might reach different replica in different order.
- Simpler is to just pick one value. Last write wins? Cassandra. Optional in Riak. Achives eventual consistency at the cost of durability.
- Sometimes data loss even in the case of non-concurrent writes. Sometimes acceptable (especially in a caching situation).
- Only way to ensure data safety is to ensure records are immutable after written. e.g. assign UUID keys in Cassandra (so writes which were earlier conflicting now have different UUID)
- To ensure no data loss, capture happens-before relatonship with versioning of writes and replicas (version vectors), and make clients/app do extra work to clean up siblings (concurrent values) e.g. take union in shopping cart example (contrast with last write wins). to accomodate removals, allow tombstone markers in server.
- DS whicch handles merging automatically in a sensible manner: Riak’s CRDTs
Partitioning
Shard: MongoDb, ElasticSearch, SolrCloud
Region: HBase
Tablet: BigTable
VNode: Cassandra, Riak
vBucket: CouchBase
- Benefits
- Scalability. if you can make it work (unskewed), then it does offer a good shot at a linear speedup. (10 nodes can take up 10 times the data and 10 times the read and write queries)
- Strategies:
- key range, boundaries chosen manually or automatically. BigTable, HBase (BigTable’s open source equivalent), RethinkDB, and Mongo 2.4 or less. As data maynot be evenly disitributed, Range may not be evenly spaced.
- Good: Range queries
- Can lead to hotspots.
- Range of Hash of key. Cassandra and Mongo use MD5. Voldemort uses Fowloer-Vo-No.
- Good: Fair distribution
- Bad: Lose range queries
Pros:
- High Availability/ More resilient: even if one partition goes down, others can continue.
- Supports parallel reads and writes
- With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
- Parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.
- You won’t have the problems with replication overhead and lag because you are writing to a appropriately sized shard rather than a single master that must replicate to all its slaves, assuming you have a lot of slaves. You will still replicate within the shard, but that will be more of a fixed reasonable cost because the number of slaves will be small. Google replicates content 3 times, so even in that case it’s more of a fixed overhead then chaining a lot of slaves together.
- Smaller index means faster searches. e.g. Smaller amounts of data in each user group mean faster querying
- smaller individual cache size, More cache hits
Cons:
- Added complexity/expense. Sretch your existing model to see if you can get some more performance out of it first. I suspect that for an application of any complexity, the cost of the programmer and sysadmin time to introduce sharding may add up to more than the cost of simply buying some bigger iron.
- Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it’s still something you must implement yourself.
- Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won’t have a shard view and you won’t find any automated backup and restore programs for your shard. With sharding you are on your own.
- Application side code needed to decide which partition to route an operation to
- You will inevitably have to change your application code; you might even have to change it a lot. For example, if you split user profile data across multiple shards, you flat out lose the ability to do queries like “find me all the users with a Z in their name”, because that crosses multiple shards. You have to do either some sort of map-reduce across the shards, or have a separate index for that field off to one side. You should be able to encapsulate this in your DAOs; where you might not is if some kinds of queries become so slow that you just can’t afford to do them any more, and have to find different ways to solve the problems they solve. If your ORM has support for sharding (apparently Hibernate does), then that might help a lot, but it won’t fundamentally change the problem.
- Expensive joins if the data across partitions need to be aggregated
- I suspect that the workloads that allow linear sharding are also the ones that would fit a NoSQL store well, because of that defining characteristic of not depending on cross-shard queries, so if you’re going to do the work to do sharding, you might as well go the whole hog and put everything in whatever the hot NoSQL store today is.
- Joining data from multiple shards.
- For a social network application, where users are sharing content with others in a random meshwork, you will end up needing to do huge amounts of cross-shard work. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can’t just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.
- Rebalancing Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let’s say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard. On some platforms I’ve worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime. Rebalancing has to be built in from the start. Google’s shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.
21 February 2018