Schema Design

Relationship Types

There are three types of relationships between entities:

1 to N
- Status and Comments on that status (Facebook)
- Question and Answers on that question (Quora)
- User and Answers (Quora)
N to 1
- Status and Author (Facebook)
- Employee and Organization
- Question and User (Quora)
or reverse of 1 to N
- Comments and Status that it belongs to (Facebook)
- Organizations and Resume that refer it (LinkedIn)
- Answer and Question that it belongs to (Quora)
- Answer and User (Quora)
N to N
- Status and Tags (Instagram)
- Users and Organizations (Linkedin)

Normalization

The normalization rules are designed to enhances the integrity of the data at some possible performance cost for certain retrieval applications (data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form). It is biased toward the assumption that all non-key fields will be updated frequently. It prevents update anomalies (easy to change entries without going through all records) and data inconsistencies (less mistakes) and minimizes redundancy (efficient use of space). Removing duplication is the key to normalisation . There is no obligation to fully normalize all records when actual performance requirements are taken into account: so it depends what your data access patterns are going to be. It also makes sharding difficult.

A NoSQL record does not adhere to first normal form (variable number of fields), but we can still ask whether to extract individual fields into their own separate collection or not. This is called “embedding” vs “referencing”.

How to find if it can be normalized?

You can “smell” duplication, scope for inconsistencies and update mistakes by asking: Is non-key field a fact about a subset of a key or another non-key field?

Every non-key must provide a fact about the key, the whole key, and nothing but the key

The table can be normalized further when these properties are present in your (tables/collection) for some dependent non-key field:

Duplication A dependent non-key field is repeated in the record of every key which has the primary non-key.
Write Overhead If the primary non-key changes, every such record must be updated (if department address changes, all entries must be changed). We would need to do extra work in ensuring consistency on every write (e.g. we would need to ensure that the added question/status has the same author name as the name in the User field and as other questions by them)
Scope for Inconsistency Because of the redundancy, the data might become inconsistent, with different records showing different dependent non-key for the primary non-key (different address for the same deparment for different employees). There would be no single source of truth (e.g the name of the author might be present in both a Question table and the User table )
No Orphaning and Evolvability If at some point in time there are no primary non-keys, there may be no record in which to keep the dependent non-key (orphaned tags, lost department address etc.)

Which entities or fields to normalize?

Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

You keep a dependent key denormalized if the relationship with the primary key will remain 1-to-N, the field/key does not support a unique representation of its values (such as free text input), the access patterns are same across the entire record, the field is not needed in isolation, and is not likely to evolve into a more complex entity. Examples include a Status and Comments, User and Location. This is true even if it is a Table or a Collection.

You normalize if the relationship is N-1 (2 table) or N-N (3 tables), a single source of truth or unique representation is desirable, the access patterns are different, the field is queried in isolation and is likely to evolve into an indepndent entity with its own attributes.

What is the Relationship? 1-N, N-1, N-N 1-N relationships are well represented as embedded. N-1 and N-N relationships, if embedded would result in denormalization and all its warts. The 1-to-N relationshop is a tree-like structure. Typically the entire tree is loaded as part of the queries. If all the data you need for one key is all in one “place”, if you know that that will always be your primary access pattern, then you don’t need joins and nosql systems may be faster for you. If you need joins, no sql systems not suitable as it would be slower (Multiple roundtrip queries to the db) and lead to more complex code (Emulate the joins in application) Example Queries for 1-to-N:

Get all comments corresponding to a given status (“get all status corresponding to a particular comment” doesn’t even make sense)

Get all answers corresponding to a question (“get all questions corresponding to an answer” doesn’t make sense either)
Is the data likely to become more interconnected? The data always tends to become more interconnected (move towards N-to-N) as application complexity increases. For example, if Linkedin originally did not have recognize an entity for Organization which different users could map to, then it would be 1-to-N, but over time, it might decide to include that, in which case the relationship would become N-to-N. We can still store the relationship in a denormalized form (e.g. store author name in every status, or array of tags within status), as in the case of a 1-to-N relationship. This would avoid joins. (However, this causes duplication and will require extra work at the time of writing to ensure that data is consistent as noted)
User interface: free text input or options? Free text suggests keeping it embedded/denormalized. Functional dependencies only exist when the things involved have unique and singular identifiers (representations) (not having them lead to data maintenance problems of their own.) functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. If two representations of address: 123 Main St., New York or 123 Main Street, NYC, then address does not have a functional dependency.
Static vs Dynamic Updates: Are some entries higher write frequency?
Important as a standalone entity which can evolve? (e.g. orphaned tags, warehouse addresses)?
Do we need to analyze the data of the fields in isolation of its main document? then we should make it in a separate entity. 

“get all statuses posted today” (for feed): so extract Status from User

“get all comments posted today”: doesn’t make sense so keep comments as part of Status

2 Table with Foreign Keys vs 3 Table Collection with Mapping Table

Normalize with a second table

Option 1: store the foreign key (e.g store author name) in one of the tables (one knows about another). Many to one is well represented.

Option 2: Foreign keys in both table (both know about each other).

However, arrays of foreign id in the case of many-to-many makes it difficult to query or join, so not desirable. Array of foreign keys also breaks the first normal form (normalization would require a separate record or document for each id.) You also have to worry about how you’re going to keep the bidirectional references in sync, (risking anomalies where a PC thinks it uses a Part, but that respective Part has no reference to the PC)

No Reference but a separate mapping table (Both don’t know about each other). To populate queries, “give all tags of a particular status” or “give all statuses of a particular tag”, we would need to emulate two joins.

Case Study: Implementing Tags

A tag is a keyword or label that categorizes your question with other, similar questions. Using the right tags makes it easier for others to find and answer your question.

There are three major database queries which you will do:

Output all Tags for one Item
Draw a Tag-Cloud
Select all items for one Tag Title.

Hierarchal “tags” or categories in the TagTable: This is commonly needed on sites that have categories and subcategories but need the flexibility of tagging. For example, recipe sites, auto parts sites, business directories, etc. These types of data don’t usually fit into only one single category so tagging is the answer but you need to use something like the Nested Set Model or the Adjacency List Model in your Tag table.

Possible Rules triggered by the tagging API:

“use the formal company name that appears on their annual SEC filings”
“search the tags database before creating a new tag”

1-Table Implementation

PostTable

Id
Content
Tags: [tag1, tag2, tag2]

Example Query: Select all posts with a particular union of tags

SELECT - 
FROM `PostTable` 
WHERE tags LIKE "%search%" 
OR tags LIKE "%webservice%" 
OR tags LIKE "%semweb%"

Drawbacks

Partial Text search in EVERY document harder?
Cannot store many tags as limit on Tags field
Cannot store Tag metadata, description, hierarchy
The document size will become very large.
- If I load this user object via REST service in my mobile app it could take a while.
- Furthermore in worst case the user document could exceed the MongoDb 16MB document limit.

2-Table Implementation: PostTable and TagTable. Relationship Embedded as Foreign Keys.

PostTable

PostId
Content
Tags: [tagId1, tagId2, tagId3]

TagTable

TagId
TagName

The Good

More normalised, as repeated tags have one document Can Store tag metadata, hierarchy etc.

The Bad

Cannot store relationship metadata
Search?: queries of type show all posts with tagName = xxxx require iterating through the array and joining?

2-Table Implementation: PostTable + Mapping Table

PostTable

PostId
Content

PostTagTable

PostId
TagName

The Bad

Not Normalised (Duplicate Tag Names), so takes more Space in DB
Slower queries: the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true.
- But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this.
Not normalised, so extra work needed to remain consistent
- The inconsistency argument is a little moot Tags are free text fields and there is no expected operation like ‘rename all tags “foo” to “bar”
You cannot create “unused”/orphan tags so an “add tag” feature has to be performed on an Item. (On the other implementations, the “add tag” feature can be performed independently

Example Query: Select all posts with a particular tag (Only 2-table join if we have tagName)

SELECT *
FROM PostTable
JOIN PostTagTable ON PostId
WHERE TagName = 

or¹

SELECT *
FROM PostTable, PostTagTable
WHERE PostTable.Id =  PostTagTable.postID
and tagName = :tag

Example Query: Select all posts with a particular union of tags

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

3-Table Implementation: Post Table + Mapping Table

PostTable

PostId
Content

TagTable

TagId
TagName

PostTagTable

PostId
TagId

Example Query: Items for one tag(3 table joing if we have tag name)

SELECT Item.* 
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

Example Query: Tag-Cloud (2 table join)

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

Example Query: All Tags for one item (2 table join if we have id):

Select tagName
FROM Tag
JOIN ItemTag on Tag.tagID = ItemTag.TagID
Where ItemTag.ItemID = :id

The Bad

Complex Queries. In the three-table solution for insert and update of item you have to lookup the Tag table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.

1-Table Implementation

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

https://stackoverflow.com/questions/1810356/how-to-implement-tag-system

Case Study: Stack Oveflow

Post

- Text
- Tags: [Tag]
- Post by: User
- Upvotes: [User_Ids]
- Total Count of upvotes?
- Type: Question/Answer/Comment
- Views
- If Type Question, Answers: [], If Type Answer, Question: [], if Type Comment, Question/Answer[]
- Edits:

Users

- id
- name
- Photo
- screen_name or username
- AuthCredentials: password?
- Reputation/Badgets etc.

TagTable (29488 rows)

tagId
Tagname

TagSynonymTable (1916 rows)

A synonym tag can only be a synonym for one master tag. A master tag can have multiple synonym tags:

sourceTag: tag1
masterTag: tag3

sourceTag: tag2
masterTag: tag3

sourceTag: tag4
masterTag: tag3 

A master tag also be a synonym tag.

sourceTag: tag3
masterTag: tag5

Q: Why no foreign key on master tag?

A: Foreign keys help by quickly searching based on id, but here we only search or make joins based on tag names. Only if you have an information requirement that can’t be achieved due to the current technical implementation there would be reason to change or add foreign-keys and possible indexes.

Hierarchy (Directional Graph without circular dependencies. technical processes in place to prevent circular references from happening).

Big Master Tag

Master Tag … Master tag 2

synonym1 …synonym 2.

Normalize vs Denormalization (Redudancy) in NOSql

“Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins. On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.””

Thumb rules:

Relational: pro normalization better because support for joins (otherwise performance and code complexity costs).

Mapping table for N-N relations
Foreign key reference in 1-N relationsps. Child stores reference to parent (because parent to child would require array of ids and that breaks first normal form)

No Sql: denormalized better because no support for joins. Having the “option” to embed “should” be the reason you have chosen MongoDB, but it will actually be how your application “uses the data” that makes the decision to which method suits which part of your data modelling (as it is not “all or nothing”) the best. The core design principle is “embedded” means “already there” as opposed to “fetching from somewhere else”. Essentially the difference between “in your pocket” and “on the shelf”, and in I/O terms usually more like “on the shelf in the library downtown”, and notably further away for network based requests.

RDBMS are great because they let you model unique structured entities (mutable or not) and their relationships with one another. This means it’s very easy to work at the entity level, updating their properties, inserting another one, deleting one, etc. But it’s also great for aggregating them dynamically, a dog with its owner, a dog with the homes it’s resided in, etc. The RDBMS gives you tools to facilitate all this. It’ll join for you, it’ll handle atomic changes across entities for you, etc.

NoSQL databases are great because they let you model semi/unstructured aggregates and dynamic entities. This means it’s very easy to model ever changing entities, entities that don’t all share the same attributes and hierarchical aggregates. * Populate vs Object Nesting

NoSQL Model

Philosophy of Denormalization: We forgo integrity guarantees (avoiding data update anomalies) by having duplication in parts of the application which are read heavy but not write-heavy (so the cost of forgoing integrity guarantees is not heavy). This allows us to increase READ performance (no joins needed to fetch data, either via native support or via multiple roundtrips) and decrease code complexity.

Schema design in NoSQL: It is generally not determined by the data you want to store, but by the operations you want to perform. Don’t design the collections and documents to model the relationships between your application entities. Design the collections and documents to make your common operations easy and fast. “I have found it much easier to model document-repository solutions using NoSQL data stores by optimizing for the read use cases, while being considerate of the atomic write operations that need to be supported by the write use cases”

For instance, the uses of a “Users in Roles” domain follow:

Role - Create, Read, Update, Delete, List Users, Add User, Remove User, Clear All Users, Index of User or similar to support “Is User In Role” (operations like a container + its own metadata).
User - Create, Read, Update, Delete (CRUD operations like a free-standing entity) This can be modeled as the following document templates: User: { _id: UniqueId, name: string, roles: string[] } Indexes: unique: [ name ] Role: { _id: UniqueId, name: string, users: string[] } Indexes: unique: [ name ]

Theory https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

Examples

Followers

https://stackoverflow.com/questions/30353202/save-followers-count-in-a-field-or-query-each-time-if-needed
https://stackoverflow.com/questions/14459194/mongodb-how-to-design-twitter-style-followers-following-relation-model-in-mongo

Modelling N-N: Store array of references in both, junction table, store array of references in one. https://stackoverflow.com/questions/28421505/followers-mongodb-database-design Likes https://stackoverflow.com/questions/28006521/how-to-model-a-likes-voting-system-with-mongodb Tags: https://stackoverflow.com/questions/13882516/content-tagging-with-mongodb https://stackoverflow.com/questions/8455685/how-to-implement-post-tags-in-mongo Hierarchy (e.g Categories) : https://docs.mongodb.com/drivers/use-cases/category-hierarchy

To model for NoSql, you need to think in terms of hierarchy and aggregates instead of entities and relations. Best practices for NoSQL database design

Designing for Document Databases
6 Rules of Thumb for Schema Design
Total document size with embedded data will typically not exceed 16MB of storage (the BSON limit) or otherwise ( as a guideline ) have arrays that contain 500 or more entries.
Data that is embedded does generally not require frequent changes. So you could live with “duplication” that comes from the de-normalization not resulting in the need to update those “duplicates” with the same information across many parent documents just to invoke a change.
Related data is frequently used in association with the parent. Which means that if your “read/write” cases are pretty much always needing to “read/write” to both parent and child then it makes sense to embed the data for atomic operations.
One: favor embedding unless there is a compelling reason not to
Two: needing to access an object on its own is a compelling reason not to embed it
Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
Six: As always with MongoDB, how you model your data depends – entirely – on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

When you use an Id, the info that is meaningful to humans is stord in only one place. Because it has no meaning to humans, it never needs to change, even if the information that it identifies changes. (otherwise all the redudant copies need to be changed: write overheads, inconsistencies).

The developer has to decide whether to denormalize or manually resolve references from one record to another (in the application code)

Advantages of normalizing (converting to many-to-one relationship)

Consistent style and spelling across docs
avoiding ambiguity if there are serveral values with the same name
easy of updating: Stored in only one place
internationalization support: change to local language at will
better search: a separate tag table can encode synonyms, misspellings, and the tagId an be linked

Look at this script and the costs of normalization in nosql databases. Here we are storing an array of tagsid and skillsid in the superpower entity. To ensure consistency, we have to write to both the tables on every write to the superpower table. To obtain the ids of new tags/skills, we have to emulate a join in application code (look at the code which populates and then filters id). All of this increases the code complecity.

Database design in a non-relational database like MongoDB depends on the queries you need to support, not the relationships between data entities. - NoSQL normalizing help

Instead, had we kept tags and skills in denormalized form within the supowerpower entity, we do not need to write or emulate joins. The problem is that we would not be able to keep skills as an independent evolvable entity and link them to the superpowers.

let updateSuperpower = (req, res) => {
  (async () => {
    let response = {};
    let tags =  req.body.tags || [];
    let skills = req.body.skills || [];
    // Returns an array of tag documents. 

    await Promise.all([powerHelper.saveTags({tags: tags}), skillHelper.saveSkills({skills: skills})]).then((results) => {
        tags = results[0];
        skills = results[1];
      }).catch((error) => {console.log("Error: " + error)})
    let tagIds = [];
    tags.filter((tag) => {
      tagIds.push(tag._id);
    });
        
    let skillIds = [];
    skills.filter((skill) => {
      skillIds.push(skill._id);
    });

    await powerHelper.updateSuperpower({superpowerId: req.body.superpowerId, description:req.body.description, tagIds:tagIds, skillIds: skillIds});
    return response;
  })().then(function (response){
        utils.sendSuccess(res, "Success", response);
    }).catch(function (error){
        utils.sendError(res, 400, "Failed");
    });
}

There is no difference between the two. For larger queries the first way is more readable. If you have a mix left joins and some inner joins, then it is far more consistent to have the joins condition on the join clause, rather than some in the joins and some in the where clause. Second representation makes query more readable and makes it look very clear as to which join corresponds to which condition. ↩

13 January 2020

Notes by Piyush

Schema Design

Normalization

Case Study: Implementing Tags

Case Study: Stack Oveflow