Schema Design

Relationship Types

There are three types of relationships between entities:


Normalization

The normalization rules are designed to enhances the integrity of the data at some possible performance cost for certain retrieval applications (data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form). It is biased toward the assumption that all non-key fields will be updated frequently. It prevents update anomalies (easy to change entries without going through all records) and data inconsistencies (less mistakes) and minimizes redundancy (efficient use of space). Removing duplication is the key to normalisation
. There is no obligation to fully normalize all records when actual performance requirements are taken into account: so it depends what your data access patterns are going to be. It also makes sharding difficult.

A NoSQL record does not adhere to first normal form (variable number of fields), but we can still ask whether to extract individual fields into their own separate collection or not. This is called “embedding” vs “referencing”.


How to find if it can be normalized?

You can “smell” duplication, scope for inconsistencies and update mistakes by asking: Is non-key field a fact about a subset of a key or another non-key field?

Every non-key must provide a fact about the key, the whole key, and nothing but the key

The table can be normalized further when these properties are present in your (tables/collection) for some dependent non-key field:


Which entities or fields to normalize?

Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

You keep a dependent key denormalized if the relationship with the primary key will remain 1-to-N, the field/key does not support a unique representation of its values (such as free text input), the access patterns are same across the entire record, the field is not needed in isolation, and is not likely to evolve into a more complex entity. Examples include a Status and Comments, User and Location. This is true even if it is a Table or a Collection.

You normalize if the relationship is N-1 (2 table) or N-N (3 tables), a single source of truth or unique representation is desirable, the access patterns are different, the field is queried in isolation and is likely to evolve into an indepndent entity with its own attributes.

  1. What is the Relationship? 1-N, N-1, N-N 1-N relationships are well represented as embedded. N-1 and N-N relationships, if embedded would result in denormalization and all its warts. The 1-to-N relationshop is a tree-like structure. Typically the entire tree is loaded as part of the queries. If all the data you need for one key is all in one “place”, if you know that that will always be your primary access pattern, then you don’t need joins and nosql systems may be faster for you. If you need joins, no sql systems not suitable as it would be slower (Multiple roundtrip queries to the db) and lead to more complex code (Emulate the joins in application) Example Queries for 1-to-N:

    Get all comments corresponding to a given status (“get all status corresponding to a particular comment” doesn’t even make sense)

    Get all answers corresponding to a question (“get all questions corresponding to an answer” doesn’t make sense either)

  2. Is the data likely to become more interconnected? The data always tends to become more interconnected (move towards N-to-N) as application complexity increases. For example, if Linkedin originally did not have recognize an entity for Organization which different users could map to, then it would be 1-to-N, but over time, it might decide to include that, in which case the relationship would become N-to-N. We can still store the relationship in a denormalized form (e.g. store author name in every status, or array of tags within status), as in the case of a 1-to-N relationship. This would avoid joins. (However, this causes duplication and will require extra work at the time of writing to ensure that data is consistent as noted)

  3. User interface: free text input or options? Free text suggests keeping it embedded/denormalized. Functional dependencies only exist when the things involved have unique and singular identifiers (representations) (not having them lead to data maintenance problems of their own.) functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. If two representations of address: 123 Main St., New York or 123 Main Street, NYC, then address does not have a functional dependency.

  4. Static vs Dynamic Updates: Are some entries higher write frequency?
  5. Important as a standalone entity which can evolve? (e.g. orphaned tags, warehouse addresses)?
  6. Do we need to analyze the data of the fields in isolation of its main document? then we should make it in a separate entity.


    “get all statuses posted today” (for feed): so extract Status from User

“get all comments posted today”: doesn’t make sense so keep comments as part of Status


2 Table with Foreign Keys vs 3 Table Collection with Mapping Table

Option 1: store the foreign key (e.g store author name) in one of the tables (one knows about another). Many to one is well represented.

Option 2: Foreign keys in both table (both know about each other).

However, arrays of foreign id in the case of many-to-many makes it difficult to query or join, so not desirable. Array of foreign keys also breaks the first normal form (normalization would require a separate record or document for each id.) You also have to worry about how you’re going to keep the bidirectional references in sync, (risking anomalies where a PC thinks it uses a Part, but that respective Part has no reference to the PC)


Case Study: Implementing Tags

A tag is a keyword or label that categorizes your question with other, similar questions. Using the right tags makes it easier for others to find and answer your question.

There are three major database queries which you will do:

Hierarchal “tags” or categories in the TagTable: This is commonly needed on sites that have categories and subcategories but need the flexibility of tagging. For example, recipe sites, auto parts sites, business directories, etc. These types of data don’t usually fit into only one single category so tagging is the answer but you need to use something like the Nested Set Model or the Adjacency List Model in your Tag table.

Possible Rules triggered by the tagging API:


1-Table Implementation

PostTable

Id
Content
Tags: [tag1, tag2, tag2]

Example Query: Select all posts with a particular union of tags

SELECT - 
FROM `PostTable` 
WHERE tags LIKE "%search%" 
OR tags LIKE "%webservice%" 
OR tags LIKE "%semweb%"

Drawbacks


2-Table Implementation: PostTable and TagTable. Relationship Embedded as Foreign Keys.

PostTable

PostId
Content
Tags: [tagId1, tagId2, tagId3]

TagTable

TagId
TagName

The Good

More normalised, as repeated tags have one document Can Store tag metadata, hierarchy etc.

The Bad


2-Table Implementation: PostTable + Mapping Table

PostTable

PostId
Content

PostTagTable

PostId
TagName

The Bad

Example Query: Select all posts with a particular tag (Only 2-table join if we have tagName)

SELECT *
FROM PostTable
JOIN PostTagTable ON PostId
WHERE TagName = 

or1

SELECT *
FROM PostTable, PostTagTable
WHERE PostTable.Id =  PostTagTable.postID
and tagName = :tag

Example Query: Select all posts with a particular union of tags

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

3-Table Implementation: Post Table + Mapping Table

PostTable

PostId
Content

TagTable

TagId
TagName

PostTagTable

PostId
TagId

Example Query: Items for one tag(3 table joing if we have tag name)

SELECT Item.* 
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

Example Query: Tag-Cloud (2 table join)

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

Example Query: All Tags for one item (2 table join if we have id):

Select tagName
FROM Tag
JOIN ItemTag on Tag.tagID = ItemTag.TagID
Where ItemTag.ItemID = :id

The Bad


1-Table Implementation

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

https://stackoverflow.com/questions/1810356/how-to-implement-tag-system


Case Study: Stack Oveflow

Post

- Text
- Tags: [Tag]
- Post by: User
- Upvotes: [User_Ids]
- Total Count of upvotes?
- Type: Question/Answer/Comment
- Views
- If Type Question, Answers: [], If Type Answer, Question: [], if Type Comment, Question/Answer[]
- Edits:

Users

- id
- name
- Photo
- screen_name or username
- AuthCredentials: password?
- Reputation/Badgets etc.

TagTable (29488 rows)

tagId
Tagname

TagSynonymTable (1916 rows)

A synonym tag can only be a synonym for one master tag. A master tag can have multiple synonym tags:

sourceTag: tag1
masterTag: tag3

sourceTag: tag2
masterTag: tag3

sourceTag: tag4
masterTag: tag3 

A master tag also be a synonym tag.

sourceTag: tag3
masterTag: tag5

Q: Why no foreign key on master tag?

A: Foreign keys help by quickly searching based on id, but here we only search or make joins based on tag names. Only if you have an information requirement that can’t be achieved due to the current technical implementation there would be reason to change or add foreign-keys and possible indexes.

Hierarchy (Directional Graph without circular dependencies.  technical processes in place to prevent circular references from happening).

Big Master Tag

Master Tag … Master tag 2

synonym1 …synonym 2.


Normalize vs Denormalization (Redudancy) in NOSql

“Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins. On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.””

Thumb rules:

Relational: pro normalization better because support for joins (otherwise performance and code complexity costs).

No Sql: denormalized better because no support for joins. Having the “option” to embed “should” be the reason you have chosen MongoDB, but it will actually be how your application “uses the data” that makes the decision to which method suits which part of your data modelling (as it is not “all or nothing”) the best. The core design principle is “embedded” means “already there” as opposed to “fetching from somewhere else”. Essentially the difference between “in your pocket” and “on the shelf”, and in I/O terms usually more like “on the shelf in the library downtown”, and notably further away for network based requests.

RDBMS are great because they let you model unique structured entities (mutable or not) and their relationships with one another. This means it’s very easy to work at the entity level, updating their properties, inserting another one, deleting one, etc. But it’s also great for aggregating them dynamically, a dog with its owner, a dog with the homes it’s resided in, etc. The RDBMS gives you tools to facilitate all this. It’ll join for you, it’ll handle atomic changes across entities for you, etc.

NoSQL databases are great because they let you model semi/unstructured aggregates and dynamic entities. This means it’s very easy to model ever changing entities, entities that don’t all share the same attributes and hierarchical aggregates. * Populate vs Object Nesting

NoSQL Model

Philosophy of Denormalization: We forgo integrity guarantees (avoiding data update anomalies) by having duplication in parts of the application which are read heavy but not write-heavy (so the cost of forgoing integrity guarantees is not heavy). This allows us to increase READ performance (no joins needed to fetch data, either via native support or via multiple roundtrips) and decrease code complexity.

Schema design in NoSQL: It is generally not determined by the data you want to store, but by the operations you want to perform. Don’t design the collections and documents to model the relationships between your application entities. Design the collections and documents to make your common operations easy and fast. “I have found it much easier to model document-repository solutions using NoSQL data stores by optimizing for the read use cases, while being considerate of the atomic write operations that need to be supported by the write use cases”

For instance, the uses of a “Users in Roles” domain follow:

  1. Role - Create, Read, Update, Delete, List Users, Add User, Remove User, Clear All Users, Index of User or similar to support “Is User In Role” (operations like a container + its own metadata).
  2. User - Create, Read, Update, Delete (CRUD operations like a free-standing entity) This can be modeled as the following document templates: User: { _id: UniqueId, name: string, roles: string[] } Indexes: unique: [ name ] Role: { _id: UniqueId, name: string, users: string[] } Indexes: unique: [ name ]

Theory https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

Examples

Followers

Modelling N-N: Store array of references in both, junction table, store array of references in one. https://stackoverflow.com/questions/28421505/followers-mongodb-database-design Likes https://stackoverflow.com/questions/28006521/how-to-model-a-likes-voting-system-with-mongodb Tags: https://stackoverflow.com/questions/13882516/content-tagging-with-mongodb https://stackoverflow.com/questions/8455685/how-to-implement-post-tags-in-mongo Hierarchy (e.g Categories) : https://docs.mongodb.com/drivers/use-cases/category-hierarchy

To model for NoSql, you need to think in terms of hierarchy and aggregates instead of entities and relations. Best practices for NoSQL database design

When you use an Id, the info that is meaningful to humans is stord in only one place. Because it has no meaning to humans, it never needs to change, even if the information that it identifies changes. (otherwise all the redudant copies need to be changed: write overheads, inconsistencies).

The developer has to decide whether to denormalize or manually resolve references from one record to another (in the application code)

Advantages of normalizing (converting to many-to-one relationship)

Look at this script and the costs of normalization in nosql databases. Here we are storing an array of tagsid and skillsid in the superpower entity. To ensure consistency, we have to write to both the tables on every write to the superpower table. To obtain the ids of new tags/skills, we have to emulate a join in application code (look at the code which populates and then filters id). All of this increases the code complecity.

Database design in a non-relational database like MongoDB depends on the queries you need to support, not the relationships between data entities. - NoSQL normalizing help

Instead, had we kept tags and skills in denormalized form within the supowerpower entity, we do not need to write or emulate joins. The problem is that we would not be able to keep skills as an independent evolvable entity and link them to the superpowers.


let updateSuperpower = (req, res) => {
  (async () => {
    let response = {};
    let tags =  req.body.tags || [];
    let skills = req.body.skills || [];
    // Returns an array of tag documents. 

    await Promise.all([powerHelper.saveTags({tags: tags}), skillHelper.saveSkills({skills: skills})]).then((results) => {
        tags = results[0];
        skills = results[1];
      }).catch((error) => {console.log("Error: " + error)})
    let tagIds = [];
    tags.filter((tag) => {
      tagIds.push(tag._id);
    });
        
    let skillIds = [];
    skills.filter((skill) => {
      skillIds.push(skill._id);
    });

    await powerHelper.updateSuperpower({superpowerId: req.body.superpowerId, description:req.body.description, tagIds:tagIds, skillIds: skillIds});
    return response;
  })().then(function (response){
        utils.sendSuccess(res, "Success", response);
    }).catch(function (error){
        utils.sendError(res, 400, "Failed");
    });
}

  1. There is no difference between the two. For larger queries the first way is more readable. If you have a mix left joins and some inner joins, then it is far more consistent to have the joins condition on the join clause, rather than some in the joins and some in the where clause.  Second representation makes query more readable and makes it look very clear as to which join corresponds to which condition. 

13 January 2020