Relationship Types
There are three types of relationships between entities:
Status
and Comments
on that status (Facebook)Question
and Answers
on that question (Quora)User
and Answers
(Quora)Status
and Author
(Facebook)Employee
and Organization
Question
and User
(Quora)or reverse of 1 to N
Comments
and Status
that it belongs to (Facebook)Organizations
and Resume
that refer it (LinkedIn)Answer
and Question
that it belongs to (Quora)Answer
and User
(Quora)Status
and Tags
(Instagram)Users
and Organizations
(Linkedin)The normalization rules are designed to enhances the integrity of the data at some possible performance cost for certain retrieval applications (data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form). It is biased toward the assumption that all non-key fields will be updated frequently. It prevents update anomalies (easy to change entries without going through all records) and data inconsistencies (less mistakes) and minimizes redundancy (efficient use of space). Removing duplication is the key to normalisation . There is no obligation to fully normalize all records when actual performance requirements are taken into account: so it depends what your data access patterns are going to be. It also makes sharding difficult.
A NoSQL record does not adhere to first normal form (variable number of fields), but we can still ask whether to extract individual fields into their own separate collection or not. This is called “embedding” vs “referencing”.
How to find if it can be normalized?
You can “smell” duplication, scope for inconsistencies and update mistakes by asking: Is non-key field a fact about a subset of a key or another non-key field?
Every non-key must provide a fact about the key, the whole key, and nothing but the key
The table can be normalized further when these properties are present in your (tables/collection) for some dependent non-key field:
User
field and as other questions by them)Question
table and the User
table )Which entities or fields to normalize?
Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)
You keep a dependent key denormalized if the relationship with the primary key will remain 1-to-N, the field/key does not support a unique representation of its values (such as free text input), the access patterns are same across the entire record, the field is not needed in isolation, and is not likely to evolve into a more complex entity. Examples include a Status
and Comments
, User
and Location
. This is true even if it is a Table or a Collection.
You normalize if the relationship is N-1 (2 table) or N-N (3 tables), a single source of truth or unique representation is desirable, the access patterns are different, the field is queried in isolation and is likely to evolve into an indepndent entity with its own attributes.
Get all comments corresponding to a given status
(“get all status corresponding to a particular comment” doesn’t even make sense)
Get all answers corresponding to a question
(“get all questions corresponding to an answer” doesn’t make sense either)
Is the data likely to become more interconnected? The data always tends to become more interconnected (move towards N-to-N) as application complexity increases. For example, if Linkedin originally did not have recognize an entity for Organization
which different users could map to, then it would be 1-to-N, but over time, it might decide to include that, in which case the relationship would become N-to-N. We can still store the relationship in a denormalized form (e.g. store author name in every status, or array of tags within status), as in the case of a 1-to-N relationship. This would avoid joins. (However, this causes duplication and will require extra work at the time of writing to ensure that data is consistent as noted)
User interface: free text input or options? Free text suggests keeping it embedded/denormalized. Functional dependencies only exist when the things involved have unique and singular identifiers (representations) (not having them lead to data maintenance problems of their own.) functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. If two representations of address: 123 Main St., New York or 123 Main Street, NYC, then address does not have a functional dependency.
“get all statuses posted today” (for feed): so extract Status from User
“get all comments posted today”: doesn’t make sense so keep comments as part of Status
2 Table with Foreign Keys vs 3 Table Collection with Mapping Table
Option 1: store the foreign key (e.g store author name) in one of the tables (one knows about another). Many to one is well represented.
Option 2: Foreign keys in both table (both know about each other).
However, arrays of foreign id in the case of many-to-many makes it difficult to query or join, so not desirable. Array of foreign keys also breaks the first normal form (normalization would require a separate record or document for each id.) You also have to worry about how you’re going to keep the bidirectional references in sync, (risking anomalies where a PC thinks it uses a Part, but that respective Part has no reference to the PC)
A tag is a keyword or label that categorizes your question with other, similar questions. Using the right tags makes it easier for others to find and answer your question.
There are three major database queries which you will do:
Hierarchal “tags” or categories in the TagTable: This is commonly needed on sites that have categories and subcategories but need the flexibility of tagging. For example, recipe sites, auto parts sites, business directories, etc. These types of data don’t usually fit into only one single category so tagging is the answer but you need to use something like the Nested Set Model or the Adjacency List Model in your Tag table.
Possible Rules triggered by the tagging API:
1-Table Implementation
PostTable
Id
Content
Tags: [tag1, tag2, tag2]
Example Query: Select all posts with a particular union of tags
SELECT -
FROM `PostTable`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"
Drawbacks
2-Table Implementation: PostTable and TagTable. Relationship Embedded as Foreign Keys.
PostTable
PostId
Content
Tags: [tagId1, tagId2, tagId3]
TagTable
TagId
TagName
The Good
More normalised, as repeated tags have one document Can Store tag metadata, hierarchy etc.
The Bad
2-Table Implementation: PostTable + Mapping Table
PostTable
PostId
Content
PostTagTable
PostId
TagName
The Bad
Example Query: Select all posts with a particular tag (Only 2-table join if we have tagName)
SELECT *
FROM PostTable
JOIN PostTagTable ON PostId
WHERE TagName =
or1
SELECT *
FROM PostTable, PostTagTable
WHERE PostTable.Id = PostTagTable.postID
and tagName = :tag
Example Query: Select all posts with a particular union of tags
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
3-Table Implementation: Post Table + Mapping Table
PostTable
PostId
Content
TagTable
TagId
TagName
PostTagTable
PostId
TagId
Example Query: Items for one tag(3 table joing if we have tag name)
SELECT Item.*
FROM Item
JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
JOIN Tag ON ItemTag.TagID = Tag.TagID
WHERE Tag.Title = :title
Example Query: Tag-Cloud (2 table join)
SELECT Tag.Title, count(*)
FROM Tag
JOIN ItemTag ON Tag.TagID = ItemTag.TagID
GROUP BY Tag.Title
Example Query: All Tags for one item (2 table join if we have id):
Select tagName
FROM Tag
JOIN ItemTag on Tag.tagID = ItemTag.TagID
Where ItemTag.ItemID = :id
The Bad
1-Table Implementation
If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:
tagcloud: {
map: function(doc){
for(tag in doc.tags){
emit(doc.tags[tag],1)
}
}
reduce: function(keys,values){
return values.length
}
}
https://stackoverflow.com/questions/1810356/how-to-implement-tag-system
Post
- Text
- Tags: [Tag]
- Post by: User
- Upvotes: [User_Ids]
- Total Count of upvotes?
- Type: Question/Answer/Comment
- Views
- If Type Question, Answers: [], If Type Answer, Question: [], if Type Comment, Question/Answer[]
- Edits:
Users
- id
- name
- Photo
- screen_name or username
- AuthCredentials: password?
- Reputation/Badgets etc.
TagTable (29488 rows)
tagId
Tagname
TagSynonymTable (1916 rows)
A synonym tag can only be a synonym for one master tag. A master tag can have multiple synonym tags:
sourceTag: tag1
masterTag: tag3
sourceTag: tag2
masterTag: tag3
sourceTag: tag4
masterTag: tag3
A master tag also be a synonym tag.
sourceTag: tag3
masterTag: tag5
Q: Why no foreign key on master tag?
A: Foreign keys help by quickly searching based on id, but here we only search or make joins based on tag names. Only if you have an information requirement that can’t be achieved due to the current technical implementation there would be reason to change or add foreign-keys and possible indexes.
Hierarchy (Directional Graph without circular dependencies. technical processes in place to prevent circular references from happening).
Big Master Tag
Master Tag … Master tag 2
synonym1 …synonym 2.
Normalize vs Denormalization (Redudancy) in NOSql
“Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins. On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.””
Thumb rules:
Relational: pro normalization better because support for joins (otherwise performance and code complexity costs).
No Sql: denormalized better because no support for joins. Having the “option” to embed “should” be the reason you have chosen MongoDB, but it will actually be how your application “uses the data” that makes the decision to which method suits which part of your data modelling (as it is not “all or nothing”) the best. The core design principle is “embedded” means “already there” as opposed to “fetching from somewhere else”. Essentially the difference between “in your pocket” and “on the shelf”, and in I/O terms usually more like “on the shelf in the library downtown”, and notably further away for network based requests.
RDBMS are great because they let you model unique structured entities (mutable or not) and their relationships with one another. This means it’s very easy to work at the entity level, updating their properties, inserting another one, deleting one, etc. But it’s also great for aggregating them dynamically, a dog with its owner, a dog with the homes it’s resided in, etc. The RDBMS gives you tools to facilitate all this. It’ll join for you, it’ll handle atomic changes across entities for you, etc.
NoSQL databases are great because they let you model semi/unstructured aggregates and dynamic entities. This means it’s very easy to model ever changing entities, entities that don’t all share the same attributes and hierarchical aggregates. * Populate vs Object Nesting
NoSQL Model
Philosophy of Denormalization: We forgo integrity guarantees (avoiding data update anomalies) by having duplication in parts of the application which are read heavy but not write-heavy (so the cost of forgoing integrity guarantees is not heavy). This allows us to increase READ performance (no joins needed to fetch data, either via native support or via multiple roundtrips) and decrease code complexity.
Schema design in NoSQL: It is generally not determined by the data you want to store, but by the operations you want to perform. Don’t design the collections and documents to model the relationships between your application entities. Design the collections and documents to make your common operations easy and fast. “I have found it much easier to model document-repository solutions using NoSQL data stores by optimizing for the read use cases, while being considerate of the atomic write operations that need to be supported by the write use cases”
For instance, the uses of a “Users in Roles” domain follow:
Theory https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Examples
Followers
Modelling N-N: Store array of references in both, junction table, store array of references in one. https://stackoverflow.com/questions/28421505/followers-mongodb-database-design Likes https://stackoverflow.com/questions/28006521/how-to-model-a-likes-voting-system-with-mongodb Tags: https://stackoverflow.com/questions/13882516/content-tagging-with-mongodb https://stackoverflow.com/questions/8455685/how-to-implement-post-tags-in-mongo Hierarchy (e.g Categories) : https://docs.mongodb.com/drivers/use-cases/category-hierarchy
To model for NoSql, you need to think in terms of hierarchy and aggregates instead of entities and relations. Best practices for NoSQL database design
Total document size with embedded data will typically not exceed 16MB of storage (the BSON limit) or otherwise ( as a guideline ) have arrays that contain 500 or more entries.
Data that is embedded does generally not require frequent changes. So you could live with “duplication” that comes from the de-normalization not resulting in the need to update those “duplicates” with the same information across many parent documents just to invoke a change.
Related data is frequently used in association with the parent. Which means that if your “read/write” cases are pretty much always needing to “read/write” to both parent and child then it makes sense to embed the data for atomic operations.
When you use an Id, the info that is meaningful to humans is stord in only one place. Because it has no meaning to humans, it never needs to change, even if the information that it identifies changes. (otherwise all the redudant copies need to be changed: write overheads, inconsistencies).
The developer has to decide whether to denormalize or manually resolve references from one record to another (in the application code)
Advantages of normalizing (converting to many-to-one relationship)
Look at this script and the costs of normalization in nosql databases. Here we are storing an array of tagsid and skillsid in the superpower entity. To ensure consistency, we have to write to both the tables on every write to the superpower table. To obtain the ids of new tags/skills, we have to emulate a join in application code (look at the code which populates and then filters id). All of this increases the code complecity.
Database design in a non-relational database like MongoDB depends on the queries you need to support, not the relationships between data entities. - NoSQL normalizing help
Instead, had we kept tags and skills in denormalized form within the supowerpower entity, we do not need to write or emulate joins. The problem is that we would not be able to keep skills as an independent evolvable entity and link them to the superpowers.
let updateSuperpower = (req, res) => {
(async () => {
let response = {};
let tags = req.body.tags || [];
let skills = req.body.skills || [];
// Returns an array of tag documents.
await Promise.all([powerHelper.saveTags({tags: tags}), skillHelper.saveSkills({skills: skills})]).then((results) => {
tags = results[0];
skills = results[1];
}).catch((error) => {console.log("Error: " + error)})
let tagIds = [];
tags.filter((tag) => {
tagIds.push(tag._id);
});
let skillIds = [];
skills.filter((skill) => {
skillIds.push(skill._id);
});
await powerHelper.updateSuperpower({superpowerId: req.body.superpowerId, description:req.body.description, tagIds:tagIds, skillIds: skillIds});
return response;
})().then(function (response){
utils.sendSuccess(res, "Success", response);
}).catch(function (error){
utils.sendError(res, 400, "Failed");
});
}
There is no difference between the two. For larger queries the first way is more readable. If you have a mix left joins and some inner joins, then it is far more consistent to have the joins condition on the join clause, rather than some in the joins and some in the where clause. Second representation makes query more readable and makes it look very clear as to which join corresponds to which condition. ↩