Big Data Engineering Notes: NoSQL
Every abstraction hides a complexity of the layer below. Our modeling will need to go to the lowest layer possible to find the best optimization.
The disconnect between models of two layers.
Fewer seeks/joins and compression options, but more memory used (disk and main).
Just how you describe the shape of your data.
A table is just a translation layer over files. So the best way to optimize a table is to optimize how the files are stored.
Schemas do change. Make sure to follow good schema evolution rules.
#1 Thing: Avoid storage hotspots (everything hashing to the same values)
Even with a high rate of false positives, searching much less of the data can be powerful
First effective optimization
- Row-oriented: When you often query most/all the columns
- Column-oriented: When often query subset of columns (analytics)
Story conveyed in the story. The appendix allows to dig deeper.
Make it look good, pictures are nice.
Appendix not required, but often used for information overflow.
Data Model Design
Purpose: Describe to other people, because they need to know how to use it.
Final project include data model. List/Pictures >>> Paragraphs.
Delimit attributes by entity types (SSN vs customer id)
With JSON, this is where you want to denormalize as much as possible.
Each one is making a network connection. Document databases can do bulk inserts.
For next week
Do some data modeling. Videos on them this week, but very little reading. Please ask questions in the general channel of the videos. Will do data modeling on the exam.