Big Data Engineering Notes: NoSQL

Storage and Serialization

Data Modeling

Important section.
Every abstraction hides a complexity of the layer below. Our modeling will need to go to the lowest layer possible to find the best optimization.

Impedance Mismatch

The disconnect between models of two layers.

Locality

Fewer seeks/joins and compression options, but more memory used (disk and main).

The answer to any specific question, like what is the right level of denormalization, is “depends on the business question.”

Schema

Just how you describe the shape of your data.

A table is just a translation layer over files. So the best way to optimize a table is to optimize how the files are stored.

Schemas do change. Make sure to follow good schema evolution rules.

Hash Index

#1 Thing: Avoid storage hotspots (everything hashing to the same values)

Bloom Filters

Even with a high rate of false positives, searching much less of the data can be powerful

Storage Orientation

First effective optimization

  • Row-oriented: When you often query most/all the columns
  • Column-oriented: When often query subset of columns (analytics)
Data Analytics Lifecycle: Often stores Hi-Fi data as row-oriented, then writes analytic data as column-oriented for Low-Fi data.

Deliverable

Story conveyed in the story. The appendix allows to dig deeper.
Make it look good, pictures are nice.
Appendix not required, but often used for information overflow.

NoSQL Data Modeling

Data Model Design

Purpose: Describe to other people, because they need to know how to use it.
Final project include data model. List/Pictures >>> Paragraphs.

Keys

Delimit attributes by entity types (SSN vs customer id)

Document Databases

Recommend book NoSQL for Mere Mortals

With JSON, this is where you want to denormalize as much as possible.

Individual Inserts

Each one is making a network connection. Document databases can do bulk inserts.

Indices

Try indexing little and indexing everything, checking the performance to find the best amount. Figure out where it breaks.

For next week

Do some data modeling. Videos on them this week, but very little reading. Please ask questions in the general channel of the videos. Will do data modeling on the exam.

Avatar
Cody Perakslis
Student

My interests are in data science, distributed computing, and philosophy.