Storage engines

Storage engines are characterised mainly by three properties

Buffering
Mutability
Ordering

Often LSM are considered non-mutable and Btree is mutable. LSM doesn't modify your records in place, Btrees do.

Although some variants exist that don't comply with this.

This blog covers some important points from chapter 2 Btree Basics of the Database Internals - Alex Petrov book.

BST - Binary Search Trees

The book begins by discussing BST. A BST has the following :

A single starting node
Higher elements to the right
Lower elements to the left
Each node can only have 2 children
In the worst case search takes O(n) (When all the nodes are left children of parent node) time and in best of O(log n) (Balanced tree)
A tree is balanced when the height of the tree is log2(N) (This is the MOST optimised height for a BST).
Fanout (max allowed childer per node) is 2 here and is considered low. In a Disk-based structure, fanout is preferred to be high and height low. This way number of pointer traversals is minimised.

It is important to note that fan out and height are inversely proportional to each other. Also, Just because a structure is optimised for space and time, it cannot be used as a disk-based structure, it needs to be adapted to the mediums limitations.

Medium

Most of these structures and algorithms were built for spinning disks. These days we are seeing an uptick in structures that optimise for byte addressable non-volative memory

Hard drives

These have physically spinning disks, hence have the following :

seeks are costly
reading or writing sequentially is optimised
sector sizes are usually from 512 bytes to 4 kb. Contiguous operations in this range are optimised.

Solid-state drives

These are some important behaviours to consider in SSDs :

made of pages -> blocks -> plane -> dies. In each level, you can have multiple of the same (a block can have many pages).
Hence, the read and write granularity is still pages.
Flash translation layer maps page IDs to their physical location (hence being good at random reads!)
hence when you read a single word, the whole block is pulled into memory and is handled by block device managers. This raises the need for buffering to reduce I/O.
writes only happen to unwritten pages, hence all the "orphaned" pages need to be cleaned using a garbage collecting process which could have a negative cost.

B-trees

From here on the book covers complexity and algorithm (insertion and deletion) for trees. I have skipped more of it except for some important points. I have considered knowledge on b-trees basic.

Insertion (splits)

Splits happen from the leaf, if full we split and place the element in the corresponding half. We equally add the new pointer in the parent, if a parent is full it is also split.

Most splits don't propagate for more than 1 level. BTrees are good this way (Happens due to high fanout).

Deletion (merges)

In any given level, if the sum of elements in two adjacent nodes is less than the number of pointers a single node can hold, it is merged.
Just like splits, merges can also propagate to the root but are rare.

Conclusion

In this chapter, we simply understand how btree works as a data structure and algorithm, on how to use (implement) them in disks is another chapter. Looking forward to it!

Home | Index | Posts | Contact-me | LinkStash | PaperStash

Basics of Btrees - Database internals 01