switch theme | rss

ZooKeeper - Part 1 | Introduction and API

Written on : 2025-03-16
Status : Completed

Est. Reading time:

3 min

This series of posts is based on the ZooKeeper paper by Yahoo! Research titled "ZooKeeper: Wait-free coordination for Internet-scale systems". Annotated paper here.

What is ZooKeeper?

Is it a warden of the zoo? The strict caretaker of the zoo, maintain and keeping the animals in check? Jokes aside, ZooKeeper is a high level distributed file system.

What? A File system? Why is (old) Kafka using it? Wdym??

Turns out ZooKeepers simple API can be used to implement a lot of the distributed system coordination requirements.

To put it in simpler terms, ZooKeeper is a toolset to help implement distributed systems coordination primitives.

Some primitives we see distributed systems needing are :

  • Leader election
  • Group membership
  • Configuration management
  • Locks
  • Barriers

We will see how each of these primitives are implemented using ZooKeeper in this series of blogs.

Some of ZooKeepers (from here on ZK) enticing features are:

  • Wait-free : Most of its API's are wait free.
  • Simple API : Provides a seemigly simple API that looks like that of a file system.
  • Read optimised : ZK is optimised for high read ratios.
  • Built for true internet scale.

ZK moved away from implementing these primivites (like other systems often do, eg Chubby) and instead exposed an API that allows us, the developers to implement the primitives for ourselves. Hence, what ZK exposes as its service, is often referred to as coordination kernel.

ZK finds that providing just two guarentees is enough for applications to implement their own primitives, those are :

  • FIFO : First in First out Client ordering, commands from a given ZK client are always ordered.
  • Linearizable writes

To implement Linearizability, ZK uses a leader based atomic broadcast protocol called ZAB. Read are however served locally, and this doesnt keep to be queued up in ZAB.

This series of post introduces the following :

  • coordination kernel : A wait-free coordination service with relaxed consistency guarentees for use in dist sys.
  • coordination recipies : How ZK can be used to build higher level coordination primivites.

The ZK service :

Actors of ZK :

  • Client : The user of the ZK service
  • Server : Process providing the ZK service
  • ensemble : The group of servers
  • znode: In memory data node in ZK data, often organised as a hirerachial namespace reffered to as data tree

hirerachial namespace is just fancy term for how file systems are often organised

  • updates or writes : Operations that modify the state of the data tree
  • Session : Clients establish a new session when they connect to a server.

Service overview :

ZK provides the clients with an abstraction of the znodes (data nodes). These znodes are the one manipulated through ZK service. These znodes are referred using the typical UNIX notation for file path. As an example : /app1/p_1 to denote the path to p_1

Image showing znode structures

All znodes can stores data, except ephermeral znode, now I think is a good point to list the types of znodes.

  • Regular : Can store data and have to deleted explicitly
  • Epehemeral : Short lived znodes, can either be deleted explicitly. Of the system automaticall removed them wehn the session that created them terminates. Cannot hold any data.

Additionally, when creating a znode (either of them), a client can set a sequential flag, doing so, adds a number to the end of the znode name. This number is monotonically increasing. Is n is the new sequential znode and its parent is p, the sequence value attached to the end of n is never smaller than the value in the name of any other sequential znode created under p. We can see an example of the same in the above diagram.

ZK also provides watches. watches are a way to allow clients to get notifications of change, without polling. watches are created by setting a flag during read operations. watches only indicate that a change has happened and do not return the value itself.

Sessions

Sessions have an assosiated time out, ZK considers a client fault, if it doesn't receive anything from a session for more than a timeout or when the client explicitly closes the session.

The API

We can now formally, define the API :

  • create(path, data, flags): creates a znode in path and stores data in it. The flags help define what type of znode it is, epehemeral, regular and Additionally helps set the sequential flag.
  • delete(path, version): Deletes the znode, if the data in the znode is of the said version.
  • exists(path, watch): returns true of the znode is present. returns false otherwise. When the watch flag is set, it sends a notification to the session when the znode comes into existence or goes out of existence.
  • getData(path, watch): Returns back the value and sets a watch if the znode exists. If not, it doesn't set the watch flag (unlike exists API does).
  • setData(path, data, version): writes data to the path, if the current version number of the znode is version.
  • getChildren(path, watch): returns bac the set of names of children node.
  • sync(path): waits for all operations pending at the start of this sync call to first finish, before sending back the value.

For any of the version argument, setting it to -1 skips the version checks.

With this, we have a good grip on what ZooKeeper is. Moving forward, we will see how this API is used to implement coordination primivites and how these API are implemented in the first place.

fin.