Kafka Confluent Schema Registry

Following blog post will cover some of the concepts related to Apache Avro and Confluent Schema Registry.

Why Schema Registry?

Kafka uses bytes and does not perform any data verification; so it is very efficient on CPU
Kafka does not deal with changes in the data format
As a side effect; when the data format changes; consumers can no longer process the data
To avoid such inconsistency issues with data management without impacting the performance and scaling capabilities of Kafka; Schema Registry component is created by Confluent.

What is Schema Registry?

Schema registry is a component from Confluent
Supports schema, schema evolution and must be lightweight

Why Avro?

JSON supports multiple data formats and is supported by multiple languages
JSON is less verbose compared to XML; but still can be quite verbose because of limited keys in collections.
But one limitation of JSON is that there is no schema enforcement
Avro is a data format to address above limitations of JSON

What is Avro?

Avro in simple terms is JSON + Schema
It is a binary data format
Data is compressed
Schema is attached with the data
Schema’s can be documented
Data can be processed in multiple languages
Most important aspect supported by Avro is Schema evolution
Since it is serialized; it is not possible to read the data without a Avro deserializer

Avro Data Types

null - no value
string - unicode char sequence
boolean - binary value
int - 32 bit signed integer
long - 64 bit signed integer
float - single precision floating point number
double - double precision floating point number
bytes - sequence of 8 unsigned bytes
Enums
- For modelling fields for which the values will be enumerated
- Once Enums are defined; changing enum values is forbidden
Arrays
- To represent list of undefined size of items
- Any data type can be used
Maps
- List of key values; keys are strings
- Define type for value
Unions
- Unions allow fields to take different types e.g. string, int, etc
- If defaults are defined; default must be of the type of first item
- Typical use case is to allow optional values
- {“name”: “salutation”, “type”: {“null”, “string”}, “default”: null}

Avro Record Schemas

{
     "type": "record",
     "namespace": "io.github.aparnachaudhary",
     "name": "Customer",
     "fields": [
       { "name": "first_name", "type": "string", "doc": "First Name of Customer" },
       { "name": "last_name", "type": "string", "doc": "Last Name of Customer" },
       { "name": "height", "type": "float", "doc": "Height at the time of registration in cm" },
       { "name": "weight", "type": "float", "doc": "Weight at the time of registration in kg" },
       { "name": "automated_email", "type": "boolean", "default": true, "doc": "Set to true if the user is enrolled in marketing emails" }
     ]
}

Avro Logical types

decimals
dates -
time-millis - number of milliseconds from midnight
timestamp-millis - number of milliseconds from epoch time
{ “name”: “registered_on”, “type”: “long”, “logicalType”:”timestamp-millis”, “doc”: “Time of registration in milliseconds since epoch time” },

Schema Evolution types and guidelines

Schema Evolution Types:

Backward - Read old data with new Schema
Forward - Read new data with Old schema
Full - both backward and forward compatible
None/Breaking - No compatibility

General Guidelines:

Make primary key a required field
Use default values for fields that could be removed in future
Never ever change enum values
Never rename fields; use aliases instead
When adding new fields; always use default values
Never delete a required field