Following blog post will cover some of the concepts related to Apache Avro and Confluent Schema Registry.

Why Schema Registry?

  • Kafka uses bytes and does not perform any data verification; so it is very efficient on CPU
  • Kafka does not deal with changes in the data format
  • As a side effect; when the data format changes; consumers can no longer process the data
  • To avoid such inconsistency issues with data management without impacting the performance and scaling capabilities of Kafka; Schema Registry component is created by Confluent.

What is Schema Registry?

  • Schema registry is a component from Confluent
  • Supports schema, schema evolution and must be lightweight

Why Avro?

  • JSON supports multiple data formats and is supported by multiple languages
  • JSON is less verbose compared to XML; but still can be quite verbose because of limited keys in collections.
  • But one limitation of JSON is that there is no schema enforcement
  • Avro is a data format to address above limitations of JSON

What is Avro?

  • Avro in simple terms is JSON + Schema
  • It is a binary data format
  • Data is compressed
  • Schema is attached with the data
  • Schema’s can be documented
  • Data can be processed in multiple languages
  • Most important aspect supported by Avro is Schema evolution
  • Since it is serialized; it is not possible to read the data without a Avro deserializer

Avro Data Types

  • null - no value
  • string - unicode char sequence
  • boolean - binary value
  • int - 32 bit signed integer
  • long - 64 bit signed integer
  • float - single precision floating point number
  • double - double precision floating point number
  • bytes - sequence of 8 unsigned bytes
  • Enums
    • For modelling fields for which the values will be enumerated
    • Once Enums are defined; changing enum values is forbidden
  • Arrays
    • To represent list of undefined size of items
    • Any data type can be used
  • Maps
    • List of key values; keys are strings
    • Define type for value
  • Unions
    • Unions allow fields to take different types e.g. string, int, etc
    • If defaults are defined; default must be of the type of first item
    • Typical use case is to allow optional values
    • {“name”: “salutation”, “type”: {“null”, “string”}, “default”: null}

Avro Record Schemas

{
     "type": "record",
     "namespace": "io.github.aparnachaudhary",
     "name": "Customer",
     "fields": [
       { "name": "first_name", "type": "string", "doc": "First Name of Customer" },
       { "name": "last_name", "type": "string", "doc": "Last Name of Customer" },
       { "name": "height", "type": "float", "doc": "Height at the time of registration in cm" },
       { "name": "weight", "type": "float", "doc": "Weight at the time of registration in kg" },
       { "name": "automated_email", "type": "boolean", "default": true, "doc": "Set to true if the user is enrolled in marketing emails" }
     ]
}

Avro Logical types

  • decimals
  • dates -
  • time-millis - number of milliseconds from midnight
  • timestamp-millis - number of milliseconds from epoch time
  • { “name”: “registered_on”, “type”: “long”, “logicalType”:”timestamp-millis”, “doc”: “Time of registration in milliseconds since epoch time” },

Schema Evolution types and guidelines

Schema Evolution Types:

  • Backward - Read old data with new Schema
  • Forward - Read new data with Old schema
  • Full - both backward and forward compatible
  • None/Breaking - No compatibility

General Guidelines:

  • Make primary key a required field
  • Use default values for fields that could be removed in future
  • Never ever change enum values
  • Never rename fields; use aliases instead
  • When adding new fields; always use default values
  • Never delete a required field