How FerretDB stores BSON in JSONB
Chi Fujii
November 8, 2022

Image credit: Christian Wiediger / Unsplash

Image credit: Christian Wiediger / Unsplash

At FerretDB, we are converting MongoDB wire protocol queries into SQL, to store information from BSON to PostgreSQL. 

To achieve this, we created our own mapping called PJSON which translates MongoDB’s BSON format into PostgreSQL’s JSONB. 

Using BSON with JSONB offers us significantly faster operation, greater range of data types compared to regular JSON, and more flexibility in modeling data of any structure. 

In this article, we’ll show you how FerretDB converts and stores MongoDB data in PostgreSQL.

What is BSON?

BSON holds a collection of field name/value pairs which is called a document. It contains length information allowing serializer/deserializer to utilize it for the performance benefit. Additionally, it preserves the order of the fields. BSON also supports additional data types such as DateTime and binary.

Let’s look at an example of how a JSON is encoded in BSON. Here the hexadecimal `\x00` notation represents `0000 0000` in bits and similarly `\x12` represents `0001 0010`. In BSON, `\x00` is a byte used as a terminator to indicate the end of a document. In BSON, field names use `cstring` which are UTF-8 characters followed by `\x00`.

{“foo”: “bar”}

\x12\x00\x00\x00 Document length in little-endian int32 (18 bytes document)
\x02 string field type `\x02`, see the BSON spec
foo\x00 Cstring field name
\x04\x00\x00\x00bar\x00 String length in little-endian int32 `\x04\x00\x00\x00` (4 bytes string) followed by string value and trailing `\x00`
\x00 Document terminator


You notice that BSON is binary serialized so it is not human readable. It contains information about the length, and an explicitly defined field type.

At FerretDB, we use PostgreSQL as a database engine and we store JSON data in JSONB data type. In light of this, we need to store BSON equivalent information in JSON format without losing type information. 

Introducing PJSON 

To embed necessary information from BSON, we introduced special keys prefixed with `$`. PJSON contains `$` followed by a character to embed information about types and the order of the fields. PJSON is designed to be serialized to JSONB. Let’s look at a simple BSON with an `ObjectId` field and see how it is represented in PJSON.

{"_id”: ObjectId("635202c8f75e487c16adc141")}
\x16\x00\x00\x00\x07_id\x00\x63\x52\x02\xc8\xf7\x5e\x48\x7c\x16\xad\xc1\x41\x00

In PJSON, we store the order of fields in the `$k` field, and `ObjectId` in the `$o` field. The duplicate fields are not allowed in PJSON.

{
  "$k": [
    "_id"
  ],
  "_id": {
    "$o": "635202c8f75e487c16adc141"
  }
}

The `$` prefixes apply to types that are not native to JSON. The current mappings to PJSON are defined below.

Document {“$k”: [“<key 1>”, “<key 2>”, …], “<key 1>”: <value 1>, “<key 2>”: <value 2>, …}
Array JSON array
64-bit binary floating point {“$f”: JSON number}
UTF-8 string JSON string
Binary data

{“$b”: “<base 64 string>”, “s”: <subtype number>}

// s is binary subtype, the detail is found in the BSON spec

ObjectId {“$o”: “<ObjectID as 24 character hex string”}
Boolean JSON true / false values
UTC datetime {“$d”: milliseconds since epoch as JSON number}
Null JSON null
Regular expression {“$r”: “<string without terminating 0x0>”, “o”: “<string without terminating 0x0>”}
Timestamp {“$t”: “<number as string>”}
64-bit integer {“$l”: “<number as string>”}

In addition to `$` prefixes, binary data have an additional field `s`  to indicate the type of binary. Also, regular expressions have an additional field `o` to specify options such as case sensitivity. 

PJSON Example: Translating and storing BSON in JSONB

Let’s look at an example of inserting BSON and storing it as PJSON using the following document.

db.groceries.insert({
_id:		ObjectId("635202c8f75e487c16adc141"),
name:		“milk”,
quantity:	3
})

1. Deserialize from BSON

In our BSON deserializer implementation, we use the byte tag according to BSON spec to extract field name value pairs.

\x33\x00\x00\x00					Document length
\x07							ObjectId field type
_id\x00							
\x63\x52\x02\xc8\xf7\x5e\x48\x7c\x16\xad\xc1\x41		
\x02							String field type
name\x00					
\x05\x00\x00\x00milk\x00
\x10							int32 field type
quantity\x00
\x03\x00\x00\x00
\x00
quantity\x00
\x03\x00\x00\x00
\x00

A BSON document contains the document length in `int32` at the very first entry of the document. BSON uses a little-endian format for `int32` type, the little-endian orders bytes from the least significant byte to the most significant byte. The document length `\x33\x00\x00\x00` is in bytes so this document is 51 bytes in length in decimal representation.

The first field type `\x07` is `ObjectId`. The field name `_id\x00` is obtained by reading until the first `\x00`. The BSON field name `_id\x00` is `cstring` type which is a UTF-8 `string` followed by `\x00`. We convert `cstring` to `string` by dropping the `\x00`, and we have the field name `_id`. The `ObjectId` field value is specified in BSON spec as 12 bytes, we read subsequent 12 bytes as `ObjectId` which are `\x63\x52\x02\xc8\xf7\x5e\x48\x7c\x16\xad\xc1\x41`.

The second field type `\x02` is `string`, the field name `name\x00`  is obtained by reading the `cstring` field name until the first `\x00. The string field value `\x05\x00\x00\x00milk\x00` contains the string length followed by the value. The first part `\x05\x00\x00\x00` is string length in little-endian int32 format which says the string is 5 bytes length. We read the subsequent 5 bytes `milk\x00` and we drop the `\x00` to get the field value of `milk`.

The last field `\x10` is `int32` type, encoded in little-endian with 8 bytes length. The field name `quantity\x00` is obtained by reading until the first `\x00`, and the field value of `int32` is obtained by reading the subsequent 8 bytes `\x03\x00\x00\x00`. The field value is little-endian format `int32` which is 3.

The terminator `\x00` indicates it’s the end of the BSON document.

2. Serialize to PJSON

We use the array `$k` to preserve the order of the fields. In PJSON, the format and types, such as `string` and `int32`, are the same as JSON types. On the other hand, `ObjectId` is represented by the `$o` key which is specific to BSON.

{
  "$k": [
    "_id",
    "name",
    "quantity"
  ],
  "_id": {
    "$o": "635202c8f75e487c16adc141"
  },
  "name": "milk",
  "quantity": 3
}

3. Store PJSON to JSONB

We use a table `_ferretdb_settings` to keep track of existing collections. For each collection, we create a table with a JSONB column and it stores the documents of the collection.

When we insert a document to the `groceries` collection, an entry related to a new collection `groceries` was added to `_ferretdb_settings`. This happened because it was the first document inserted into the `groceries` collection. And a new document was inserted to a new generated table `groceries_6a5f9564`.

ferretdb=# \d test._ferretdb_settings 
          Table "test._ferretdb_settings"
  Column  | Type  | Collation | Nullable | Default 
----------+-------+-----------+----------+---------
 settings | jsonb |           |          | 

ferretdb=# SELECT settings FROM test._ferretdb_settings;
                                             settings                                             
--------------------------------------------------------------------------------------------------
 {"$k": ["collections"], "collections": {"$k": ["groceries"], "groceries": "groceries_6a5f9564"}}
(1 row)
ferretdb=# \d test.groceries_6a5f9564 
         Table "test.groceries_6a5f9564"
 Column | Type  | Collation | Nullable | Default 
--------+-------+-----------+----------+---------
 _jsonb | jsonb |           |          | 
ferretdb=# SELECT _jsonb FROM test.groceries_6a5f9564;
                                                    _jsonb                                                     
---------------------------------------------------------------------------------------------------------------
 {"$k": ["_id", "name", "quantity"], "_id": {"$o": "635202c8f75e487c16adc141"}, "name": "milk", "quantity": 3}
(1 row)

This is how we store BSON information at FerretDB in PostgreSQL JSONB.

Roundup

So far, we’ve shown how we use PJSON mappings to store MongoDB BSON data in JSONB while preserving type information and field orders. This process includes deserializing BSON and mapping it to PJSON, and finally storing them in JSONB.

For our community contributors and users, understanding how we convert BSON data in MongoDB to JSONB of PostgreSQL helps them gain more insight into how FerretDB works. 

To start contributing to FerretDB, read our contribution guidelines.

2 Comments

  1. Alexey Palazhchenko

    Having many copies of field names and types is the biggest one. Implementing indexes will also be tricky, but we have a few ideas we could try to make it easier. And, of course, we are considering extracting common fields into separate columns.

    Reply
  2. Tim Callaghan

    Any downsides to this approach other than the document size bloating by the storing another copy of all the field names?

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *