Data layout and schema agreement on Hypercore protocol
Lately I’ve been exploring data models on the Hypercore Protocol. In this post, I’m going to talk about approaches to schemas and semantics in a decentralized setting.
If you’re not familiar with Hypercore protocol already, it’s a peer-to-peer data network that’s similar to BitTorrent. It supports many data structures, but the one I’m focusing on currently is the Hyperbee, which is a distributed B-tree with a LevelDB-like API.
The database model I’ll be exploring is a document-store similar to Mongo or CouchDB (the NoSQL model). Let me explain why relative to some other possible data layouts on top of Hyperbee’s B-Tree:
- Row-oriented (RDBMS model): Row-oriented databases use tables of rows with predefined columns. “Complex objects,” i.e. an object with sub-objects or sub-arrays, are spread across multiple tables and joined at query-time. This is a flexible model but it means that many queries require multiple reads from the database. In Hyper, if a value is not present in the local cache it will be automatically fetched from the network — specifically, the Internet. That means that reads from partially-synced databases can incur a 100–1000ms latency, so we want to reduce reads as much as possible.
- Triple-store (Graph model): Triple-stores encode graph relations according to the
<source, edge, destination>definition. They are intended to efficiently query according to any given set of the three values, and so the database will typically encode multiple variations of the triple:
<destination, edge>, and so on. This is extremely flexible but is arguably akin to a row-oriented database where every attribute is indexed, meaning the database size bloats. This model suffers more significantly than the row-oriented model from read latency, which makes it a non-starter, though a “property graph” model could help with that problem.
A document-store, which I’m exploring now, is like a row-oriented database that’s very denormalized by default. This minimizes the number of reads required to satisfy a query, which optimizes for our network conditions.
There are mitigating factors to my reasoning. If you always pre-sync Hyper databases prior to executing queries, the variable read-latency problem is negated. Even a document-store will want to apply pre-sync strategies so this is not an infeasible solution. If, however, you will need fast “first reads” on new databases, you really want to minimize the amount of records to transfer before satisfying a query.
In a decentralized network, interoperability is a key requirement. No device can dictate the behaviors of other devices, and so a shared language is mandatory. What good is sharing Hyperbees if our applications don’t understand the documents inside of them? To solve this, we need a global schema model.
The most well-known global schema model is RDF, the Resource Description Framework. RDF is a graph-oriented language for global IDs on data. Every addressable entity or property possesses a URI; for instance, a JS object which looks like this:
bio: ‘A cool hacker’,
birthplace: ‘New York’
Would be encoded in RDF somewhat like this:
‘http://example.com/bio': ‘A cool hacker’,
‘http://example.com/birthplace': ‘New York’
Which is, in fact, a kind of rendering of an underlying graph. A more accurate rendering of RDF would be:
#object http://example.com/name ‘Bob’
#object http://example.com/bio ‘A cool hacker’
#object http://example.com/birthplace ‘New York’
#object is a kind of local placeholder. Your application could give the entity being described its own URL, and now you have fully global RDF data:
http://myapp.com/users/bob http://example.com/name ‘Bob’
http://myapp.com/users/bob http://example.com/bio ‘A cool hacker’
http://myapp.com/users/bob http://example.com/birthplace ‘New York’
JSON-LD is a standard way of representing RDF information as JSON. It uses various tricks to shorten the actual keys in the object so that properties don’t have to be referenced by URLs, but URLs are always the keys.
There’s a reasonable argument for using RDF but I’m not a fan of the approach. RDF optimizes for flexibility: the graph model conceptually encapsulates all other data layouts, and it enables arbitrary composition of multiple different schemas. This is nice but comes with complexity. The JSON-LD format is noisy and tedious to program against.
To my thinking, schemas are similar to types in a programming language. They’re a tool to ensure correctness. In a decentralized network, correctness means contractual interoperability: if your application supports a semantic identifier, it should accept and produce a well-known record definition.
To provide interop-correctness, we need strict constraints on the shapes of objects, their attribute types, and their value formats. Schemas need to provide a guarantee; if a record mismatches the schema, it violates the guarantee and should be ignored. This is the same mentality we apply to incoming messages to servers, which is equivalent to our reading databases from untrusted origins.
Machine-readable, globally-identified record schemas
The approach I’m considering is a machine-readable schema definition which uses URLs to identify each schema. For instance, a hypothetical Twitter clone called Fritter would publish a schema identified by
fritter.com/post for its posts. This URL would resolve to the schema document. Schema documents would include a strict definition, most likely JSON-Schema, to validate records.
This model means we are trading flexibility for specificity; whoever controls the schema asserts the shape of its records globally. Put another way, a “Fritter Post” must match the published
fritter.com/post schema to be considered valid. Whoever publishes the schema exerts that control, which is a logical form of centralization.
While some may balk at the idea of schema ownership, I’ve found that haphazard schema composition is painful for users and developers alike. For users, you get bugs and surprise incompatibilites. For developers, you get something akin to decision paralysis; without any clarity on how other applications will behave, you struggle to implement features which require schema changes. Schema ownership creates a clear model for consensus, as the holder of the schema’s identifier is effectively charged with the schemas’ maintanence. (This is no different than the model used for modules in package managers.)
The significant advantage to this approach is that reading from a table gives predictable results. You will never read a
fritter.com/post table and get
critter.com/post records with only a few Fritter Post attributes mixed into it. If an application does need some such extension, it would define an extension-schema such as
critter.com/post-extension which would reference the
fritter.com/post records they extend.
Ultimately we’re looking for a simple way to indicate support for a semantics — a kind of contracts system. Since applications often span multiple record types, it may be necessary to create identifiers for higher-level contracts, such as a claim that “All fritter.com schemas and related behaviors are supported by this database.” Initially a per-record-schema contract seems sufficient, but time will tell.
Well-defined, machine-readable schemas also create the opportunity for behavior definitions such as indexed attributes and computed values such as aggregations.
The interplay between schemas and developers
When your network presents itself as a database, the schema definitions are your API. The creator of a schema is defining the API for the network, which (pro) ensures shared expectations but (con) restricts local decision-making.
A common issue, for instance, might be the need for indexes on attributes which are not specified to be indexed by the schema-author, but this would be a problem whether owned schemas were used or not. Databases which are produced by other devices are not writable; if the index isn’t there, it isn’t there. This means that most applications will be maintaining local indexes which span multiple Hyperbees (a perfectly suitable solution).
One way to view the difference is as “publisher” versus “consumer” indexes. An index which is written to a Hyperbee along with the primary records, eg an index of Fritter Posts by their ctime, is written by the posts’ authors — the publishers. An additional FTS index of the Fritter Posts’s text fields which is not shipped in the source DBs must be generated locally by the consumers. The only difference is that consumer indexes must be produced locally for each node interested in having them, while publisher indexes can be generated once and then shared over the network.
When extensions to a schema are needed, a developer can create new schemas for records which extend their originals. This requires some careful consideration: if a feature depends on the new schema, then software which supports the original schema but not the extension may produce unexpected behaviors. Extensions should almost always be optional; otherwise, an entirely separate schema should be used.
Publishing schemas at nice URLs may be a burdensome requirement, especially for cases in which interoperability is not a concern. The latter could be solved by a kind of “database local” schema definition which lacks a URL identifier; the former could be solved by a kind of “schema package manager” service.
This mostly concludes my current thinking about data layout and schemas on Hypercore. I’m open to considering alternatives to everything stated here, and probably some details will change or refine with anything I develop in the future.
There’s more to be considered such as a query language (perhaps Mango), multi-writer and conflict management, atomic transactions, automatic local index generation to optimize common queries, and techniques to compress record references (which are lengthy Hyper URLs). I’ll continue sharing as I think about these kinds of tasks.