Transactions: MVCC or a query language?
As I have been thinking about a database design on Hypercore, I’ve been drawing from my experience on the now-closed CTZN project.
In that app, you had a simple model of the nodejs server speaking to the Hyperspace daemon. It was able to assume it was the only thread speaking to Hyperspace.
This meant if I had to write a transactional update such as:
let record = await db.get('foo')
record.value.counter++
await db.put('foo', record.value)
I could make this transactional by applying my own lock regions:
let release = await lock('db')
try {
let record = await db.get('foo')
record.value.counter++
await db.put('foo', record.value)
} finally {
release()
}
This handles a different concern than batched writes getting applied atomically, which Hyperbee supports with a batch()
API. What this lock region ensures is that records don’t get updated during the read/update/write cycle, which could cause the interim update to get clobbered.
Well! Now I’m considering how to write a database program that would co-exist with Hyperspace — wrapping it, so to speak.
This would mean you have a Web server speaking to this DB (which speaks to Hyperspace). Of course, there could be two Web servers speaking to our DB, so the DB needs to provide transactions for read/update/write. Locked regions inside the servers will not help us here.
There are some simple kinds of APIs you can make to help with this, such as:
await conn.update(dbId, 'foo', {$inc: '.counter'})
But what you’re doing is specifying a query language, and if you expect it to handle a lot of common use-cases, this is a weak footing. MongoDB has a language like this, and it seems fine for some limited use-cases, but you’re going to hit the ceiling of the language eventually.
Option 1: MVCC
CouchDB handles this problem using the _rev
“revision” property. When you write a record, you send the _rev
along with your write. The database checks the value against the currently-stored _rev
and if the sent value is behind, the write is rejected.
This approach is known as Multiversion Concurrency Control (MVCC). It ensures transactionality by monitoring whether the records changed during your read/update/write flow.
The downside to this model is that it pushes conflict-handling to the application. Here is how our update would need to be re-written for MVCC:
while (1) {
try {
let record = await conn.get(dbId, 'foo')
record.value.counter++
await conn.put(dbId, 'foo', record._rev, record.value)
break // success, exit loop
} catch (e) {
if (e.code !== 'CONFLICT') {
throw e // unexpected error, escalate
}
// conflict, try again
}
}
Now — This isn’t all bad. The pattern is so consistent that a helper-function wrapper could remove most of the boilerplate:
await doUpdate(async () => {
let record = await conn.get(dbId, 'foo')
record.value.counter++
await conn.put(dbId, 'foo', record._rev, record.value)
})
The remaining downside is that your read/update/write loop is occurring over your socket to the DB, so there’s going to be more latency involved.
Option 2: A query language
We’re building a DB from scratch, so we could solve this by designing a query language for these kinds of updates.
SQL makes a column-increment relatively simple:
UPDATE myTable SET counter = counter + 1 WHERE id = 'foo'
You can write relatively sophisticated updates using conditions, functions, sub-queries, and so on.
I’m not inclined to write a SQL executor. I would also prefer something close to a drop-in solution if we’re going to use a query language; I don’t really want to embark on designing a whole new lang. It’s even better if developers are immediately familiar with the language.
Would it be absolutely insane to send Javascript to the server to act as our language?
await conn.exec(`
let release = await db.lock()
try {
let record = await db.get(params.key)
record.value.counter++
await db.put(params.key, record.value)
} finally {
release()
}
`, {key: 'foo'})
Maybe the answer is “That’s a terrible idea, don’t do it!” but it’s not exactly unprecedented. Redis uses Lua this way, and even Postgres has its own (weird) PL/SQL language. It’s also exactly what CouchDB does to define its Map/Reduce views.
We could also use JS to run complex “SELECT” with queries with “JOINS”:
const results = await conn.exec(`
let users = await db.table('users').list({limit: params.limit})
for (let user of users) {
user.lastPosts = await db.table('posts').list({
where: {author: user.id},
limit: params.postsLimit
})
}
return users
`, {limit: 10, postsLimit: 5})
This would be relatively faster than sending each of these queries over the socket to the DB. It’s far from the most concise query language I’ve ever seen, but it’s very clear what’s happening.
The main concern I have is whether this will trigger everybody’s “security anxiety,” and it’s not exactly wrong to worry about sending turing-complete scripts to the database. Getting pwned may not mean sending a DROP TABLES to your database; it might mean a full RCE.
Thoughts?
Given how straight-forward the MVCC pattern is, I’m hesitant to dive into this JS-as-query-language idea. Even an indexer — which (in our model) is a script which gets triggered on record changes — could be handled by watching a changes stream from the application and writing with simple CRUD methods.
This leaves performance as the remaining concern, which a query language would help solve. For complex read-queries we could use GraphQL, which is designed for kind of thing. Or, you know, we could use Javascript.
Is there an option I’m not considering? Let me know.