RustFest Barcelona – David Bach & Malte Sandstede: Right inside the database

RustFest Barcelona – David Bach & Malte Sandstede: Right inside the database


MALTE: Hi. I want you to imagine a world where obtaining
backend data is as simple as writing a query, where your apps are always up-to-date, where
offline-first is only a caching decision about what kind of your backend you replicate in
your frontend, and where object relational matters or manual graph, cure resolvers or
mappers in general don’t even exist. We want you to imagine a world where you are
right inside the database, and with that, welcome. This is David. My name is Malte. We work at our own small software consultancy
where we consult people on the stuff we are talking about today and a lot of the stuff
originated from the ETH Zurich systems group and also Frank McSherry, who is the creator
of some of the systems we are going to introduce to you today. So what better way to start a Rust talk than
to show some JavaScript, so of course if we start at the frontend and want to make this
utopia come true of writing them in a declarative manner, we could just imagine what this should
look like. At the bottom half of the slide you can see
the frontend function, so it’s just a component, and we just describe what we want to display. Some senders. And at the top we see a declarative query. This is written in Datalog but it is the names
of the senders of the messages that the current user has received, so you can see on the message
ID slide there, message ID entity there is the recipient and then it matches all the
sender IDs and gets the names from them, so you could also write SQL if you are more inclined
to do so. This gives us a very declarative manner of
specifying our data needs in the frontend, and since we register this query, as you can
see down below, what we want to say with that is that you also don’t have to worry about
at what time this data arrives in our system. We always think that our data just will flow
into the system and changes will propagate through the system as soon as they arrive. We in the frontend don’t have to care about
this at all. So declarative and reactive frontends are
something you can kind of have today if you look at libraries such as React or think about
component queries, but now if we look at the whole Full Stack side of things, things look
very differently so we are not really reactive anymore since we mostly resort to polling. We ask time and again our backend: is there
something new, is there something new, without knowing if there is something new. There is also a translation step between the
application server and the source of truth since, for example, if we use an object relational
matter or resolvers we always have to speak to our source of truth and translate the queries
we formulate in the frontend so that our applications, we can actually understand them and the backend
can understand them. What we would rather like to see is something
that looks like this and is slightly stolen from a great blog post, “The source of truth
of tomorrow”, and the first we can use is by using a declarative query, what type should
be accessible, so this is in the server. We also want to formulate what of the data
is currently relevant. This is the query you just saw in the frontend
JavaScript code thing where we specified what data we are actually interested in, and lastly
we want to push this data into the frontend instead of asking for it and pulling it, to
make things reactive. So if we want to re-evaluate how we build
this Full Stack architecture, the first thing we have to look for is a push-based programming
model, to avoid, or to become reactive. If we look somewhere we might stumble upon
the dataflow programming model and this is really a nice way to model these kind of things
because in the dataflow programming model you do it as graphs. The nodes are operators and they do some work,
for example if you have a map operator it takes some data and it will map over it and
produce a result based on it. The edges specify from which operator to which
operator data can flow. For example, if you have a join operator it
probably has to input edges and produces a single output edge. This offers some nice characteristics. For example, there is no central co-ordination
needed. Data co-ordinates the execution of the data
flow. Secondly, and this plays nicely with Rust
or if you want to implement this in Rust, we don’t really need to fight the borrow checker
since every operator owns its own part of the data so there are never two separate entities
owning the same part of data and fighting over it. Lastly, it’s quite easy to distribute since
we can just clone operators and chart the data we input into our system and then just
feed one operator, one half of the data and another one the other, and this way parallel
with ease. And an implementation in Rust based on this
dataflow model is called timely data flow and it offers those that can even be cyclic
and this can be something like this so this is a Rust program using Timely Dataflow and
the first thing is you always write it from the perspective of a single worker so while
constructing my dataflow I don’t have to worry about distribution or parallelisation. I can do all of that stuff later on. Then comes the first stage of dataflow programming,
the creation of the actual dataflow so this is how we construct the dataflow from small
operators that are chained together and to the right here you can see a visualisation
of that, so we are ingesting some data, exchanging it to possibly other workers and then inspecting
it. Then, as the second stage of the dataflow
programming we are actually running it. So now we are feeding – after we have created
the structure, we are now feeding input into the dataflow to compute something. So now that we know what dataflows are all
about, we can re-examine our introductory problem of thinking about how we can build
a Full Stack architecture using that, and if we squint a little it could look something
like this, where we have the ingesting operators in our dataflow being the source of truth,
then we have the application server level which is the actual dataflow computation and
which will narrow our data down to relevant data and apply access policies and lastly
the exit nodes are our frontend, and of course we won’t model our frontend as a dataflow
operator, so this is more of a conceptual dataflow sync but we can actually model all
of the backend stuff as a dataflow, as a real dataflow, or in this case a timely dataflow. So what does this look like? The first thing we have to worry about is
how we now model our data because we are not anymore in a static mindset where we have
just a database sitting somewhere and we query it but we are talking about dataflows and
as the name suggests data is streamed through this system. What we propose is using a fully normalised
attribute-oriented data model, so something like the six normal form and in that data
model the fundamental unit is the fact, a triple of entity, attribute and value. For example, if you have an entity one it
might have an attribute name and the value of that attribute is Peter. This allows us to freely compose very small
units of data into higher level concepts, so for example if we want to model a person
and her residence, she might have a name, she might have an age and a residence, and
the residence then again links to the entity ID of the resident facts we define below and
this composes, as is quite straightforward to see, to a higher level concept such as
a person or a resident. If we model this in a database setting we
would have a separate table, since we are completely normalised, for every attribute. So, for example, the pers/name has an entity
column and a value column. If we want to transfer this to the dataflow
system where we think about streams we can think about entity and value streams so we
introduce the notion of time because in a streaming world we have to somehow define
at what point this data arrived, so Alice, for example, the name Alice was introduced
at t0 and we also want to make multiplicities or differences explicit because data can change
in this stream. Alice’s name as of t0 is Alice but later on
she changed her name to Bob, so in this case by using these multiplicities and additions
and retractions, we can say at t2 we retract the fact Alice, or the first name is Alice
for entity 1, and we add the fact Bob for entity 1. This also allows us to become very reactive
since now we can observe or make explicit every single data change in our system. If we now want to arrive at this higher level
concept of a whole person, what we are going to do is we consolidate the streams, that
is we add together all these differences and subtract the ones that are negative so that
we obtain a consistent view as of a common t*, so for example if you have a common pers/name
stream, age which will probably update more frequently and residents, we just sum up all
these facts to obtain the person table as of some consistent time, t*. Cool. Now we have a data model that can express
everything we want from our Full Stack application in a dataflow setting but what is still missing
is making things declarative and formulating queries on top of them, so the first thing
we want to do is register a query, again so that every time something changes updates
are pushed, for example via a websocket connection to our frontend, and we also want to make
sure that we can even formulate very complex queries such as access policies using that
model. So here, this is a reachability query, so
for whatever reason we might be interested in finding out all of Alice’s friends and
friends of friends, and again it’s Datalog but basically what it does is it takes the
first clause and matches it, so this would be direct friends, or it matches the second
clause which is a direct friend of one of the friends or friends of friends we have
already discovered. Even if you don’t read Datalog all day, what
you can see here just from syntax is that we are still using these facts we defined
before, even now that we are in our query world. What is not so straightforward is that to
match these clauses we have to introduce additional operators, so for example here matching on
the hop is nothing more than a relational join, so we have to have some kind of means
to express this in a dataflow setting and iteratively calling the reachability query
means we need some kind of iterate clause. Especially if you know stream processing and
dataflows this is often very hard to do and fortunately again there is already a system,
again written in Rust, that solves this. It’s called differential dataflow and it provides
complex operators. So to the left you can see a differential
dataflow having these iterate operators and this join rate and it looks like any other
Rust programme and the cool thing is that also it can incrementalise these operators,
so if we at some later point in time change a fact, we don’t have to recompute all of
the friends of friends from scratch, but can just do work in the order of change. So now that we’ve covered how we can actually
model these complex query matching clauses in our dataflow world, what is still missing
is kind of obvious: how do we get from the query representation to the Rust representation? This is what David is going to talk about. DAVID: Okay, cool. All right, yes. So Malte just sort of explained the way we
end up with these timing programmes written in Rust and we know how Rust works, we write
this programme, we compile it and we run it, and the idea of these dataflow programmes
that you start to run them and they run indefinitely because we never know if we are done with
the input. There could be potentially unbounded new input
coming in and we always want results given these new inputs but as a client talking to
some sort of database we always have this interaction role that a client comes at some
arbitrary point in time, asks the database to compute some query, the database computes
the query, the client takes the result and minds its own business. So this is what this match between aesthetically
compiled Rust programme on the right-hand side doing this dataflow business and a dynamic
query from client – and what we build is something called 3DF, which stands for declarative dataflow
and what you can see here is an excerpt of the plan extraction. We have this Rust data structure. That is the idea that it provides the server
with a plan of how to implement these operators. So the idea is that we enumerate all possible
operations that clients could ask us for, for example the join operation, and when we
give the server or someone implementing this library this plan struct he will just walk
through it and implement all these plans. What he will do when he implements this plan
is we basically take all these operators we saw earlier from Malte, we stick them together
and for example a website with a stream of the resulting data. Perhaps time for a recap. The first thing was basically having a run
time to run these in. Then we talked how do we model the data such
that it makes sense in a streaming setting, having all these normalised streams coming
into our database or into our system. I just told you how we can make this all dynamic
by using this layer that in a running Rust program constructs these dataflows and hands
us back the resulting data or the results. Using this declarative layer we can actually
now query this system and the system will produce these dataflows and keeps us up-to-date
with all the data needs we might have as a client. But of course, we need to now think of how
this now fits into the frontend story, so we are going to do this now. So what we are supposed to do here with this
talk is giving you the idea that a client is actually also just a peer through the database,
and in another sense the client basically has the illusion that he sits just next to
the database, or inside the database, so he has all the query power that a normal relational
database has and the whole rest of this data is extracted away. When we start thinking of this, just as a
quick reminder, maybe relational databases are also created at modelling states for the
frontend themselves, so if you have some frontend application that some web page, different
things, you can think of this state of this web page as being the database and if different
parts need different data it will just query the database and ask for some data. If it wants to do some transitions, for example
someone puts in data into the text fields, it will transact this into the database so
there is already a good fit as using a relational database as a means of modelling frontend
state. When we talk about databases usually we say
we have a query and the query will be evaluated and we get a result set, but the whole story
we are telling here is we want to replicate the query for the client so the client feels
he is right inside the database, so what we want is this function f that takes the database
and gives us back this selective database that only contains the data that is accessible,
relevant and important for the application itself. So how do we go from queries to these replication
queries, what is this function f that makes it possible for the database to be replicated
in this sense? The first thing to notice is that, if we use
the same data model it will make it easier so we will use this same thing that Malte
told us earlier, that our application database is also this database of facts, making up
our local database, and there is one implementation in JavaScript called data script that will
do that, gives us a Datalog interface over this local application database. Our job is now how do we populate this database
with all the facts that this client is interested in seeing? The idea is to just query for all the attributes
that the client is interested in. These attributes are in itself binary relations,
so that when we query it we get back some triplets and attribute value that the client
can put into its local database. To make this a bit clearer I have come up
with a simple example. On the right-hand side again we see this query
struct or the query in Datalog where we have some message sender query we ask, and the
query body, starting with the find and having all these clauses, and if we look at what
the find says, it says we want the message ID which corresponds to the entity we are
wanting to find. We have a value which is value of the thing,
and of course we give it a name and this is sort of the fact, so the right-hand side query,
message sender query, we are resolving a stream of resolved entity attributes that we can
put into our local database, thus replicating it at our local database. The cool thing is we don’t need to necessarily
only replicate data; we can potentially use the full power of our Datalog or SQL query
language to do some aggregations as Malte showed us earlier. We have to describe all the data needs that
we would like to have replicated in our local database. Some of these things may seem a bit effortful
because we need to take all these different attributes maybe, corresponding for example
to the same entity, but there is something called a pull syntax where we can see I want
all the attributes corresponding to some one entity. It just makes a few things easier. All right, now let’s step back again and have
a look at what we’ve covered so far. From the beginning we started with this system
that allows us to write these dataflow programs, distribute dataflow programmes and gives us
this push access model. We showed you 3DF which allows us to write
these queries and the queries are our means of describing the data interest we want and
also describes all the access policy that is potentially necessary to enforce privacy. Then now we covered, when these streams come
in, or these entity attribute value streams come in, we put them into our local database
and the rest of the system of the frontend can talk to this replicated database right
now and render its messages, for example, from this local database, and does not have
to care about all the rest of the stack being kept in sync, so to say. The cool thing about this is that all these
errors we were showing earlier in all the thing, it’s just one declarative query that
describes this one attribute message sender and the rest is basically taken care of, so
one declarative query pulled them all, so to say. We are nearing the end of the talk so we thought
we would include a bit of a – what sort of open things are there currently occurring? The first is we are using this data script
thing for our frontend application, but 3DF is written in Rust and we can use literally
the same code as is used in the big end machines and this is something we have been playing
in a bit and there is a point if you are interested in this. Of course, we love Datalog but other people
might not, so this Plan struct I showed earlier is an easy target for SQL or GraphQL, other
languages, so we could simply write a GraphQL path which then gives it to our database or
to the system and it will work just fine. All right, let’s see what we did here. So we started off with the way it used to
be, arrows pointing up because people have to pull down all the data they are interested
in, transform it and potentially in different languages, and the first thing we did was
introduced a dataflow and push-based model where all these arrows are push-based but
we still query languages up on the top to talk about in database languages. I hope what we’ve sort of conveyed now is
that we pushed relational queries all the way down to the frontend itself so it’s all
written as a single Datalog query and the frontend is also in a relational query so
in a sense we put the frontend right inside the database. Thank you. [Applause]
>>Thank you for that great talk. Are there any questions in the room? Please raise your hand.>>Hi. A great talk. So an association I immediately have when
I see attribute value is RDF where you have subject, predicate, object, and I was wondering
the semantic web which is based also on this would benefit tremendously from having a client-side
database. Projects are already working on that but do
you think this might be an alternative to work, for example, in the [inaudible]? DAVID: I would say, if you are interested
in these sort of incrementalising these updates, so if your database is really interested in
getting push-based access to all the rest of the data that is potentially needed, then
of course, the stuff I showed you is simply super early. So if there are some funny cursor movements
and colours. I wouldn’t at all suggest you use this right
now. But definitely, in the future, yes, I would
think so, yes.>>Next question.>>Thank you very much. Great talk. I am totally on board with the idea of moving
relational queries down to the client, however in many contexts where you have this version,
the clients are – it’s running on someone else’s machine and now that it’s providing
them with a lot of additional power they might decide to run a query that is just going to
make your backend slow down to a crawl. Have you thought about how to deal with this? DAVID: Yes, we did. We had a slide but we thought maybe that would
be too much. Totally fair. Usually you wouldn’t trust any client to just
execute any query, of course. There are two things to this story, basically. One is worst case, which is like a research
topic where you can do all these join operations in that even if clients ask for weird join
queries we won’t throw up the whole server because we are smart in these worst case cases. If we restrict the aggregations to only count
– as long as we don’t allow user-defined aggregations, it couldn’t blow up too much in a sense. Last part is basically access policy, so the
access policy must be enforced not by the client itself but of course by our system,
but yes, great question.>>The last question? Anyone?>>Thank you, great talk. Some time ago I wrote down an application
with Kafka Streams for interactive queries. I didn’t do it properly as they suggested
and I was moving all the state to all the nodes in the front which gave us a lot of
problems, especially when a node goes down and needs to recover. There is a lot of overhead of gigabytes of
data being moved. How do you deal with boot strapping and fault
re-runs? MALTE: Yes, so also a very good question and
I think there are multiple projects that are trying to replicate basically the whole database
on the frontend side. One thing to tackle this is, of course, because
we can write the specific queries we already narrow down the data to the data that is actually
relevant to the query. The other thing is, for example, if you want
to recover from a node breaking down, that you don’t have to send all of this data over
as data, so not only the differences but that you can, of course, think about creating snapshots,
similar to what we showed about the table as of t* so that you can send over an entity
just as a single serialised entity and not have to replay all the work you’ve done before. Yes. But of course it’s a real concern, especially
in the frontend part. Yes.>>Thank you.>>Thanks. Give a great hand to David and Malte. [Applause]

Leave a Reply

Your email address will not be published. Required fields are marked *