RustFest Barcelona – Tyler Mandry: Life of an async fn


TYLER: Well, in case you haven’t been paying
any attention, async/await is now stable in Rust. [Applause] This is a huge milestone for the
language and I think systems development in general, so very exciting. Again, my name is Tyler, or tmandry. I work on Fuchsia, which is an experimental
operating system at Google. We use Rust for a number of things in the
operating system. I work on the toolchain team supporting Rust,
which means that sometimes I get to spend time fiddling around in the compiler and I’m
also a Rust release team member and compiler team contributor. So I just want to briefly give kind of a primer
on futures before I dive into some of the async stuff. Florian covered some of this, so just to recap,
a future is any computation that progresses asynchronously, so that could be like a request
to some external database server, it could be like reading and writing on disk, or doing
some expensive data processing on another thread, something like that. All of these things can be represented with
a future. It eventually produces a result which is called
its output and we represent futures in Rust with the future trait, so you can see there
is an associated type called output which is the output of your computation, and then
there’s this method called poll which has kind of the scary signature but the job of
the poll method is to make progress on the future, as much progress as possible. Poll returns an enum called poll and the result
of the poll is either pending which means that I made as much progress as I can and
now I’m waiting for some external event, or ready, which means that I’m completely done
and here is the result of my computation. An important thing to remember is that in
Rust we have a lazy futures model, so futures don’t do anything until they are told and
it’s the job of the executor to do that. Okay, so I want to talk about async functions
in this talk and I will start with a really simple example of an async function, so here
is an async function named handle request. We are going to take some arguments. The first thing we are going to do is to issue
a request to an external database server, to get a row from our database corresponding
to the ID we were handed. We are going to await the result of that,
so basically suspend execution of this function until we get the row back. Once we get it back we are going to encode
it into JSON and then finally we are going to write the encoded JSON out onto a TCP stream. Now, if you are coding this synchronously
your function would block your thread at two potential points. One is basically when you are awaiting the
database server and the other is when you are writing out to a TCP stream because TCP
sockets many times have a buffer that can fill up so you will block your entire thread
and at minimum you will pay a contact switch or your thread could be locked for a really
long time. The way this connects to future, as was mentioned
earlier, is we re-write the signature in the compiler of your async function. It could just be a normal function that returns
impl future and that means that we are returning some type, we are not going to tell you what
type it is, that happens to implement the future crate. In this case our async function did not have
a recurrent type so the output is just the empty [inaudible] and then inside of the function
body, we just wrap it with an async block. So let’s talk about what the compiler actually
does when you compile this async function. To do that I want to walk through how you
would implement this by hand in the pre-async/await world. So the way you would do that is you want to
create a data structure that represents the state of your request, and so we are going
to do that, we are going to call it request handler and start with what we call the argument
star function into that struct and then we are going to implement future on a request
handler, again our async function doesn’t return anything so the output associated type
is just tuple and then we are going to implement the poll method, so the way you implement
this by hand is by writing a state machine which is what we are going to do. We are going to add a state field to our struct
which keeps track of what state our request is in and then, inside the poll method, every
time I get polled we are going to match on that state and basically behave accordingly. So the very first state that we can be in
is we’ve never been polled before, this corresponds to the very top of the async function; the
next one is that we are getting the row so awaiting the result of the database; the third
is that we are writing to our TCP stream. Notice I didn’t create a state for encoding
the JSON because that’s a synchronous operation. There’s no await. Then finally we are completely done and ready
to return our result. Let’s go back. In the case where we’ve never been polled
before we are going to just basically execute the very first line in our function, so we
are going to call get row and it’s going to return a future which we are going to store
in a field and then we will update to getting row. Get row returns a future. We are going to store that in a field called
row fut and we have to initialise all the fields, need to be able to say: I don’t have
it yet, none, and then when we do have it, initialise it, some, with the value of the
future. Then once we are polled we are polled when
we are in the getting row state. We are awaiting, so we need to propagate this
to pull down the future that we are awaiting, so we will do that. We know there’s a future inside of row_fut. If that returns pending that means we can’t
make any more progress so we simply put that back up the stack. If we get poll ready, that means the row is
ready and we can make progress so the very first thing we can do is deallocate the row_fut,
we don’t need that anymore and then we are going to do the next thing in the function
which is to encode our row into JSON. Then we do the next thing which is to create
the write all. So we call write all. Again, we are going to store the value in
our state machine and then we are going to update our state to be writing. All right, so the next state is writing. This sounds really simple, basically the same
thing that we saw before, where we are now await the write_fut so we will propagate the
poll down to the write_fut. Again if it’s pending, we will return poll
pending. If it’s ready there’s no output so we immediately
update our state to be ready and that’s it. The ready state is really simple. You simply [inaudible]. So there’s one last thing that we need to
do which is add a loop round the inside of our body and that means that every time poll
gets called, we make as much progress as we possibly can until we either hit a poll pending
on subfuture or we are completely done, so basically we will constantly loop until we
hit one of these three return statements in the body. You notice there’s quite a bit more code on
the right side than the left. It’s not as readable. That’s one major benefit of async/await, of
course. And in fact we actually cheated and actually
made the code a little bit more readable than it would have to be in real life because we
are actually borrowing one of our struct members with [inaudible] encoded inside of the value
of the write so when we call write all we are handing it a reference to basically put
it to buffer and you can’t actually write this in normal Rust. You would have to introduce a layer of indirection
with a smart pointer or something and it ends up being really not fun. This is an issue that comes up a lot in async. Thankfully, with async/await all that is taken
care of for you and it’s more efficient. Just says: I know that you are [inaudible]
yourself and thanks to some extractions we can guarantee that. Another problem with our implementation that
you might notice is we are actually wasting some space here. Any time we are in the middle of getting the
row we don’t have any need for the write_fut, encoder bytes, and then in the next stage
when we are writing to the TCP stream we don’t need the row_fut anymore so it would be really
nice if we could not allocate all of that space and just re-use the bytes. If we look at the way that it’s laid out in
memory right now, we’ve got offset zero of our structs and you have all of the fields,
so this is the way it is now. What we would really like to be able to do
is basically again re-use those same bytes, row_futs in the next stage when we are awaiting
the write_fut. So as recently as a few months ago it didn’t
do this and this problem compounds really quickly. If we assume for simplicity’s sake that the
size of our future is basically dominated by the size of all the subfutures, then we
are roughly twice as big as we need, not re-using though bytes. And that’s a problem but it actually gets
worse because row_fut and write_fut, those are written as async functions and those each
have their own problems where they are awaiting different subfutures and so really those are
twice as big as they need to be. So really the future we are writing isn’t
twice as big as it needs to be, it’s really four times as big as it needs to be and then
if that gets awaited by some other future maybe that’s eight times a big as it needs
to be. This really did happen in practice. So in Fuchsia we would have futures that took
up half a megabyte on the stack, which is really big and we had a lot of futures that
actually blew the stack. You can [inaudible] in Rust. It’s not undefined behaviour, but you can. What the compiler wants to do is it wants
to ask the question: okay, which of these fields can I overlap? The way it does that is it looks at your programme
– I will zoom in here so you can read some things – so this is a graphical representation
of what’s called mirror or the mid-level IR inside the Rust compiler. We represent the entire program as this control
flow graph, we are inside of every block. These are called basic blocks, so on the left
side in the middle you will see basically block number 14 and inside of that block there’s
a bunch of statements. These are very simple statements where you
can only do things like: I need to allocate storage on the stack for a local variable,
or I need to access one member of this variable, reference something or assign something. There’s no control flow inside of the basic
blocks. All the control flow happens between the basic
blocks along these edges here. So at the end of block 14 we drop something
and drop is an operation which can panic, so there are two options. Either we return and go down the happy path,
which is normally what you think about, down to basic block 23, or we panic and we unwind
and go into basic block 7 to do all the clean-up. So what we are going to do is crawl this entire
info graph and work out which variables are storage live at the same time. Storage live is how we represent that we actually
need storage for a variable. We are not actually going to do that because
we don’t have three hours for this talk so we are just going to squint at the source
and see if we can figure anything out. You can do this at home. That works pretty well. So the first question we are going to ask
is: can I overlap row_fut with encoded? The rule of thumb here is, if the last time
you ever use a future is – sorry, if the last time you ever use field A happens before the
first time you ever use field B then you can definitely overlap. So in this case, you will notice the last
time we used row_fut is we set it equal to none and then the following line is the first
time we ever use encoded and you can kind of reason about this by looking at the source
and saying: okay, definitely these are never needed at the same time. So yes, we can overlap these in memory. When I say overlap, I mean just re-use the
bytes, right? Okay, so what about row_fut with write_fut? Again, in this case row_fut, the last time
we ever use it is setting it equal to none and then later on we use write_fut, so we
can definitely overlap these two. Then the last two fields that we care about
are write_fut and encoded. So in this case we assign to encoded and then
we assign to write_fut and we are actually handing out a reference to the encoded bytes,
which means that we definitely need these. Write_fut contains a reference to encoded
so we can’t overlap these in memory. So this allows us to overlap these bytes. What this ends up looking like is not so much
a struct but more like an enum where you have one variant per every state your future could
be in and inside of each enum variant you have all the fields that you need for that
particular state, so in general in this table we have all of the enum variants so we could
be in unpolled state, suspend 0, 1, 2 etc and then all of our local fields on the left
here. The only difference between a regular Rust
enum and this is that you can have one field and multiple variants, so in this example
we need local a starting in the suspend zero variant but also in the suspend 1 variant
so you can have one local that persists across multiple await points. That comes up quite a lot. So I want to talk about – this is all very
cool, but what are the performance applications? So let’s look at some patterns that you might
not realise have subtle performance issues. So in this example we have a function called
do-stuff, we take a context which is just a smart pointer to some context somewhere
and what we are going to do is we are going to log some development fill with some contexts. There is a subtle performance issue with this
code. Can anyone spot it? Nod if you think you see it. Not seeing any nodding, okay. So what can happen is, at the end of every
scope is when we drop our local variables, right? So the compiler is going to insert an implicit
drop at the very end of our function but we don’t actually need context until the end
of the function; we just log in at the very beginning. So we are actually hanging on to this context,
all the way through awaiting that future and then finishing the task, which is kind of
inefficient. If we have the only reference to context then
we will be heaving that memory around and we don’t need to be. What we are going to do is, one way to mitigate
this is to move that context into basically an interscope in the function and the effect
of this is that we are actually moving the drop to the end of that interscope before
the await. This is quite a bit better. Not hanging on to context or awaiting some
other future. That future could take a year to complete. We have no idea. But it’s still not perfect. The reason is: remember, futures do nothing
until they are polled and in the case of an async function that includes all of the lines
at the top of the function, so none of these lines are run until the first time the poll
is called, which means we are still hanging on to contexts or in that unresumed state. So the “perfect” way of dealing with this
is actually to desugar our async function into a regular function which returns impl
future like we saw earlier – compiler does this for you but you can do it yourself and
at the very top of that function we will do our log and then we will write an async block
which drops this into an async context where we can await anything we want and actually
evaluate your future return. Because we don’t reference context inside
of that async block at all, we never have to save it inside of our state machine, so
we aren’t allocating any bytes to save the smart pointer and also never hanging on to
that reference inside of the future. So once due step is called, we immediately
log. Then we return a future that can make progress
later. Another pattern to look out for is when you
are awaiting expressions that have really big, complicated temporaries. So in this example I have a struct Big which
has an array in it that is just kilobyte and link, a pretty big type. We are going to implement drop for Big and
do something silly like print a message when it is called. Then I have a function through from usize
to usize. It doesn’t really matter what it does. Inside of my async function bar, we are going
to construct a Big simply to get the length of the array out and then call foo with that
link. So this is semantically equivalent to calling
foo with 1024. Let’s say we don’t want to repeat ourselves
and we forgot the thing in Rust so we are going to construct it on the stack and get
the link up. Normally the compiler could optimise this
away and it would be fine. But what happens when we try to print the
size of the state machine that is returned? We are going to get something really big back. Our state machine is over a kilobyte in size. That’s because it is embedding a copy of Big
inside of the state machine. The reason for this is that Rust has well-defined
semantics for when the structs are run and any time you create a temporary inside of
a statement, the destructor for that temporary is called at the very end of the statement. Basically, you have a temporary expression
inside of a larger statement, fast forward to the next semicolon you see and that’s when
the destructor will be run for that temporary, which in this case is not good because that
after our await, right? So this is something to look out for and it’s
pretty easy to mitigate. So in this case we could simply get the link
inside of one statement, there’s a semicolon in between that statement and the next await
and that would be fine. We can also write it like this. Actually, constructing the entire foo future
inside of one statement and then awaiting on the next line. It doesn’t really matter. In either case, when we print the size of
our state machine it’s going to be much, much smaller. We are not embedding a copy of Big. You might say, okay, that seems like a problem
but if the temporary doesn’t have a constructor, in other words it doesn’t implement drop,
we should be able to optimise that away. We don’t need to hang on to this copy of Big
in between because we never need to call the constructor on it and you would be right except
unfortunately we don’t do that optimisation today. So that’s kind of a future thing that we would
want to do in the future. So for now just always look out for big complicated
temporaries. This only comes up occasionally in practice
but it’s something to be aware of. So should I use async/await? Yes, definitely use it. It’s now stable but it’s a “Minimum Viable
Product”, and that means we are taking advantage of the Rust release process where we release
a new compiler every six weeks to iterate on the language compiler and implementation. All of your code – all of the state machines
that get produced should be as good or better than hand-rolled in most cases with a few
notable exceptions, and that’s going to improve. I got to talk to some of the people writing
the C++ coroutines proposal for 2020 which is a really interesting talk and that’s some
of their version, C++’s version of async/await so there are a couple of differences between
theirs and ours. One major difference between Rust and the
current proposal for C++ is that any time we allocate anything on the heap in Rust,
that allocation is always made explicit so if you await another future, it always gets
rolled up, all of the state for that future gets rolled up into your state machine, your
data structure, and we don’t allocate it on the heap unless you create a box and put the
future in there. The C++ proposal, there’s actually an implicit
heap allocation every time you await a future and in many cases the compiler can optimise
that away, but there are no guarantees. There’s a possibility to flip over a new – an
optimisation and have an optimisation fail and now you have a new heap allocation in
your programme, in the C++ programme. Rust has fewer language-level extension points
so one consequence of releasing a new standard only once every three years is that you can
shove everything in at once. This works for C++, but in the case of Rust
we kind of have a minimal service area. The benefit of that is that you have to do
less work, keeping code performant, and the rest of the stuff you don’t need. We have less legacy. Async/await is not just for web servers. There are a lot of things you could use it
for and I think there are really a lot of interesting applications out there. One is high-performance compute. Another is actually filesystems, so any time
you are writing to or reading from disk, that’s a whole lot slower than your CPUs. I think there are some interesting applications
there. Device drivers might be an interesting thing. I’m working on operating systems so that’s
what I’m thinking about. Build systems: any time you are writing a
build system, you are basically starting a compile step and then that can take a very
long time so there might be some interesting applications there for expressing build systems
and bug rules and UI is another way, so you can imagine showing a pop-up user and then
awaiting their response. Things like that. It’s really just a paradigm in a way of writing
code that allows you to express what you want in a really high level way. If any of this interested you in terms of
the compiler or just getting involved in the Rust project in general I highly encourage
you to contribute. There are lots of ways you can contribute. Probably the simplest way is, if you hit an
issue, whether it’s a compiler bug or confusing error message, it’s a great idea to file an
issue for that. There are people who do nothing except look
for confusing messages in the compiler and fix them, and they are awesome. So definitely report things. If you feel like picking up an issue to tackle
there’s a lot of easy or mentor tags in the Rust repo, so you can look for that, and take
advantage of the community. I mean, one of the things I love about Rust
is that it has a really strong community and people that are always happy to support you. So there’s a few different places that you
can go to kind of post, ask questions, there’s me, you can find me on Twitter, GitHub, Gmail,
and there are also working groups that you can join if you want more of an ongoing contribution. Here is a shortlist of some working groups
in the compiler team but there’s a lot of other working groups and other teams. You can find a list online. And that’s everything. So do we have time for questions? [Applause]
>>You mentioned that Fuchsia’s use of the generators prior to those optimisations being
implemented was about, what, half a megabyte on the stack or more? After the optimisations, can you share how
much it shrank by? TYLER: Yes, that’s a good question. I do not remember that particular one. I think we shrunk it by over 80%. Some of them – so some of the futures that
we had shrunk by – didn’t shrink at all, but some shrunk by 80 or 90% in size so it just
depended how deep the stack of futures that were being awaited was.>>The desugaring of the loop showed that
it always – there’s a loop and a match, so that even if you go from one basic block to
the next it would jump to the top of the loop and rematch again. TYLER: Yes.>>Is that how it actually compiles or does
it optimise it so that if you don’t suspend the execution, it just falls through to the
next basic block without jumping up to the top and then matching it there? Jumping back down. TYLER: So, right, yes, so it does actually
have a loop in there, so any time you have an await, that is actually where the loop
is introduced. So any time you await some subfuture, we basically
poll that future in a loop, so what I’m doing is semantically equivalent but I’m putting
the loop inside of the poll function. Around the body of our entire poll function,
essentially the same thing, I’m just not embedding a loop at every single await point because
then it wouldn’t hit on the slide.>>So async/await has been stabilising for
four years, right, or something like that? TYLER: It has been a while.>>Now it’s stable, so are you seeing something
that might show us that that was a mistake? We should have waited a little bit more? No pun intended. TYLER: Ooh. Uh oh. So did we make any mistakes? I think time will tell. I’m really happy that we picked the time,
even though a lot of people understand build were upset by this. Yes, I don’t think that there’s any red flags
or things missing. Some of the performance things that are covered,
like the fact that we hang on to a temporary until the end of a statement, even if there’s
an await, that was a change that was introduced a couple of months before stabilisation. So I think that makes sense for the consistency
of the language, that’s how the language works outside of an async context, but it’s possible
we could have defined the language differently and just silently changed the semantics so
that you don’t hold on to temporary as across an await. Time will tell.>>Hi, thank you for the talk, it was really
interesting. TYLER: Thank you.>>I’m curious about the state machine. Can you ask the compiler to omit not Rust
stores but the mirror for what the state machine generates? TYLER: Yes, you can. It’s a nightly only flag, so you have to download
the nightly compiler and it’s not super user friendly. What am I doing? I am going to show you. So yes, this diagram that I showed in the
slides is actually generated by that. It actually emits a dot graph file and you
can render it. So there’s a couple of different modes but
there’s z emit mirror and then z – I can’t remember the flag, but message me and I will
send you the flag. There’s a really nice documentation called
the rustc guide so people working on the compiler. You can go and find it there, even if you
just want to see your state machine.>>Thank you. Give a big hand to Tyler. [Applause]

Leave a Reply

Your email address will not be published. Required fields are marked *