There are currently two competing designs for async iteration traits for Rust. The first is poll_next. The second is async fn next. I see strengths to each design. The poll_next design seems stronger on technical concerns, such as performance and ease of implementation. The async fn next design seems better from an ergonomics and consistency perspective. Unfortunately, the process of resolve this debate has been slow going.

One thing I've realized in the debate is that at this point the two designs are more similar than they may seem at first glance. In this post I'd like to show how a handful of minor tweaks to each design results in something that is the same modulo names.

A Concrete Performance Difference🔗

But first, we're going to go on a bit of a diversion about the performance of the two designs.

One of the early arguments in favor of poll_next was that it was more efficient. In most cases, we expect the compiler to be able to optimize away any of the overhead that async fn next might introduce, but it seems better, all else being equal, to pick the design that doesn't require as much work from the compiler. That said, we discovered one case where the compiler cannot optimize away the overhead of async fn next, and that has to do with the using a dyn AsyncIterator object.

To see the difference, let's consider pseudo-code for a for await loop for the two designs. We'll also desugar the .await calls so we can see where the calls to poll happen.

Let's consider a simple function that receives a dyn AsyncIterator and iterates over it:

async fn sum(it: &Box<dyn AsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    for await i in it {
        total += i;
    }
    total
}

Using the poll_next design, this desugars to something like follows. I've left out the context parameter to poll_next for simplicity.

async fn sum(it: Box<dyn AsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    let it = Box::into_pin(it);
    loop {
        match it.poll_next() {  // poll_next call is indirect
            Poll::Ready(Some(i)) => total += i,
            Poll::Ready(None) => break,
            Poll::Pending => yield Poll::Pending,
        }
    }
    total
}

Using the async fn next design, this desugars to something like below. Note that the in this version AsyncIterator is not object safe, so we're assuming we have support using something like dyn*.

async fn sum(it: Box<dyn AsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    loop {
        let f: Pin<dyn* Future<Output = Option<i32>>> = pin!(it.next()); // next call is indirect
        let next = loop {
            match f.poll() { // poll call is indirect
                Poll::Ready(next) => break next,
                Poll::Pending => yield Poll::Pending,
            }
        };
        match next {
            Some(i) => total += i,
            None => break,
        }
    }
    total
}

The async fn next version has two indirect calls in its desugaring. First need to make an indirect call to next to get the future to poll to get the actual next item. The way async in trait objects works means that the resulting future will itself be a trait object, which means calling poll on that future will be an indirect call too.

It's possible these calls don't matter. Indirect calls are normally only relevant in CPU-bound workloads, while async is most often used for IO-bound workloads. In wg-async, we've talked a lot about the this extra indirections but I think it's mostly because we can objectively count how many indirect calls there are, while we can't objectively things like design elegance or ergonomics.

There will be cases where this overhead does matter though, so it is desirable to not bake these calls in at the language level. Can we tweak the design of async fn next to remove most of this overhead?

Reducing the `async fn next` overhead🔗

The key idea here is to change the semantics of the future returned by async fn next so that it's able to be polled again after completion. When polled after completion, it essentially calls next again but reuses the future. This is, in fact, how the Next future from the futures crate works. It wraps poll_next, so the future has the same semantics as poll_next.

If we make this semantic change, then we can desugar the async fn next version to something like below.

async fn sum(it: Box<dyn AsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    let mut f: Pin<dyn* Future<Output = Option<i32>>> = pin!(it.next()); // next call is indirect
    loop {
        let next = loop {
            match f.poll() { // poll call is indirect
                Poll::Ready(next) => break next,
                Poll::Pending => yield Poll::Pending,
            }
        };
        match next {
            Some(i) => total += i,
            None => break,
        }
        // the next time around the loop we'll use `f` again
    }
    total
}

We still have two distinct indirect calls, but the first one is only called once at the start of the loop rather than inside the loop for each iteration.

This code is now more convoluted than it needs to be. We can instead refactor it to only have one loop and one match expression:

async fn sum(it: Box<dyn AsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    let mut f: Pin<dyn* Future<Output = Option<i32>>> = pin!(it.next()); // next call is indirect
    loop {
        match f.poll() {  // poll call is indirect
            Poll::Ready(Some(i)) => total += i,
            Poll::Ready(None) => break,
            Poll::Pending => yield Poll::Pending,
        }
    }
    total
}

This version looks suspiciously like the poll_next desugaring.¹

Aside: This Doesn't Actually Work🔗

This optimization sounds good, but it doesn't actually work. I want to briefly touch on why but then let's just pretend we didn't notice this because I'd like to consider the rest of the argument.

The problem is that async fns only complete once. We might redefine the semantics of Future to allow multi-completion futures, but unless we also change async fn, the futures produced by async functions will still be single-completion. The key benefit of the async fn next is that users can hand-roll their own async iterators by writing async fn next and that this should be simpler than writing two state machines at once with a hand-written poll_next. We could add a modified version of async fn that lets you write a multi-completion future, but if we do this I'm pretty sure we'll end up with async gen fn with some different names. Or we could add a way to create a multi-completion future from an async closure, but if we make users go through all of this, we've given up the simplicity benefits of being able to just write async fn next.

Anyway, let's ignore this for now and continue on.

Iterator Setup🔗

So with the performance optimization, things are looking very similar. Both versions start with a little bit of setup code. The poll_next version needs to pin the iterator since poll_next takes a pinned argument. The async fn next version needs to call next once to get the future to poll, and then pins that as well. Then both versions go into a loop calling either poll_next, or a poll function that we've redefined to have the same semantics as poll_next.

The poll_next version does not have an analog to the next call in the async fn next version. I think this does exist, it's just not shown here. At some point, you need to set up your iterator state. You can't just call poll_next on a Vec, partly because the Vec would have to be pinned and that would be inconvenient, and also because it's not ideal to keep your iteration state attached to the Vec.

So let's imagine we have some operation which sets up the state needed to run an async iterator. In our poll_next examples, we've been assuming the iterator is set up before the call to sum. In other words, it's called as sum(my_vec.async_iter()) and not sum(my_vec).

What if, instead, we created some generic way to make an async iterator for some collection and moved that into the sum function? We could call this trait IntoAsyncIterator², and it would look something like:

trait IntoAsyncIterator {
    type Item;
    type AsyncIter: AsyncIterator<Item = Self::Item>;
    fn into_async_iter(self) -> Self::AsyncIter;
}

Then we could write the poll_next version of sum as:

async fn sum(it: Box<dyn IntoAsyncIterator<Item = i32>>) -> i32 {
    let mut total = 0;
    let mut f: Pin<dyn* AsyncIterator<Item = i32>> = pin!(it.into_async_iter()); // into_async_iter call is indirect
    loop {
        match f.poll_next() {  // poll_next call is indirect
            Poll::Ready(Some(i)) => total += i,
            Poll::Ready(None) => break,
            Poll::Pending => yield Poll::Pending,
        }
    }
    total
}

Now we basically have the same code as the async fn next version. There's a subtle difference, since async fn next takes &mut self while into_async_iter takes self, but we could design the "initialize async iterator state" operation to take &mut self if we wanted. Also, I should point out that self methods are not allowed in object-safe traits, so as designed, we would not be able to have a dyn IntoAsyncIterator.

What's in a name?🔗

If we make the change to async fn next to allow the future to be polled again after completion and we add some kind of IntoAsyncIterator, then the two designs are basically equivalent. In both of them, the for await loop desugars into a setup phase (similar to have synchronous for loops call IntoIterator), and then there is a loop can calls a method to advance the state of the iterator.

If we make these two changes, then IntoAsyncIterator in the poll_next design is essentially what AsyncIterator is in the async fn next design. They are both the way to initialize iterator state. Then the AsyncIterator trait in the poll_next design is equivalent to the modified Future semantics in the async fn next design.³

So to me it seems that if we apply the multi-completion optimization to async fn next then the two designs have the same shape and most of the debate is around the boundaries and names of various subcomponents.

Conclusion🔗

I wrote the first draft of this post about three months ago and then hesitated to publish it because in writing it I discovered some fatal flaws in the argument. However, I've shared the draft privately with a few people and we've been referring to it often in discussions so I felt like it was worth putting it out in public. Others have come up with similar arguments to this post so I consider this post to be my particular take on the argument rather than the definitive statement.

At the beginning I had hoped to say that with the optimizations we've added to the async fn next design, the two designs are essentially the same. The implication then would be that we should stop arguing about names, pick something, and stabilize it.

Having written the post, I think we instead see an irreconcilable difference in the designs. While the multi-completion future optimization does lead to a design that's equivalent to poll_next, it gives up the key characteristic of the async fn next design, which is the ability to hand-write iterators with async fn next. It's not "here's a small tweak and now the designs are the same," it's "here's a small tweak and now async fn next is a completely different design."

So in conclusion, the two designs are not secretly the same and we are going to have to actually make a decision on which one we want to have.

In fact, I copied the poll_next desugaring and then made a few changes to get this version.↩

IntoAsyncIterator may not be the best name, because it implies it consumes its source. We might want a non-destructive way to iterate over something asynchronously.↩

If we go this route, I'd suggest having AsyncIterator::next return a subtrait of Future called something like IterableFuture or MultiCompletionFuture to highlight multi-completion semantics. This is similar to how FusedFuture adds additional semantics to the underlying Future trait.↩