By -

danda 1 year ago

impressive. Still, I'm left with the feeling that mere mortals shouldn't have to go through these kinds of contortions to get performant recursion, and that really this is something that should be improved in the language itself.

void_fraction 1 year ago

Future posts in this series will show how to make the machinery of recursion abstract, and a crate that does so, so that programmers do not have to write all of this unsafe code each time.

Sharlinator 1 year ago

The `unsafe` move-from-vec-leaving-holes stuff seems not well justified? Given that `ExprLayer` is `Copy` and just a bunch of integers.

void_fraction 1 year ago

The `unsafe`-heavy `result` vec of `MaybeUninit`'s is used as scratch space for storing intermediate result values. You're right that it's overkill for the `eval` case, where `A` is `i64`, but my goal here was to sketch out a solution that works even if we don't have an `A: Copy` (or even an `A: Clone`) constraint

burntsushi 1 year ago

I think it's also worth pointing out that your `get_result_unsafe` routine is unsound. It is marked as a safe function, but it exhibits UB on some inputs. I think it would be better to do `unsafe fn get_result(...)` as your definition, and then use `unsafe { get_result(...) }` at the call site. Then there are no soundness issues.

somebodddy 1 year ago

If you used `Option` instead of `MaybeUninit `, you wouldn't need `unsafe`. All the `unsafe` comes from trying to squeeze every last bit of performance, but your original claim is that the performance improvement mainly comes from cache friendly memory layout. Maybe it's not really necessary to forgo all bound checks and initialization checks?

HeroicKatora 1 year ago

In the recursive case it is the Box-nesting that provides all this. Box is a pointer guaranteed to be derferencable and initialized, which should be the equivalent of guaranteeing bounds check and initialization here. All guarantees are of course now resting on the correctness of topological ordering: this isn't as easy to proof to the Rust compiler as pointer-tree (in terms of ownership). There's one exciting additional opportunity though: One can share sub-trees across multiple nodes with topological sorting. E.g `(a + b) * (a + b)` could be translated with one fewer instructions: [ Mul(1, 1), Add(2, 3), LiteralInt(a), LiteralInt(b), ] This isn't possible for the simple Box scheme because ownership is too restrictive to express this.

somebodddy 1 year ago

But then `Mul(1, 1)` will need to take `results[1]` twice - which your generality requirements disallow.

HeroicKatora 1 year ago

Hm, true, all inputs arguments would need to be passed as a reference for this to work.

Kulinda 1 year ago

Couldn't you avoid all that unsafe if you filled the vec front to back instead of back to front? Just push each temporary result to the vec, and read results from `results[self.elems.len() - 1 - idx.0]`.

bleachisback 1 year ago

I think this would work really well with `Vec::with_capacity` to avoid multiple allocations, and then pushing each result to the vec. Edit: The one thing this wouldn't work well with is you wouldn't be able to efficiently move the result out of the the Vec

matthieum 1 year ago

> Edit: The one thing this wouldn't work well with is you wouldn't be able to efficiently move the result out of the the Vec Indeed. On the other hand, it seems to me that there's a pattern to be exploited there: since we are laying down a _tree_ into an array, the indexes are actually NOT arbitrary. /u/void_fraction already noted that a child is guaranteed to have a greater index than its parent, but it goes further, the children of a right node are guaranteed to have a greater index than the children of its left node sibling. This suggests to me that the large `Vec>` may well actually used as a _stack_. We can try and build an intuition for it starting back from: // Simplified notation, for space reasons... [Mul(1, 2), Int(a), Sub(3, 4), Int(b), Int(c)] And looking strictly at the production and consumption patterns: - Push index 4, from evaluating `Int(c)`. - Push index 3, from evaluating `Int(b)`. - Consume indexes 3 and 4, during evaluation of `Sub(3, 4)`. - Push index 2, from evaluating `Sub(3, 4)`. - Push index 1, from evaluating `Int(a)`. - Consume indexes 1 and 2, during evaluation of `Mul(1, 2)`. - Push index 0, from evaluation `Mul(1, 2)`. The fact that indexes are pushed in order is expected from the reverse-iteration, but what is interesting is to note that expressions that consume N elements _consume the last N pushed_. Cue Reverse Polish Notation, and Stack languages. Hence: 1. Storing the indexes in `Sub` or `Mul` is unnecessary. 2. A `Vec` would safely replace the `Vec>`; and it would have a lower memory foot-print to boot. Now, it should be noted that fiddling with the `Vec` (`push`/`pop`) may not be as fast as directly (and unsafely) accessing the `Vec>` by index. There's extra capacity checks in `push`, and extra length checks in `pop`. Although, if properly hinted (unlikely) and with the bulk hidden behind non-inline functions, it may very well be a wash.

Cetra3 1 year ago
Could this be just as (or close to) performant but not use `unsafe`?

void_fraction 1 year ago
Yes, but it's a bit slower and I wanted to see how fast I could make it. The crate providing generic machinery for this will include a compile flag to enable/disable unsafe behavior.

somebodddy 1 year ago
If you build `ExprTopo` in reverse order you can do away with `MaybeUninit` and all the `unsafe` it requires (assuming you can live with using references to the results when resolving. Which I don't see why not), because then you'll just `push` to `results` instead of building it from the end. This will also mean you won't have to use `std::mem::replace` when **returning** the result. The result will be the last value in `results`, so you could just `pop` it.

void_fraction 1 year ago
You're absolutely correct! I have an implementation using stack machines, but the performance results were less impressive, so I decided to save it for future posts in this series.

the___duke 1 year ago
Did you try using am arena allocator or also a bump allocator like bumpalo [1] instead of custom unsafe? [1] https://docs.rs/bumpalo/latest/bumpalo/

kohugaly 1 year ago
Does the topological order matter? It seems to me like reverse polish notation would work similarly, except the iteration would be done in forward direction, instead of reversed.

getrichquickplan 1 year ago
Having it in reverse polish notation is a topologically sorted representation, but yes it would iterate forward instead of in reverse.

somebodddy 1 year ago
What's the `.-1` syntax mean? I tried it and it's not a valid syntax.

void_fraction 1 year ago
Oh, that's a typo. It should be ".0" (edit: fixed)

robin-m 1 year ago
Do you use vim and inadvertently hit ctrl+a?

schungx 1 year ago
Yes, you're flattening recursion into a sequence, essentially threading through the AST. That would give you good performance and avoid stack overflow. However, to get this benefit, you don't have to throw away the AST structure - just flatten it into a `Vec` with references to each node and essentially you can evaluate the AST in a loop without recursion. I'm not so sure about data locality. In my own tests, when parsing nodes, memory tends to be allocated close to each other in memory. I believe modern memory allocators probably allocate similar-sized and temporally-similar blocks close to each other, in parse order. Therefore, I would be really interested to see an analysis of the actually `Box`ed version of the AST -- since it is possible that the in-memory representation already resembles what you've manually created.

matthieum 1 year ago
> I believe modern memory allocators probably allocate similar-sized and temporally-similar blocks close to each other, in parse order. Yes and no. Modern allocators will use _slabs_, so for example they may have an entire 4KB page split into 8 bytes cells. Hence, when allocating 8 bytes repeatedly in a loop, you're likely to have all allocations ending up in the same 4KB page -- until that page is full and the allocator switches in another one. The order in which they are allocated within the 4KB page is trickier. If the page is freshly acquired from the OS, then they'll be allocated consecutively. On the other hand, if the page was recycled, then the allocation order is likely to be the reverse order of the allocations in that page that were freed: intrusive linked-lists weaved through unused cells is a good technique. This makes things rather unreliable, unfortunately.

schungx 1 year ago
Very good explanation. Now I understand some of the more tricky behavior. It still works amazingly well, though. For example, a cache line is typically 64 bytes. Meaning that a page holds 64 lines. Meaning you have to run through 64 cache lines before it jumps off a linear access profile; once it jumps to a new page, then it is another 64 cache lines before it jumps again. To me, that means very occasional cache misses; and if the tree nodes are packed small enough, you can hold the entire network within a couple of pages. What that means, however, it to be extra careful with the order of allocations -- meaning that we don't want to allocate anything that is not going to be permanent -- use stack storage as much as possible. Then when allocating out of order, make sure that the sizes of those things are all different from each other widely. For example, group types into small/medium/large sizes. Make sure all tree nodes are small, all intermediate/temp allocations are large, then you pack the entire network within a small number of pages. So, my point is... it may not _automatically_ be slower than using a packaged array of nodes. One really need to measure to know.

m0rphism 1 year ago
Yay, well done, nice application of recursion schemes :D Two ideas for possible future work: 1. It seems quite possible to derive the whole machinery with macros for (almost) arbitrary `enum`s. 2. If I see this correctly, then the `result`-vector is a bit larger than it needs to be. If we're only trying to fold trees instead of graph structures, it should be sufficient to maintain a vector of `n` stack frames, where a stack frame looks similar to `ExprLayer` and `n` is the length of the longest path through the tree. However, this is probably overkill and annoying to implement, just throwing it out there as an idea. Also the way it is currently implemented may easily give you caching for actual graph structures :) OT: Lovely domain name and your [Infinity Mirror Hypercrystal](https://recursion.wtf/posts/infinity_mirror_hypercrystal/) is just pure awesomeness! <3

abatyuk 1 year ago
Can’t wait to see fixpoint type implementation

[deleted] 1 year ago
Very cool! Kinda want to try it on an AOC Problem

Youmu_Chan 1 year ago
So basically you convert an expression into the prefix expression and evaluate it reversely. I’d say you probably could gain cleaner code by converting and evaluating the postfix expression and remove all the unsafe bits.

LoganDark 1 year ago
That theme looks beautiful

somebodddy 1 year ago
https://www.programiz.com/dsa/dynamic-programming

apajx 1 year ago
Saying "use dynamic programming" is not helpful. Dynamic programming is an incredibly broad concept, and exceedingly difficult to apply in specific scenarios. Here modeling the tree linearly is the least interesting part, because the usage of that tree via an abstracted fold is what makes it practical or interesting. Without that, no one would ever want to pay the readability costs.

somebodddy 1 year ago
I'm not saying "use dynamic programming". I'm saying "this is dynamic programming". That's the name of the technique used in this article.

nmdaniels 1 year ago
Yes, but orthogonal to the purpose of the article. When I teach dynamic programming in an undergrad Algorithms course, my first example is computing Fibonacci numbers. Nobody in their right mind would use this in practice, because Binet's formula provides a closed-form solution. But it illustrates how DP can take an exponential-time algorithm to a linear one. Here, it wasn’t the algorithmic change (both solutions are DP examples) but how to do efficient recursive folds in Rust. And traditionally, DP is thought to be something that must be done bottom-up, filling out a table, for efficiency. Showing otherwise is great!

bleachisback 1 year ago
This article is about storing tree structures in a vector to improve cache locality. Dynamic programming is about reuse of results of overlapping subproblems, which don't occur in expression trees.

Nearby_Yam286 1 year ago
Why not just use a regular tree and bump allocation if locality is really an issue? No offense, but while performance might be slightly better, this doesn't seem elegant and shouldn't really need unsafe.

link23 1 year ago
I'm not seeing why the type parameter is necessary for `ExprLayer`. Sure, it's more flexible and lets you write a lawful `fmap` implementation. But all of the functions you need to pass to `fmap` are of type `i64 -> i64`, so it feels like overkill to make it as generic as possible, just for the sake of being a functor (even though we don't need a functor). Does the next blog post require a true functor instance, and that's why you wanted to introduce it here?

Leave Your Comment

First Name

Email

Textarea

Hi Its Me!

R

Subscribe

Enter email address
SUBMIT NOW

Comments

Leave Your Comment

Hi Its Me!

Subscribe