···11++++
22+date = 2026-04-20
33+title = "Scotty, I need warp speed in three minutes"
44+55+description = """
66+My journey into optimising Elixir codebase of Ultravisor (my fork of Supabase's
77+Supavisor).
88+99+This story is not about a goal, but ~~friends~~ optimisations we met along the
1010+way.
1111+"""
1212+1313+[taxonomies]
1414+tags = [
1515+ "beam",
1616+ "performance"
1717+]
1818++++
1919+2020+In my last larger gig I worked on fascinating project - [Postgres connection
2121+pooler written in Elixir][supavisor]. Unfortunately, due to different
2222+circumstances, this project burned me out to the ground. However what doesn't
2323+kill you ~~is crap not a weapon~~ can become great learning experience.
2424+2525+[supavisor]: https://github.com/supabase/supavisor
2626+2727+Most of my achievements in this project were related to performance. This
2828+project contains **very** tight loop in form of query handler, that needed to
2929+run hundreds of thousands times per second per user connection. That mean that
3030+this functions are *very* sensitive to even slightest performance changes. And
3131+that was my task - to find potential improvements that can be made to make this
3232+codebase be much faster.
3333+3434+After departing from Supabase I liked the project so much (mostly as learning
3535+ground) that I have created my own fork, where unrestrained from all business
3636+side of the project I could focus purely on squeezing as much of performance as
3737+I can. This project now lives as [Ultravisor][] - it is still nowhere near being
3838+done in a way that I like, but I still go back to work on it from time to time
3939+to find potential performance improvements.
4040+4141+This is a story of things that I have done and learned during that journey.
4242+4343+> **Beware**: It is a retrospection, so in some places my memory may be not the
4444+> best.
4545+4646+Here I need to provide some explanation first, about how Ultravisor works with
4747+database connections. It provides 2 modes of operation:
4848+4949+- `session` - where each connection from user to Ultravisor checks out one
5050+ connection from Ultravisor to database. It checks out once, at the start
5151+ of connection, and then holds connection until the end;
5252+- `transaction` - where on connection there is nothing done. Client connects to
5353+ the Ultravisor and can keep that connection indefinitely without ever
5454+ bothering database. Database connection is checked out *only* when there is
5555+ some request from user and is returned to the pool as soon as that result of
5656+ that query is returned and DB is ready for next one.
5757+5858+While `session` mode is quite on par with other implementations of connection
5959+pooling for Postgres, `transaction` mode is where performance is lacking and is
6060+the main focus is put. In whole article (unless mentioned otherwise) I will
6161+speak about `transaction` mode of Ultravisor.
6262+6363+## Lesson: Flame graphs and call tracing is essential
6464+6565+Pretty obvious thing, but still valuable lesson for any performance
6666+optimisation endeavour. For that the great thanks to [Trevor Brown][Stratus3D]
6767+and his awesome project [eFlambè][eflambe]. This helped a lot in tracing hot
6868+points in the running code.
6969+7070+Unfortunately this project seems to be less active recently and has some missing
7171+features, like [listening for given duration instead of function calls
7272+count](https://github.com/Stratus3D/eflambe/issues/48). This can be partially
7373+fixed by simply listening for count of calls to `handle_event/4` function given
7474+times and then running `cat *.bggg` to concatenate all files into larger trace.
7575+That has disadvantages, but at least it was workable within [Speedoscope][]
7676+which I also highly recommend to anyone who needs to work on such optimisation.
7777+7878+While flame graphs are awesome, there is cost to gathering them with eFlambè -
7979+it greatly affects performance. Fortunately Erlang has some built in tools that
8080+have lesser performance impact, and the "most modern" of these is
8181+[`tprof`][tprof]. This tool is pretty easy to use, but is less detailed than
8282+eFlambè. But even with that limitation, it provides superb insight into stuff
8383+that has greatest impact on performance, as well as it make it easier to work on
8484+long running processes, as it work asynchronously, so you can "manually" decide
8585+how long you want to trace your process.
8686+8787+[Stratus3D]: https://github.com/Stratus3D
8888+[eflambe]: https://github.com/Stratus3D/eflambe
8989+[tprof]: https://erlang.org/doc/man/tprof.html
9090+9191+**Summary:** Knowing where your bottlenecks are is essential for performance
9292+optimisations.
9393+9494+## Lesson: Doing less can improve performance
9595+9696+Obvious thing that need to be stated - doing nothing is faster than doing
9797+something. Extracting amount of data sent over given socket using
9898+`:inet.getstat/2` call is fast, but not free. That involves some waiting for
9999+response from either port or process handling connection, which introduces
100100+slowdown. Two possible solutions there are:
101101+102102+1. Do not gather that metric at all - sensible, but not feasible, especially when
103103+ you use that metric to charge your users.
104104+1. Gather that data less often.
105105+106106+The approach I have taken there is obviously 1., and the solution is dumb
107107+simple - debouncer.
108108+109109+Debouncing is an interesting technique often used in user interfaces where you
110110+accept some event, and then for some period you ignore repeated events. The
111111+reason for that is that our interfaces may have flaws that send repeated events
112112+one after another.
113113+114114+In this case Ultravisor tries to store amount of sent data after each query, but
115115+that can get expensive for many short queries. Instead I have implemented simple
116116+per-process debouncer:
117117+118118+```elixir
119119+defmodule Ultravisor.Debouncer do
120120+ def debounce(key, time \\ 100, func) do
121121+ current = System.monotonic_time(:millisecond)
122122+ key = {:debounce, key}
123123+124124+ case Process.get(key, nil) do
125125+ {prev, ret} when prev + time > current ->
126126+ ret
127127+128128+ _ ->
129129+ ret = func.()
130130+ Process.put(key, {current, ret})
131131+ ret
132132+ end
133133+ end
134134+end
135135+```
136136+137137+This stores returned data in process dictionary (per-process mutable space with
138138+quick access) and if there was no call in given time-period, then we process
139139+data again. This is safe way to do so, as `:inet.getstat/2` will always return
140140+amount of data that socket processed since it started, so data between calls
141141+will be accounted.
142142+143143+Before:
144144+```
145145+tps = 79401.392762 (without initial connection time)
146146+```
147147+148148+After (10ms of debouncing):
149149+```
150150+tps = 80069.646510 (without initial connection time)
151151+```
152152+153153+After (100ms of debouncing):
154154+```
155155+tps = 80568.825937 (without initial connection time)
156156+```
157157+158158+**Summary**: Doing noting is more performant than doing something. Sometimes
159159+doing nothing can be quite easy.
160160+161161+## Lesson: Telemetry is not free
162162+163163+When working on most projects, especially Phoenix-based, one can slap
164164+`:telemetry.execute/3` calls everywhere and notice no performance degradation[^telemetry].
165165+Unfortunately, when you do hundreds of thousands calls a second - that is not a
166166+case.
167167+168168+[^telemetry]: For unaware readers - [Telemetry][] is Erlang event dispatching
169169+ system for observability events.
170170+171171+[Telemetry]: https://github.com/beam-telemetry/telemetry
172172+173173+In this project the metrics are exposed in Prometheus/OpenMetrics format, which
174174+mean that there need to be collection system within application. In BEAM
175175+applications the standard way to implement that is to use ETS tables to store
176176+recorded values. Fortunately there are libraries to handle that for you, and for
177177+the longest time "gold standard" for it was `telemetry_prometheus_core` library
178178+created by Telemetry core team.
179179+180180+While for most projects that library is performant enough (because metrics
181181+aren't recorded in quite tight loops), in case of this project that was not a
182182+case. There, metrics gathering is still one of the hottest spot in the codebase,
183183+even with all improvements that have been done.
184184+185185+Excerpt from `tprof` profile:
186186+187187+| Function | Calls count | Per call (μs) | Percentage |
188188+| - | - :| - :| - :|
189189+| `Peep.EventHandler.store_metrics/5` | 1421911 | 0.13 | 4.21% |
190190+| `Peep.Storage.Striped.insert_metric/5` | 904852 | 0.26 | 5.11% |
191191+192192+This is with awesome library [Peep][] by [Richard Kallos][rkallos]. When using
193193+`telemetry_prometheus_core` it was simply the most expensive thing in whole
194194+loop. Just replacing metrics gathering library with Peep gave us about 2x bump
195195+in TPS.
196196+197197+[Peep]: https://github.com/rkallos/peep
198198+[rkallos]: https://github.com/rkallos
199199+200200+**Summary:** Telemetry handler can matter in tight loops. Fast metrics gathering
201201+isn't easy.
202202+203203+## Lesson: Records instead of maps or structs
204204+205205+Elixir uses structs for structured data. That gives a lot nice features wrt. hot
206206+code reloads, compilation graph dependencies, and other. However, because
207207+structs are maps, there is a cost. Maps have O(log n) access time to the fields,
208208+this is how maps are constructed in memory. While smaller maps have slightly
209209+different (better in most cases) characteristics, there is strict requirement
210210+that you keep your structure with less than 31 fields[^fields] and it still has
211211+slight memory overhead. The alternative is to use [records][]. These have better
212212+performance characteristic (always constant) irrelevant of the amount of fields
213213+at the cost of being slightly more rigid (records are tuple based) and less
214214+convenient to use (experience may vary). Additional advantage in my opinion is
215215+that it is harder to add incorrect field by using `Map` module.
216216+217217+[^fields]: Current (OTP 28) limit for small map is 32 keys, but Elixir uses one
218218+ key for struct name, hence 31 fields is the limit.
219219+220220+Before you will run and change all structs in your system to records, just
221221+remember - most of the time the difference doesn't matter - just use structures.
222222+223223+[HAMT]: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
224224+[records]: https://hexdocs.pm/elixir/Record.html
225225+226226+Before:
227227+```
228228+tps = 81765.266264 (without initial connection time)
229229+```
230230+231231+After:
232232+```
233233+tps = 82147.855889 (without initial connection time)
234234+```
235235+236236+**Summary:** `Record`s are super handy when you need to squeeze each bit of
237237+performance. It doesn't provide much, but these adds up.
238238+239239+## Lesson: ETS tables are super fast, but not always
240240+241241+[ETS][] is Erlang's built-in module for storing key-value data in mutable way.
242242+Like built-in Redis. This structure allows for sharing some data in a way, that
243243+is easy to access from different parts of the system. One example of system that
244244+is using ETS for storing their information is Telemetry (mentioned above).
245245+246246+While for 99% of the use cases Telemetry will be fast enough, it has some
247247+problems with tight loops. Main problem is that it will always copy data from
248248+table to the caller process. That mean that it can put high memory pressure on
249249+the process that tries to retrieve data.
250250+251251+Fortunately Erlang supports another mechanism for storing globally accessible
252252+data - `persistent_term`. Of course, there is no such thing as "free lunch" so
253253+it has substantial disadvantage - it works poorly[^pt] with data that changes
254254+often, as removing or changing data in a key will require walk through all
255255+processes to copy data from it to processes that may use it into process memory.
256256+However - Telemetry handlers should not change a lot, you should just set them
257257+once as soon as your system start, and then ideally they will not change ever
258258+again.
259259+260260+[^pt]: There is slight optimisation that makes it fast in some cases (single
261261+ word values, like atoms), but that is not the case there, so we can ignore
262262+ that.
263263+264264+Before[^lower-tps]:
265265+```
266266+tps = 76914.004685 (without initial connection time)
267267+```
268268+269269+After:
270270+```
271271+tps = 78479.006634 (without initial connection time)
272272+```
273273+274274+[^lower-tps]: If you wonder why these results are lower than in previous
275275+ section, it is because test conditions are identical only per section, not
276276+ cross sections. In this particular case I have ran benchmark while
277277+ collecting metrics (to show difference in `persistent_term` change) while
278278+ other are ran without metrics to not pollute results.
279279+280280+**Summary:** `persistent_term` is awesome and super fast, so if you know that
281281+you have some data that will probably never change and will be requested
282282+*constantly*, then it may be good place to store that data.
283283+284284+[ETS]: https://www.erlang.org/doc/apps/stdlib/ets.html
285285+286286+## Lesson: Calling your `GenServer`s is fast, but not 90k times per second fast
287287+288288+One of the interesting observations is that I have spotted is that if there are
289289+longer running queries, one that send more data over the network than just
290290+simple short response, then the difference between Ultravisor and "state of the
291291+art" tools like [PgBouncer][] or [PgDog][] (that are written in non-managed
292292+languages like C and Rust) is much smaller (obviously it is still there, but it
293293+is on par, not substantially off).
294294+295295+[PgBouncer]: https://www.pgbouncer.org
296296+[PgDog]: https://pgdog.dev
297297+298298+I needed to dig more, what can be the cause of such strange behaviour. The
299299+reason was found in place where I least expected it - checking out database
300300+connection to be used.
301301+302302+Flame graph showed that almost third of the time is spent on checking out
303303+database connections, and most of that time is spent in 2 function calls, both
304304+of them are internally `gen_statem` calls and in both most time is spent on
305305+sleeping (aka, waiting for reply).
306306+307307+
308308+309309+Now, this one is hard thing to optimise, as in Elixir there is no mutability
310310+(almost, we will get there). This mean that if I want some form of shared queue
311311+of processes, then I need to use separate process to keep state of the queue
312312+for us, and then do `GenServer` calls to fetch that state. What I did in such
313313+situation? What any unreasonable Elixir developer obsessed with performance
314314+would do - NIF[^ets].
315315+316316+[^ets]: I wanted to use ETS there, but for that to work it lacks function like
317317+ `ets:take/2` that would return only one element from the tables with type
318318+ `bag` or `duplicate_bag`. Or any other form of just taking out any (possibly
319319+ random) element from ETS table in atomic way.
320320+321321+The implementation is rather basic wrapper over [`VecDeque`s][rust::VecDeque]
322322+that allow popping single element from that queue without any message passing.
323323+The implementation is very crude, nowhere production ready. It doesn't provide
324324+any form of worker restarts or anything, but works quite well as PoC of what is
325325+possible.
326326+327327+New queue also provides a way to store additional "metadata" alongside the
328328+worker PID. This allows me to store DB connection socket next to connection
329329+process, which removes need for additional call to extract that data to pass
330330+requests directly to other DB, without copying data between processes.
331331+332332+[rust::VecDeque]: https://doc.rust-lang.org/1.94.0/std/collections/struct.VecDeque.html
333333+334334+Before:
335335+```
336336+tps = 83619.640673 (without initial connection time)
337337+```
338338+339339+After:
340340+```
341341+tps = 94191.475386 (without initial connection time)
342342+```
343343+344344+**Summary:** Sometimes one need to get creative to get around platform
345345+limitations. This may require some pesky NIFs though.
346346+347347+## Conclusions
348348+349349+Optimising such project was enormous fun and I think that at the current state
350350+there is nothing extra that can be done to optimise it more without optimising
351351+generated JIT-ed native code or optimising Erlang scheduler.
352352+353353+
354354+355355+There are some flags, that affect performance, but as it is currently unclear
356356+why these work at all (probably it is related more to the OS scheduler rather
357357+than Erlang performance), I left them out of this article for now.
358358+359359+## Post Scriptum: Good tooling helps a lot
360360+361361+Just after I have started that optimisation project after leaving Supabase I
362362+started using [Jujutsu][jj] for version control. That one thing helped me **a
363363+lot** with being able to have separate branches/PRs for each of the changes,
364364+while at the same being able to work with [mega-merge][jj-mm] of them all.
365365+366366+That allows me to profile code with all other noise removed, while still
367367+exposing the changes as separate reviewable units. Without that support I would
368368+need to decipher what have already been changed and/or removed from the profile.
369369+370370+Additional feature that I heavily used there is "anonymous branching". As when
371371+working with JJ I do not need to create new name for each branch that I want to
372372+try, it was way easier to implement one idea, then just do `jj new @-` (which
373373+branches off at the commit that is parent of the current one) and just implement
374374+alternative idea. I used that constantly to compare ideas and reject failed
375375+concepts.
376376+377377+[jj]: https://docs.jj-vcs.dev/latest/
378378+[jj-mm]: https://steveklabnik.github.io/jujutsu-tutorial/advanced/simultaneous-edits.html