···11++++
22+date = 2026-02-27
33+title = "Scotty, I need warp speed in three minutes"
44+55+[taxonomies]
66+tags = [
77+ "beam",
88+ "performance"
99+]
1010++++
1111+1212+In my last larger gig I was working on fascinating project - [Postgres connection
1313+pooler written in Elixir][supavisor]. Unfortunately, due to different
1414+circumstances, this project burned me out to the ground. However what doesn't
1515+kill you ~~is crap not a weapon~~ can become great learning experience.
1616+1717+Most of my achievements in this project were related to performance. This
1818+project contains **very** tight loop in form of query handler, that needed to
1919+run hundreds of thousands times per second per user connection. That mean that
2020+this functions are *very* sensitive to even slightest performance changes. And
2121+that was my task - to find potential improvements that can be made to make this
2222+codebase be much faster.
2323+2424+After departing from Supabase I liked the project so much (mostly as learning
2525+ground) that I have created my own fork, where unrestrained from all business
2626+side of the project I could focus purely on squeezing as much of performance as
2727+I can. This project now lives as [Ultravisor][] - it is still nowhere near being
2828+done in a way that I like, but I still go back to work on it from time to time
2929+to find potential performance improvements.
3030+3131+This is a story of things that I have done and learned during that journey.
3232+3333+> **Beware**: It is a retrospection, so in some places my memory may be not the
3434+> best.
3535+3636+Here I need to provide some explanation first, about how Ultravisor works with
3737+database connections. There are 2 modes of operation:
3838+3939+- `session` - where each connection from user to Ultravisor checks out one
4040+ connection from Ultravisor to database. So that checks out once, at the start
4141+ of connection, and then holds connection until the end;
4242+- `transaction` - where on connection there is nothing done. Client connects to
4343+ the Ultravisor and can keep that connection indefinitely without ever
4444+ bothering database. Database connection is checked out *only* when there is
4545+ some request from user and is returned to the pool as soon as that result of
4646+ that query is returned and DB is ready for next one.
4747+4848+While `session` mode is quite on par with other implementations of connection
4949+pooling for Postgres, `transaction` mode is where performance is lacking and is
5050+the main focus is put. In whole article (unless mentioned otherwise) I will
5151+speak about `transaction` mode of Ultravisor.
5252+5353+## Lesson: Flame graphs and call tracing is essential
5454+5555+Pretty obvious thing, but still valuable lesson for any performance
5656+optimisation endeavour. For that the great thanks to [Trevor Brown][Stratus3D]
5757+and his awesome project [eFlambè][eflambe]. This helped a lot in tracing hot
5858+points in the running code.
5959+6060+Unfortunately this project seems to be less active recently and has some missing
6161+features, like [listening for given duration instead of function calls
6262+count](https://github.com/Stratus3D/eflambe/issues/48). Fortunately this could
6363+be partially fixed by simply listening for count of calls to `handle_event/4`
6464+function given times and then running `cat *.bggg` to concatenate all files into
6565+larger trace. That has disadvantages, but at least it was workable within
6666+[Speedoscope][] which I also highly recommend to anyone who needs to work on
6767+such optimisation.
6868+6969+While flame graphs are awesome, there is cost to gathering them with eFlambè -
7070+it greatly affects performance. Fortunately Erlang has some built in tools that
7171+have lesser performance impact, and the "most modern" of these is
7272+[`tprof`][tprof]. This tool is pretty easy to use, but is less detailed than
7373+eFlambè. But even with that limitation, it provides superb insight into stuff
7474+that has greatest impact on performance, as well as it make it easier to work on
7575+long running processes, as it work asynchronously, so you can "manually" decide
7676+how long you want to trace your process.
7777+7878+[Stratus3D]: https://github.com/Stratus3D
7979+[eflambe]: https://github.com/Stratus3D/eflambe
8080+[tprof]: https://erlang.org/doc/man/tprof.html
8181+8282+**Summary:** Knowing where your bottlenecks are is essential for performance
8383+optimisations.
8484+8585+## Lesson: Doing less can improve performance
8686+8787+Obvious thing that need to be stated - doing nothing is faster than doing
8888+something. Extracting amount of data sent over given socket using
8989+`:inet.getstat/2` call is fast, but not free. That involves some waiting for
9090+response from either port or process handling connection, which introduces
9191+slowdown. There are 2 possible solutions there:
9292+9393+1. Do not gather that metric at all - sensible, but not feasible, especially when
9494+ you use that metric to charge your users.
9595+2. Gather that data less often.
9696+9797+The approach I have taken there is obviously 1., and the solution is dumb
9898+simple - debouncer.
9999+100100+Debouncing is an interesting technique often used in user interfaces where you
101101+accept some event, and then for some period you ignore repeated events. The
102102+reason for that is that our interfaces may have flaws that send repeated events
103103+one after another.
104104+105105+In this case Ultravisor tries to store amount of sent data after each query, but
106106+that can get expensive for many short queries. Instead I have implemented simple
107107+per-process debouncer:
108108+109109+```elixir
110110+defmodule Ultravisor.Debouncer do
111111+ def debounce(key, time \\ 100, func) do
112112+ current = System.monotonic_time(:millisecond)
113113+ key = {:debounce, key}
114114+115115+ case Process.get(key, nil) do
116116+ {prev, ret} when prev + time > current ->
117117+ ret
118118+119119+ _ ->
120120+ ret = func.()
121121+ Process.put(key, {current, ret})
122122+ ret
123123+ end
124124+ end
125125+end
126126+```
127127+128128+129129+This stores returned data in process dictionary (per-process mutable space with
130130+quick access) and if there was no call in given time-period, then we process
131131+data again. This is safe way to do so, as `:inet.getstat/2` will always return
132132+amount of data that socket processed since it started, so data between calls
133133+will be accounted.
134134+135135+136136+137137+## Lesson: Telemetry is not free
138138+139139+When working on most projects, especially Phoenix-based, one can slap
140140+`:telemetry.execute/3` calls everywhere and notice no performance degradation[^telemetry].
141141+Unfortunately, when you do hundreds of thousands calls a second - that is not a
142142+case.
143143+144144+[^telemetry]: For unaware readers - [Telemetry][] is Erlang event dispatching
145145+ system for observability events.
146146+147147+[Telemetry]: https://github.com/beam-telemetry/telemetry
148148+149149+In this project the metrics are exposed in Prometheus/OpenMetrics format, which
150150+mean that there need to be collection system within application. In BEAM
151151+applications the standard way to implement that is to use ETS tables to store
152152+recorded values. Fortunately there are libraries to handle that for you, and for
153153+the longest time "gold standard" for it was `telemetry_prometheus_core` library
154154+created by Telemetry core team.
155155+156156+While for most projects that library is performant enough (because metrics
157157+aren't recorded in quite tight loops), in case of this project that was not a
158158+case. There, metrics gathering is still one of the hottest spot in the codebase,
159159+even with all improvements that have been done.
160160+161161+Excerpt from `tprof` profile:
162162+163163+| Function | Calls count | Per call (μs) | Percentage |
164164+| - | - :| - :| - :|
165165+| `Peep.EventHandler.store_metrics/5` | 1421911 | 0.13 | 4.21% |
166166+| `Peep.Storage.Striped.insert_metric/5` | 904852 | 0.26 | 5.11% |
167167+168168+This is with awesome library [Peep][] by [Richard Kallos][rkallos]. When using
169169+`telemetry_prometheus_core` it was simply the most expensive thing in whole
170170+loop. Just replacing metrics gathering library with Peep gave us about 2x bump
171171+in TPS.
172172+173173+[Peep]: https://github.com/rkallos/peep
174174+[rkallos]: https://github.com/rkallos
175175+176176+**Summary:** Telemetry handler can matter in tight loops. Fast metrics gathering
177177+isn't easy.
178178+179179+## Records instead of maps or structs
180180+181181+Elixir uses structs for structured data. That gives a lot nice features wrt. hot
182182+code reloads, compilation graph dependencies, and other. However, because
183183+structs are maps, there is a cost. Maps have O(log n) access time to the fields,
184184+this is how maps are constructed in memory. While smaller maps have slightly
185185+different (better in most cases) characteristics, there is strict requirement
186186+that you keep your structure with less than 31 fields[^fields] and it still has
187187+slight memory overhead. The alternative is to use [records][]. These have better
188188+performance characteristic (always constant) irrelevant of the amount of fields
189189+at the cost of being slightly more rigid (records are tuple based) and less
190190+convenient to use (experience may vary). Additional advantage in my opinion is
191191+that it is harder to accidentally add incorrect field by using `Map` module.
192192+193193+[^fields]: Current (OTP 28) limit for small map is 32 keys, but Elixir uses one
194194+ key for struct name, hence 31 fields is the limit.
195195+196196+But before you will run and change all structs in your system to records, just
197197+remember - most of the time the difference doesn't matter - just use structures.
198198+199199+[HAMT]: https://en.wikipedia.org/wiki/Hash_array_mapped_trie
200200+[records]: https://hexdocs.pm/elixir/Record.html
201201+202202+Before:
203203+```
204204+tps = 81765.266264 (without initial connection time)
205205+```
206206+207207+After:
208208+```
209209+tps = 82147.855889 (without initial connection time)
210210+```
211211+212212+213213+**Summary:** `Record`s are super handy when you need to squeeze each bit of
214214+performance. It doesn't provide much, but these adds up.
215215+216216+## ETS table are super fast, but not always
217217+218218+[ETS][] is Erlang's built-in module for storing key-value data in mutable way.
219219+Like built-in Redis. This structure allows for sharing some data in a way, that
220220+is easy to access from different parts of the system. One example of system that
221221+is using ETS for storing their information is Telemetry (mentioned above).
222222+223223+While for 99% of the use cases Telemetry will be fast enough, it has some
224224+problems with very tight loops. Main problem is that it will always copy data
225225+from table to the caller process. That mean that it can put high memory pressure
226226+on the process that tries to retrieve data.
227227+228228+Fortunately Erlang supports another mechanism for storing globally accessible
229229+data - `persistent_term`. Of course, there is no such thing as "free lunch" so
230230+it has substantial disadvantage - it works poorly[^pt] with data that changes
231231+often, as removing or changing data in a key will require walk through all
232232+processes to copy data from it to processes that may use it into process memory.
233233+However - Telemetry handlers should not change a lot, you should just set them
234234+once as soon as your system start, and then ideally they will not change ever
235235+again.
236236+237237+[^pt]: There is slight optimisation that makes it fast in some cases (single
238238+ word values, like atoms), but that is not the case there, so we can simply
239239+ ignore that.
240240+241241+Before[^lower-tps]:
242242+```
243243+tps = 76914.004685 (without initial connection time)
244244+```
245245+246246+After:
247247+```
248248+tps = 78479.006634 (without initial connection time)
249249+```
250250+251251+[^lower-tps]: If you wonder why these results are lower than in previous
252252+ section, it is because test conditions are identical only per section, not
253253+ cross sections. In this particular case I have ran benchmark while
254254+ collecting metrics (to show difference in `persistent_term` change) while
255255+ other are ran without metrics to not pollute results.
256256+257257+**Summary:** `persistent_term` is awesome and super fast, so if you know that
258258+you have some data that will probably never change and will be requested
259259+*constantly*, then it may be good place to store that data.
260260+261261+[ETS]: https://www.erlang.org/doc/apps/stdlib/ets.html
262262+263263+## Calling your `GenServer`s is fast, but not 90k times per second fast
264264+265265+One of the interesting observations is that I have spotted is that if there are
266266+longer running queries, one that send more data over the network than just
267267+simple short response, then the difference between Ultravisor and "state of the
268268+art" tools like [PgBouncer][] or [PgDog][] (that are written in non-managed
269269+languages like C and Rust) is much smaller (obviously it is still there, but it
270270+is on par, not substantially off).
271271+272272+I needed to dig more, what can be the cause of such strange behaviour. The
273273+reason was found in place where I least expected it - checking out database
274274+connection to be used.
275275+276276+Flame graph showed that almost third of the time is spent on checking out
277277+database connections, and most of that time is spent in 2 function calls, both
278278+of them are internally `GenServer` calls and in both most time is spent on
279279+sleeping (aka, waiting for reply).
280280+281281+<!-- TODO: Add image -->
282282+283283+Now, this one is hard thing to optimise, as in Elixir there is no mutability
284284+(almost, we will get there). This mean that if I want some form of shared queue
285285+of processes, then I need to use separate process to keep state of the queue
286286+for us, and then do `GenServer` calls to fetch that state. So what I did in such
287287+situation? What any unreasonable Elixir developer obsessed with performance
288288+would do - NIF[^ets].
289289+290290+[^ets]: I wanted to use ETS there, but for that to work it lacks function like
291291+ `ets:take/2` that would return only one element from the tables with type
292292+ `bag` or `duplicate_bag`. Or any other form of just taking out any (possibly
293293+ random) element from ETS table in atomic way.
294294+295295+The implementation is rather basic wrapper over few
296296+[`VecDeque`s][rust::VecDeque] that allow popping single element from that queue
297297+without any message passing. The implementation is very crude, nowhere
298298+production ready. It doesn't provide any form of worker restarts or anything,
299299+but works quite well as PoC of what is possible.
300300+301301+New queue also provides a way to store additional "metadata" alongside the
302302+worker PID. This allows me to store DB connection socket next to connection
303303+process, which removes need for additional call to extract that data to pass
304304+requests directly to other DB, without copying data between processes.
305305+306306+[rust::VecDeque]: https://doc.rust-lang.org/1.94.0/std/collections/struct.VecDeque.html
307307+308308+Before:
309309+```
310310+tps = 83619.640673 (without initial connection time)
311311+```
312312+313313+After:
314314+```
315315+tps = 94191.475386 (without initial connection time)
316316+```
317317+318318+**Summary:** Sometimes one need to get creative to get around platform
319319+limitations. This may require some pesky NIFs though.