Article about Elixir performance · hauleth.dev/blog@8946eb3

content/post/things-about-elixir-you-probably-will-never-need/final.png

This is a binary file and will not be displayed.

+378

content/post/things-about-elixir-you-probably-will-never-need/index.md

··· 1 + +++ 2 + date = 2026-04-20 3 + title = "Scotty, I need warp speed in three minutes" 4 + 5 + description = """ 6 + My journey into optimising Elixir codebase of Ultravisor (my fork of Supabase's 7 + Supavisor). 8 + 9 + This story is not about a goal, but ~~friends~~ optimisations we met along the 10 + way. 11 + """ 12 + 13 + [taxonomies] 14 + tags = [ 15 + "beam", 16 + "performance" 17 + ] 18 + +++ 19 + 20 + In my last larger gig I worked on fascinating project - [Postgres connection 21 + pooler written in Elixir][supavisor]. Unfortunately, due to different 22 + circumstances, this project burned me out to the ground. However what doesn't 23 + kill you ~~is crap not a weapon~~ can become great learning experience. 24 + 25 + [supavisor]: https://github.com/supabase/supavisor 26 + 27 + Most of my achievements in this project were related to performance. This 28 + project contains **very** tight loop in form of query handler, that needed to 29 + run hundreds of thousands times per second per user connection. That mean that 30 + this functions are *very* sensitive to even slightest performance changes. And 31 + that was my task - to find potential improvements that can be made to make this 32 + codebase be much faster. 33 + 34 + After departing from Supabase I liked the project so much (mostly as learning 35 + ground) that I have created my own fork, where unrestrained from all business 36 + side of the project I could focus purely on squeezing as much of performance as 37 + I can. This project now lives as [Ultravisor][] - it is still nowhere near being 38 + done in a way that I like, but I still go back to work on it from time to time 39 + to find potential performance improvements. 40 + 41 + This is a story of things that I have done and learned during that journey. 42 + 43 + > **Beware**: It is a retrospection, so in some places my memory may be not the 44 + > best. 45 + 46 + Here I need to provide some explanation first, about how Ultravisor works with 47 + database connections. It provides 2 modes of operation: 48 + 49 + - `session` - where each connection from user to Ultravisor checks out one 50 + connection from Ultravisor to database. It checks out once, at the start 51 + of connection, and then holds connection until the end; 52 + - `transaction` - where on connection there is nothing done. Client connects to 53 + the Ultravisor and can keep that connection indefinitely without ever 54 + bothering database. Database connection is checked out *only* when there is 55 + some request from user and is returned to the pool as soon as that result of 56 + that query is returned and DB is ready for next one. 57 + 58 + While `session` mode is quite on par with other implementations of connection 59 + pooling for Postgres, `transaction` mode is where performance is lacking and is 60 + the main focus is put. In whole article (unless mentioned otherwise) I will 61 + speak about `transaction` mode of Ultravisor. 62 + 63 + ## Lesson: Flame graphs and call tracing is essential 64 + 65 + Pretty obvious thing, but still valuable lesson for any performance 66 + optimisation endeavour. For that the great thanks to [Trevor Brown][Stratus3D] 67 + and his awesome project [eFlambè][eflambe]. This helped a lot in tracing hot 68 + points in the running code. 69 + 70 + Unfortunately this project seems to be less active recently and has some missing 71 + features, like [listening for given duration instead of function calls 72 + count](https://github.com/Stratus3D/eflambe/issues/48). This can be partially 73 + fixed by simply listening for count of calls to `handle_event/4` function given 74 + times and then running `cat *.bggg` to concatenate all files into larger trace. 75 + That has disadvantages, but at least it was workable within [Speedoscope][] 76 + which I also highly recommend to anyone who needs to work on such optimisation. 77 + 78 + While flame graphs are awesome, there is cost to gathering them with eFlambè - 79 + it greatly affects performance. Fortunately Erlang has some built in tools that 80 + have lesser performance impact, and the "most modern" of these is 81 + [`tprof`][tprof]. This tool is pretty easy to use, but is less detailed than 82 + eFlambè. But even with that limitation, it provides superb insight into stuff 83 + that has greatest impact on performance, as well as it make it easier to work on 84 + long running processes, as it work asynchronously, so you can "manually" decide 85 + how long you want to trace your process. 86 + 87 + [Stratus3D]: https://github.com/Stratus3D 88 + [eflambe]: https://github.com/Stratus3D/eflambe 89 + [tprof]: https://erlang.org/doc/man/tprof.html 90 + 91 + **Summary:** Knowing where your bottlenecks are is essential for performance 92 + optimisations. 93 + 94 + ## Lesson: Doing less can improve performance 95 + 96 + Obvious thing that need to be stated - doing nothing is faster than doing 97 + something. Extracting amount of data sent over given socket using 98 + `:inet.getstat/2` call is fast, but not free. That involves some waiting for 99 + response from either port or process handling connection, which introduces 100 + slowdown. Two possible solutions there are: 101 + 102 + 1. Do not gather that metric at all - sensible, but not feasible, especially when 103 + you use that metric to charge your users. 104 + 1. Gather that data less often. 105 + 106 + The approach I have taken there is obviously 1., and the solution is dumb 107 + simple - debouncer. 108 + 109 + Debouncing is an interesting technique often used in user interfaces where you 110 + accept some event, and then for some period you ignore repeated events. The 111 + reason for that is that our interfaces may have flaws that send repeated events 112 + one after another. 113 + 114 + In this case Ultravisor tries to store amount of sent data after each query, but 115 + that can get expensive for many short queries. Instead I have implemented simple 116 + per-process debouncer: 117 + 118 + ```elixir 119 + defmodule Ultravisor.Debouncer do 120 + def debounce(key, time \\ 100, func) do 121 + current = System.monotonic_time(:millisecond) 122 + key = {:debounce, key} 123 + 124 + case Process.get(key, nil) do 125 + {prev, ret} when prev + time > current -> 126 + ret 127 + 128 + _ -> 129 + ret = func.() 130 + Process.put(key, {current, ret}) 131 + ret 132 + end 133 + end 134 + end 135 + ``` 136 + 137 + This stores returned data in process dictionary (per-process mutable space with 138 + quick access) and if there was no call in given time-period, then we process 139 + data again. This is safe way to do so, as `:inet.getstat/2` will always return 140 + amount of data that socket processed since it started, so data between calls 141 + will be accounted. 142 + 143 + Before: 144 + ``` 145 + tps = 79401.392762 (without initial connection time) 146 + ``` 147 + 148 + After (10ms of debouncing): 149 + ``` 150 + tps = 80069.646510 (without initial connection time) 151 + ``` 152 + 153 + After (100ms of debouncing): 154 + ``` 155 + tps = 80568.825937 (without initial connection time) 156 + ``` 157 + 158 + **Summary**: Doing noting is more performant than doing something. Sometimes 159 + doing nothing can be quite easy. 160 + 161 + ## Lesson: Telemetry is not free 162 + 163 + When working on most projects, especially Phoenix-based, one can slap 164 + `:telemetry.execute/3` calls everywhere and notice no performance degradation[^telemetry]. 165 + Unfortunately, when you do hundreds of thousands calls a second - that is not a 166 + case. 167 + 168 + [^telemetry]: For unaware readers - [Telemetry][] is Erlang event dispatching 169 + system for observability events. 170 + 171 + [Telemetry]: https://github.com/beam-telemetry/telemetry 172 + 173 + In this project the metrics are exposed in Prometheus/OpenMetrics format, which 174 + mean that there need to be collection system within application. In BEAM 175 + applications the standard way to implement that is to use ETS tables to store 176 + recorded values. Fortunately there are libraries to handle that for you, and for 177 + the longest time "gold standard" for it was `telemetry_prometheus_core` library 178 + created by Telemetry core team. 179 + 180 + While for most projects that library is performant enough (because metrics 181 + aren't recorded in quite tight loops), in case of this project that was not a 182 + case. There, metrics gathering is still one of the hottest spot in the codebase, 183 + even with all improvements that have been done. 184 + 185 + Excerpt from `tprof` profile: 186 + 187 + | Function | Calls count | Per call (μs) | Percentage | 188 + | - | - :| - :| - :| 189 + | `Peep.EventHandler.store_metrics/5` | 1421911 | 0.13 | 4.21% | 190 + | `Peep.Storage.Striped.insert_metric/5` | 904852 | 0.26 | 5.11% | 191 + 192 + This is with awesome library [Peep][] by [Richard Kallos][rkallos]. When using 193 + `telemetry_prometheus_core` it was simply the most expensive thing in whole 194 + loop. Just replacing metrics gathering library with Peep gave us about 2x bump 195 + in TPS. 196 + 197 + [Peep]: https://github.com/rkallos/peep 198 + [rkallos]: https://github.com/rkallos 199 + 200 + **Summary:** Telemetry handler can matter in tight loops. Fast metrics gathering 201 + isn't easy. 202 + 203 + ## Lesson: Records instead of maps or structs 204 + 205 + Elixir uses structs for structured data. That gives a lot nice features wrt. hot 206 + code reloads, compilation graph dependencies, and other. However, because 207 + structs are maps, there is a cost. Maps have O(log n) access time to the fields, 208 + this is how maps are constructed in memory. While smaller maps have slightly 209 + different (better in most cases) characteristics, there is strict requirement 210 + that you keep your structure with less than 31 fields[^fields] and it still has 211 + slight memory overhead. The alternative is to use [records][]. These have better 212 + performance characteristic (always constant) irrelevant of the amount of fields 213 + at the cost of being slightly more rigid (records are tuple based) and less 214 + convenient to use (experience may vary). Additional advantage in my opinion is 215 + that it is harder to add incorrect field by using `Map` module. 216 + 217 + [^fields]: Current (OTP 28) limit for small map is 32 keys, but Elixir uses one 218 + key for struct name, hence 31 fields is the limit. 219 + 220 + Before you will run and change all structs in your system to records, just 221 + remember - most of the time the difference doesn't matter - just use structures. 222 + 223 + [HAMT]: https://en.wikipedia.org/wiki/Hash_array_mapped_trie 224 + [records]: https://hexdocs.pm/elixir/Record.html 225 + 226 + Before: 227 + ``` 228 + tps = 81765.266264 (without initial connection time) 229 + ``` 230 + 231 + After: 232 + ``` 233 + tps = 82147.855889 (without initial connection time) 234 + ``` 235 + 236 + **Summary:** `Record`s are super handy when you need to squeeze each bit of 237 + performance. It doesn't provide much, but these adds up. 238 + 239 + ## Lesson: ETS tables are super fast, but not always 240 + 241 + [ETS][] is Erlang's built-in module for storing key-value data in mutable way. 242 + Like built-in Redis. This structure allows for sharing some data in a way, that 243 + is easy to access from different parts of the system. One example of system that 244 + is using ETS for storing their information is Telemetry (mentioned above). 245 + 246 + While for 99% of the use cases Telemetry will be fast enough, it has some 247 + problems with tight loops. Main problem is that it will always copy data from 248 + table to the caller process. That mean that it can put high memory pressure on 249 + the process that tries to retrieve data. 250 + 251 + Fortunately Erlang supports another mechanism for storing globally accessible 252 + data - `persistent_term`. Of course, there is no such thing as "free lunch" so 253 + it has substantial disadvantage - it works poorly[^pt] with data that changes 254 + often, as removing or changing data in a key will require walk through all 255 + processes to copy data from it to processes that may use it into process memory. 256 + However - Telemetry handlers should not change a lot, you should just set them 257 + once as soon as your system start, and then ideally they will not change ever 258 + again. 259 + 260 + [^pt]: There is slight optimisation that makes it fast in some cases (single 261 + word values, like atoms), but that is not the case there, so we can ignore 262 + that. 263 + 264 + Before[^lower-tps]: 265 + ``` 266 + tps = 76914.004685 (without initial connection time) 267 + ``` 268 + 269 + After: 270 + ``` 271 + tps = 78479.006634 (without initial connection time) 272 + ``` 273 + 274 + [^lower-tps]: If you wonder why these results are lower than in previous 275 + section, it is because test conditions are identical only per section, not 276 + cross sections. In this particular case I have ran benchmark while 277 + collecting metrics (to show difference in `persistent_term` change) while 278 + other are ran without metrics to not pollute results. 279 + 280 + **Summary:** `persistent_term` is awesome and super fast, so if you know that 281 + you have some data that will probably never change and will be requested 282 + *constantly*, then it may be good place to store that data. 283 + 284 + [ETS]: https://www.erlang.org/doc/apps/stdlib/ets.html 285 + 286 + ## Lesson: Calling your `GenServer`s is fast, but not 90k times per second fast 287 + 288 + One of the interesting observations is that I have spotted is that if there are 289 + longer running queries, one that send more data over the network than just 290 + simple short response, then the difference between Ultravisor and "state of the 291 + art" tools like [PgBouncer][] or [PgDog][] (that are written in non-managed 292 + languages like C and Rust) is much smaller (obviously it is still there, but it 293 + is on par, not substantially off). 294 + 295 + [PgBouncer]: https://www.pgbouncer.org 296 + [PgDog]: https://pgdog.dev 297 + 298 + I needed to dig more, what can be the cause of such strange behaviour. The 299 + reason was found in place where I least expected it - checking out database 300 + connection to be used. 301 + 302 + Flame graph showed that almost third of the time is spent on checking out 303 + database connections, and most of that time is spent in 2 function calls, both 304 + of them are internally `gen_statem` calls and in both most time is spent on 305 + sleeping (aka, waiting for reply). 306 + 307 + ![Image showing left-heavy flamegraph of the profiled application](sleep.png) 308 + 309 + Now, this one is hard thing to optimise, as in Elixir there is no mutability 310 + (almost, we will get there). This mean that if I want some form of shared queue 311 + of processes, then I need to use separate process to keep state of the queue 312 + for us, and then do `GenServer` calls to fetch that state. What I did in such 313 + situation? What any unreasonable Elixir developer obsessed with performance 314 + would do - NIF[^ets]. 315 + 316 + [^ets]: I wanted to use ETS there, but for that to work it lacks function like 317 + `ets:take/2` that would return only one element from the tables with type 318 + `bag` or `duplicate_bag`. Or any other form of just taking out any (possibly 319 + random) element from ETS table in atomic way. 320 + 321 + The implementation is rather basic wrapper over [`VecDeque`s][rust::VecDeque] 322 + that allow popping single element from that queue without any message passing. 323 + The implementation is very crude, nowhere production ready. It doesn't provide 324 + any form of worker restarts or anything, but works quite well as PoC of what is 325 + possible. 326 + 327 + New queue also provides a way to store additional "metadata" alongside the 328 + worker PID. This allows me to store DB connection socket next to connection 329 + process, which removes need for additional call to extract that data to pass 330 + requests directly to other DB, without copying data between processes. 331 + 332 + [rust::VecDeque]: https://doc.rust-lang.org/1.94.0/std/collections/struct.VecDeque.html 333 + 334 + Before: 335 + ``` 336 + tps = 83619.640673 (without initial connection time) 337 + ``` 338 + 339 + After: 340 + ``` 341 + tps = 94191.475386 (without initial connection time) 342 + ``` 343 + 344 + **Summary:** Sometimes one need to get creative to get around platform 345 + limitations. This may require some pesky NIFs though. 346 + 347 + ## Conclusions 348 + 349 + Optimising such project was enormous fun and I think that at the current state 350 + there is nothing extra that can be done to optimise it more without optimising 351 + generated JIT-ed native code or optimising Erlang scheduler. 352 + 353 + ![Final flamegraph of application](final.png) 354 + 355 + There are some flags, that affect performance, but as it is currently unclear 356 + why these work at all (probably it is related more to the OS scheduler rather 357 + than Erlang performance), I left them out of this article for now. 358 + 359 + ## Post Scriptum: Good tooling helps a lot 360 + 361 + Just after I have started that optimisation project after leaving Supabase I 362 + started using [Jujutsu][jj] for version control. That one thing helped me **a 363 + lot** with being able to have separate branches/PRs for each of the changes, 364 + while at the same being able to work with [mega-merge][jj-mm] of them all. 365 + 366 + That allows me to profile code with all other noise removed, while still 367 + exposing the changes as separate reviewable units. Without that support I would 368 + need to decipher what have already been changed and/or removed from the profile. 369 + 370 + Additional feature that I heavily used there is "anonymous branching". As when 371 + working with JJ I do not need to create new name for each branch that I want to 372 + try, it was way easier to implement one idea, then just do `jj new @-` (which 373 + branches off at the commit that is parent of the current one) and just implement 374 + alternative idea. I used that constantly to compare ideas and reject failed 375 + concepts. 376 + 377 + [jj]: https://docs.jj-vcs.dev/latest/ 378 + [jj-mm]: https://steveklabnik.github.io/jujutsu-tutorial/advanced/simultaneous-edits.html

content/post/things-about-elixir-you-probably-will-never-need/sleep.png

This is a binary file and will not be displayed.

+2 -2

sass/_buttons.scss

··· 38 38 39 39 &.link { 40 40 background: none; 41 - font-size: 1rem; 41 + font-size: 1em; 42 42 } 43 43 44 44 /* sizes */ 45 45 46 46 &.small { 47 - font-size: .8rem; 47 + font-size: .8em; 48 48 } 49 49 50 50 &.wide {

+9 -9

sass/_main.scss

··· 38 38 .zola-anchor { 39 39 font-size: .75em; 40 40 visibility: hidden; 41 - margin-left: 0.5rem; 41 + margin-left: .5rem; 42 42 vertical-align: 1%; 43 43 text-decoration: none; 44 44 border-bottom-color: transparent; ··· 58 58 } 59 59 60 60 h1 { 61 - font-size: 1.4rem; 61 + font-size: 1.4em; 62 62 } 63 63 64 64 h2 { 65 - font-size: 1.3rem; 65 + font-size: 1.3em; 66 66 } 67 67 68 68 h3 { 69 - font-size: 1.1rem; 69 + font-size: 1.1em; 70 70 } 71 71 72 72 h4, h5, h6 { 73 - font-size: 1.05rem; 73 + font-size: 1.05em; 74 74 } 75 75 76 76 :is(h1, h2, h3, h4, h5, h6) { ··· 113 113 color: var(--accent); 114 114 padding: 1px 6px; 115 115 margin: 0 2px; 116 - font-size: .95rem; 116 + font-size: .95em; 117 117 hyphens: none; 118 118 } 119 119 120 120 pre { 121 121 font-family: ui-monospace, monospace; 122 122 padding: 20px 10px; 123 - font-size: .95rem; 123 + font-size: .95em; 124 124 overflow: auto; 125 125 border-top: 1px solid rgba(255, 255, 255, .1); 126 126 border-bottom: 1px solid rgba(255, 255, 255, .1); ··· 144 144 right: 0; 145 145 padding: .2em .5em; 146 146 font-weight: bold; 147 - font-size: .95rem; 147 + font-size: .95em; 148 148 border-radius: 0 0 0 6px; 149 149 background-color: var(--accent-alpha-20); 150 150 } ··· 199 199 } 200 200 201 201 th { 202 - font-size: .9rem; 202 + font-size: .9em; 203 203 color: var(--accent); 204 204 } 205 205

+1 -1

sass/_post.scss

··· 128 128 } 129 129 130 130 .post-toc { 131 - font-size: .8rem; 131 + font-size: .8em; 132 132 133 133 .toggleable { display: none; } 134 134

+1 -1

templates/macros/lists.html

··· 12 12 13 13 <div class="post-content"> 14 14 {% if page.description -%} 15 - {{ page.description }} 15 + {{ page.description | markdown(inline=true) | safe }} 16 16 {#- end if-check for description -#} 17 17 {% elif page.summary -%} 18 18 {{ page.summary | safe }}

Configure Feed

Configure Feed