My personal blog hauleth.dev
blog
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

WIP: Article about Elixir performance

+319
+319
content/post/things-about-elixir-you-probably-will-never-need.md
··· 1 + +++ 2 + date = 2026-02-27 3 + title = "Scotty, I need warp speed in three minutes" 4 + 5 + [taxonomies] 6 + tags = [ 7 + "beam", 8 + "performance" 9 + ] 10 + +++ 11 + 12 + In my last larger gig I was working on fascinating project - [Postgres connection 13 + pooler written in Elixir][supavisor]. Unfortunately, due to different 14 + circumstances, this project burned me out to the ground. However what doesn't 15 + kill you ~~is crap not a weapon~~ can become great learning experience. 16 + 17 + Most of my achievements in this project were related to performance. This 18 + project contains **very** tight loop in form of query handler, that needed to 19 + run hundreds of thousands times per second per user connection. That mean that 20 + this functions are *very* sensitive to even slightest performance changes. And 21 + that was my task - to find potential improvements that can be made to make this 22 + codebase be much faster. 23 + 24 + After departing from Supabase I liked the project so much (mostly as learning 25 + ground) that I have created my own fork, where unrestrained from all business 26 + side of the project I could focus purely on squeezing as much of performance as 27 + I can. This project now lives as [Ultravisor][] - it is still nowhere near being 28 + done in a way that I like, but I still go back to work on it from time to time 29 + to find potential performance improvements. 30 + 31 + This is a story of things that I have done and learned during that journey. 32 + 33 + > **Beware**: It is a retrospection, so in some places my memory may be not the 34 + > best. 35 + 36 + Here I need to provide some explanation first, about how Ultravisor works with 37 + database connections. There are 2 modes of operation: 38 + 39 + - `session` - where each connection from user to Ultravisor checks out one 40 + connection from Ultravisor to database. So that checks out once, at the start 41 + of connection, and then holds connection until the end; 42 + - `transaction` - where on connection there is nothing done. Client connects to 43 + the Ultravisor and can keep that connection indefinitely without ever 44 + bothering database. Database connection is checked out *only* when there is 45 + some request from user and is returned to the pool as soon as that result of 46 + that query is returned and DB is ready for next one. 47 + 48 + While `session` mode is quite on par with other implementations of connection 49 + pooling for Postgres, `transaction` mode is where performance is lacking and is 50 + the main focus is put. In whole article (unless mentioned otherwise) I will 51 + speak about `transaction` mode of Ultravisor. 52 + 53 + ## Lesson: Flame graphs and call tracing is essential 54 + 55 + Pretty obvious thing, but still valuable lesson for any performance 56 + optimisation endeavour. For that the great thanks to [Trevor Brown][Stratus3D] 57 + and his awesome project [eFlambè][eflambe]. This helped a lot in tracing hot 58 + points in the running code. 59 + 60 + Unfortunately this project seems to be less active recently and has some missing 61 + features, like [listening for given duration instead of function calls 62 + count](https://github.com/Stratus3D/eflambe/issues/48). Fortunately this could 63 + be partially fixed by simply listening for count of calls to `handle_event/4` 64 + function given times and then running `cat *.bggg` to concatenate all files into 65 + larger trace. That has disadvantages, but at least it was workable within 66 + [Speedoscope][] which I also highly recommend to anyone who needs to work on 67 + such optimisation. 68 + 69 + While flame graphs are awesome, there is cost to gathering them with eFlambè - 70 + it greatly affects performance. Fortunately Erlang has some built in tools that 71 + have lesser performance impact, and the "most modern" of these is 72 + [`tprof`][tprof]. This tool is pretty easy to use, but is less detailed than 73 + eFlambè. But even with that limitation, it provides superb insight into stuff 74 + that has greatest impact on performance, as well as it make it easier to work on 75 + long running processes, as it work asynchronously, so you can "manually" decide 76 + how long you want to trace your process. 77 + 78 + [Stratus3D]: https://github.com/Stratus3D 79 + [eflambe]: https://github.com/Stratus3D/eflambe 80 + [tprof]: https://erlang.org/doc/man/tprof.html 81 + 82 + **Summary:** Knowing where your bottlenecks are is essential for performance 83 + optimisations. 84 + 85 + ## Lesson: Doing less can improve performance 86 + 87 + Obvious thing that need to be stated - doing nothing is faster than doing 88 + something. Extracting amount of data sent over given socket using 89 + `:inet.getstat/2` call is fast, but not free. That involves some waiting for 90 + response from either port or process handling connection, which introduces 91 + slowdown. There are 2 possible solutions there: 92 + 93 + 1. Do not gather that metric at all - sensible, but not feasible, especially when 94 + you use that metric to charge your users. 95 + 2. Gather that data less often. 96 + 97 + The approach I have taken there is obviously 1., and the solution is dumb 98 + simple - debouncer. 99 + 100 + Debouncing is an interesting technique often used in user interfaces where you 101 + accept some event, and then for some period you ignore repeated events. The 102 + reason for that is that our interfaces may have flaws that send repeated events 103 + one after another. 104 + 105 + In this case Ultravisor tries to store amount of sent data after each query, but 106 + that can get expensive for many short queries. Instead I have implemented simple 107 + per-process debouncer: 108 + 109 + ```elixir 110 + defmodule Ultravisor.Debouncer do 111 + def debounce(key, time \\ 100, func) do 112 + current = System.monotonic_time(:millisecond) 113 + key = {:debounce, key} 114 + 115 + case Process.get(key, nil) do 116 + {prev, ret} when prev + time > current -> 117 + ret 118 + 119 + _ -> 120 + ret = func.() 121 + Process.put(key, {current, ret}) 122 + ret 123 + end 124 + end 125 + end 126 + ``` 127 + 128 + 129 + This stores returned data in process dictionary (per-process mutable space with 130 + quick access) and if there was no call in given time-period, then we process 131 + data again. This is safe way to do so, as `:inet.getstat/2` will always return 132 + amount of data that socket processed since it started, so data between calls 133 + will be accounted. 134 + 135 + 136 + 137 + ## Lesson: Telemetry is not free 138 + 139 + When working on most projects, especially Phoenix-based, one can slap 140 + `:telemetry.execute/3` calls everywhere and notice no performance degradation[^telemetry]. 141 + Unfortunately, when you do hundreds of thousands calls a second - that is not a 142 + case. 143 + 144 + [^telemetry]: For unaware readers - [Telemetry][] is Erlang event dispatching 145 + system for observability events. 146 + 147 + [Telemetry]: https://github.com/beam-telemetry/telemetry 148 + 149 + In this project the metrics are exposed in Prometheus/OpenMetrics format, which 150 + mean that there need to be collection system within application. In BEAM 151 + applications the standard way to implement that is to use ETS tables to store 152 + recorded values. Fortunately there are libraries to handle that for you, and for 153 + the longest time "gold standard" for it was `telemetry_prometheus_core` library 154 + created by Telemetry core team. 155 + 156 + While for most projects that library is performant enough (because metrics 157 + aren't recorded in quite tight loops), in case of this project that was not a 158 + case. There, metrics gathering is still one of the hottest spot in the codebase, 159 + even with all improvements that have been done. 160 + 161 + Excerpt from `tprof` profile: 162 + 163 + | Function | Calls count | Per call (μs) | Percentage | 164 + | - | - :| - :| - :| 165 + | `Peep.EventHandler.store_metrics/5` | 1421911 | 0.13 | 4.21% | 166 + | `Peep.Storage.Striped.insert_metric/5` | 904852 | 0.26 | 5.11% | 167 + 168 + This is with awesome library [Peep][] by [Richard Kallos][rkallos]. When using 169 + `telemetry_prometheus_core` it was simply the most expensive thing in whole 170 + loop. Just replacing metrics gathering library with Peep gave us about 2x bump 171 + in TPS. 172 + 173 + [Peep]: https://github.com/rkallos/peep 174 + [rkallos]: https://github.com/rkallos 175 + 176 + **Summary:** Telemetry handler can matter in tight loops. Fast metrics gathering 177 + isn't easy. 178 + 179 + ## Records instead of maps or structs 180 + 181 + Elixir uses structs for structured data. That gives a lot nice features wrt. hot 182 + code reloads, compilation graph dependencies, and other. However, because 183 + structs are maps, there is a cost. Maps have O(log n) access time to the fields, 184 + this is how maps are constructed in memory. While smaller maps have slightly 185 + different (better in most cases) characteristics, there is strict requirement 186 + that you keep your structure with less than 31 fields[^fields] and it still has 187 + slight memory overhead. The alternative is to use [records][]. These have better 188 + performance characteristic (always constant) irrelevant of the amount of fields 189 + at the cost of being slightly more rigid (records are tuple based) and less 190 + convenient to use (experience may vary). Additional advantage in my opinion is 191 + that it is harder to accidentally add incorrect field by using `Map` module. 192 + 193 + [^fields]: Current (OTP 28) limit for small map is 32 keys, but Elixir uses one 194 + key for struct name, hence 31 fields is the limit. 195 + 196 + But before you will run and change all structs in your system to records, just 197 + remember - most of the time the difference doesn't matter - just use structures. 198 + 199 + [HAMT]: https://en.wikipedia.org/wiki/Hash_array_mapped_trie 200 + [records]: https://hexdocs.pm/elixir/Record.html 201 + 202 + Before: 203 + ``` 204 + tps = 81765.266264 (without initial connection time) 205 + ``` 206 + 207 + After: 208 + ``` 209 + tps = 82147.855889 (without initial connection time) 210 + ``` 211 + 212 + 213 + **Summary:** `Record`s are super handy when you need to squeeze each bit of 214 + performance. It doesn't provide much, but these adds up. 215 + 216 + ## ETS table are super fast, but not always 217 + 218 + [ETS][] is Erlang's built-in module for storing key-value data in mutable way. 219 + Like built-in Redis. This structure allows for sharing some data in a way, that 220 + is easy to access from different parts of the system. One example of system that 221 + is using ETS for storing their information is Telemetry (mentioned above). 222 + 223 + While for 99% of the use cases Telemetry will be fast enough, it has some 224 + problems with very tight loops. Main problem is that it will always copy data 225 + from table to the caller process. That mean that it can put high memory pressure 226 + on the process that tries to retrieve data. 227 + 228 + Fortunately Erlang supports another mechanism for storing globally accessible 229 + data - `persistent_term`. Of course, there is no such thing as "free lunch" so 230 + it has substantial disadvantage - it works poorly[^pt] with data that changes 231 + often, as removing or changing data in a key will require walk through all 232 + processes to copy data from it to processes that may use it into process memory. 233 + However - Telemetry handlers should not change a lot, you should just set them 234 + once as soon as your system start, and then ideally they will not change ever 235 + again. 236 + 237 + [^pt]: There is slight optimisation that makes it fast in some cases (single 238 + word values, like atoms), but that is not the case there, so we can simply 239 + ignore that. 240 + 241 + Before[^lower-tps]: 242 + ``` 243 + tps = 76914.004685 (without initial connection time) 244 + ``` 245 + 246 + After: 247 + ``` 248 + tps = 78479.006634 (without initial connection time) 249 + ``` 250 + 251 + [^lower-tps]: If you wonder why these results are lower than in previous 252 + section, it is because test conditions are identical only per section, not 253 + cross sections. In this particular case I have ran benchmark while 254 + collecting metrics (to show difference in `persistent_term` change) while 255 + other are ran without metrics to not pollute results. 256 + 257 + **Summary:** `persistent_term` is awesome and super fast, so if you know that 258 + you have some data that will probably never change and will be requested 259 + *constantly*, then it may be good place to store that data. 260 + 261 + [ETS]: https://www.erlang.org/doc/apps/stdlib/ets.html 262 + 263 + ## Calling your `GenServer`s is fast, but not 90k times per second fast 264 + 265 + One of the interesting observations is that I have spotted is that if there are 266 + longer running queries, one that send more data over the network than just 267 + simple short response, then the difference between Ultravisor and "state of the 268 + art" tools like [PgBouncer][] or [PgDog][] (that are written in non-managed 269 + languages like C and Rust) is much smaller (obviously it is still there, but it 270 + is on par, not substantially off). 271 + 272 + I needed to dig more, what can be the cause of such strange behaviour. The 273 + reason was found in place where I least expected it - checking out database 274 + connection to be used. 275 + 276 + Flame graph showed that almost third of the time is spent on checking out 277 + database connections, and most of that time is spent in 2 function calls, both 278 + of them are internally `GenServer` calls and in both most time is spent on 279 + sleeping (aka, waiting for reply). 280 + 281 + <!-- TODO: Add image --> 282 + 283 + Now, this one is hard thing to optimise, as in Elixir there is no mutability 284 + (almost, we will get there). This mean that if I want some form of shared queue 285 + of processes, then I need to use separate process to keep state of the queue 286 + for us, and then do `GenServer` calls to fetch that state. So what I did in such 287 + situation? What any unreasonable Elixir developer obsessed with performance 288 + would do - NIF[^ets]. 289 + 290 + [^ets]: I wanted to use ETS there, but for that to work it lacks function like 291 + `ets:take/2` that would return only one element from the tables with type 292 + `bag` or `duplicate_bag`. Or any other form of just taking out any (possibly 293 + random) element from ETS table in atomic way. 294 + 295 + The implementation is rather basic wrapper over few 296 + [`VecDeque`s][rust::VecDeque] that allow popping single element from that queue 297 + without any message passing. The implementation is very crude, nowhere 298 + production ready. It doesn't provide any form of worker restarts or anything, 299 + but works quite well as PoC of what is possible. 300 + 301 + New queue also provides a way to store additional "metadata" alongside the 302 + worker PID. This allows me to store DB connection socket next to connection 303 + process, which removes need for additional call to extract that data to pass 304 + requests directly to other DB, without copying data between processes. 305 + 306 + [rust::VecDeque]: https://doc.rust-lang.org/1.94.0/std/collections/struct.VecDeque.html 307 + 308 + Before: 309 + ``` 310 + tps = 83619.640673 (without initial connection time) 311 + ``` 312 + 313 + After: 314 + ``` 315 + tps = 94191.475386 (without initial connection time) 316 + ``` 317 + 318 + **Summary:** Sometimes one need to get creative to get around platform 319 + limitations. This may require some pesky NIFs though.