blog/2024: add nomadic compute linkpost · xeiaso.net/site@2f5122a

+529

1 changed file

expand all

lume

src

blog

2024

tigris-nomadic-compute.mdx

+529

lume/src/blog/2024/tigris-nomadic-compute.mdx

··· 1 + --- 2 + title: "Nomadic Infrastructure Design for AI Workloads" 3 + date: 2024-11-12 4 + redirect_to: "https://tigrisdata.com/blog/nomadic-compute/" 5 + hero: 6 + ai: "Flux [dev] by Black Forest Labs" 7 + file: "_yj_eBqjMOIe0Bv-oQxoy" 8 + prompt: "A nomadic server hunts for GPUs, powered by Taco Bell" 9 + --- 10 + 11 + Taco Bell is a miracle of food preparation. They manage to have a menu of dozens 12 + of items that all boil down to permutations of 8 basic items: meat, cheese, 13 + beans, vegetables, bread, and sauces. Those basic fundamentals are combined in 14 + new and interesting ways to give you the crunchwrap, the chalupa, the doritos 15 + locos tacos, and more. Just add hot water and they’re ready to eat. 16 + 17 + Even though the results are exciting, the ingredients for them are not. They’re 18 + all really simple things. The best designed production systems I’ve ever used 19 + take the same basic idea: build exciting things out of boring components that 20 + are well understood across all facets of the industry (eg: S3, Postgres, HTTP, 21 + JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the 22 + industry-disrupting industry. 23 + 24 + A bunch of companies want to sell you inference time for your AI workloads or 25 + the results of them inferencing AI workloads for you, but nobody really tells 26 + you how to make this yourself. That’s the special Mexican Pizza sauce that you 27 + can’t replicate at home no matter how much you want to be able to. 28 + 29 + Today, we’ll cover how you, a random nerd that likes reading architectural 30 + articles, should design a production-ready AI system so that you can maximize 31 + effectiveness per dollar, reduce dependency lock-in, and separate concerns down 32 + to their cores. Buckle up, it’s gonna be a ride. 33 + 34 + <Conv name="Mara" mood="hacker"> 35 + The industry uses like a billion different terms for “unit of compute that has 36 + access to a network connection and the ability to store things for some amount 37 + of time” that all conflict in mutually incompatible ways. When you read 38 + “workload”, you should think about some program that has network access to 39 + some network and some amount of storage through some means running somewhere, 40 + probably in a container. 41 + </Conv> 42 + 43 + ## The fundamentals of any workload 44 + 45 + At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes, 46 + $5 Hetzner VPSen, etc.) is a combination of three basic factors: 47 + 48 + - Compute, or the part that executes code and does math 49 + - Network, or the part that lets you dial and accept sockets 50 + - Storage, or the part that remembers things for next time 51 + 52 + In reality, these things will overlap a little (compute has storage in the form 53 + of ram, some network cards run their own Linux kernel, and storage is frequently 54 + accessed over the network), but that still very cleanly maps to the basic things 55 + that you’re billed for in the cloud: 56 + 57 + - Gigabyte-core-seconds of compute 58 + - Gigabytes egressed over the network 59 + - Gigabytes stored in persistent storage 60 + 61 + And of course, there’s a huge money premium for any of this being involved in AI 62 + anything because people will pay. However, let’s take a look at that second 63 + basic thing you’re billed for a bit closer: 64 + 65 + > - Gigabytes egressed over the network 66 + 67 + Note that it’s _egress_ out of your compute, not _ingress_ to your compute. 68 + Providers generally want you to make it easy to put your data into their 69 + platform and harder to get the data back out. This is usually combined with your 70 + storage layer, which can make it annoying and expensive to deal with data that 71 + is bigger than your local disk. Your local disk is frequently way too small to 72 + store everything, so you have to make compromises. 73 + 74 + What if your storage layer didn’t charge you per gigabyte of data you fetched 75 + out of it? What classes of problems would that allow you to solve that were 76 + previously too expensive to execute on? 77 + 78 + If you put your storage in a service that is low-latency, close to your servers, 79 + and has no egress fees, then it can actually be cheaper to pull things from 80 + object storage just-in-time to use them than it is to store them persistently. 81 + 82 + ### Storage that is left idle is more expensive than compute time 83 + 84 + In serverless (Lambda) scenarios, most of the time your application is turned 85 + off. This is good. This is what you want. You want it to turn on when it’s 86 + needed, and turn back off when it’s not. When you do a setup like this, you also 87 + usually assume that the time it takes to do a cold start of the service is fast 88 + enough that the user doesn’t mind. 89 + 90 + Let’s say that your AI app requires 16 gigabytes of local disk space for your 91 + Docker image with the inference engine and the downloaded model weights. In some 92 + clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have 93 + the data sitting there doing nothing, even if the actual compute time is as low 94 + as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of 95 + weight bytes) and those weights take 5 minutes to download, this means that you 96 + are only spending $0.12 waiting things to download. If you’re only doing 97 + inference in bulk scenarios where latency doesn’t matter as much, then it can be 98 + much, much cheaper to dynamically mint new instances, download the model weights 99 + from object storage, do all of the inference you need, and then slay those 100 + instances off when you’re done. 101 + 102 + Most of the time, any production workload’s request rate is going to follow a 103 + sinusodal curve where there’s peak usage for about 8 hours in the middle of the 104 + day and things will fall off overnight as everyone goes to bed. If you spin up 105 + AI inference servers on demand following this curve, this means that the first 106 + person of the day to use an AI feature could have it take a bit longer for the 107 + server to get its coffee, but it’ll be hot’n’ready for the next user when they 108 + use that feature. 109 + 110 + You can even cheat further with optional features such that the first user 111 + doesn’t actually see them, but it triggers the AI inference backend to wake up 112 + for the next request. 113 + 114 + ### It may not be your money, but the amounts add up 115 + 116 + When you set up cloud compute, it’s really easy to fall prey to the siren song 117 + of the seemingly bottomless budget of the corporate card. At a certain point, we 118 + all need to build sustainable business as the AI hype wears off and the free 119 + tier ends. However, thanks to the idea of Taco Bell infrastructure design, you 120 + can reduce the risk of lock-in and increase flexibility between providers so you 121 + can lower your burn rate. 122 + 123 + In many platforms, data ingress is free. Data _egress_ is where they get you. 124 + It’s such a problem for businesses that the 125 + [EU has had to step in and tell providers that people need an easy way out](https://commission.europa.eu/news/data-act-enters-force-what-it-means-you-2024-01-11_en). 126 + Every gigabyte of data you put into those platforms is another $0.05 that it’ll 127 + cost to move away should you need to. 128 + 129 + This doesn’t sound like an issue, because the CTO negotiating dream is that 130 + they’ll be able to play the “we’re gonna move our stuff elsewhere” card and 131 + instantly win a discount and get a fantastic deal that will enable future growth 132 + or whatever. 133 + 134 + This is a nice dream. 135 + 136 + In reality, the sales representative has a number in big red letters in front of 137 + them. This number is the amount of money it would cost for you to move your 3 138 + petabytes of data off of their cloud. You both know you’re stuck with eachother, 139 + and you’ll happily take an additional measly 5% discount on top of the 10% 140 + discount you negotiated last year. We all know that the actual cost of running 141 + the service is 15% of even that cost; but the capitalism machine has to eat 142 + somehow, right? 143 + 144 + ## On the nature of dependencies 145 + 146 + Let’s be real, dependencies aren’t fundamentally bad things to have. All of us 147 + have a hard dependency on the Internet, amd64 CPUs, water, and storage. 148 + Everything’s a tradeoff. The potentially harmful part comes in when your 149 + dependency locks you in so you can’t switch away easily. 150 + 151 + This is normally pretty bad with traditional compute setups, but can be extra 152 + insidious with AI workloads. AI workloads make cloud companies staggering 153 + amounts of money, so they want to make sure that you keep your AI workloads on 154 + their servers as much as possible so they can extract as much revenue out of you 155 + as possible. Combine this with the big red number disadvantage in negotiations, 156 + and you can find yourself backed into a corner. 157 + 158 + ### Strategic dependency choice 159 + 160 + This is why picking your dependencies is such a huge thing to consider. There’s 161 + a lot to be said about choosing dependencies to minimize vendor lock-in, and 162 + that’s where the Taco Bell infrastructure philosophy comes in: 163 + 164 + - Trigger compute with HTTP requests that use well-defined schemata. 165 + - Find your target using DNS. 166 + - Store things you want to keep in Postgres or object storage. 167 + - Fetch things out of storage when you need them. 168 + - Mint new workers when there is work to be done. 169 + - Slay those workers off when they’re not needed anymore. 170 + 171 + If you follow these rules, you can easily make your compute nomadic between 172 + services. Capitalize on things like Kubernetes (the universal API for cloud 173 + compute, as much as I hate that it won), and you make the underlying clouds an 174 + implementation detail that can be swapped out as you find better strategic 175 + partnerships that can offer you more than a measly 5% discount. 176 + 177 + Just add water. 178 + 179 + ### How AI models become dependencies 180 + 181 + There's an extra evil way that AI models can become production-critical 182 + dependencies. Most of the time when you implement an application that uses an AI 183 + model, you end up encoding "workarounds" for the model into the prompts you use. 184 + This happens because AI models are fundamentally unpredictable and unreliable 185 + tools that sometimes give you the output you want. As a result though, changing 186 + out models _sounds_ like it's something that should be easy. You _just_ change 187 + out the model and then you can take advantage of better accuracy, new features 188 + like tool use, or JSON schema prompting, right? 189 + 190 + In many cases, changing out a model will result in a service that superficially 191 + looks and functions the same. You give it a meeting transcript, it tells you 192 + what the action items are. The problem comes in with the subtle nuances of the 193 + je ne sais quoi of the experience. Even subtle differences like 194 + [the current date being in the month of December](https://arstechnica.com/information-technology/2023/12/is-chatgpt-becoming-lazier-because-its-december-people-run-tests-to-find-out/) 195 + can _drastically_ change the quality of output. A 196 + [recent paper from Apple](https://arxiv.org/pdf/2410.05229) concluded that 197 + adding superficial details that wouldn't throw off a human can severely impact 198 + the performance of large language models. Heck, they even struggle or fall prey 199 + to fairly trivial questions that humans find easy, such as: 200 + 201 + - How many r's are in the word "strawberry"? 202 + - What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three 203 + pounds of air? 204 + 205 + If changing the placement of a comma in a prompt can cause such huge impacts to 206 + the user experience, what would changing the model do? What would being forced 207 + to change the model because the provider is deprecating it so they can run newer 208 + models that don't do the job as well as the model you currently use? This is a 209 + really evil kind of dependency that you can only get when you rely on 210 + cloud-hosted models. By controlling the weights and inference setups for your 211 + machines, you have a better chance of being able to dictate the future of your 212 + product and control all parts of the stack as much as possible. 213 + 214 + ## How it’s made prod-ready 215 + 216 + Like I said earlier, the three basic needs of any workload are compute, network, 217 + and storage. Production architectures usually have three basic planes to support 218 + them: 219 + 220 + - The compute plane, which is almost certainly going to be ether Docker or 221 + Kubernetes somehow. 222 + - The network plane, which will be a Virtual Private Cloud (VPC) or overlay 223 + network that knits clusters together. 224 + - The storage plane, which is usually the annoying exercise left to the reader, 225 + leading you to make yet another case for either using NFS or sparkly NFS like 226 + Longhorn. 227 + 228 + Storage is the sticky bit; it’s not really changed since the beginning. You 229 + either use a POSIX-compatible key-value store or an S3 compatible key-value 230 + store. Both are used in practically the same ways that the framers intended in 231 + the late 80’s and 2009 respectively. You chuck bytes into the system with a 232 + name, and you get the bytes back when you give the name. 233 + 234 + Storage is the really important part of your workloads. Your phone would not be 235 + as useful if it didn’t remember your list of text messages when you rebooted it. 236 + Many applications also (reasonably) assume that storage always works, is fast 237 + enough that it’s not an issue, and is durable enough that they don’t have to 238 + manually make backups. 239 + 240 + What about latency? Human reaction time is about 250 milliseconds on average. It 241 + takes about 250 milliseconds for a TCP session to be established between Berlin 242 + and us-east-1. If you move your compute between providers, is your storage plane 243 + also going to move data around to compensate? 244 + 245 + If your storage plane doesn’t have egress costs and stores your data close to 246 + where it’s used, this eliminates a lot of local storage complexity, at the cost 247 + of additional compute time spent waiting to pull things and the network 248 + throughput for them to arrive. Somehow compute is cheaper than storage in anno 249 + dominium two-thousand twenty-four. No, I don’t get how that happened either. 250 + 251 + ### Pass-by-reference semantics for the cloud 252 + 253 + Part of the secret for how people make these production platforms is that they 254 + cheat: they don’t pass around values as much as possible. They pass a reference 255 + to that value in the storage plane. When you upload an image to the ChatGPT API 256 + to see if it’s a picture of a horse, you do a file upload call and then an 257 + inference call with the ID of that upload. This makes it easier to sling bytes 258 + around and overall makes things a lot more efficient at the design level. This 259 + is a lot like pass-by-reference semantics in programming languages like Java or 260 + a pointer to a value in Go. 261 + 262 + ### The big queue 263 + 264 + The other big secret is that there’s a layer on top of all of the compute: an 265 + orchestrator with a queue. 266 + 267 + This is the rest of the owl that nobody talks about. Just having compute, 268 + network, and storage is not good enough; there needs to be a layer on top that 269 + spreads the load between workers, intelligently minting and slaying them off as 270 + reality demands. 271 + 272 + ## Okay but where’s the code? 273 + 274 + Yeah, yeah, I get it, you want to see this live and in action. I don’t have an 275 + example totally ready yet, but in lieu of drawing the owl right now, I can tell 276 + you what you’d need in order to make it a reality on the cheap. 277 + 278 + Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f. 279 + [Ôrödyagzou](https://www.youtube.com/watch?v=uuYmkZ-Aomo), Ithkuil for 280 + “synesthesia”). This app is both a HTTP API and an orchestrator. It manages a 281 + pool of worker nodes that do the actual AI inferencing. 282 + 283 + So let’s say a user submits a request asking for a picture of a horse. That’ll 284 + come in to the right HTTP route and it has logic like this: 285 + 286 + ```go 287 + type ScaleToZeroProxy struct { 288 + cfg Config 289 + ready bool 290 + endpointURL string 291 + instanceID int 292 + lock sync.RWMutex 293 + lastUsed time.Time 294 + } 295 + 296 + func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) { 297 + s.lock.RLock() 298 + ready := s.ready 299 + s.lock.RUnlock() 300 + 301 + if !ready { 302 + // TODO: implement instance creation 303 + } 304 + 305 + s.lock.RLock() 306 + defer s.lock.RUnlock() 307 + u, err := url.Parse(s.endpointURL) 308 + if err != nil { 309 + panic(err) 310 + } 311 + 312 + u.Path = r.URL.Path 313 + u.RawQuery = r.URL.RawQuery 314 + 315 + next := httputil.NewSingleHostReverseProxy(u) 316 + 317 + next.ServeHTTP(w, r) 318 + s.lock.Lock() 319 + s.lastUsed = time.Now() 320 + s.lock.Unlock() 321 + } 322 + ``` 323 + 324 + This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance 325 + ID in memory, some logic to check if the instance is “ready”, and if it’s not 326 + then to create it. Let’s mint an instance using the [Vast.ai](http://Vast.ai) 327 + CLI. First, some configuration: 328 + 329 + ```go 330 + const ( 331 + diskNeeded = 36 332 + dockerImage = "reg.xeiaso.net/runner/sdxl-tigris:latest" 333 + httpPort = 5000 334 + modelBucketName = "ciphanubakfu" // lojban: test-number-bag 335 + modelPath = "glides/ponyxl" 336 + onStartCommand = "python -m cog.server.http" 337 + publicBucketName = "xe-flux" 338 + 339 + searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450` 340 + 341 + // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist 342 + ) 343 + 344 + type Config struct { 345 + diskNeeded int // gigabytes 346 + dockerImage string 347 + environment map[string]string 348 + httpPort int 349 + onStartCommand string 350 + } 351 + ``` 352 + 353 + Then we can search for potential machines with some terrible wrappers to the 354 + CLI: 355 + 356 + ```go 357 + func runJSON[T any](ctx context.Context, args ...any) (T, error) { 358 + return trivial.andThusAnExerciseForTheReader[T](ctx, args) 359 + } 360 + 361 + func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error { 362 + s.lock.Lock() 363 + defer s.lock.Unlock() 364 + candidates, err := runJSON[[]vastai.SearchResponse]( 365 + ctx, 366 + "vastai", "search", "offers", 367 + searchCaveats, 368 + "-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first 369 + "--raw", // output JSON 370 + ) 371 + if err != nil { 372 + return fmt.Errorf("can't search for instances: %w", err) 373 + } 374 + 375 + // grab the cheapest option 376 + candidate := candidates[0] 377 + 378 + contractID := candidate.AskContractID 379 + slog.Info("found candidate instance", 380 + "contractID", contractID, 381 + "gpuName", candidate.GPUName, 382 + "cost", candidate.Search.TotalHour, 383 + ) 384 + // ... 385 + } 386 + ``` 387 + 388 + Then you can try to create it: 389 + 390 + ```go 391 + func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error { 392 + // ... 393 + instanceData, err := runJSON[vastai.NewInstance]( 394 + ctx, 395 + "vastai", "create", "instance", 396 + contractID, 397 + "--image", s.cfg.dockerImage, 398 + // dump ports and envvars into format vast.ai wants 399 + "--env", s.cfg.FormatEnvString(), 400 + "--disk", s.cfg.diskNeeded, 401 + "--onstart-cmd", s.cfg.onStartCommand, 402 + "--raw", 403 + ) 404 + if err != nil { 405 + return fmt.Errorf("can't create new instance: %w", err) 406 + } 407 + 408 + slog.Info("created new instance", "instanceID", instanceData.NewContract) 409 + s.instanceID = instanceData.NewContract 410 + // ... 411 + ``` 412 + 413 + Then collect the endpoint URL: 414 + 415 + ```go 416 + func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error { 417 + // ... 418 + instance, err := runJSON[vastai.Instance]( 419 + ctx, 420 + "vastai", "show", "instance", 421 + instanceData.NewContract, 422 + "--raw", 423 + ) 424 + if err != nil { 425 + return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err) 426 + } 427 + 428 + s.EndpointURL = fmt.Sprintf( 429 + "http://%s:%d", 430 + instance.PublicIPAddr, 431 + instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort, 432 + ) 433 + 434 + return nil 435 + } 436 + ``` 437 + 438 + And then finally wire it up and have it test if the instance is ready somehow: 439 + 440 + ```go 441 + func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) { 442 + // ... 443 + 444 + if !ready { 445 + if err := s.mintInstance(r.Context()); err != nil { 446 + slog.Error("can't mint new instance", "err", err) 447 + http.Error(w, err.Error(), http.StatusInternalServerError) 448 + return 449 + } 450 + 451 + t := time.NewTicker(5 * time.Second) 452 + defer t.Stop() 453 + for range t.C { 454 + if ok := s.testReady(r.Context()); ok { 455 + break 456 + } 457 + } 458 + } 459 + 460 + // ... 461 + ``` 462 + 463 + Then the rest of the logic will run through, the request will be passed to the 464 + GPU instance and then a response will be fired. All that’s left is to slay the 465 + instances off when they’re unused for about 5 minutes: 466 + 467 + ```go 468 + func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) { 469 + t := time.NewTicker(5 * time.Minute) 470 + defer t.Stop() 471 + 472 + for { 473 + select { 474 + case <-t.C: 475 + s.lock.RLock() 476 + lastUsed := s.lastUsed 477 + s.lock.RUnlock() 478 + 479 + if lastUsed.Add(5 * time.Minute).Before(time.Now) { 480 + if err := s.slay(ctx); err != nil { 481 + slog.Error("can't slay instance", "err", err) 482 + } 483 + } 484 + case <-ctx.Done(): 485 + return 486 + } 487 + } 488 + } 489 + ``` 490 + 491 + Et voila! Run `maybeSlayLoop` in the background and implement the `slay()` 492 + method to use the `vastai destroy instance` command, then you have yourself 493 + nomadic compute that makes and destroys itself on demand to the lowest bidder. 494 + 495 + Of course, any production-ready implementation would have limits like “don’t 496 + have more than 20 workers” and segment things into multiple work queues. This is 497 + all really hypothetical right now, I wish I had a thing to say you could 498 + `kubectl apply` and use right now, but I don’t. 499 + 500 + I’m going to be working on this this on my Friday streams 501 + [on Twitch](https://twitch.tv/princessxen) until it’s done. I’m going to 502 + implement it from an empty folder and then work on making it a Kubernetes 503 + operator to run any task you want. It’s going to involve generative AI, API 504 + reverse engineering, eternal torment, and hopefully not getting banned from the 505 + providers I’m going to be using. It should be a blast! 506 + 507 + ## Conclusion 508 + 509 + Every workload involves compute, network, and storage on top of production’s 510 + compute plane, network plane, and storage plane. Design your production clusters 511 + to take advantage of very well-understood fundamentals like HTTP, queues, and 512 + object storage so that you can reduce your dependencies to the bare minimum. 513 + Make your app an orchestrator of vast amounts of cheap compute so you don’t need 514 + to pay for compute or storage that nobody is using while everyone is asleep. 515 + 516 + This basic pattern is applicable to just about anything on any platform, not 517 + just AI or not just with Tigris. We hope that by publishing this architectural 518 + design, you’ll take it to heart when building your production workloads of the 519 + future so that we can all use the cloud responsibly. Certain parts of the 520 + economics of this pattern work best when you have free (or basically free) 521 + egress costs though. 522 + 523 + We’re excited about building the best possible storage layer based on the 524 + lessons learned building the storage layer Uber uses to service millions of 525 + rides per month. If you try us and disagree, that’s fine, we won’t nickel and 526 + dime you on the way out because we don’t charge egress costs. 527 + 528 + When all of these concerns are made easier, all that’s left for you is to draw 529 + the rest of the owl and get out there disrupting industries.

Configure Feed

Configure Feed