···11+---
22+title: "Nomadic Infrastructure Design for AI Workloads"
33+date: 2024-11-12
44+redirect_to: "https://tigrisdata.com/blog/nomadic-compute/"
55+hero:
66+ ai: "Flux [dev] by Black Forest Labs"
77+ file: "_yj_eBqjMOIe0Bv-oQxoy"
88+ prompt: "A nomadic server hunts for GPUs, powered by Taco Bell"
99+---
1010+1111+Taco Bell is a miracle of food preparation. They manage to have a menu of dozens
1212+of items that all boil down to permutations of 8 basic items: meat, cheese,
1313+beans, vegetables, bread, and sauces. Those basic fundamentals are combined in
1414+new and interesting ways to give you the crunchwrap, the chalupa, the doritos
1515+locos tacos, and more. Just add hot water and they’re ready to eat.
1616+1717+Even though the results are exciting, the ingredients for them are not. They’re
1818+all really simple things. The best designed production systems I’ve ever used
1919+take the same basic idea: build exciting things out of boring components that
2020+are well understood across all facets of the industry (eg: S3, Postgres, HTTP,
2121+JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the
2222+industry-disrupting industry.
2323+2424+A bunch of companies want to sell you inference time for your AI workloads or
2525+the results of them inferencing AI workloads for you, but nobody really tells
2626+you how to make this yourself. That’s the special Mexican Pizza sauce that you
2727+can’t replicate at home no matter how much you want to be able to.
2828+2929+Today, we’ll cover how you, a random nerd that likes reading architectural
3030+articles, should design a production-ready AI system so that you can maximize
3131+effectiveness per dollar, reduce dependency lock-in, and separate concerns down
3232+to their cores. Buckle up, it’s gonna be a ride.
3333+3434+<Conv name="Mara" mood="hacker">
3535+ The industry uses like a billion different terms for “unit of compute that has
3636+ access to a network connection and the ability to store things for some amount
3737+ of time” that all conflict in mutually incompatible ways. When you read
3838+ “workload”, you should think about some program that has network access to
3939+ some network and some amount of storage through some means running somewhere,
4040+ probably in a container.
4141+</Conv>
4242+4343+## The fundamentals of any workload
4444+4545+At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes,
4646+$5 Hetzner VPSen, etc.) is a combination of three basic factors:
4747+4848+- Compute, or the part that executes code and does math
4949+- Network, or the part that lets you dial and accept sockets
5050+- Storage, or the part that remembers things for next time
5151+5252+In reality, these things will overlap a little (compute has storage in the form
5353+of ram, some network cards run their own Linux kernel, and storage is frequently
5454+accessed over the network), but that still very cleanly maps to the basic things
5555+that you’re billed for in the cloud:
5656+5757+- Gigabyte-core-seconds of compute
5858+- Gigabytes egressed over the network
5959+- Gigabytes stored in persistent storage
6060+6161+And of course, there’s a huge money premium for any of this being involved in AI
6262+anything because people will pay. However, let’s take a look at that second
6363+basic thing you’re billed for a bit closer:
6464+6565+> - Gigabytes egressed over the network
6666+6767+Note that it’s _egress_ out of your compute, not _ingress_ to your compute.
6868+Providers generally want you to make it easy to put your data into their
6969+platform and harder to get the data back out. This is usually combined with your
7070+storage layer, which can make it annoying and expensive to deal with data that
7171+is bigger than your local disk. Your local disk is frequently way too small to
7272+store everything, so you have to make compromises.
7373+7474+What if your storage layer didn’t charge you per gigabyte of data you fetched
7575+out of it? What classes of problems would that allow you to solve that were
7676+previously too expensive to execute on?
7777+7878+If you put your storage in a service that is low-latency, close to your servers,
7979+and has no egress fees, then it can actually be cheaper to pull things from
8080+object storage just-in-time to use them than it is to store them persistently.
8181+8282+### Storage that is left idle is more expensive than compute time
8383+8484+In serverless (Lambda) scenarios, most of the time your application is turned
8585+off. This is good. This is what you want. You want it to turn on when it’s
8686+needed, and turn back off when it’s not. When you do a setup like this, you also
8787+usually assume that the time it takes to do a cold start of the service is fast
8888+enough that the user doesn’t mind.
8989+9090+Let’s say that your AI app requires 16 gigabytes of local disk space for your
9191+Docker image with the inference engine and the downloaded model weights. In some
9292+clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have
9393+the data sitting there doing nothing, even if the actual compute time is as low
9494+as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of
9595+weight bytes) and those weights take 5 minutes to download, this means that you
9696+are only spending $0.12 waiting things to download. If you’re only doing
9797+inference in bulk scenarios where latency doesn’t matter as much, then it can be
9898+much, much cheaper to dynamically mint new instances, download the model weights
9999+from object storage, do all of the inference you need, and then slay those
100100+instances off when you’re done.
101101+102102+Most of the time, any production workload’s request rate is going to follow a
103103+sinusodal curve where there’s peak usage for about 8 hours in the middle of the
104104+day and things will fall off overnight as everyone goes to bed. If you spin up
105105+AI inference servers on demand following this curve, this means that the first
106106+person of the day to use an AI feature could have it take a bit longer for the
107107+server to get its coffee, but it’ll be hot’n’ready for the next user when they
108108+use that feature.
109109+110110+You can even cheat further with optional features such that the first user
111111+doesn’t actually see them, but it triggers the AI inference backend to wake up
112112+for the next request.
113113+114114+### It may not be your money, but the amounts add up
115115+116116+When you set up cloud compute, it’s really easy to fall prey to the siren song
117117+of the seemingly bottomless budget of the corporate card. At a certain point, we
118118+all need to build sustainable business as the AI hype wears off and the free
119119+tier ends. However, thanks to the idea of Taco Bell infrastructure design, you
120120+can reduce the risk of lock-in and increase flexibility between providers so you
121121+can lower your burn rate.
122122+123123+In many platforms, data ingress is free. Data _egress_ is where they get you.
124124+It’s such a problem for businesses that the
125125+[EU has had to step in and tell providers that people need an easy way out](https://commission.europa.eu/news/data-act-enters-force-what-it-means-you-2024-01-11_en).
126126+Every gigabyte of data you put into those platforms is another $0.05 that it’ll
127127+cost to move away should you need to.
128128+129129+This doesn’t sound like an issue, because the CTO negotiating dream is that
130130+they’ll be able to play the “we’re gonna move our stuff elsewhere” card and
131131+instantly win a discount and get a fantastic deal that will enable future growth
132132+or whatever.
133133+134134+This is a nice dream.
135135+136136+In reality, the sales representative has a number in big red letters in front of
137137+them. This number is the amount of money it would cost for you to move your 3
138138+petabytes of data off of their cloud. You both know you’re stuck with eachother,
139139+and you’ll happily take an additional measly 5% discount on top of the 10%
140140+discount you negotiated last year. We all know that the actual cost of running
141141+the service is 15% of even that cost; but the capitalism machine has to eat
142142+somehow, right?
143143+144144+## On the nature of dependencies
145145+146146+Let’s be real, dependencies aren’t fundamentally bad things to have. All of us
147147+have a hard dependency on the Internet, amd64 CPUs, water, and storage.
148148+Everything’s a tradeoff. The potentially harmful part comes in when your
149149+dependency locks you in so you can’t switch away easily.
150150+151151+This is normally pretty bad with traditional compute setups, but can be extra
152152+insidious with AI workloads. AI workloads make cloud companies staggering
153153+amounts of money, so they want to make sure that you keep your AI workloads on
154154+their servers as much as possible so they can extract as much revenue out of you
155155+as possible. Combine this with the big red number disadvantage in negotiations,
156156+and you can find yourself backed into a corner.
157157+158158+### Strategic dependency choice
159159+160160+This is why picking your dependencies is such a huge thing to consider. There’s
161161+a lot to be said about choosing dependencies to minimize vendor lock-in, and
162162+that’s where the Taco Bell infrastructure philosophy comes in:
163163+164164+- Trigger compute with HTTP requests that use well-defined schemata.
165165+- Find your target using DNS.
166166+- Store things you want to keep in Postgres or object storage.
167167+- Fetch things out of storage when you need them.
168168+- Mint new workers when there is work to be done.
169169+- Slay those workers off when they’re not needed anymore.
170170+171171+If you follow these rules, you can easily make your compute nomadic between
172172+services. Capitalize on things like Kubernetes (the universal API for cloud
173173+compute, as much as I hate that it won), and you make the underlying clouds an
174174+implementation detail that can be swapped out as you find better strategic
175175+partnerships that can offer you more than a measly 5% discount.
176176+177177+Just add water.
178178+179179+### How AI models become dependencies
180180+181181+There's an extra evil way that AI models can become production-critical
182182+dependencies. Most of the time when you implement an application that uses an AI
183183+model, you end up encoding "workarounds" for the model into the prompts you use.
184184+This happens because AI models are fundamentally unpredictable and unreliable
185185+tools that sometimes give you the output you want. As a result though, changing
186186+out models _sounds_ like it's something that should be easy. You _just_ change
187187+out the model and then you can take advantage of better accuracy, new features
188188+like tool use, or JSON schema prompting, right?
189189+190190+In many cases, changing out a model will result in a service that superficially
191191+looks and functions the same. You give it a meeting transcript, it tells you
192192+what the action items are. The problem comes in with the subtle nuances of the
193193+je ne sais quoi of the experience. Even subtle differences like
194194+[the current date being in the month of December](https://arstechnica.com/information-technology/2023/12/is-chatgpt-becoming-lazier-because-its-december-people-run-tests-to-find-out/)
195195+can _drastically_ change the quality of output. A
196196+[recent paper from Apple](https://arxiv.org/pdf/2410.05229) concluded that
197197+adding superficial details that wouldn't throw off a human can severely impact
198198+the performance of large language models. Heck, they even struggle or fall prey
199199+to fairly trivial questions that humans find easy, such as:
200200+201201+- How many r's are in the word "strawberry"?
202202+- What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three
203203+ pounds of air?
204204+205205+If changing the placement of a comma in a prompt can cause such huge impacts to
206206+the user experience, what would changing the model do? What would being forced
207207+to change the model because the provider is deprecating it so they can run newer
208208+models that don't do the job as well as the model you currently use? This is a
209209+really evil kind of dependency that you can only get when you rely on
210210+cloud-hosted models. By controlling the weights and inference setups for your
211211+machines, you have a better chance of being able to dictate the future of your
212212+product and control all parts of the stack as much as possible.
213213+214214+## How it’s made prod-ready
215215+216216+Like I said earlier, the three basic needs of any workload are compute, network,
217217+and storage. Production architectures usually have three basic planes to support
218218+them:
219219+220220+- The compute plane, which is almost certainly going to be ether Docker or
221221+ Kubernetes somehow.
222222+- The network plane, which will be a Virtual Private Cloud (VPC) or overlay
223223+ network that knits clusters together.
224224+- The storage plane, which is usually the annoying exercise left to the reader,
225225+ leading you to make yet another case for either using NFS or sparkly NFS like
226226+ Longhorn.
227227+228228+Storage is the sticky bit; it’s not really changed since the beginning. You
229229+either use a POSIX-compatible key-value store or an S3 compatible key-value
230230+store. Both are used in practically the same ways that the framers intended in
231231+the late 80’s and 2009 respectively. You chuck bytes into the system with a
232232+name, and you get the bytes back when you give the name.
233233+234234+Storage is the really important part of your workloads. Your phone would not be
235235+as useful if it didn’t remember your list of text messages when you rebooted it.
236236+Many applications also (reasonably) assume that storage always works, is fast
237237+enough that it’s not an issue, and is durable enough that they don’t have to
238238+manually make backups.
239239+240240+What about latency? Human reaction time is about 250 milliseconds on average. It
241241+takes about 250 milliseconds for a TCP session to be established between Berlin
242242+and us-east-1. If you move your compute between providers, is your storage plane
243243+also going to move data around to compensate?
244244+245245+If your storage plane doesn’t have egress costs and stores your data close to
246246+where it’s used, this eliminates a lot of local storage complexity, at the cost
247247+of additional compute time spent waiting to pull things and the network
248248+throughput for them to arrive. Somehow compute is cheaper than storage in anno
249249+dominium two-thousand twenty-four. No, I don’t get how that happened either.
250250+251251+### Pass-by-reference semantics for the cloud
252252+253253+Part of the secret for how people make these production platforms is that they
254254+cheat: they don’t pass around values as much as possible. They pass a reference
255255+to that value in the storage plane. When you upload an image to the ChatGPT API
256256+to see if it’s a picture of a horse, you do a file upload call and then an
257257+inference call with the ID of that upload. This makes it easier to sling bytes
258258+around and overall makes things a lot more efficient at the design level. This
259259+is a lot like pass-by-reference semantics in programming languages like Java or
260260+a pointer to a value in Go.
261261+262262+### The big queue
263263+264264+The other big secret is that there’s a layer on top of all of the compute: an
265265+orchestrator with a queue.
266266+267267+This is the rest of the owl that nobody talks about. Just having compute,
268268+network, and storage is not good enough; there needs to be a layer on top that
269269+spreads the load between workers, intelligently minting and slaying them off as
270270+reality demands.
271271+272272+## Okay but where’s the code?
273273+274274+Yeah, yeah, I get it, you want to see this live and in action. I don’t have an
275275+example totally ready yet, but in lieu of drawing the owl right now, I can tell
276276+you what you’d need in order to make it a reality on the cheap.
277277+278278+Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f.
279279+[Ôrödyagzou](https://www.youtube.com/watch?v=uuYmkZ-Aomo), Ithkuil for
280280+“synesthesia”). This app is both a HTTP API and an orchestrator. It manages a
281281+pool of worker nodes that do the actual AI inferencing.
282282+283283+So let’s say a user submits a request asking for a picture of a horse. That’ll
284284+come in to the right HTTP route and it has logic like this:
285285+286286+```go
287287+type ScaleToZeroProxy struct {
288288+ cfg Config
289289+ ready bool
290290+ endpointURL string
291291+ instanceID int
292292+ lock sync.RWMutex
293293+ lastUsed time.Time
294294+}
295295+296296+func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
297297+ s.lock.RLock()
298298+ ready := s.ready
299299+ s.lock.RUnlock()
300300+301301+ if !ready {
302302+ // TODO: implement instance creation
303303+ }
304304+305305+ s.lock.RLock()
306306+ defer s.lock.RUnlock()
307307+ u, err := url.Parse(s.endpointURL)
308308+ if err != nil {
309309+ panic(err)
310310+ }
311311+312312+ u.Path = r.URL.Path
313313+ u.RawQuery = r.URL.RawQuery
314314+315315+ next := httputil.NewSingleHostReverseProxy(u)
316316+317317+ next.ServeHTTP(w, r)
318318+ s.lock.Lock()
319319+ s.lastUsed = time.Now()
320320+ s.lock.Unlock()
321321+}
322322+```
323323+324324+This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance
325325+ID in memory, some logic to check if the instance is “ready”, and if it’s not
326326+then to create it. Let’s mint an instance using the [Vast.ai](http://Vast.ai)
327327+CLI. First, some configuration:
328328+329329+```go
330330+const (
331331+ diskNeeded = 36
332332+ dockerImage = "reg.xeiaso.net/runner/sdxl-tigris:latest"
333333+ httpPort = 5000
334334+ modelBucketName = "ciphanubakfu" // lojban: test-number-bag
335335+ modelPath = "glides/ponyxl"
336336+ onStartCommand = "python -m cog.server.http"
337337+ publicBucketName = "xe-flux"
338338+339339+ searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450`
340340+341341+ // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist
342342+)
343343+344344+type Config struct {
345345+ diskNeeded int // gigabytes
346346+ dockerImage string
347347+ environment map[string]string
348348+ httpPort int
349349+ onStartCommand string
350350+}
351351+```
352352+353353+Then we can search for potential machines with some terrible wrappers to the
354354+CLI:
355355+356356+```go
357357+func runJSON[T any](ctx context.Context, args ...any) (T, error) {
358358+ return trivial.andThusAnExerciseForTheReader[T](ctx, args)
359359+}
360360+361361+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
362362+ s.lock.Lock()
363363+ defer s.lock.Unlock()
364364+ candidates, err := runJSON[[]vastai.SearchResponse](
365365+ ctx,
366366+ "vastai", "search", "offers",
367367+ searchCaveats,
368368+ "-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first
369369+ "--raw", // output JSON
370370+ )
371371+ if err != nil {
372372+ return fmt.Errorf("can't search for instances: %w", err)
373373+ }
374374+375375+ // grab the cheapest option
376376+ candidate := candidates[0]
377377+378378+ contractID := candidate.AskContractID
379379+ slog.Info("found candidate instance",
380380+ "contractID", contractID,
381381+ "gpuName", candidate.GPUName,
382382+ "cost", candidate.Search.TotalHour,
383383+ )
384384+ // ...
385385+}
386386+```
387387+388388+Then you can try to create it:
389389+390390+```go
391391+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
392392+ // ...
393393+ instanceData, err := runJSON[vastai.NewInstance](
394394+ ctx,
395395+ "vastai", "create", "instance",
396396+ contractID,
397397+ "--image", s.cfg.dockerImage,
398398+ // dump ports and envvars into format vast.ai wants
399399+ "--env", s.cfg.FormatEnvString(),
400400+ "--disk", s.cfg.diskNeeded,
401401+ "--onstart-cmd", s.cfg.onStartCommand,
402402+ "--raw",
403403+ )
404404+ if err != nil {
405405+ return fmt.Errorf("can't create new instance: %w", err)
406406+ }
407407+408408+ slog.Info("created new instance", "instanceID", instanceData.NewContract)
409409+ s.instanceID = instanceData.NewContract
410410+ // ...
411411+```
412412+413413+Then collect the endpoint URL:
414414+415415+```go
416416+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
417417+ // ...
418418+ instance, err := runJSON[vastai.Instance](
419419+ ctx,
420420+ "vastai", "show", "instance",
421421+ instanceData.NewContract,
422422+ "--raw",
423423+ )
424424+ if err != nil {
425425+ return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err)
426426+ }
427427+428428+ s.EndpointURL = fmt.Sprintf(
429429+ "http://%s:%d",
430430+ instance.PublicIPAddr,
431431+ instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort,
432432+ )
433433+434434+ return nil
435435+}
436436+```
437437+438438+And then finally wire it up and have it test if the instance is ready somehow:
439439+440440+```go
441441+func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
442442+ // ...
443443+444444+ if !ready {
445445+ if err := s.mintInstance(r.Context()); err != nil {
446446+ slog.Error("can't mint new instance", "err", err)
447447+ http.Error(w, err.Error(), http.StatusInternalServerError)
448448+ return
449449+ }
450450+451451+ t := time.NewTicker(5 * time.Second)
452452+ defer t.Stop()
453453+ for range t.C {
454454+ if ok := s.testReady(r.Context()); ok {
455455+ break
456456+ }
457457+ }
458458+ }
459459+460460+ // ...
461461+```
462462+463463+Then the rest of the logic will run through, the request will be passed to the
464464+GPU instance and then a response will be fired. All that’s left is to slay the
465465+instances off when they’re unused for about 5 minutes:
466466+467467+```go
468468+func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) {
469469+ t := time.NewTicker(5 * time.Minute)
470470+ defer t.Stop()
471471+472472+ for {
473473+ select {
474474+ case <-t.C:
475475+ s.lock.RLock()
476476+ lastUsed := s.lastUsed
477477+ s.lock.RUnlock()
478478+479479+ if lastUsed.Add(5 * time.Minute).Before(time.Now) {
480480+ if err := s.slay(ctx); err != nil {
481481+ slog.Error("can't slay instance", "err", err)
482482+ }
483483+ }
484484+ case <-ctx.Done():
485485+ return
486486+ }
487487+ }
488488+}
489489+```
490490+491491+Et voila! Run `maybeSlayLoop` in the background and implement the `slay()`
492492+method to use the `vastai destroy instance` command, then you have yourself
493493+nomadic compute that makes and destroys itself on demand to the lowest bidder.
494494+495495+Of course, any production-ready implementation would have limits like “don’t
496496+have more than 20 workers” and segment things into multiple work queues. This is
497497+all really hypothetical right now, I wish I had a thing to say you could
498498+`kubectl apply` and use right now, but I don’t.
499499+500500+I’m going to be working on this this on my Friday streams
501501+[on Twitch](https://twitch.tv/princessxen) until it’s done. I’m going to
502502+implement it from an empty folder and then work on making it a Kubernetes
503503+operator to run any task you want. It’s going to involve generative AI, API
504504+reverse engineering, eternal torment, and hopefully not getting banned from the
505505+providers I’m going to be using. It should be a blast!
506506+507507+## Conclusion
508508+509509+Every workload involves compute, network, and storage on top of production’s
510510+compute plane, network plane, and storage plane. Design your production clusters
511511+to take advantage of very well-understood fundamentals like HTTP, queues, and
512512+object storage so that you can reduce your dependencies to the bare minimum.
513513+Make your app an orchestrator of vast amounts of cheap compute so you don’t need
514514+to pay for compute or storage that nobody is using while everyone is asleep.
515515+516516+This basic pattern is applicable to just about anything on any platform, not
517517+just AI or not just with Tigris. We hope that by publishing this architectural
518518+design, you’ll take it to heart when building your production workloads of the
519519+future so that we can all use the cloud responsibly. Certain parts of the
520520+economics of this pattern work best when you have free (or basically free)
521521+egress costs though.
522522+523523+We’re excited about building the best possible storage layer based on the
524524+lessons learned building the storage layer Uber uses to service millions of
525525+rides per month. If you try us and disagree, that’s fine, we won’t nickel and
526526+dime you on the way out because we don’t charge egress costs.
527527+528528+When all of these concerns are made easier, all that’s left for you is to draw
529529+the rest of the owl and get out there disrupting industries.