Stitch any CI into Tangled
69
fork

Configure Feed

Select the types of activity you want to include in your feed.

buildkite: fall back to meta_data on webhook cache miss #9

open opened by mitchellh.com targeting main from push-wvullswvqnms

Buildkite webhooks can land before Spawn's goroutine has persisted the build UUID to (knot, pipeline_rkey, workflow) mapping. The window is small but real: CreateBuild returns, Buildkite fires build.scheduled, the webhook handler runs LookupBuildkiteBuildByUUID and gets nothing, and the event is dropped on the floor forever.

HandleWebhook now reconstructs the ref from the build's meta_data when the lookup misses. We already attach tack:knot, tack:pipeline_rkey, and tack:workflow at CreateBuild time, so a Buildkite-originated webhook for one of our builds always carries enough information to recover the tuple. Org and pipeline slug come from the payload's top-level organization and embedded build.pipeline objects.

The reconstructed ref is opportunistically inserted via the existing ON CONFLICT DO UPDATE, so subsequent webhooks and any /logs request hit the cache instead of redoing the meta_data dance. If the authoritative Spawn-side insert lands afterwards it just refreshes the row.

Builds without our tack:* meta_data still no-op, preserving the 'foreign build sharing this webhook URL' behavior. WebhookPayload gains an Organization field so the org slug is available without poking at Build.Pipeline.

Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:onu3oqfahfubgbetlr4giknc/sh.tangled.repo.pull/3mktvlttdqa22
+190 -8
Diff #0
+17 -3
internal/buildkite/buildkite.go
··· 255 255 // actually decode. Build events all wrap a "build" object; job 256 256 // events wrap a "job" — keeping both around lets callers add 257 257 // job-level mapping later without changing the decoder. 258 + // 259 + // Organization is the top-level organization object Buildkite mirrors 260 + // onto every webhook. We decode just its slug so the provider can 261 + // reconstruct (org, pipeline_slug, build_number) from a webhook alone 262 + // when the local UUID→tuple mapping hasn't been persisted yet (the 263 + // race between CreateBuild returning and InsertBuildkiteBuild landing). 258 264 type WebhookPayload struct { 259 - Event string `json:"event"` 260 - Build Build `json:"build"` 261 - Job Job `json:"job"` 265 + Event string `json:"event"` 266 + Build Build `json:"build"` 267 + Job Job `json:"job"` 268 + Organization Organization `json:"organization"` 269 + } 270 + 271 + // Organization is the small slice of Buildkite's organization object 272 + // we care about on webhook payloads: just the slug, used as the org 273 + // component of REST URLs in subsequent calls. 274 + type Organization struct { 275 + Slug string `json:"slug"` 262 276 } 263 277 264 278 // WebhookMode selects how an inbound webhook request is
+79 -5
provider_buildkite.go
··· 544 544 return fmt.Errorf("lookup build by uuid: %w", err) 545 545 } 546 546 if ref == nil { 547 - // Most likely: this build was triggered outside tack and 548 - // just happens to share our webhook URL. Nothing to do. 549 - p.log.Debug("webhook for unknown build; ignoring", 547 + // Cache miss. Two plausible causes: 548 + // 549 + // 1. Genuinely-foreign build: a Buildkite job triggered 550 + // outside tack that just happens to share this 551 + // webhook URL. Nothing to do. 552 + // 2. Race: Spawn's goroutine fired CreateBuild but 553 + // hasn't yet written the UUID→tuple row. A fast 554 + // build.scheduled webhook can land in that window 555 + // and would otherwise be dropped forever. 556 + // 557 + // We disambiguate using the Buildkite meta_data we set at 558 + // CreateBuild time. If the tack:* keys are present the 559 + // build is ours; we reconstruct the ref and opportunistically 560 + // persist it so subsequent webhooks (and any Logs call) 561 + // hit the cache rather than re-doing this work. 562 + ref = refFromWebhook(payload) 563 + if ref == nil { 564 + p.log.Debug("webhook for unknown build; ignoring", 565 + "event", payload.Event, 566 + "build_uuid", payload.Build.ID, 567 + ) 568 + return nil 569 + } 570 + // Opportunistic cache fill. Failure here is non-fatal: 571 + // Spawn's authoritative insert will land shortly (or has 572 + // already, in which case our INSERT … ON CONFLICT just 573 + // refreshes the row). We continue with the reconstructed 574 + // ref either way so a status publish isn't lost. 575 + if err := p.st.InsertBuildkiteBuild(ctx, *ref); err != nil { 576 + p.log.Warn("opportunistic persist of buildkite build mapping", 577 + "err", err, "build_uuid", ref.BuildUUID, 578 + ) 579 + } 580 + p.log.Info("buildkite webhook reconstructed from meta_data", 550 581 "event", payload.Event, 551 - "build_uuid", payload.Build.ID, 582 + "build_uuid", ref.BuildUUID, 583 + "workflow", ref.Workflow, 552 584 ) 553 - return nil 554 585 } 555 586 556 587 status, ok := mapBuildkiteState(payload.Build.State) ··· 580 611 return nil 581 612 } 582 613 614 + // refFromWebhook reconstructs a BuildkiteBuildRef directly from a 615 + // webhook payload, using the tack:* meta_data we attach at Spawn 616 + // time as the source of truth for (knot, pipeline_rkey, workflow). 617 + // Org and pipeline slug come from the payload's organization and 618 + // embedded pipeline objects, both of which Buildkite populates on 619 + // every build.* event. 620 + // 621 + // Returns nil when the payload doesn't carry our meta_data: that's 622 + // the signal the build was triggered outside tack and we should 623 + // keep ignoring it. A partial set (one or two of the three keys) 624 + // also returns nil; we don't want to half-reconstruct a row. 625 + // 626 + // This exists so HandleWebhook can recover the tuple when the 627 + // CreateBuild→InsertBuildkiteBuild race drops a webhook on the 628 + // floor. The caller is expected to opportunistically persist the 629 + // returned ref so subsequent lookups hit the cache. 630 + func refFromWebhook(payload buildkite.WebhookPayload) *BuildkiteBuildRef { 631 + md := payload.Build.MetaData 632 + knot := md[bkMetaKnot] 633 + rkey := md[bkMetaPipelineRkey] 634 + wf := md[bkMetaWorkflow] 635 + if knot == "" || rkey == "" || wf == "" { 636 + return nil 637 + } 638 + 639 + // Pipeline slug lives on the build's embedded pipeline object. 640 + // Decoding it as map[string]interface{} keeps the buildkite 641 + // package's Build struct from sprouting fields we only ever 642 + // touch on this fallback path. 643 + pipelineSlug, _ := payload.Build.Pipeline["slug"].(string) 644 + 645 + return &BuildkiteBuildRef{ 646 + BuildUUID: payload.Build.ID, 647 + BuildNumber: payload.Build.Number, 648 + PipelineSlug: pipelineSlug, 649 + Org: payload.Organization.Slug, 650 + Knot: knot, 651 + PipelineRkey: rkey, 652 + Workflow: wf, 653 + PipelineURI: pipelineATURI(knot, rkey), 654 + } 655 + } 656 + 583 657 // mapBuildkiteState translates Buildkite's build state strings into 584 658 // the Tangled spindle StatusKind enum. The mapping aligns with the 585 659 // upstream constants (StatusKindRunning/Failed/Cancelled/Success);
+94
provider_buildkite_test.go
··· 463 463 } 464 464 } 465 465 466 + // TestBuildkiteHandleWebhookMetaDataFallback covers the race where 467 + // a webhook arrives before Spawn has persisted the UUID→tuple row. 468 + // The handler must reconstruct the ref from the build's tack:* 469 + // meta_data, publish a status event, and opportunistically persist 470 + // the row so subsequent lookups hit the cache. 471 + func TestBuildkiteHandleWebhookMetaDataFallback(t *testing.T) { 472 + p, st, _, _ := newBuildkiteTestProvider(t, buildkite.WebhookModeToken, "secret", 473 + func(w http.ResponseWriter, r *http.Request) { t.Fatal("buildkite shouldn't be called") }) 474 + 475 + // No InsertBuildkiteBuild: simulate the race window where 476 + // CreateBuild has returned to Buildkite (so a webhook is in 477 + // flight) but the provider hasn't yet written the mapping row. 478 + err := p.HandleWebhook(context.Background(), buildkite.WebhookPayload{ 479 + Event: "build.scheduled", 480 + Build: buildkite.Build{ 481 + ID: "uuid-race", 482 + Number: 42, 483 + State: "scheduled", 484 + MetaData: map[string]string{ 485 + bkMetaKnot: "knot.example.com", 486 + bkMetaPipelineRkey: "rkey-race", 487 + bkMetaWorkflow: "ci.yml", 488 + }, 489 + Pipeline: map[string]any{"slug": "mypipe"}, 490 + }, 491 + Organization: buildkite.Organization{Slug: "myorg"}, 492 + }) 493 + if err != nil { 494 + t.Fatalf("HandleWebhook: %v", err) 495 + } 496 + 497 + // Status was published despite the missing cache row. 498 + rows, err := st.EventsAfter(context.Background(), 0) 499 + if err != nil { 500 + t.Fatalf("EventsAfter: %v", err) 501 + } 502 + if len(rows) != 1 { 503 + t.Fatalf("got %d events, want 1", len(rows)) 504 + } 505 + var rec tangled.PipelineStatus 506 + if err := json.Unmarshal(rows[0].EventJSON, &rec); err != nil { 507 + t.Fatalf("decode: %v", err) 508 + } 509 + if rec.Status != "pending" || rec.Workflow != "ci.yml" { 510 + t.Fatalf("bad status: %+v", rec) 511 + } 512 + 513 + // Opportunistic insert means the row is now cached for any 514 + // subsequent webhook (or Logs call), including the Org and 515 + // PipelineSlug we recovered from the payload. 516 + ref, err := st.LookupBuildkiteBuildByUUID(context.Background(), "uuid-race") 517 + if err != nil { 518 + t.Fatalf("lookup: %v", err) 519 + } 520 + if ref == nil { 521 + t.Fatal("expected opportunistic InsertBuildkiteBuild, got nil row") 522 + } 523 + if ref.Workflow != "ci.yml" || ref.Knot != "knot.example.com" || 524 + ref.PipelineRkey != "rkey-race" || ref.BuildNumber != 42 || 525 + ref.PipelineSlug != "mypipe" || ref.Org != "myorg" { 526 + t.Fatalf("ref mismatch: %+v", ref) 527 + } 528 + } 529 + 530 + // TestBuildkiteHandleWebhookForeignBuild covers the genuinely-foreign 531 + // case: a webhook for a build that doesn't carry our tack:* meta_data 532 + // must remain a silent no-op even though the new fallback path now 533 + // inspects meta_data. 534 + func TestBuildkiteHandleWebhookForeignBuild(t *testing.T) { 535 + p, st, _, _ := newBuildkiteTestProvider(t, buildkite.WebhookModeToken, "secret", 536 + func(w http.ResponseWriter, r *http.Request) {}) 537 + 538 + if err := p.HandleWebhook(context.Background(), buildkite.WebhookPayload{ 539 + Event: "build.finished", 540 + Build: buildkite.Build{ 541 + ID: "uuid-foreign", 542 + State: "passed", 543 + // No tack:* keys: build was triggered outside tack. 544 + MetaData: map[string]string{"someone-elses-key": "value"}, 545 + }, 546 + }); err != nil { 547 + t.Fatalf("HandleWebhook: %v", err) 548 + } 549 + 550 + rows, _ := st.EventsAfter(context.Background(), 0) 551 + if len(rows) != 0 { 552 + t.Fatalf("got %d events, want 0", len(rows)) 553 + } 554 + ref, _ := st.LookupBuildkiteBuildByUUID(context.Background(), "uuid-foreign") 555 + if ref != nil { 556 + t.Fatalf("foreign build was opportunistically persisted: %+v", ref) 557 + } 558 + } 559 + 466 560 // TestBuildkiteWebhookHandlerHTTP exercises the full HTTP path 467 561 // including auth: a request signed with the wrong secret must be 468 562 // rejected, and a correctly-signed one must reach the provider.

History

1 round 0 comments
sign up or login to add to the discussion
mitchellh.com submitted #0
1 commit
expand
buildkite: fall back to meta_data on webhook cache miss
merge conflicts detected
expand
  • internal/buildkite/buildkite.go:255
  • provider_buildkite.go:544
expand 0 comments