providers/anthropic: enforce max_tokens > thinking_budget and auto-stream over threshold
Make run_generate and run_agenerate robust to two latent Anthropic SDK
constraints that surfaced during the L3 sense pilot. Both apply only to
the generate path; cogitate is untouched.
Fix A: when thinking_budget is positive and max_output_tokens falls at
or below thinking_budget + 1000, lift max_tokens rather than clamp the
caller's thinking budget. The caller's declared max is a stated output
floor when thinking is active; clamping thinking would silently shrink
a deliberate reasoning budget. A logger.info emits before/after on each
lift. The BadRequestError retry inherits the adjustment via dict copy.
Fix B: route requests through client.messages.stream(...) when
max_tokens trips the SDK's non-streaming guard, either by exceeding
MODEL_NONSTREAMING_TOKENS[model] or the time formula threshold
(60 * 60 * max_tokens / 128_000 > 600 ≈ 21,333 tokens). Downstream
extraction is unchanged since ParsedMessage subclasses Message.
MODEL_NONSTREAMING_TOKENS is imported from anthropic._constants — the
SDK itself imports it via the public messages path at
anthropic/resources/messages/messages.py, so the symbol is stable.
Both fixes compose: thinking_budget=24576 + max_output_tokens=24576 is
lifted to 25577 and then routed through streaming. The retry path
re-evaluates the dispatch decision, so a primary create() call that
raises BadRequestError with post-lift max_tokens above the threshold
will route its retry through streaming.
Live validation with production-scale budgets deferred:
ANTHROPIC_API_KEY is not available in this worktree. Unit tests cover
Fix A (adjust / no-adjust / async), Fix B (create vs stream per model,
sync + async), the Fix A → Fix B interaction, and the tool-use fallback
routing under streaming.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>