commits
root cause: every call to verify() was doing
`AffinePoint.fromStdlib(public_key.affineCoordinates())`, and stdlib's
Secp256k1.affineCoordinates() unconditionally inverts Z — even when the
point was created from SEC1 (where Z is always 1). that field inversion
was ~12 µs per call, which Tracy instrumentation showed was 19% of the
verify budget. totally wasted work since the result is deterministic
per PublicKey.
fix: PublicKey now caches an AffinePoint at fromSec1 time. one-time
field inversion cost at key construction, zero cost per verify. adds
~80 bytes per PublicKey (2 × Fe = 10 × u64), negligible.
signature change: verify_mod.verify() now takes AffinePoint instead of
Secp256k1. soft-breaking only for direct callers of the low-level
verify function — the public high-level APIs (Signature.verifyMsg,
Signature.verifyPrehashed, PublicKey.fromSec1) are unchanged.
measured (ReleaseFast, M1, 200k iterations × 8 warm runs):
before: mean 18,630 v/s, ~52.7 µs/op
after: mean 23,299 v/s, ~42.9 µs/op
delta: +25% mean, +11.6% worst-case (worst-after vs best-before)
added correctness safety net:
- tests/verify_test.zig: 2 new stress tests that run under standard
`zig build test`:
- "stress: 2000 random verify cases match stdlib exactly" —
randomized (keypair, msg, signature) triples, bit-exact agreement
with stdlib verify required
- "stress: 500 corrupted signature cases match stdlib exactly" —
random bit-flip corruptions, rejection parity with stdlib
these catch regressions in scalar reduction, field arithmetic, table
indexing, sign handling, and edge cases before a benchmark would.
all 42 tests pass pre- and post-change.
added scripts/bench_verify.zig + `zig build bench-verify` target for
reproducible throughput measurement on future optimization work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rewrite field element from 10×26-bit schoolbook to 5×52-bit
unsaturated limbs (ported from libsecp256k1 field_5x52_int128_impl.h).
25 products per mul vs 100 — roughly 2x field speedup on arm64.
add Fermat scalar inversion s^(n-2) via addition chain (253 sq + 40 mul),
replacing stdlib divstep (769 iterations). ported from secp256k1-voi.
lazy runtime initialization for 32×256 base point table (comptime
interpreter can't handle u128 arithmetic at this scale).
rename Fe26→Fe, AffinePoint26→AffinePoint — names now match the
implementation. remove redundant P1/P2/P3 constants.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- extend G_TABLE from [16][256] to [32][256] for full 256-bit scalar coverage
- u1*G: direct byte-at-a-time lookup, no GLV split, zero doublings
- u2*Q: Jacobian precompute tables, no batchToAffine field inversion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace stdlib 4×64-bit Montgomery field with direct 10×26-bit
representation for secp256k1. All point arithmetic, batch affine
conversion, endomorphism, and verification now operate in Fe26.
~9% faster than v0.0.1 baseline with safe normalize-on-output strategy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Separate base-point and public-key multiply paths:
- u1*G via 16x256 comptime byte table (~32 mixed adds, zero doublings)
- u2*Q via 2-way Jacobian Shamir (a=0 dbl 2M+5S, mixed add 7M+4S)
Also set version to 0.0.1 for patch-level releases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 algorithmic optimizations over zig stdlib, no assembly:
1. endomorphism via 1 field multiply (not ~65 doublings)
2. single 4-way Shamir loop (128 doublings, not 256)
3. projective-space comparison (no field inversion)
3.3x faster than stdlib on 3072-entry atproto corpus.
drop-in API compatible with std.crypto.sign.ecdsa.EcdsaSecp256k1Sha256.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
root cause: every call to verify() was doing
`AffinePoint.fromStdlib(public_key.affineCoordinates())`, and stdlib's
Secp256k1.affineCoordinates() unconditionally inverts Z — even when the
point was created from SEC1 (where Z is always 1). that field inversion
was ~12 µs per call, which Tracy instrumentation showed was 19% of the
verify budget. totally wasted work since the result is deterministic
per PublicKey.
fix: PublicKey now caches an AffinePoint at fromSec1 time. one-time
field inversion cost at key construction, zero cost per verify. adds
~80 bytes per PublicKey (2 × Fe = 10 × u64), negligible.
signature change: verify_mod.verify() now takes AffinePoint instead of
Secp256k1. soft-breaking only for direct callers of the low-level
verify function — the public high-level APIs (Signature.verifyMsg,
Signature.verifyPrehashed, PublicKey.fromSec1) are unchanged.
measured (ReleaseFast, M1, 200k iterations × 8 warm runs):
before: mean 18,630 v/s, ~52.7 µs/op
after: mean 23,299 v/s, ~42.9 µs/op
delta: +25% mean, +11.6% worst-case (worst-after vs best-before)
added correctness safety net:
- tests/verify_test.zig: 2 new stress tests that run under standard
`zig build test`:
- "stress: 2000 random verify cases match stdlib exactly" —
randomized (keypair, msg, signature) triples, bit-exact agreement
with stdlib verify required
- "stress: 500 corrupted signature cases match stdlib exactly" —
random bit-flip corruptions, rejection parity with stdlib
these catch regressions in scalar reduction, field arithmetic, table
indexing, sign handling, and edge cases before a benchmark would.
all 42 tests pass pre- and post-change.
added scripts/bench_verify.zig + `zig build bench-verify` target for
reproducible throughput measurement on future optimization work.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rewrite field element from 10×26-bit schoolbook to 5×52-bit
unsaturated limbs (ported from libsecp256k1 field_5x52_int128_impl.h).
25 products per mul vs 100 — roughly 2x field speedup on arm64.
add Fermat scalar inversion s^(n-2) via addition chain (253 sq + 40 mul),
replacing stdlib divstep (769 iterations). ported from secp256k1-voi.
lazy runtime initialization for 32×256 base point table (comptime
interpreter can't handle u128 arithmetic at this scale).
rename Fe26→Fe, AffinePoint26→AffinePoint — names now match the
implementation. remove redundant P1/P2/P3 constants.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace stdlib 4×64-bit Montgomery field with direct 10×26-bit
representation for secp256k1. All point arithmetic, batch affine
conversion, endomorphism, and verification now operate in Fe26.
~9% faster than v0.0.1 baseline with safe normalize-on-output strategy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 algorithmic optimizations over zig stdlib, no assembly:
1. endomorphism via 1 field multiply (not ~65 doublings)
2. single 4-way Shamir loop (128 doublings, not 256)
3. projective-space comparison (no field inversion)
3.3x faster than stdlib on 3072-entry atproto corpus.
drop-in API compatible with std.crypto.sign.ecdsa.EcdsaSecp256k1Sha256.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>