Add CRC benchmarks, software-only API, and update README
Add bench/ with throughput comparison of the three implementation tiers:
- Byte-at-a-time (reference baseline): ~290 MB/s
- Slicing-by-8 (software fast path): ~730 MB/s
- Hardware ARM CRC intrinsics: ~12-16 GB/s
Expose crc32_software / crc32c_software for benchmarking and testing
the software path independently of hardware dispatch.
Update README to document performance tiers, hardware detection, and
portability story (jsoo/wasm).