Skip to content

perf: improve tokenizer scanning and add benchmark suite#51

Open
smith558 wants to merge 3 commits intofent:masterfrom
smith558:rescans-refactor
Open

perf: improve tokenizer scanning and add benchmark suite#51
smith558 wants to merge 3 commits intofent:masterfrom
smith558:rescans-refactor

Conversation

@smith558
Copy link

Summary

The tokeniser now parses regex source directly instead of doing an upfront escaped-string rewrite and slice-based parsing for character classes and {m,n} repetitions.

What Changed

  • removed the up-front strToChars() pass from tokenization
  • replaced slice-based class parsing with an indexed class scanner
  • replaced regex-on-slice repetition parsing with an indexed {m,n} parser
  • consolidated escape and number parsing helpers to keep the hot path smaller
  • added light comments around the non-obvious tokeniser branches
  • added a repo benchmark suite covering:
    • tokenizer
    • reconstruct
    • roundtrip
  • documented benchmark usage in the README

Why

The main goal is to reduce per-parse allocations and avoid unnecessary rescans of the input string, especially on patterns with character classes and custom repetitions.

The benchmark suite is included so future tokeniser changes can be measured with representative workloads instead of one-off local scripts.

Benchmark

I compared the current branch against HEAD~2 locally using the same benchmark case set.

Tokeniser results:

  • geometric mean speedup: 2.753x
  • arithmetic mean speedup: 2.975x

Selected tokeniser cases:

  • email-like: 4.922x
  • dense-sets: 4.027x
  • path-like: 3.447x
  • literal: 3.147x
  • class-heavy: 3.144x

Roundtrip results:

  • geometric mean speedup: 1.826x

reconstruct also benchmarked faster in this harness, but tokenizer is the primary target of the change.

Testing

  • npm run build
  • npm test
  • npm run bench -- --min-ms 200

Notes

The benchmark suite is intentionally dependency-free and runs against the built dist output:

npm run bench
npm run bench -- --suite tokenizer --min-ms 750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant