Add missing metrics from leanMetrics spec#192
Conversation
Add 13 new metrics and update 3 existing ones to align with the upstream leanMetrics specification. New metrics: - PQ signature counters: attestation_signatures total/valid/invalid - PQ signature histograms: signing time, verification time, aggregated signatures building time, aggregated signatures verification time - Fork choice gauges: gossip_signatures, new/known aggregated payloads - Fork choice histogram: reorg_depth - Committee signatures aggregation time histogram - Validator gauge: is_aggregator Updated metrics: - lean_attestations_valid/invalid_total: removed source label - lean_fork_choice_block_processing_time_seconds: extended buckets - lean_connected_peers: updated client label set in docs Gossip signatures count uses an in-memory AtomicUsize counter (incremented on insert, decremented on delete/prune) to avoid a full RocksDB table scan on every attestation.
🤖 Kimi Code ReviewReview SummaryThis PR adds comprehensive metrics collection to ethlambda, focusing on signature operations, attestation processing, and fork-choice behavior. The implementation is generally well-structured and follows Rust best practices. Issues Found1. Atomic Counter Synchronization Issue (
|
🤖 Codex Code Review
Overall, the PR is solid from a consensus/security perspective: no obvious fork-choice, attestation validation, justification/finalization, state transition, XMSS verification, or SSZ correctness regressions in this diff. I did not run tests in this review pass. Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have all the information I need to write a thorough review. Code Review: PR 192 — Add missing metrics from leanMetrics specOverall this is a well-executed metrics instrumentation PR. The design decisions are sound, the Bug:
|
Greptile SummaryThis PR brings ethlambda's metric coverage from 21 to 34 metrics by instrumenting PQ signature operations, fork-choice store state gauges, committee aggregation timing, and the Key findings:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| crates/blockchain/src/store.rs | Adds timing guards, gauge updates, and refactors is_reorg → reorg_depth; two logic issues found: lean_gossip_signatures gauge not pushed after pruning, and MAX_REORG_DEPTH cutoff causes false-positive reorg events and counter inflation. |
| crates/storage/src/store.rs | Adds AtomicUsize gossip_signatures_count with DB-seeded startup value, upsert-safe point-read guard in insert_gossip_signature, and saturating_sub in delete/prune paths — the atomic counter logic is correct, but the corresponding Prometheus gauge push is missing from the pruning path (tracked in blockchain/store.rs comment). |
| crates/blockchain/src/metrics.rs | Adds 13 new metric functions and updates 2 existing counters (label removal) and 1 histogram (extended buckets); all registrations are lazy-static and consistent with the spec, no issues found. |
| crates/blockchain/src/key_manager.rs | Wraps secret_key.sign() with time_pq_sig_attestation_signing() guard and increments inc_pq_sig_attestation_signatures() only on success; instrumentation is correct. |
| crates/blockchain/src/lib.rs | Single-line addition of metrics::set_is_aggregator(is_aggregator) at node startup; straightforward and correct. |
| docs/metrics.md | Documentation synced to leanMetrics spec commit 2719baad; all 13 new metrics marked ✅, updated bucket sets and removed source label match the implementation. |
Sequence Diagram
sequenceDiagram
participant GossipHandler
participant BlockchainStore
participant StorageStore
participant AtomicCounter
participant PrometheusGauge
Note over GossipHandler,PrometheusGauge: Gossip attestation path
GossipHandler->>BlockchainStore: on_gossip_attestation()
BlockchainStore->>BlockchainStore: time_pq_sig_attestation_verification()
BlockchainStore->>BlockchainStore: inc_pq_sig_attestation_signatures_valid/invalid()
BlockchainStore->>StorageStore: insert_gossip_signature()
StorageStore->>StorageStore: read check (is_new?)
StorageStore->>AtomicCounter: fetch_add(1) if is_new
BlockchainStore->>PrometheusGauge: update_gossip_signatures(count)
Note over GossipHandler,PrometheusGauge: Aggregation cycle
BlockchainStore->>BlockchainStore: aggregate_committee_signatures()
BlockchainStore->>BlockchainStore: time_pq_sig_aggregated_signatures_building()
BlockchainStore->>StorageStore: delete_gossip_signatures(keys)
StorageStore->>AtomicCounter: saturating_sub(keys.len())
BlockchainStore->>PrometheusGauge: update_gossip_signatures(count)
Note over GossipHandler,PrometheusGauge: Finalization / pruning path (gauge NOT updated)
BlockchainStore->>StorageStore: update_checkpoints()
StorageStore->>StorageStore: prune_gossip_signatures(slot)
StorageStore->>AtomicCounter: saturating_sub(pruned)
Note right of PrometheusGauge: ❌ No gauge push here — stale value until next insert/delete
Last reviewed commit: 17f10ea
crates/storage/src/store.rs
Outdated
| backend, | ||
| new_payloads: Arc::new(Mutex::new(PayloadBuffer::new(NEW_PAYLOAD_CAP))), | ||
| known_payloads: Arc::new(Mutex::new(PayloadBuffer::new(AGGREGATED_PAYLOAD_CAP))), | ||
| gossip_signatures_count: Arc::new(AtomicUsize::new(0)), |
There was a problem hiding this comment.
gossip_signatures_count is initialized to 0, but the GossipSignatures RocksDB table may already contain entries from a previous run. After a restart:
- Counter starts at
0. - Finalization at startup fires
prune_gossip_signatures(finalized_slot)(line 466), which callsfetch_sub(pruned, ...)at line 584. - If
pruned > 0for pre-existing entries,fetch_subwraps around tousize::MAX - pruned + 1. - The metric then reports a large negative gauge value (e.g.,
-5when cast toi64), corrupting observability data.
AtomicUsize::fetch_sub does not panic on underflow—it silently wraps—so this is a silent data corruption of the metric.
Fix: Initialize the counter from the existing DB count at startup:
// In Store::init_store() or the constructor
let initial_count = self.count_gossip_signatures(); // scan table once
gossip_signatures_count: Arc::new(AtomicUsize::new(initial_count))Alternatively, if an O(N) startup scan is undesirable, persist the count in a separate metadata key and restore it on initialization.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 355
Comment:
`gossip_signatures_count` is initialized to `0`, but the `GossipSignatures` RocksDB table may already contain entries from a previous run. After a restart:
1. Counter starts at `0`.
2. Finalization at startup fires `prune_gossip_signatures(finalized_slot)` (line 466), which calls `fetch_sub(pruned, ...)` at line 584.
3. If `pruned > 0` for pre-existing entries, `fetch_sub` wraps around to `usize::MAX - pruned + 1`.
4. The metric then reports a large negative gauge value (e.g., `-5` when cast to `i64`), corrupting observability data.
`AtomicUsize::fetch_sub` does not panic on underflow—it silently wraps—so this is a silent data corruption of the metric.
**Fix:** Initialize the counter from the existing DB count at startup:
```rust
// In Store::init_store() or the constructor
let initial_count = self.count_gossip_signatures(); // scan table once
gossip_signatures_count: Arc::new(AtomicUsize::new(initial_count))
```
Alternatively, if an O(N) startup scan is undesirable, persist the count in a separate metadata key and restore it on initialization.
How can I resolve this? If you propose a fix, please make it concise.
crates/storage/src/store.rs
Outdated
| .put_batch(Table::GossipSignatures, entries) | ||
| .expect("put signature"); | ||
| batch.commit().expect("commit"); | ||
| self.gossip_signatures_count.fetch_add(1, Ordering::Relaxed); |
There was a problem hiding this comment.
put_batch is an upsert — if a signature for the same (validator_id, data_root) key already exists, it updates the existing entry without adding a new row. However, the counter is always incremented unconditionally at line 1069.
If insert_gossip_signature is called twice for the same key (e.g., duplicate gossip message before network-layer deduplication), the database will have 1 entry but the counter will be 2. When the entry is later deleted via delete_gossip_signatures, only one decrement happens, leaving the counter permanently inflated by 1. This causes permanent metric drift.
Fix: Check whether the key is new before incrementing:
let is_new = !self.backend.begin_read()
.expect("read")
.get(Table::GossipSignatures, &encode_signature_key(&key))
.expect("get")
.is_some();
// ... existing write logic ...
if is_new {
self.gossip_signatures_count.fetch_add(1, Ordering::Relaxed);
}Alternatively, add a guard in callers of insert_gossip_signature to ensure it is never called twice for the same key.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 1069
Comment:
`put_batch` is an upsert — if a signature for the same `(validator_id, data_root)` key already exists, it updates the existing entry without adding a new row. However, the counter is always incremented unconditionally at line 1069.
If `insert_gossip_signature` is called twice for the same key (e.g., duplicate gossip message before network-layer deduplication), the database will have 1 entry but the counter will be 2. When the entry is later deleted via `delete_gossip_signatures`, only one decrement happens, leaving the counter permanently inflated by 1. This causes permanent metric drift.
**Fix:** Check whether the key is new before incrementing:
```rust
let is_new = !self.backend.begin_read()
.expect("read")
.get(Table::GossipSignatures, &encode_signature_key(&key))
.expect("get")
.is_some();
// ... existing write logic ...
if is_new {
self.gossip_signatures_count.fetch_add(1, Ordering::Relaxed);
}
```
Alternatively, add a guard in callers of `insert_gossip_signature` to ensure it is never called twice for the same key.
How can I resolve this? If you propose a fix, please make it concise.
MegaRedHand
left a comment
There was a problem hiding this comment.
LGTM. Left some comments
Address review comments: - Cap reorg_depth chain walk at 128 steps to prevent unbounded traversal in pathological cases. - Pin leanMetrics specification link to commit 2719baad instead of main branch.
Seed the in-memory AtomicUsize counter from a one-time DB scan at startup so it reflects pre-existing GossipSignatures entries after a node restart. Previously the counter started at 0, causing underflow on the first finalization-triggered prune. Also check for existing keys before incrementing on insert_gossip_signature to prevent counter drift from duplicate upserts (put_batch overwrites silently but the counter always incremented). Switch fetch_sub to saturating_sub in prune and delete paths as a defensive measure against future double-decrement bugs.
|
@greptile-apps review the PR again please |
| metrics::observe_fork_choice_reorg_depth(depth); | ||
| info!(%old_head, %new_head, depth, "Fork choice reorg detected"); | ||
| } | ||
| store.update_checkpoints(ForkCheckpoints::head_only(new_head)); |
There was a problem hiding this comment.
lean_gossip_signatures gauge not updated after pruning
update_checkpoints (called on line 56) internally calls prune_gossip_signatures, which correctly decrements the gossip_signatures_count atomic in crates/storage/src/store.rs. However, the Prometheus gauge lean_gossip_signatures is only pushed via metrics::update_gossip_signatures() from two call sites:
- After
insert_gossip_signatureinon_gossip_attestation - After
delete_gossip_signaturesinaggregate_committee_signatures
There is no push after the finalization-driven pruning path. This means every time update_checkpoints prunes stale gossip signatures (e.g., on finality advancement), the gauge will report an inflated count until the next individual attestation insert or aggregation-cycle delete. On a quiet node that received no new gossip between two finalization events the gauge could stay inflated indefinitely.
Fix: add a metric push right after update_checkpoints returns:
store.update_checkpoints(ForkCheckpoints::head_only(new_head));
metrics::update_gossip_signatures(store.gossip_signatures_count());Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 56
Comment:
**`lean_gossip_signatures` gauge not updated after pruning**
`update_checkpoints` (called on line 56) internally calls `prune_gossip_signatures`, which correctly decrements the `gossip_signatures_count` atomic in `crates/storage/src/store.rs`. However, the Prometheus gauge `lean_gossip_signatures` is only pushed via `metrics::update_gossip_signatures()` from two call sites:
1. After `insert_gossip_signature` in `on_gossip_attestation`
2. After `delete_gossip_signatures` in `aggregate_committee_signatures`
There is no push after the finalization-driven pruning path. This means every time `update_checkpoints` prunes stale gossip signatures (e.g., on finality advancement), the gauge will report an inflated count until the next individual attestation insert or aggregation-cycle delete. On a quiet node that received no new gossip between two finalization events the gauge could stay inflated indefinitely.
Fix: add a metric push right after `update_checkpoints` returns:
```rust
store.update_checkpoints(ForkCheckpoints::head_only(new_head));
metrics::update_gossip_signatures(store.gossip_signatures_count());
```
How can I resolve this? If you propose a fix, please make it concise.| if depth >= MAX_REORG_DEPTH { | ||
| warn!(depth, "Reorg depth exceeded maximum, stopping walk"); | ||
| return Some(depth); | ||
| } |
There was a problem hiding this comment.
MAX_REORG_DEPTH cutoff may produce false reorg events
When the walk reaches 128 parent hops without finding the target slot, the function unconditionally returns Some(128). The caller (update_head) then calls both metrics::inc_fork_choice_reorgs() and metrics::observe_fork_choice_reorg_depth(128), counting this as a confirmed reorg.
The original is_reorg had no depth limit — it continued walking and could correctly conclude "no reorg" even for long chains. With the new limit, any scenario where old_head and new_head are more than 128 slots apart (e.g., a node catching up from behind, or a one-time large jump in head) will be mis-classified as a reorg of depth 128. The lean_fork_choice_reorgs_total counter will be permanently inflated on nodes that experience long-range head updates.
A safer approach is to return None at the depth limit (treating the outcome as "unknown / not a verified reorg"), consistent with the existing behaviour at the bottom of the loop when chain headers are missing:
if depth >= MAX_REORG_DEPTH {
warn!(depth, "Reorg depth exceeded maximum, stopping walk");
return None; // Cannot confirm reorg; avoid false-positive count
}Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 1270-1273
Comment:
**`MAX_REORG_DEPTH` cutoff may produce false reorg events**
When the walk reaches 128 parent hops without finding the target slot, the function unconditionally returns `Some(128)`. The caller (`update_head`) then calls both `metrics::inc_fork_choice_reorgs()` and `metrics::observe_fork_choice_reorg_depth(128)`, counting this as a confirmed reorg.
The original `is_reorg` had no depth limit — it continued walking and could correctly conclude "no reorg" even for long chains. With the new limit, any scenario where `old_head` and `new_head` are more than 128 slots apart (e.g., a node catching up from behind, or a one-time large jump in head) will be mis-classified as a reorg of depth 128. The `lean_fork_choice_reorgs_total` counter will be permanently inflated on nodes that experience long-range head updates.
A safer approach is to return `None` at the depth limit (treating the outcome as "unknown / not a verified reorg"), consistent with the existing behaviour at the bottom of the loop when chain headers are missing:
```rust
if depth >= MAX_REORG_DEPTH {
warn!(depth, "Reorg depth exceeded maximum, stopping walk");
return None; // Cannot confirm reorg; avoid false-positive count
}
```
How can I resolve this? If you propose a fix, please make it concise.
Closes #76
Summary
Implements all remaining metrics from the leanMetrics specification that were missing in ethlambda. This brings our metric coverage from 21 to 34 metrics, and also updates 3 existing metrics to match the latest upstream spec.
New Metrics (13)
PQ Signature — Individual Attestations (6)
lean_pq_sig_attestation_signatures_totalsecret_key.sign()inkey_manager.rslean_pq_sig_attestation_signatures_valid_totalsignature.is_valid()returns true inon_gossip_attestation()lean_pq_sig_attestation_signatures_invalid_totalsignature.is_valid()returns false inon_gossip_attestation()lean_pq_sig_attestation_signing_time_secondssecret_key.sign()calllean_pq_sig_attestation_verification_time_secondssignature.is_valid()calllean_pq_sig_aggregated_signatures_verification_time_secondsverify_aggregated_signature()call sites (block verification + gossip aggregated attestation)PQ Signature — Aggregation (1)
lean_pq_sig_aggregated_signatures_building_time_secondsaggregate_signatures()call insideaggregate_committee_signatures()Fork Choice — Store State (5)
lean_gossip_signaturesinsert_gossip_signature(),delete_gossip_signatures()lean_latest_new_aggregated_payloadsinsert_new_aggregated_payloads_batch(),promote_new_aggregated_payloads()lean_latest_known_aggregated_payloadspromote_new_aggregated_payloads()lean_committee_signatures_aggregation_time_secondsaggregate_committee_signatures()function (after the early return for empty sigs)lean_fork_choice_reorg_depthupdate_head()Validator (1)
lean_is_aggregatorBlockChain::spawn()(1 = true, 0 = false)Updated Existing Metrics (3)
lean_attestations_valid_total/lean_attestations_invalid_totalIntCounterVecwithsourcelabel to plainIntCounter(no labels)source=block,gossiplabellean_fork_choice_block_processing_time_seconds[0.005, 0.01, 0.025, 0.05, 0.1, 1.0]to[..., 1.0, 1.25, 1.5, 2.0, 4.0]lean_connected_peers(docs only)ethlambda, grandine, lighthouseDesign Decisions
1. Gossip signatures count: AtomicUsize instead of table scan
The
lean_gossip_signaturesgauge needs the count of entries in the RocksDBGossipSignaturestable. A naive implementation would callprefix_iterator().count()on every gossip attestation — this is a full table scan with O(N) cost per attestation, leading to O(N²) per slot on aggregator nodes.Instead, we maintain an
AtomicUsizecounter on theStorestruct that is:count_gossip_signatures()) to correctly reflect pre-existing entries after a node restartinsert_gossip_signature()(only if the key is new — a point read guards against counter inflation from duplicate upserts)keys.len()ondelete_gossip_signatures()(usingsaturating_sub)prunedonprune_gossip_signatures()(usingsaturating_sub)This makes the count O(1). We use
Ordering::Relaxedsince the counter is only used for metrics (no correctness dependency on exact ordering). Thesaturating_subprevents silent wraparound if any future code path introduces a double-decrement.2. Timing guard placement in
aggregate_committee_signaturesThe
time_committee_signatures_aggregation()guard is created after thegossip_sigs.is_empty()early return check, not before. This avoids polluting the histogram with near-zero durations when there are no signatures to aggregate (the common case for non-aggregator nodes or between aggregation events).3.
is_reorg()→reorg_depth()refactorRefactored
is_reorg(old_head, new_head, store) -> booltoreorg_depth(old_head, new_head, store) -> Option<u64>. The function already walked the chain ancestry — now it counts steps while walking. ReturnsSome(depth)for reorgs andNonefor no-reorg or missing headers. The caller usesif let Some(depth)to emit both the existinglean_fork_choice_reorgs_totalcounter and the newlean_fork_choice_reorg_depthhistogram. The walk is bounded toMAX_REORG_DEPTH = 128to prevent unbounded traversal in pathological cases.4. Skipped metrics
Two metrics from the upstream spec were intentionally skipped:
lean_attestation_committee_subnet— ethlambda has no committee subdivision (all validators attest to the same data)lean_attestation_committee_count— same reason; there are no attestation committeesThese can be added as static values (subnet=0, count=1) if the spec requires them in the future.
5. Histogram buckets match upstream exactly
All histogram buckets were updated to match the upstream leanMetrics spec, including metrics that already existed but had different bucket boundaries (e.g.,
fork_choice_block_processinggained 4 upper buckets).Files Changed
crates/blockchain/src/metrics.rscrates/blockchain/src/store.rsreorg_depthrefactor with bounded walkcrates/blockchain/src/key_manager.rscrates/blockchain/src/lib.rsset_is_aggregator()call at startupcrates/storage/src/store.rsAtomicUsizegossip sig counter (DB-seeded, upsert-safe, saturating sub),PayloadBuffer::len(), count accessorsdocs/metrics.md2719baad), all new metrics marked ✅How to Test
cargo test --workspace --release— all pass, no new test failurescargo clippy --workspace -- -D warnings— clean--is-aggregatorflag, then scrape/metricsand verify:lean_is_aggregatoris 1 for aggregator nodes, 0 for non-aggregators