Skip to content

perf(saexec): Remove TrieDB memory leak#236

Open
alarso16 wants to merge 29 commits intomainfrom
alarso16/triedb-dereference
Open

perf(saexec): Remove TrieDB memory leak#236
alarso16 wants to merge 29 commits intomainfrom
alarso16/triedb-dereference

Conversation

@alarso16
Copy link
Contributor

@alarso16 alarso16 commented Feb 25, 2026

Currently, SAE follows this pattern:
If we call triedb.Commit() every N blocks:

State 0 will be on disk
States [1, N] will be generated in memory
State N will be committed, moving all dirty nodes at that root from memory onto disk
All outdated nodes in [1, N) will remain in the dirty cache

Although not a correctness bug, keeping these tries is generally unnecessary and leads to a memory leak.

The expected functionality should be:

  • For validators: keep the minimum states around to guarantee that all consensus-necessary states are accessible
  • For API nodes: keep all states on disk.

Now, all nodes will default to NOT storing all states. API nodes must set the necessary config

@alarso16 alarso16 marked this pull request as ready for review February 25, 2026 16:56
saedb/saedb.go Outdated

// Record tracks the root and may commit the trie associated with the root
// to the database if the height is on an multiple of [CommitTrieDBEvery].
func (e *StateRecorder) Record(root common.Hash, height uint64) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic as coreth's core/state_manager.go logic, but simpler. There is a "we're not ready to commit yet, but we have a huge amount of state in memory, let's offload some into the database" function, but I don't think it's necessary until proven so

saedb/saedb.go Outdated
}

// Close commits the most recent state to the database for shutdown.
func (e *StateRecorder) Close() (errs error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better pattern for this? I could avoid the defer, but we should report all these errors, right?

Copy link
Collaborator

@ARR4N ARR4N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging deeply enough to figure out that this would be a problem, and for fixing it!

// We expect to not find blocks older than [saedb.StateHistory]
for _, b := range chain.AllBlocks() {
sdb, err := e.StateDB(b.PostExecutionStateRoot())
inMemory := b.NumberU64()+saedb.StateHistory > uint64(numBlocks) //nolint:gosec // positive plus positive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you define numBlocks as a uint64 then the linter won't complain:

numBlocks := uint64(saedb.StateHistory) + 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to just make StateHistory a uint64? the buffer takes an int for some reason...

final := chain.Last()
require.NoErrorf(t, final.WaitUntilExecuted(ctx), "%T.WaitUntilExecuted() on last-enqueued block", final)

t.Run("remove in memory state", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that share logic and only differ in the declaration of parameters are easier to reason about. It's also unnecessary to check that a non-nil StateDB is returned when there's a nil error because that's implied by all other usage.

	t.Run("access state", func(t *testing.T) {
		for _, b := range chain.AllBlocks() {
			var want testerr.Want

			switch num := b.NumberU64(); {
			case num > numBlocks-saedb.StateHistory:
				// Still referenced
			case saedb.ShouldCommitTrieDB(num):
				// On disk
			default:
				want = testerr.As(func(got *trie.MissingNodeError) string {
					if r := b.PostExecutionStateRoot(); got.NodeHash != r {
						return fmt.Sprintf("%T for hash %#x", got, r)
					}
					return ""
				})
			}

			_, err := e.StateDB(b.PostExecutionStateRoot())
			if diff := testerr.Diff(err, want); diff != "" {
				t.Errorf("%T.StateDB([post-execution root of block %d]) %s", e, b.NumberU64(), diff)
			}
		}
	})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does look cleaner. I also tried to share with the recover portion of the test, since it's all the same logic besides one line

Copy link
Collaborator

@ARR4N ARR4N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily readability and structure but the approach LGTM in general.

Comment on lines +94 to +97
// WorstCaseState returns a [worstcase.State] at the starting at the provided settled block.
func (s *stateRecorder) WorstCaseState(hooks hook.Points, config *params.ChainConfig, settled *blocks.Block) (*worstcase.State, error) {
return worstcase.NewState(hooks, config, s.cache, settled, s.snaps)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense for the saexec.Executor to construct a worstcase.State. Although the plumbing works, the abstraction is strange and is being governed by the former.

This refactoring allows the Executor to be injected into the worstcase.State constructor instead of calling it:

package worstcase

type StateDBOpener interface {
  StateDB(root common.Hash) (*state.StateDB, error)
}

func New(hook.Points, *params.CacheConfig, *types.Block, StateDBOpener)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I didn't love the abstraction, but didn't see a better way. This does seem to be an improvement, but still feels a little unnatural (specifically for testing). I added it, and let me know if you agree and we can refactor this constructor some more

@alarso16 alarso16 requested a review from ARR4N March 3, 2026 16:10
Copy link
Collaborator

@ARR4N ARR4N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much there. The move of the interface and test helper is trivial and I just have one open question. Please DM to take another look and we can almost certainly merge this today.

}

// If we have new state, commit changes to database for easier startup.
if err := s.cache.TrieDB().Commit(root, true /* log */); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why log here when we're not doing it in the regular path?

Copy link
Contributor Author

@alarso16 alarso16 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that logging the final state root known to the database is helpful, specifically if there's issues restarting the VM after shutdown.

FWIW setting log = false still does log, but at a debug instead of info level

@alarso16 alarso16 changed the title perf(saexec): Remove stale tries from memory perf(saexec): Remove TrieDB memory leak Mar 6, 2026
}
}

func TestRegressionLoseStateBeforeSettlement(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this regression test is still necessary....

// cancel the returned context, which is useful when waiting for blocks that
// can never finish execution because of an error.
func newSUT(tb testing.TB, hooks *saehookstest.Stub) (context.Context, SUT) {
func newSUT(tb testing.TB, opts ...sutOption) (context.Context, *SUT) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I now had two optional fields, seemed better to just provide the infrastructure.

@alarso16 alarso16 marked this pull request as ready for review March 6, 2026 20:27
@alarso16 alarso16 force-pushed the alarso16/triedb-dereference branch from 16a0619 to 1db3ce7 Compare March 10, 2026 13:31
Copy link
Collaborator

@ARR4N ARR4N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished the review but I've noticed something that requires a bit of work (as a test) so thought I'd share it early.

continue
}
vm.blocks.Delete(s.Hash())
vm.exec.Untrack(s.PostExecutionStateRoot())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the last-settled block has no transactions (allowed by SAE) and len(settles) > 1 then b.LastSettled().ParentBlock() will have the same post-execution state root and we'll un-track it early.

	keep := b.LastSettled()
	for _, s := range settles {
		if s.Hash() != keep.Hash() {
			vm.blocks.Delete(s.Hash())
		}
		// If `s` is the parent of `keep` and the latter has no transactions
		// then we MUST NOT dereference the state root too early.
		if r := s.PostExecutionStateRoot(); r != keep.PostExecutionStateRoot() {
			vm.exec.Untrack(r)
		}
	}
	if h := parentLastSettled.Hash(); h != keep.Hash() { // i.e. `parentLastSettled` was the last block's `keep`
		vm.blocks.Delete(h)
		vm.exec.Untrack(parentLastSettled.PostExecutionStateRoot())
	}

I presume this would be catastrophic as block-building couldn't then open the settled state, so such a scenario requires a test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second half of my block is incorrect and should follow the same pattern as in the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each time a state root is seen by the Tracker, a reference is added. Thus, it won't removed until Untrack is called as many times as Track was called. Since each block will call Track, even if the state root is not unique, everything works fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment at the location of use (here) to explain that we don't need to check for duplicate roots

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants