Skip to content

feat(ethexe/node-loader): support multiple validator nodes#5208

Open
playX18 wants to merge 5 commits intomasterfrom
ap-multi-validator-loader
Open

feat(ethexe/node-loader): support multiple validator nodes#5208
playX18 wants to merge 5 commits intomasterfrom
ap-multi-validator-loader

Conversation

@playX18
Copy link
Member

@playX18 playX18 commented Mar 10, 2026

Added support for multiple-validator nodes in node-loader. It will randomly switch between them and reconenct to them if API fails. In case of too many fails connection is removed, and on all connections dead program terminates.

Also added start-local-network.sh script that allows users to start proper Vara.eth network with multiple validators that are all connected to single Anvil instance. It is also possible to start-up node-loader automatically this way.

@playX18 playX18 requested a review from grishasobol March 10, 2026 05:35
@playX18 playX18 self-assigned this Mar 10, 2026
@playX18 playX18 added A0-pleasereview PR is ready to be reviewed by the team D8-ethexe ethexe-related PR labels Mar 10, 2026
@semanticdiff-com
Copy link

semanticdiff-com bot commented Mar 10, 2026

Review changes with  SemanticDiff

Changed Files
File Status
  ethexe/node-loader/src/args.rs  71% smaller
  ethexe/node-loader/src/main.rs  46% smaller
  ethexe/node-loader/src/batch.rs  10% smaller
  ethexe/scripts/start-local-network.sh Unsupported file format

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for multiple validator nodes in the node-loader, improving its robustness and ability to handle API failures. It also includes a script to easily start a local Vara.eth network with multiple validators, streamlining the development and testing process.

Highlights

  • Multiple Validator Support: The node-loader now supports connecting to multiple validator nodes, enhancing resilience and load distribution.
  • Randomized Node Switching: The node-loader randomly switches between validator nodes and reconnects upon API failures.
  • Local Network Script: A new start-local-network.sh script simplifies setting up a Vara.eth network with multiple validators connected to a single Anvil instance.
Changelog
  • ethexe/node-loader/src/args.rs
    • The ethexe_node argument was replaced with ethexe_nodes to accept a list of validator node endpoints.
  • ethexe/node-loader/src/batch.rs
    • Introduced EthexeRpcPool to manage multiple RPC endpoints for validator nodes.
    • Implemented retry logic with reconnection for RPC requests to handle potential failures.
    • Modified BatchPool::new to accept a list of ethexe RPC URLs and initialize the EthexeRpcPool.
  • ethexe/node-loader/src/main.rs
    • Updated BatchPool::new call to pass the list of ethexe nodes.
  • ethexe/scripts/start-local-network.sh
    • Added a new script to automate the setup of a local Vara.eth network with multiple validators.
Activity
  • Added support for multiple validator nodes.
  • Implemented random switching and reconnection logic.
  • Introduced start-local-network.sh script for easy local network setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multiple validator nodes in the node-loader. It adds an EthexeRpcPool to manage connections to multiple ethexe-node endpoints, with logic for random selection, reconnection, and retries on failure. A new script, start-local-network.sh, is also included to facilitate setting up a local test network with multiple validators.

My review focuses on the new connection management and retry logic. I've identified a potential race condition in connection handling that could lead to creating unnecessary connections, and significant code duplication in the retry logic for RPC calls. I've provided suggestions to improve both of these aspects for better performance and maintainability.

Comment on lines +133 to +177
async fn reconnect_client(
&self,
endpoint_idx: usize,
api: &Ethereum,
) -> Result<Arc<VaraEthApi>> {
let endpoint = self
.endpoints
.get(endpoint_idx)
.ok_or_else(|| anyhow!("invalid endpoint index: {endpoint_idx}"))?;

tracing::warn!(
endpoint_idx,
endpoint = %endpoint.url,
"Connecting ethexe RPC client"
);

let client = Arc::new(VaraEthApi::new(&endpoint.url, api.clone()).await?);
let mut lock = endpoint.client.write().await;
*lock = Some(client.clone());

tracing::info!(
endpoint_idx,
endpoint = %endpoint.url,
"Connected ethexe RPC client"
);

Ok(client)
}

async fn get_or_connect_client(
&self,
endpoint_idx: usize,
api: &Ethereum,
) -> Result<Arc<VaraEthApi>> {
let endpoint = self
.endpoints
.get(endpoint_idx)
.ok_or_else(|| anyhow!("invalid endpoint index: {endpoint_idx}"))?;

if let Some(client) = endpoint.client.read().await.clone() {
return Ok(client);
}

self.reconnect_client(endpoint_idx, api).await
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential race condition in get_or_connect_client that can lead to creating unnecessary connections. If multiple threads call this function for an unconnected endpoint, they might all see that the client is None and proceed to call reconnect_client, resulting in multiple connections being established where only one is needed. The last one to acquire the write lock will win, and the other connections will be dropped.

To fix this, you can use a double-checked locking pattern. This involves checking for the client with a read lock, then acquiring a write lock and checking again before creating a new connection. This ensures that only one thread will create the connection.

I suggest refactoring get_or_connect_client and inlining reconnect_client to implement this pattern.

    async fn get_or_connect_client(
        &self,
        endpoint_idx: usize,
        api: &Ethereum,
    ) -> Result<Arc<VaraEthApi>> {
        let endpoint = self
            .endpoints
            .get(endpoint_idx)
            .ok_or_else(|| anyhow!("invalid endpoint index: {endpoint_idx}"))?;

        // First, check with a read lock for efficiency.
        if let Some(client) = endpoint.client.read().await.clone() {
            return Ok(client);
        }

        // If no client, acquire a write lock to create one.
        let mut client_guard = endpoint.client.write().await;

        // Double-check in case another thread created the client while we were waiting for the lock.
        if let Some(client) = client_guard.clone() {
            return Ok(client);
        }

        // Still no client, so we are the one to create it.
        tracing::warn!(
            endpoint_idx,
            endpoint = %endpoint.url,
            "Connecting ethexe RPC client"
        );

        let client = Arc::new(VaraEthApi::new(&endpoint.url, api.clone()).await?);
        *client_guard = Some(client.clone());

        tracing::info!(
            endpoint_idx,
            endpoint = %endpoint.url,
            "Connected ethexe RPC client"
        );

        Ok(client)
    }

Comment on lines +186 to +370
async fn request_code_validation(
&self,
endpoint_idx: usize,
api: &Ethereum,
code: &[u8],
) -> Result<CodeId> {
for attempt in 1..=RPC_MAX_ATTEMPTS {
let client = match self.get_or_connect_client(endpoint_idx, api).await {
Ok(client) => client,
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
continue;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client"
);
return Err(err);
}
};

match client.router().request_code_validation(code).await {
Ok((_, code_id)) => return Ok(code_id),
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"request_code_validation failed; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"request_code_validation failed"
);
return Err(err.into());
}
}
}

Err(anyhow!("request_code_validation exhausted retries"))
}

async fn wait_for_code_validation(
&self,
endpoint_idx: usize,
api: &Ethereum,
code_id: CodeId,
) -> Result<()> {
for attempt in 1..=RPC_MAX_ATTEMPTS {
let client = match self.get_or_connect_client(endpoint_idx, api).await {
Ok(client) => client,
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
continue;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client"
);
return Err(err);
}
};

match client.router().wait_for_code_validation(code_id).await {
Ok(_) => return Ok(()),
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"wait_for_code_validation failed; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"wait_for_code_validation failed"
);
return Err(err.into());
}
}
}

Err(anyhow!("wait_for_code_validation exhausted retries"))
}

async fn send_message_injected(
&self,
endpoint_idx: usize,
api: &Ethereum,
actor: ActorId,
payload: &[u8],
value: u128,
) -> Result<MessageId> {
for attempt in 1..=RPC_MAX_ATTEMPTS {
let client = match self.get_or_connect_client(endpoint_idx, api).await {
Ok(client) => client,
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
continue;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"failed to acquire ethexe RPC client"
);
return Err(err);
}
};

match client
.mirror(actor)
.send_message_injected(payload, value)
.await
{
Ok(mid) => return Ok(mid),
Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
tracing::warn!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"send_message_injected failed; reconnecting and retrying"
);
self.invalidate_client(endpoint_idx).await;
}
Err(err) => {
tracing::error!(
endpoint_idx,
attempt,
max_attempts = RPC_MAX_ATTEMPTS,
error = %err,
"send_message_injected failed"
);
return Err(err.into());
}
}
}

Err(anyhow!("send_message_injected exhausted retries"))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic is duplicated across request_code_validation, wait_for_code_validation, and send_message_injected. This makes the code harder to maintain and prone to errors if one of them is updated and the others are not.

You can extract this logic into a generic helper function that takes a closure for the specific RPC call. This would reduce code duplication and improve maintainability.

Here's an example of how such a generic helper function could look:

async fn with_retry<T, F, Fut>(&self, endpoint_idx: usize, api: &Ethereum, call_name: &str, f: F) -> Result<T>
where
    F: Fn(Arc<VaraEthApi>) -> Fut,
    Fut: std::future::Future<Output = Result<T>>,
{
    for attempt in 1..=RPC_MAX_ATTEMPTS {
        let client = match self.get_or_connect_client(endpoint_idx, api).await {
            Ok(client) => client,
            Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
                tracing::warn!(
                    endpoint_idx,
                    attempt,
                    max_attempts = RPC_MAX_ATTEMPTS,
                    error = %err,
                    "failed to acquire ethexe RPC client for {}; reconnecting and retrying",
                    call_name
                );
                self.invalidate_client(endpoint_idx).await;
                continue;
            }
            Err(err) => {
                tracing::error!(
                    endpoint_idx,
                    attempt,
                    max_attempts = RPC_MAX_ATTEMPTS,
                    error = %err,
                    "failed to acquire ethexe RPC client for {}",
                    call_name
                );
                return Err(err);
            }
        };

        match f(client).await {
            Ok(result) => return Ok(result),
            Err(err) if attempt < RPC_MAX_ATTEMPTS && is_retryable_rpc_error(&err) => {
                tracing::warn!(
                    endpoint_idx,
                    attempt,
                    max_attempts = RPC_MAX_ATTEMPTS,
                    error = %err,
                    "{} failed; reconnecting and retrying",
                    call_name
                );
                self.invalidate_client(endpoint_idx).await;
            }
            Err(err) => {
                tracing::error!(
                    endpoint_idx,
                    attempt,
                    max_attempts = RPC_MAX_ATTEMPTS,
                    error = %err,
                    "{} failed",
                    call_name
                );
                return Err(err.into());
            }
        }
    }

    Err(anyhow!("{} exhausted retries", call_name))
}

You could then refactor request_code_validation like this:

async fn request_code_validation(
    &self,
    endpoint_idx: usize,
    api: &Ethereum,
    code: &[u8],
) -> Result<CodeId> {
    self.with_retry(endpoint_idx, api, "request_code_validation", |client| async move {
        client.router().request_code_validation(code).await.map(|(_, code_id)| code_id)
    }).await
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A0-pleasereview PR is ready to be reviewed by the team D8-ethexe ethexe-related PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant