[Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode

### Bug Type (问题类型)

others (please edit later)

### Before submit

- [x] 我已经确认现有的 [Issues](https://github.com/apache/hugegraph/issues) 与 [FAQ](https://hugegraph.apache.org/docs/guides/faq/) 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

### Environment (环境信息)

- Server Version: 1.7.0
- Backend: hstore (3 PD + 3 Store + 3 Server)
- OS: macOS Apple M4
- related pr - #2952 
- Network: Docker bridge mode (static IPs via ipam)
- Docker Desktop: latest

## Summary
In a 3-node PD cluster running in Docker bridge network mode, the cluster 
only works correctly when pd0 wins the raft leader election. If pd1 or pd2 
becomes leader, store registration fails, partitions are never distributed, 
and HugeGraph servers cannot initialize.

## Root Cause
In RaftEngine.java, getLeaderGrpcAddress() makes a live bolt RPC call to 
discover the leader's gRPC address when the current node is a follower:

    return raftRpcClient.getGrpcAddress(
        raftNode.getLeaderId().getEndpoint().toString()
    ).get().getGrpcAddress();  // .get() returns null in bridge mode → NPE

This call fails in Docker bridge mode — the TCP connection establishes 
successfully but the bolt RPC response never returns, causing 
CompletableFuture.get() to return null and throw NPE.

This causes:
1. redirectToLeader() fails with NPE
2. Store registration requests landing on follower PDs are never forwarded
3. Stores register but partitions are never distributed (partitionCount:0)
4. HugeGraph servers stuck in DEADLINE_EXCEEDED loop indefinitely

## Why It Only Affects Bridge Mode
In host network mode all PD nodes communicate via 127.0.0.1 — the bolt RPC 
call succeeds instantly over loopback. In bridge mode the call traverses 
Docker's virtual network and the response never arrives properly.

## Why It's Nondeterministic
JRaft leader election is timing-based. If pd0 wins, isLeader() returns true 
and the broken code path is never reached. If pd1 or pd2 wins, pd0 becomes 
a follower and hits the NPE on every redirect attempt.

## Reproduction
1. Run the 3-node cluster in Docker bridge mode
2. Check which PD won leader election:
   docker exec hg-pd0 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
   docker exec hg-pd1 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
   docker exec hg-pd2 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
3. If pd1 or pd2 is leader, check store partitions:
   curl -u store:admin http://localhost:8620/v1/stores | grep partitionCount
   → store1 and store2 will show partitionCount:0
4. Check pd0 logs for the NPE:
   docker exec hg-pd0 grep "getLeaderGrpcAddress" /hugegraph-pd/logs/hugegraph-pd-stdout.log

## Evidence

The NPE occurs in two separate call paths:

**Path 1 — fires immediately when node becomes follower during leader election:**
```
java.util.concurrent.ExecutionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
    at java.util.concurrent.CompletableFuture.get(Unknown Source)
    at org.apache.hugegraph.pd.raft.RaftEngine.getLeaderGrpcAddress(RaftEngine.java:242)
    at org.apache.hugegraph.pd.service.PDService.onRaftLeaderChanged(PDService.java:1345)
    at org.apache.hugegraph.pd.raft.RaftStateMachine.lambda$onStartFollowing$1(RaftStateMachine.java:141)
```
```
**Path 2 — fires on every store registration redirect attempt:**
java.util.concurrent.ExecutionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
    at java.util.concurrent.CompletableFuture.get(Unknown Source)
    at org.apache.hugegraph.pd.raft.RaftEngine.getLeaderGrpcAddress(RaftEngine.java:242)
    at org.apache.hugegraph.pd.service.PDService.redirectToLeader(PDService.java:1275)
```
**proof:** restarting the non-pd0 leader forcing pd0 to win election 
immediately resolved the issue — partitionCount:12 on all 3 stores, all 9 
containers healthy. Reproducible 100% of the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode #2959

Bug Type (问题类型)

Before submit

Environment (环境信息)

Summary

Root Cause

Why It Only Affects Bridge Mode

Why It's Nondeterministic

Reproduction

Evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode #2959

Description

Bug Type (问题类型)

Before submit

Environment (环境信息)

Summary

Root Cause

Why It Only Affects Bridge Mode

Why It's Nondeterministic

Reproduction

Evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions