Skip to content

[Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode #2959

@bitflicker64

Description

@bitflicker64

Bug Type (问题类型)

others (please edit later)

Before submit

  • 我已经确认现有的 IssuesFAQ 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

Environment (环境信息)

Summary

In a 3-node PD cluster running in Docker bridge network mode, the cluster
only works correctly when pd0 wins the raft leader election. If pd1 or pd2
becomes leader, store registration fails, partitions are never distributed,
and HugeGraph servers cannot initialize.

Root Cause

In RaftEngine.java, getLeaderGrpcAddress() makes a live bolt RPC call to
discover the leader's gRPC address when the current node is a follower:

return raftRpcClient.getGrpcAddress(
    raftNode.getLeaderId().getEndpoint().toString()
).get().getGrpcAddress();  // .get() returns null in bridge mode → NPE

This call fails in Docker bridge mode — the TCP connection establishes
successfully but the bolt RPC response never returns, causing
CompletableFuture.get() to return null and throw NPE.

This causes:

  1. redirectToLeader() fails with NPE
  2. Store registration requests landing on follower PDs are never forwarded
  3. Stores register but partitions are never distributed (partitionCount:0)
  4. HugeGraph servers stuck in DEADLINE_EXCEEDED loop indefinitely

Why It Only Affects Bridge Mode

In host network mode all PD nodes communicate via 127.0.0.1 — the bolt RPC
call succeeds instantly over loopback. In bridge mode the call traverses
Docker's virtual network and the response never arrives properly.

Why It's Nondeterministic

JRaft leader election is timing-based. If pd0 wins, isLeader() returns true
and the broken code path is never reached. If pd1 or pd2 wins, pd0 becomes
a follower and hits the NPE on every redirect attempt.

Reproduction

  1. Run the 3-node cluster in Docker bridge mode
  2. Check which PD won leader election:
    docker exec hg-pd0 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
    docker exec hg-pd1 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
    docker exec hg-pd2 grep "becomes leader" /hugegraph-pd/logs/hugegraph-pd-stdout.log
  3. If pd1 or pd2 is leader, check store partitions:
    curl -u store:admin http://localhost:8620/v1/stores | grep partitionCount
    → store1 and store2 will show partitionCount:0
  4. Check pd0 logs for the NPE:
    docker exec hg-pd0 grep "getLeaderGrpcAddress" /hugegraph-pd/logs/hugegraph-pd-stdout.log

Evidence

The NPE occurs in two separate call paths:

Path 1 — fires immediately when node becomes follower during leader election:

java.util.concurrent.ExecutionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
    at java.util.concurrent.CompletableFuture.get(Unknown Source)
    at org.apache.hugegraph.pd.raft.RaftEngine.getLeaderGrpcAddress(RaftEngine.java:242)
    at org.apache.hugegraph.pd.service.PDService.onRaftLeaderChanged(PDService.java:1345)
    at org.apache.hugegraph.pd.raft.RaftStateMachine.lambda$onStartFollowing$1(RaftStateMachine.java:141)
**Path 2 — fires on every store registration redirect attempt:**
java.util.concurrent.ExecutionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
    at java.util.concurrent.CompletableFuture.get(Unknown Source)
    at org.apache.hugegraph.pd.raft.RaftEngine.getLeaderGrpcAddress(RaftEngine.java:242)
    at org.apache.hugegraph.pd.service.PDService.redirectToLeader(PDService.java:1275)

proof: restarting the non-pd0 leader forcing pd0 to win election
immediately resolved the issue — partitionCount:12 on all 3 stores, all 9
containers healthy. Reproducible 100% of the time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions