Skip to content

fix(docker): migrate single-node compose from host to bridge networking#2952

Open
bitflicker64 wants to merge 9 commits intoapache:masterfrom
bitflicker64:docker-fix-bridge-network
Open

fix(docker): migrate single-node compose from host to bridge networking#2952
bitflicker64 wants to merge 9 commits intoapache:masterfrom
bitflicker64:docker-fix-bridge-network

Conversation

@bitflicker64
Copy link
Contributor

@bitflicker64 bitflicker64 commented Feb 15, 2026

Purpose of the PR

Fix single node Docker deployment failing on macOS and Windows due to Linux only host networking.

close #2951

Main Changes

  • Remove network_mode host from single node docker compose setup

  • Use default bridge networking

  • Add explicit port mappings

    • 8080 Server HTTP
    • 8520 Store HTTP
    • 8620 PD HTTP
  • Add configuration volume mounts

    • docker pd-conf
    • docker store-conf
    • docker server-conf
  • Replace localhost and non routable addresses with container hostnames

    • PD grpc host set to pd
    • Store grpc host set to store
    • Server pd peers set to pd 8686
  • Update healthcheck endpoints

Problem

The original single node Docker configuration uses network_mode host.

This only works on native Linux. Docker Desktop on macOS and Windows does not implement host networking the same way. Containers start but HugeGraph services advertise incorrect addresses such as 127.0.0.1 or 0.0.0.0.

Resulting failures:

  • Server stuck in loop waiting for storage backend
  • PD client UNAVAILABLE io exception errors
  • Store reports zero partitions
  • Cluster never becomes usable even though containers are running

The issue is not process failure but invalid service discovery and advertised endpoints.

Root Cause

  • network_mode host is Linux specific
  • Docker Desktop falls back to bridge networking
  • HugeGraph components still advertise localhost style addresses
  • Other containers cannot route to those addresses

Solution

Switch to bridge networking and advertise container resolvable hostnames.

Docker DNS resolves service names automatically. Services bind normally while exposing correct internal endpoints.

Verification

Observed behavior after changes on Docker Desktop macOS:

Container state

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

server   Up healthy   0.0.0.0:8080->8080
store    Up healthy   0.0.0.0:8520->8520
pd       Up healthy   0.0.0.0:8620->8620

Server startup sequence

Hugegraph server are waiting for storage backend
Initializing HugeGraph Store
Starting HugeGraphServer ... OK
Started

Endpoints

Server:

curl http://localhost:8080

Returns service metadata.

Store:

curl http://localhost:8520

Returns non zero leader and partition count:

{"leaderCount":12,"partitionCount":12}

PD:

curl http://localhost:8620

Returns expected auth response, confirming service availability.

Cluster becomes operational after initialization delay.

Why This Works

  • Bridge networking is cross platform
  • Container names become stable service addresses
  • No platform dependent networking behavior
  • Services advertise routable endpoints

Does this PR potentially affect the following parts

  • Modify configurations
  • Dependencies
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc No Need

Related fixes discovered during this work:

Bug Issue Fix
getLeaderGrpcAddress() NPE in bridge mode — follower PDs crash redirecting to leader #2959 #2961
IpAuthHandler hostname vs IP mismatch — cross-node raft connections silently blocked #2960 #2962

Changes Checklist

  • Updated Docker networking configuration
  • Added / verified required port mappings
  • Adjusted service communication to use container hostnames
  • Validated environment-based configuration
  • Verified PD, Store, and Server containers start correctly
  • Confirmed single-node cluster reaches healthy state
  • Confirmed partition assignment and leader election
  • Validated multi-node (3-node) cluster deployment
  • Tested on macOS (Apple M4)
  • Fixed server startup timeout to allow partition assignment to complete
  • Verified PD, Store and Server healthchecks pass on both single and 3-node
  • Test on Linux
  • Fix IpAuthHandler hostname vs IP resolution in RaftEngine.java (currently worked around via static IPs in docker-compose)
  • FIx getLeaderGrpcAddress() NPE
  • Remove static IP workaround once IpAuthHandler is fixed upstream
  • Update Docker images to reflect entrypoint changes
  • Update documentation

Replace network_mode: host with explicit port mappings and add configuration
volumes for PD, Store, and Server services to support macOS/Windows Docker.

- Remove host network mode from all services
- Add explicit port mappings (8620, 8520, 8080)
- Add configuration directories with volume mounts
- Update healthcheck endpoints
- Add PD peers environment variable

Enables HugeGraph cluster to run on all Docker platforms.
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. pd PD module store Store module labels Feb 15, 2026
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 15, 2026
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 20, 2026
@bitflicker64
Copy link
Contributor Author

Bridge networking changes have been validated successfully across environments:

  • macOS (Docker Desktop)
  • Ubuntu 24.04.4 LTS

Observed behavior:

  • PD container starts and becomes healthy
  • Store container starts, registers, and receives partitions
  • Partitions are assigned and Raft leaders are elected
  • Server container initializes without errors
  • REST endpoints respond as expected

No regressions were observed in the single-node deployment. Service discovery and inter-container communication function correctly under bridge networking.


ARM64 Compatibility Fix — wait-storage.sh

Problem

The original wait-storage.sh relied on gremlin-console.sh for storage readiness detection:

On ARM64 (Apple Silicon), this fails due to a Jansi native library crash


Root Cause

  • gremlin-console.sh depends on Jansi, which is unstable on ARM64
  • The detection logic is triggered only when hugegraph.* environment variables are used
  • Volume-mounted configurations bypass this code path, masking the failure

Fix

Replaced Gremlin Console detection with a lightweight PD REST health check:

Cleanup

detect-storage.groovy is no longer required by the updated startup flow and can be removed

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the single-node Docker Compose configuration from Linux-specific host networking to cross-platform bridge networking. The change addresses a critical issue where Docker Desktop on macOS and Windows doesn't support host networking properly, causing services to advertise unreachable addresses and preventing cluster initialization.

Changes:

  • Replaced host networking with bridge networking and explicit port mappings
  • Added comprehensive environment-based configuration for PD, Store, and Server through new entrypoint scripts
  • Implemented health-aware startup with PD REST endpoint polling in wait-storage.sh
  • Added volume mounts for persistent data and deprecated variable migration guards

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
docker/docker-compose.yml Migrated from host to bridge networking, added environment variables, updated healthchecks, exposed required ports
hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh Refactored to use environment variables for backend and PD configuration with deprecation guards
hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh Replaced Gremlin-based storage detection with PD REST health endpoint polling, increased timeout to 300s

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bitflicker64
Copy link
Contributor Author

Thank you for the review. I’ll take care of the suggested adjustments and will proceed with testing the 3 node cluster configuration next.

networks:
hg-net:
driver: bridge

Copy link
Member

@imbajin imbajin Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ build: in quickstart compose can unexpectedly trigger local source builds — prefer pull-only defaults

Using build: together with image: does not always force a local build. In Compose, the usual behavior (depending on pull_policy) is to try pulling first, then fall back to building if pull is unavailable.

That said, this is still risky for a quickstart file:

  1. If the image/tag cannot be pulled, users unexpectedly need the full source tree and Docker build context.
  2. Startup becomes much slower because services may be built locally.
  3. Locally built images can differ from official release artifacts, reducing reproducibility.

Since docker/docker-compose.yml is intended for quickstart usage, I recommend keeping it pull-only by default and moving build: blocks to a dev-specific override (for example, docker-compose.dev.yml).

Suggested change
pd:
image: hugegraph/pd:${HUGEGRAPH_VERSION:-1.7.0}
container_name: hg-pd
hostname: pd

If local builds are needed for development, users can combine files explicitly, e.g.:
docker compose -f docker/docker-compose.yml -f docker/docker-compose.dev.yml up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve changed the compose setup so it’s pull-only and no longer builds locally. Right now, I’m bind-mounting the updated entrypoint scripts because the current published images don’t include these changes.
Once this PR is done and new images are available, I can remove the mounts. I’ll follow up with a small cleanup PR to switch completely to the official images.

Replace weak PD only probe with hard pre start validation.
Startup now blocks until PD health is OK and PD reports at least one Up store.
Timeouts fail fast preventing server launch in broken cluster states.
Partition stabilization remains a post start concern and does not gate boot.
@bitflicker64
Copy link
Contributor Author

PD readiness checks were failing even when the cluster was working correctly. In secured setups the PD service returns Unauthorized if no credentials are provided. The wait logic treated this response as a failure instead of a normal security behavior. This created false failures where PD and Store were actually healthy. As a result the readiness loop waited . The fix was to add authentication to the curl commands. After that the checks worked correctly and no longer timed out

@bitflicker64
Copy link
Contributor Author

@imbajin Update on bridge network 3-node cluster investigation:

Hi, I've been looking into why the 3-node PD cluster fails to form raft quorum in Docker bridge network mode. To rule out any config loading issues, I tried both the env var based entrypoint approach and a fully hardcoded config, both produced the same error, so I don't think it's config related.

Both attempts show this in the logs:

IpAuthHandler - Blocked connection from 172.18.0.x

Tracing it back, I noticed this in RaftEngine.java around line 159:

IpAuthHandler.getInstance(
    peers.stream()
        .map(PeerId::getIp)
        .collect(Collectors.toSet())
);

I think PeerId::getIp might be returning the raw hostname string (e.g. pd1) rather than the resolved IP. In bridge mode, incoming connections arrive with their actual container IP (172.18.0.x), so the allowlist check would always fail and block cross-node raft connections. This wouldn't surface in network_mode: host since all connections appear as 127.0.0.1.

Could you take a look and let me know if this diagnosis makes sense? I might be missing something still getting familiar with the codebase. If this does seem right, I'd be happy to try putting together a fix.

Thanks!

@bitflicker64
Copy link
Contributor Author

As a temporary workaround, assigning static IPs to PD containers via docker-compose ipam and using those IPs directly in HG_PD_RAFT_PEERS_LIST instead of hostnames fixes the issue. Since PeerId::getIp then returns actual IPs, the allowlist check passes and cross-node raft connections are no longer blocked.

@bitflicker64
Copy link
Contributor Author

@imbajin Happy Lantern Festival!🏮

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 37 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +42
pull_policy: always
restart: unless-stopped
entrypoint: ["/hugegraph-pd/docker-entrypoint.sh"]
healthcheck:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as the single-node compose: setting entrypoint: overrides the Dockerfile ENTRYPOINT (dumb-init) and can degrade shutdown/reaping behavior for these Java processes. Prefer keeping the image entrypoint (remove entrypoint:) or explicitly invoke dumb-init in the compose entrypoint.

Copilot uses AI. Check for mistakes.
store0: { condition: service_healthy }
store1: { condition: service_healthy }
store2: { condition: service_healthy }
entrypoint: ["/hugegraph-server/docker-entrypoint.sh"]
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-server-common sets entrypoint: which overrides the Dockerfile ENTRYPOINT (dumb-init). For the server JVM this can cause slow/unclean stops and poor signal propagation. Prefer relying on the image entrypoint (remove entrypoint:) or call dumb-init explicitly in the compose entrypoint.

Suggested change
entrypoint: ["/hugegraph-server/docker-entrypoint.sh"]

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +110
entrypoint: ["/hugegraph-server/docker-entrypoint.sh"]

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entrypoint: here overrides the image ENTRYPOINT (dumb-init), which can prevent proper signal forwarding and leave the HugeGraph server JVM running after docker compose stop. Prefer relying on the image entrypoint (remove this) or use /usr/bin/dumb-init -- /hugegraph-server/docker-entrypoint.sh.

Copilot uses AI. Check for mistakes.
pd0: { condition: service_healthy }
pd1: { condition: service_healthy }
pd2: { condition: service_healthy }
entrypoint: ["/hugegraph-store/docker-entrypoint.sh"]
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-store-common sets entrypoint: which overrides the image’s ENTRYPOINT (dumb-init). That can interfere with signal handling and leave orphaned JVMs/zombies. Prefer removing the compose entrypoint: (let the image’s entrypoint run) or explicitly wrap the script with dumb-init.

Suggested change
entrypoint: ["/hugegraph-store/docker-entrypoint.sh"]

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +73
entrypoint: ["/hugegraph-store/docker-entrypoint.sh"]

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entrypoint: overrides the image’s Dockerfile ENTRYPOINT (dumb-init), which can hurt signal handling/shutdown behavior for the Store JVM. Prefer removing this entrypoint: (the image already runs ./docker-entrypoint.sh) or wrap it with dumb-init explicitly.

Suggested change
entrypoint: ["/hugegraph-store/docker-entrypoint.sh"]

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +37
esc_key=$(printf '%s' "$key" | sed -e 's/[][(){}.^$*+?|\\/]/\\&/g')
esc_val=$(printf '%s' "$val" | sed -e 's/[&|\\]/\\&/g')

if [ ! -f "${DOCKER_FOLDER}/${INIT_FLAG_FILE}" ]; then
# wait for storage backend
./bin/wait-storage.sh
if [ -z "$PASSWORD" ]; then
echo "init hugegraph with non-auth mode"
if grep -qE "^[[:space:]]*${esc_key}[[:space:]]*=" "${file}"; then
sed -ri "s|^([[:space:]]*${esc_key}[[:space:]]*=).*|\\1${esc_val}|" "${file}"
else
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_prop helper interpolates the untrusted val argument (for example from HG_SERVER_BACKEND or HG_SERVER_PD_PEERS) directly into a double-quoted sed -ri expression. If an attacker can control one of these environment variables, they can inject $(command) or backticks so that arbitrary shell commands run when sed is invoked. Refactor this to ensure the replacement value is treated as literal data�validate or whitelist accepted formats and avoid embedding it directly into a sed program that the shell parses.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pd PD module size:L This PR changes 100-499 lines, ignoring generated files. store Store module

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[Bug] Single node Docker setup does not work on macOS and Windows because of host networking

3 participants