Multi-architecture container and CLI for managing ephemeral SRE investigations on AWS Fargate with OIDC-authenticated access control.
- Go CLI:
rosa-boundary— authenticate, start, join, list, and stop investigations - AWS CLI: Both Fedora RPM and official AWS CLI v2 with alternatives support
- OpenShift CLI: Versions 4.14 through 4.20 from stable channels
- Claude Code: AI-powered CLI assistant with Amazon Bedrock integration
- Dynamic Version Selection: Switch tool versions via environment variables at runtime
- ECS Exec Ready: Designed for AWS Fargate with ECS Exec support
- Multi-architecture: Supports both x86_64 (amd64) and ARM64 (aarch64)
- OIDC Authentication: Keycloak integration with Lambda-based authorization
- Tag-Based Isolation: Shared SRE role with task-level ABAC access control
- Go 1.23+ (to build the CLI from source)
- Terraform (infrastructure deployment)
- Keycloak with OIDC configured (see OIDC Identity Requirements)
session-manager-plugin— required forjoin-taskandstart-task --connect
The session-manager-plugin is an AWS-provided binary that handles the WebSocket session protocol used by ECS Exec. The rosa-boundary CLI calls the ECS ExecuteCommand API to obtain session credentials, then hands off to this plugin to establish the interactive session. It must be installed separately on each machine running the CLI.
macOS:
brew install --cask session-manager-pluginLinux (x86_64):
curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm" -o /tmp/session-manager-plugin.rpm
sudo yum install -y /tmp/session-manager-plugin.rpmVerify:
session-manager-plugin --versionSee the AWS documentation for other platforms and package managers.
-
Copy the example environment file and fill in values:
cp .env.example .env
-
Required Terraform variables (no defaults):
Variable Description container_imageContainer image URI vpc_idVPC for Fargate tasks subnet_ids2+ subnets in the same VPC keycloak_issuer_urlOIDC issuer URL (e.g., https://keycloak.example.com/realms/sre-ops)keycloak_thumbprintSHA1 thumbprint of the Keycloak TLS certificate -
Deploy:
cd deploy/regional && terraform init && terraform apply
See deploy/regional/README.md for the complete deployment guide.
Keycloak must issue tokens with these claims:
| Claim | Purpose |
|---|---|
sub |
Stored as oidc_sub tag (audit trail) |
preferred_username |
Used as username tag (ABAC key) |
email |
Logged |
groups |
Must contain sre-team |
aud |
Must match aws-sre-access |
https://aws.amazon.com/tags |
Session tags with principal_tags.username for ABAC |
Required Keycloak mappers:
- Groups (flat names), email, audience (
aws-sre-access) - AWS session tags: map
preferred_username→principal_tags.username
Client settings: public client, standard flow + PKCE, redirect URI http://localhost:8400/callback.
See docs/configuration/keycloak-realm-setup.md for step-by-step setup.
make build-cli && make install-cliCreate ~/.rosa-boundary/config.yaml with the values specific to your deployment:
keycloak_url: https://keycloak.example.com
lambda_function_name: rosa-boundary-dev-create-investigation
invoker_role_arn: arn:aws:iam::123456789012:role/rosa-boundary-dev-lambda-invokerCore workflow:
# Start an investigation (authenticates, creates task, waits for RUNNING)
rosa-boundary start-task --cluster-id my-cluster --connect
# List running tasks
rosa-boundary list-tasks
# Connect to an existing task
rosa-boundary join-task <task-id>
# Stop a task (triggers S3 sync)
rosa-boundary stop-task <task-id>rosa-boundary/
├── .env.example # Environment configuration template (copy to .env)
├── Containerfile # Multi-arch container build
├── entrypoint.sh # Runtime initialization and signal handling
├── skel/sre/.claude/ # Skeleton Claude Code config for container users
├── cmd/rosa-boundary/ # CLI entrypoint
├── internal/
│ ├── auth/ # PKCE/OIDC authentication
│ ├── aws/ # ECS and STS clients
│ ├── cmd/ # Cobra subcommands
│ ├── config/ # Viper-based configuration
│ ├── lambda/ # Lambda invocation client
│ └── output/ # Text/JSON output helpers
├── deploy/
│ ├── regional/ # Terraform: ECS, EFS, S3, Lambda, OIDC
│ │ ├── *.tf # Infrastructure definitions
│ │ ├── examples/ # Manual lifecycle scripts
│ │ └── README.md # Deployment guide
│ └── keycloak/ # Kustomize: Keycloak realm and clients
├── lambda/
│ ├── create-investigation/ # OIDC-authenticated investigation creation
│ │ ├── handler.py # Group auth, role creation, task tagging
│ │ └── Makefile # Build Lambda package
│ └── reap-tasks/ # Periodic task timeout enforcement
│ └── handler.py # Deadline-based task termination
├── tests/
│ └── localstack/ # LocalStack integration tests
│ ├── compose.yml # LocalStack Pro + mock OIDC
│ └── integration/ # AWS service tests
├── docs/ # Architecture and implementation docs
└── .github/workflows/ # CI/CD automation
| Command | Description |
|---|---|
login |
Authenticate with Keycloak and cache the OIDC token |
start-task |
Create an investigation and start an ECS task |
join-task <task-id> |
Connect to a running ECS task via ECS Exec |
list-tasks |
List ECS tasks in the cluster |
stop-task <task-id> |
Stop a running ECS task |
version |
Print the rosa-boundary version |
start-task:
--cluster-id— cluster ID (defaults tocluster_namefrom config)--investigation-id— auto-generated if omitted (e.g.,swift-dance-party)--oc-version— OpenShift CLI version (default:4.20)--task-timeout— seconds before reaper kills the task (default:3600)--connect— automatically join the task after it reaches RUNNING--no-wait— return immediately without waiting for RUNNING--force-login— force fresh OIDC authentication--output text|json
join-task: --container (default: rosa-boundary), --command (default: runuser -u sre -- sh -c 'cd ~ && exec bash --login'), --no-wait
list-tasks: --status RUNNING|STOPPED|all (default: RUNNING), --output text|json
stop-task: --reason, --wait
login: --force
--verbose, -v Enable verbose/debug output
--keycloak-url Keycloak base URL
--realm Keycloak realm (default: sre-ops)
--client-id OIDC client ID (default: aws-sre-access)
--region AWS region (default: us-east-2)
--cluster ECS cluster name (default: rosa-boundary-dev)
--lambda-function-name Lambda function name or ARN
--invoker-role-arn Lambda invoker role ARN
--role-arn SRE role ARN (overrides Lambda response)
--lambda-url Lambda function URL (HTTP mode)
Flags > environment variables (ROSA_BOUNDARY_*) > ~/.rosa-boundary/config.yaml > defaults
Environment variable examples: ROSA_BOUNDARY_KEYCLOAK_URL, ROSA_BOUNDARY_LAMBDA_FUNCTION_NAME, ROSA_BOUNDARY_INVOKER_ROLE_ARN.
# Build both architectures and create manifest
make all
# Build single architecture
make build-amd64
make build-arm64
# Create manifest list from existing builds
make manifest
# Remove all images and manifests
make clean# Build the rosa-boundary binary to ./bin/
make build-cli
# Install to $GOBIN
make install-cli
# Run Go unit tests
make test-cliThe easiest way to select tool versions is via environment variables at container startup:
| Variable | Values | Default | Description |
|---|---|---|---|
OC_VERSION |
4.14, 4.15, 4.16, 4.17, 4.18, 4.19, 4.20 |
4.20 |
OpenShift CLI version |
AWS_CLI |
fedora, official |
official |
AWS CLI source |
S3_AUDIT_ESCROW |
S3 URI (e.g., s3://bucket/path/) |
(none) | S3 destination for /home/sre sync on exit |
CLAUDE_CODE_USE_BEDROCK |
0, 1 |
1 |
Enable Claude Code via Amazon Bedrock |
AWS_REGION |
AWS region code | (auto-detect) | AWS region for Bedrock. Auto-detected from ECS metadata; fallback to us-east-1 |
ANTHROPIC_MODEL |
Bedrock model ID | (default) | Override Claude model (e.g., global.anthropic.claude-sonnet-4-5-20250929-v1:0) |
Examples:
# Use OpenShift CLI 4.18
podman run -e OC_VERSION=4.18 rosa-boundary:latest
# Use Fedora's AWS CLI
podman run -e AWS_CLI=fedora rosa-boundary:latest
# Use both together
podman run -e OC_VERSION=4.17 -e AWS_CLI=fedora rosa-boundary:latest
# With a custom command
podman run -e OC_VERSION=4.19 rosa-boundary:latest /bin/bashThe container includes a non-root sre user (uid=1000) designed for SSM/ECS Exec connections. The /home/sre directory is intended to be mounted as EFS via Fargate task definition.
When the container receives termination signals (SIGTERM, SIGINT, SIGHUP) or exits normally, the entrypoint automatically syncs /home/sre to S3 if S3_AUDIT_ESCROW is set:
# Container will sync /home/sre to S3 on exit
podman run -e S3_AUDIT_ESCROW=s3://my-bucket/investigation-123/ rosa-boundary:latestFeatures:
- Automatic sync on container exit or termination signals
- Graceful failure - warns but doesn't block exit if sync fails
- Only syncs if
S3_AUDIT_ESCROWis defined (no sync if unset) - Useful for preserving investigation artifacts after ephemeral container use
The container supports two methods for switching tool versions:
- Environment Variables (recommended): Set
OC_VERSIONorAWS_CLIat container startup (see above) - Alternatives Commands (advanced): Manually switch versions inside a running container
The container includes two AWS CLI versions managed with alternatives:
- fedora (priority 10): Fedora RPM package
- aws-official (priority 20): Official AWS CLI v2 (default)
# View current AWS CLI configuration
alternatives --display aws
# Switch to Fedora version
alternatives --set aws /usr/bin/aws
# Switch to official version
alternatives --set aws /opt/aws-cli-official/v2/current/bin/awsSeven OpenShift CLI versions are available (4.14-4.20), with 4.20 as the default:
# View available oc versions
alternatives --display oc
# Switch to a specific version
alternatives --set oc /opt/openshift/4.17/oc
alternatives --set oc /opt/openshift/4.19/ocThe container includes Claude Code CLI with Amazon Bedrock integration for AI-assisted troubleshooting and automation.
Location: /home/sre/.claude/
Default configuration files are automatically initialized on first run:
settings.json- Bedrock authentication and auto-update settingsCLAUDE.md- SRE workflow guidance and available tools documentation
Authentication: Uses IAM via Amazon Bedrock (no API keys required)
Claude Code automatically detects the AWS region from ECS task metadata:
- Checks if
AWS_REGIONenvironment variable is set (explicit override) - Queries ECS metadata endpoint to extract region from Task ARN
- Falls back to
us-east-1if detection fails
This ensures Claude Code uses Bedrock in the same region as the running container.
The ECS task role needs Bedrock permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
"bedrock:ListInferenceProfiles"
],
"Resource": [
"arn:aws:bedrock:*:*:inference-profile/*",
"arn:aws:bedrock:*:*:foundation-model/*"
]
}
]
}# Start Claude Code session
claude
# Get help with a command
claude "How do I check the status of cluster operators?"
# Run interactive investigation
claude "Investigate pods in crashloop in default namespace"
# Disable Claude Code via environment variable
podman run -e CLAUDE_CODE_USE_BEDROCK=0 rosa-boundary:latestConfiguration files in /home/sre/.claude/ are preserved across container restarts when using EFS:
- First run: Skeleton files copied from
/etc/skel-sre/.claude/ - Subsequent runs: Existing configuration preserved (no overwrite)
- Customize: Edit
/home/sre/.claude/CLAUDE.mdto add cluster-specific context
# Run with default versions (OC 4.20, official AWS CLI)
podman run -it rosa-boundary:latest /bin/bash
# Run with specific versions
podman run -it -e OC_VERSION=4.18 -e AWS_CLI=fedora rosa-boundary:latest /bin/bash
# Check tool versions
podman run --rm rosa-boundary:latest sh -c "aws --version && oc version --client"- Base: Fedora 43
- AWS CLI: v2.32.16+ (official), v2.27.0 (Fedora RPM)
- OpenShift CLI: 4.14.x, 4.15.x, 4.16.x, 4.17.x, 4.18.x, 4.19.x, 4.20.x
- Claude Code: 2.0.69 (native installer), auto-updates disabled
- Additional tools: util-linux (includes su for user switching)
The manifest list automatically selects the appropriate image for your platform:
linux/amd64- x86_64 architecturelinux/arm64- ARM64/aarch64 architecture (Graviton)
Test AWS functionality locally before production deployment:
# Start LocalStack (requires LocalStack Pro token)
make localstack-up
# Run fast tests (~2-3 min)
make test-localstack-fast
# Run full test suite (~5-7 min)
make test-localstack
# Stop LocalStack
make localstack-downSee tests/localstack/README.md for complete documentation.
cd lambda/create-investigation/
make testGitHub Actions workflow runs on PRs and pushes to main:
- LocalStack Integration Tests - AWS service validation
- Lambda Unit Tests - Handler function validation with moto
Required GitHub Secret: LOCALSTACK_AUTH_TOKEN (LocalStack Pro license)