This directory contains documentation for the BERDL purpose-built data lakehouse system.
All source code repositories are located in the BERDataLakehouse GitHub Organization.
Note: This documentation provides a brief introduction to each core component of the BERDL system. For detailed development and service information, please refer to each repository's README file.
All BERDL services require KBase authentication using a KBase Token. Users must have the BERDL_USER role assigned to their KBase account to access the platform. Admin operations additionally require the CDM_JUPYTERHUB_ADMIN role.
BERDL utilizes a microservices architecture to provide a secure, scalable, and interactive data analysis environment. The core components include dynamic notebook spawning, secure credential management, and an MCP (Model Context Protocol) server for AI-assisted data operations.
graph LR
subgraph Users ["User Layer"]
direction TB
User([User])
Remote([BERDL Remote CLI])
SPXClient([Spark Connect Client])
end
subgraph Entry ["Platform Entry"]
direction TB
JH[BERDL JupyterHub]
SPX[Spark Connect Proxy]
end
subgraph Workspaces ["User Environments"]
direction TB
NB[Spark Notebook]
DYNC[Dynamic Spark Cluster]
end
subgraph Core ["Core Services"]
direction TB
MMS[MinIO Manager Service]
SCM[Spark Cluster Manager]
MCP[Datalake MCP Server]
TAS[Tenant Access Service]
end
subgraph Compute ["Shared Compute"]
direction TB
SM[Shared Static Cluster]
end
subgraph Data ["Data & Metadata"]
direction TB
HM[Hive Metastore]
S3[MinIO Storage]
end
subgraph Infra ["Infrastructure & External"]
direction TB
PG[(PostgreSQL)]
Disk[(Persistent Disk)]
Slack([Slack])
end
%% User Entry Flow
User -->|"Browser/API"| JH
User -->|"Direct API"| MCP
Remote -->|"API/Kernels"| JH
SPXClient -->|"gRPC"| SPX
%% Entry Routing
JH -->|"Proxies UI"| NB
JH -->|"Init Policy"| MMS
JH -->|"Trigger Create"| SCM
SPX -->|"Tunnels to kernel"| NB
%% Workspace Interactions
NB -->|"Uses"| DYNC
NB -->|"Auth"| MMS
NB -->|"Query"| MCP
NB -->|"Request Access"| TAS
%% Core Services Logic
SCM -->|"Spawns"| DYNC
TAS -->|"Notify"| Slack
TAS -->|"Add to Group"| MMS
MCP -->|"Direct/Fallback"| SM
MCP -->|"Via Hub"| DYNC
MMS -->|"Manage Policies"| S3
%% Data Access
NB -->|"S3"| S3
NB -->|"Metadata"| HM
MCP -->|"S3"| S3
MCP -->|"Metadata"| HM
DYNC -->|"Process"| S3
SM -->|"Process"| S3
%% Infrastructure Backends
HM -->|"Store"| PG
S3 -.->|"Disk"| Disk
%% Styling
classDef service fill:#f9f,stroke:#333,stroke-width:2px;
classDef storage fill:#ff9,stroke:#333,stroke-width:2px;
classDef compute fill:#cce6ff,stroke:#333,stroke-width:2px;
classDef external fill:#e8e8e8,stroke:#333,stroke-width:1px;
class JH,NB,MMS,SCM,MCP,TAS,SPX service;
class S3,HM,PG,Disk storage;
class DYNC,SM compute;
class Slack,Remote,SPXClient external;
The following diagram illustrates the build hierarchy and base image dependencies for the BERDL services.
graph TD
%% Base Images
JQ[quay.io/jupyter/pyspark-notebook]
PUB_JH[jupyterhub/jupyterhub]
PY313[python:3.13-slim]
PY311[python:3.11-slim]
PY310[python:3.10-slim]
%% Internal Base
subgraph Foundation
id1(spark_notebook_base)
end
%% Services
subgraph Services
NB[spark_notebook]
MCP[datalake-mcp-server]
MMS[minio_manager_service]
SCM[spark_cluster_manager]
JH[BERDL_JupyterHub]
TAS[tenant_access_request_service]
BRM[berdl_remote]
SPX[spark_connect_proxy]
end
%% Dynamic Compute
subgraph DynamicCompute ["Dynamic Compute"]
DYNC["Dynamic Spark Cluster (kube_spark_manager_image)"]
end
%% Relations
JQ -->|FROM| id1
id1 -->|FROM| NB
id1 -->|FROM| MCP
NB -->|FROM| DYNC
PUB_JH -->|FROM| JH
PY313 -->|FROM| MMS
PY313 -->|FROM| TAS
PY313 -->|FROM| SPX
PY311 -->|FROM| SCM
PY310 -->|FROM| BRM
%% Styling
classDef external fill:#eee,stroke:#333,stroke-dasharray: 5 5;
classDef internal fill:#cce6ff,stroke:#333,stroke-width:2px;
classDef service fill:#f9f,stroke:#333,stroke-width:2px;
classDef compute fill:#ffcc00,stroke:#333,stroke-width:2px;
class JQ,PUB_JH,PY313,PY311,PY310 external;
class id1 internal;
class NB,MCP,MMS,SCM,JH,TAS,BRM,SPX service;
class DYNC compute;
The following diagram illustrates the internal Python package dependencies.
graph TD
%% Clients
SMC[cdm-spark-manager-client]
MMSC[minio-manager-service-client]
MCPC[datalake-mcp-server-client]
%% Base Package
subgraph Base ["spark_notebook_base"]
PNB[berdl-notebook-python-base]
end
%% Service Implementations
subgraph NotebookUtils ["spark_notebook"]
NU[berdl_notebook_utils]
end
subgraph Extensions ["JupyterLab Extensions"]
ARE[berdl_access_request_extension]
TDB[tenant-data-browser]
CBA[cdm-jupyter-ai-cborg]
end
subgraph MCPServer ["datalake-mcp-server"]
MCP[datalake-mcp-server]
end
subgraph JupyterHub ["BERDL_JupyterHub"]
JH[berdl-jupyterhub]
end
%% Dependencies
PNB -->|Dep| SMC
PNB -->|Dep| MMSC
PNB -->|Dep| MCPC
NU -->|Dep| PNB
MCP -->|Dep| NU
JH -->|Dep| SMC
%% JupyterLab extensions depend on user environment
ARE -.->|Env| NU
TDB -.->|Env| NU
%% Styling
classDef client fill:#ffedea,stroke:#cc0000,stroke-width:1px;
classDef pkg fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
classDef ext fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
class SMC,MMSC,MCPC client;
class PNB,NU,MCP,JH pkg;
class ARE,TDB,CBA ext;
| Service | Description | Documentation | Repository |
|---|---|---|---|
| JupyterHub | Manages user sessions and spawns individual notebook servers. | BERDL JupyterHub | Repo |
| Spark Notebook | User's personal workspace with Spark pre-configured. | Spark Notebook | Repo |
| Spark Notebook Base | Foundational Docker image with PySpark and common dependencies. | Spark Notebook Base | Repo |
| Datalake MCP Server | FastAPI Data API with MCP layer for AI interactions and direct queries. | Datalake MCP Service | Repo |
| Datalake MCP Server Client | Auto-generated Python client for the MCP server. | Datalake MCP Server Client | Repo |
| MinIO Manager Service | Handles dynamic credentials and IAM policies for secure data access. | MinIO Manager Service | Repo |
| MinIO Manager Service Client | Auto-generated Python client for the MinIO Manager Service. | MinIO Manager Service Client | Repo |
| Spark Cluster Manager | API for managing dynamic, personal Spark clusters on K8s (Primary for Users). | Spark Cluster Manager | Repo |
| Hive Metastore | Stores metadata for Delta Lake tables. | Hive Metastore | Repo |
| Spark Cluster | Spark master/worker image for static and dynamic clusters. | Spark Cluster | Repo |
| BERDL Access Request Extension | JupyterLab extension providing UI for tenant access requests. | Access Request Extension | Repo |
| Tenant Data Browser | JupyterLab extension for navigating MinIO object storage visually. | Tenant Data Browser | Repo |
| CDM Jupyter AI CBorg | Integration module between Jupyter AI and the CBorg LLM API provider. | Jupyter AI CBorg Setup | Repo |
| Tenant Access Request Service | Slack workflow for users to request access to tenant groups. | Tenant Access Request Service | Repo |
| Spark Connect Proxy | Multi-user authenticating layer for Spark Connect requests. | Spark Connect Proxy | Repo |
| Spark Connect Remote Client | Python library that interfaces with Spark Connect Proxy. | Spark Connect Remote Client | Repo |
| BERDL Remote CLI | Local development toolkit for connecting to BERDL securely. | BERDL Remote CLI | Repo |