// INFRASTRUCTURE DOCUMENTATION
DevOps Proof
This page documents a production-grade Kubernetes cluster built from scratch on bare metal hardware — from flashing a raw OS image to a fully operational GitOps-driven infrastructure stack. Every design decision is explained, every tool is justified, and the architecture is diagrammed in full. This is the same stack I would discuss in any senior DevOps or platform engineering interview.
// 01 · ARCHITECTURE
Full Stack Overview
The cluster runs on an old laptop with an Intel Core i7-7700 and GTX 1050 Ti, repurposed as a dedicated Kubernetes node isolated completely from the MS-01 production Unraid server. The OS is Talos Linux — an immutable, API-only OS with no shell, no SSH daemon, and no package manager. It boots directly into a minimal Linux kernel running only containerd and the Talos API daemon. Every configuration change is a signed API call. This enforces infrastructure-as-code discipline at the OS level.
On top of Talos, a single-node Kubernetes v1.35.2 cluster runs the full application stack. Flannel handles pod networking, MetalLB provides LoadBalancer IP assignment from the LAN range, and Ingress-NGINX routes external traffic by hostname. All workloads are deployed and reconciled continuously by ArgoCD, which watches the GitHub repository as the single source of truth. Nothing is ever applied manually.
// 02 · DESIGN DECISIONS
Why Each Tool Was Chosen
Every tool in the stack was chosen for a specific production reason, not to pad a skills list. The table below explains the decision behind each component.
| TOOL | PROBLEM IT SOLVES | WHY NOT THE ALTERNATIVE |
|---|---|---|
| Talos OS | Need an OS that enforces IaC discipline at the node level. No shell access means no configuration drift, no manual fixes that never get documented. | Ubuntu with Ansible: works, but SSH access is a footgun. A developer can SSH in, fix something manually, and the playbook no longer reflects reality. Talos makes that impossible. |
| ArgoCD | Need a continuous reconciliation loop that keeps cluster state aligned with Git. Manual kubectl applies break down as soon as more than one person or automation touches the cluster. | FluxCD is equivalent — chose ArgoCD for the UI and the explicit Application CRD model which makes the desired state auditable per-application rather than just per-directory. |
| MetalLB | Kubernetes LoadBalancer type services require a cloud provider by default. On bare metal there is no cloud to allocate IPs, so pods can only be reached via NodePort or ClusterIP. | Pure NodePort works but requires hardcoded high ports everywhere. MetalLB makes LoadBalancer services behave identically to how they would on EKS or GKE — a real IP, standard port 80/443. |
| Ingress-NGINX | Without a reverse proxy, each service needs its own IP and port. Ingress-NGINX allows routing many services through one IP using hostname-based virtual hosting. | Traefik is a popular alternative with similar capability. Ingress-NGINX was chosen because it mirrors how most production clusters (including large Dutch enterprises) have historically handled ingress, making it more interview-relevant. |
| Cert-Manager | TLS certificates expire and managing them manually is a production incident waiting to happen. Cert-Manager integrates with Let's Encrypt and renews automatically. | Manual certificate management is not acceptable in production. Cert-Manager is the industry standard for automated TLS in Kubernetes and handles DNS-01 and HTTP-01 challenge types. |
| Sealed Secrets | Raw Kubernetes Secrets are base64, not encrypted. Committing them to Git means anyone with repo access can decode them instantly. | SOPS with age keys is equally valid (already used for Terraform secrets in the IaC repo). Sealed Secrets was added to demonstrate the Kubernetes-native pattern used at many enterprises. |
| kube-prometheus-stack | Need cluster-wide metrics for CPU, memory, pod health, and deployment status. Flying blind without observability is not a production posture. | The full stack (Prometheus Operator + Grafana + kube-state-metrics + Alertmanager) ships as a single Helm chart, pre-wired with all the dashboards and scrape configs needed. Installing components separately would take significantly more configuration time. |
| Loki | Logs from every pod need to be aggregatable without SSHing into nodes. Loki aggregates logs cluster-wide and makes them queryable in Grafana alongside metrics. | Elasticsearch (ELK stack) handles this but is significantly heavier on resources for a homelab node. Loki's compressed log storage and tight Grafana integration makes it the right fit here. |
| Cloudflare Tunnel (planned) | DU Telecom in Dubai uses CGNAT — there is no public inbound IP on the home connection. Standard port forwarding is impossible. The cluster cannot be publicly accessible without a solution that works outbound-only. | ngrok and similar tools work but are not production-grade for a persistent homelab. Cloudflare Tunnel is already proven on the MS-01 (running onetwork.cc for Plex and other services), so the pattern is known and trusted. |
// 03 · GITOPS PIPELINE
From Git Push to Running Pods
The GitOps pipeline enforces a single invariant: Git is the only write path to the cluster.
No human runs kubectl apply directly against production. Every change — whether a new application,
a configuration update, or an infrastructure addition — is made by editing YAML in the repository and merging
a pull request. ArgoCD polls the repository every three minutes and applies any detected drift automatically.
For application deployments, GitHub Actions builds the container image, pushes it to GHCR, and then commits the updated image tag back into the manifest in Git. ArgoCD detects this commit and triggers a sync. The prune flag removes resources that are no longer in Git. The selfHeal flag reverts any manual change made directly to the cluster — enforcing that Git always wins. This is the same reconciliation loop used by teams at ING, Booking.com, and most other Dutch technology companies that have adopted GitOps.
k8s/apps/ in the repo.
Every file in that directory is an ArgoCD Application or ApplicationSet pointing at a Helm chart.
Adding a new service is one YAML file and a git push. ArgoCD bootstraps everything else.
infrastructure.yaml
ApplicationSet using a list generator. One file, six deployed Helm releases.
cloudflare
provider handles everything that lives outside the Kubernetes cluster.
// 04 · TRAFFIC EXPOSURE
How Services Are Accessed: LAN Today, Public Tomorrow
Currently, all cluster services are accessible from devices on the local network only.
Ingress-NGINX is assigned 192.168.0.200 via MetalLB L2 advertisement.
A hosts file entry on the operator PC maps *.homelab to the node IP,
and requests hit the NodePort on :31837 which routes to Ingress-NGINX
which routes by hostname to the appropriate pod.
The production upgrade path is Cloudflare Tunnel — the same pattern already running
on the MS-01 homelab server behind onetwork.cc. A single cloudflared pod deployed
by ArgoCD establishes an outbound-only encrypted tunnel to Cloudflare. Cloudflare Access sits
in front of sensitive services (ArgoCD, Grafana, Kubernetes Dashboard) and requires identity
verification before any traffic reaches the cluster. No ports are opened on the home router.
This is the correct architecture for a residential connection behind DU Telecom CGNAT.
// 05 · VERIFIABLE EVIDENCE
What Can Be Verified
The following items are directly verifiable by anyone reviewing this profile. This is not a theoretical understanding — the infrastructure is running.
Contains Terraform modules (Cloudflare DNS, Zero Trust, Tunnel), Ansible roles, Docker Compose stacks, and the full
k8s/ directory with ArgoCD Applications,
ApplicationSets, Helm values, and ingress manifests. Every tool on this page is represented
in the repository as code that can be reviewed, diffed, and applied.