// INFRASTRUCTURE DOCUMENTATION

DevOps Proof

This page documents a production-grade Kubernetes cluster built from scratch on bare metal hardware, from flashing a raw OS image to a fully operational GitOps-driven infrastructure stack. Every design decision is explained, every tool is justified, and the architecture is diagrammed in full. This is the same stack I would discuss in any senior DevOps or platform engineering interview.

Talos OS · Bare Metal Kubernetes v1.35 ArgoCD GitOps Helm · Terraform Prometheus · Grafana · Loki Ingress-NGINX · MetalLB · Cert-Manager Sealed Secrets Cloudflare Tunnel (planned)

// 01 · ARCHITECTURE

Full Stack Overview

The cluster runs across three repurposed machines (two Omen gaming laptops, one with a GTX 1050 Ti and one with an RTX 2070 Super, plus a Beelink mini PC), forming a dedicated multi-node Kubernetes cluster isolated completely from the MS-01 production Unraid server. The OS is Talos Linux: an immutable, API-only OS with no shell, no SSH daemon, and no package manager. It boots directly into a minimal Linux kernel running only containerd and the Talos API daemon. Every configuration change is a signed API call. This enforces infrastructure-as-code discipline at the OS level.

On top of Talos, a 3-node Kubernetes v1.35.2 cluster runs the full application stack. Flannel handles pod networking, MetalLB provides LoadBalancer IP assignment from the LAN range, and Ingress-NGINX routes external traffic by hostname. All workloads are deployed and reconciled continuously by ArgoCD, which watches the GitHub repository as the single source of truth. Nothing is ever applied manually.

Diagram 1 of 3: Seven-layer stack from physical hardware to GitOps applications. Green nodes are Kubernetes-native components. Blue nodes are networking and GitOps tooling. Amber nodes are planned or in-progress additions.

// 02 · DESIGN DECISIONS

Why Each Tool Was Chosen

Every tool in the stack was chosen for a specific production reason, not to pad a skills list. The table below explains the decision behind each component.

TOOL	PROBLEM IT SOLVES	WHY NOT THE ALTERNATIVE
Talos OS	Need an OS that enforces IaC discipline at the node level. No shell access means no configuration drift, no manual fixes that never get documented.	Ubuntu with Ansible: works, but SSH access is a footgun. A developer can SSH in, fix something manually, and the playbook no longer reflects reality. Talos makes that impossible.
ArgoCD	Need a continuous reconciliation loop that keeps cluster state aligned with Git. Manual kubectl applies break down as soon as more than one person or automation touches the cluster.	FluxCD is equivalent. Chose ArgoCD for the UI and the explicit Application CRD model, which makes the desired state auditable per-application rather than just per-directory.
MetalLB	Kubernetes LoadBalancer type services require a cloud provider by default. On bare metal there is no cloud to allocate IPs, so pods can only be reached via NodePort or ClusterIP.	Pure NodePort works but requires hardcoded high ports everywhere. MetalLB makes LoadBalancer services behave identically to how they would on EKS or GKE: a real IP, standard port 80/443.
Ingress-NGINX	Without a reverse proxy, each service needs its own IP and port. Ingress-NGINX allows routing many services through one IP using hostname-based virtual hosting.	Traefik is a popular alternative with similar capability. Ingress-NGINX was chosen because it mirrors how most production clusters (including large Dutch enterprises) have historically handled ingress, making it more interview-relevant.
Cert-Manager	TLS certificates expire and managing them manually is a production incident waiting to happen. Cert-Manager integrates with Let's Encrypt and renews automatically.	Manual certificate management is not acceptable in production. Cert-Manager is the industry standard for automated TLS in Kubernetes and handles DNS-01 and HTTP-01 challenge types.
Sealed Secrets	Raw Kubernetes Secrets are base64, not encrypted. Committing them to Git means anyone with repo access can decode them instantly.	SOPS with age keys is equally valid (already used for Terraform secrets in the IaC repo). Sealed Secrets was added to demonstrate the Kubernetes-native pattern used at many enterprises.
kube-prometheus-stack	Need cluster-wide metrics for CPU, memory, pod health, and deployment status. Flying blind without observability is not a production posture.	The full stack (Prometheus Operator + Grafana + kube-state-metrics + Alertmanager) ships as a single Helm chart, pre-wired with all the dashboards and scrape configs needed. Installing components separately would take significantly more configuration time.
Loki	Logs from every pod need to be aggregatable without SSHing into nodes. Loki aggregates logs cluster-wide and makes them queryable in Grafana alongside metrics.	Elasticsearch (ELK stack) handles this but is significantly heavier on resources for a homelab node. Loki's compressed log storage and tight Grafana integration makes it the right fit here.
Cloudflare Tunnel (planned)	DU Telecom in Dubai uses CGNAT, so there is no public inbound IP on the home connection. Standard port forwarding is impossible. The cluster cannot be publicly accessible without a solution that works outbound-only.	ngrok and similar tools work but are not production-grade for a persistent homelab. Cloudflare Tunnel is already proven on the MS-01 (running onetwork.cc for Plex and other services), so the pattern is known and trusted.

// 03 · GITOPS PIPELINE

From Git Push to Running Pods

The GitOps pipeline enforces a single invariant: Git is the only write path to the cluster. No human runs kubectl apply directly against production. Every change, whether a new application, a configuration update, or an infrastructure addition, is made by editing YAML in the repository and merging a pull request. ArgoCD polls the repository every three minutes and applies any detected drift automatically.

For application deployments, GitHub Actions builds the container image, pushes it to GHCR, and then commits the updated image tag back into the manifest in Git. ArgoCD detects this commit and triggers a sync. The prune flag removes resources that are no longer in Git. The selfHeal flag reverts any manual change made directly to the cluster, enforcing that Git always wins. This is the same reconciliation loop used by teams at ING, Booking.com, and most other Dutch technology companies that have adopted GitOps.

Diagram 2 of 3: End-to-end GitOps pipeline from developer commit to pod running on cluster. The bottom-left box highlights the key principle: Git is the only write path. Self-heal means manual cluster modifications are automatically reverted.

App of Apps pattern

A single root ArgoCD Application watches k8s/apps/ in the repo. Every file in that directory is an ArgoCD Application or ApplicationSet pointing at a Helm chart. Adding a new service is one YAML file and a git push. ArgoCD bootstraps everything else.

ApplicationSet for infrastructure

All six infrastructure tools (MetalLB, Ingress-NGINX, Cert-Manager, Sealed Secrets, kube-prometheus-stack, Loki) are declared in a single infrastructure.yaml ApplicationSet using a list generator. One file, six deployed Helm releases.

Helm via ArgoCD (no local Helm needed)

ArgoCD has Helm built in. The operator machine (Windows PC) does not need Helm installed. Chart versions are pinned in Git. Upgrades are a version bump commit, not a command to remember to run.

Terraform for Cloudflare layer

DNS records, tunnel routes, Zero Trust access policies, and Worker deployments for this CV page are all managed by Terraform with state stored in Cloudflare R2. The cloudflare provider handles everything that lives outside the Kubernetes cluster.

// 04 · TRAFFIC EXPOSURE

How Services Are Accessed: LAN Today, Public Tomorrow

Currently, all cluster services are accessible from devices on the local network only. Ingress-NGINX is assigned 192.168.0.200 via MetalLB L2 advertisement. A hosts file entry on the operator PC maps *.homelab to the node IP, and requests hit the NodePort on :31837 which routes to Ingress-NGINX which routes by hostname to the appropriate pod.

The production upgrade path is Cloudflare Tunnel: the same pattern already running on the MS-01 homelab server behind onetwork.cc. A single cloudflared pod deployed by ArgoCD establishes an outbound-only encrypted tunnel to Cloudflare. Cloudflare Access sits in front of sensitive services (ArgoCD, Grafana, Kubernetes Dashboard) and requires identity verification before any traffic reaches the cluster. No ports are opened on the home router. This is the correct architecture for a residential connection behind DU Telecom CGNAT.

Diagram 3 of 3: Current LAN access path (top) vs planned Cloudflare Tunnel path (bottom). The tunnel approach requires zero open inbound ports and adds identity-gated access via Cloudflare Access, meaning even if someone knows the URL, they cannot reach the service without authenticated identity.

// 05 · VERIFIABLE EVIDENCE

What Can Be Verified

The following items are directly verifiable by anyone reviewing this profile. This is not a theoretical understanding. The infrastructure is running.

IaC repository, public on GitHub

github.com/omaratabany/Home-Lab-Infra-as-code
Contains Terraform modules (Cloudflare DNS, Zero Trust, Tunnel), Ansible roles, Docker Compose stacks, and the full k8s/ directory with ArgoCD Applications, ApplicationSets, Helm values, and ingress manifests. Every tool on this page is represented in the repository as code that can be reviewed, diffed, and applied.

Live Grafana panels on the CV site

The CV at atabany.net embeds live Prometheus/Grafana panels from the MS-01 homelab server (scroll to Live observability on the homepage). CPU, memory, disk I/O, and network metrics: real time series. The data path (Node Exporter scraping to Prometheus, Grafana querying internally, Cloudflare Tunnel exposing only Grafana) is the same pattern being extended to the Kubernetes cluster.

Talos bare metal installation, documented process

Three nodes, Talos installed directly without any underlying OS on each. The installation involved diagnosing and resolving I/O errors during initial disk partitioning, identifying disk assignment conflicts between the NVMe and a USB drive, and resolving PodSecurity namespace policy violations blocking MetalLB speaker. These are the real problems that occur in production bare metal deployments.

ArgoCD managing seven active applications

cert-manager, ingress-nginx, sealed-secrets, metallb, kube-prometheus-stack, loki-stack, and homepage are all deployed and reconciled by ArgoCD from the GitHub repository. The ApplicationSet pattern deploys all infrastructure tools from a single YAML file. Adding any new tool to the cluster is a single file addition and a git push.

Prometheus stack with custom unified dashboard

kube-prometheus-stack deploys Prometheus Operator, Grafana, kube-state-metrics, and Alertmanager. A custom unified operations dashboard (exported as importable JSON) consolidates critical signals from across the default dashboards into a single view: node status, CPU/memory usage, failing pods, pod restarts by namespace, ArgoCD sync health, and firing alerts.

Zero-trust remote access on production MS-01 (already proven)

The MS-01 homelab server has been running Cloudflare Tunnel since 2022 under onetwork.cc. Plex, Overseerr, Jellyseerr, and Grafana are all exposed publicly via Cloudflare Tunnel with Cloudflare Access policies, working correctly behind DU Telecom CGNAT with no open inbound ports. This is the proven pattern that will be replicated for the Kubernetes cluster.