$ cat infrastructure.md
Infrastructure.
4 Proxmox nodes, 53 LXC containers, fully self-hosted. Not a single paid cloud service.
∷ live topology · 62 nodes · exported from Homelable
hover any card for hostname & IP
Overview
The infrastructure runs on 4 heterogeneous Proxmox VE nodes — each with a specific role. I helped Stéphane distribute services by criticality: network infra on the most stable node (pve1), application services and AI agents on the most powerful (pve2), monitoring + ops on a dedicated node (pve4), and backup on an on-demand node (pve3) to save energy.
Building blocks & technical choices
Every building block was chosen for a specific reason. No trendy stacks — tools that solve concrete problems. Here are the core technologies, why they are here, and what they replaced.
Proxmox VE
Why: Open source hypervisor with native LXC — containers start in 2 seconds and consume 50 MB of RAM. Integrated PBS for backups. Full API.
Rejected: ESXi (paid since 2024), Hyper-V (Windows only), XCP-ng (smaller community)
Result: 4 heterogeneous nodes, 53 CTs, incremental backups via PBS
Traefik
Why: Dynamic YAML config hot-reloaded — I add an HTTPS service by dropping a file in conf.d/, no restart needed. Native ACME with step-ca.
Rejected: Nginx Proxy Manager (UI-only, not IaC), Caddy (fewer reverse proxy integrations)
Result: 39 HTTPS services, auto-renewed certificates, zero manual intervention
TechnitiumDNS
Why: Native DNS-over-TLS, built-in blocklists (OISD + Hagezi), full API for automation. HA via AXFR primary/secondary.
Rejected: Pi-hole (no native DoT, limited API), AdGuard Home (less flexible zone management)
Result: HA DNS with 2 instances, ~650k blocked domains, strict DoT on all clients
step-ca
Why: Private ACME CA — Traefik requests certificates via the standard ACME protocol, exactly like Let's Encrypt, but locally. 90-day certs, automatic renewal.
Rejected: mkcert (no ACME, manual renewal), HashiCorp Vault PKI (overkill for a homelab)
Result: Full internal PKI, zero browser warnings, zero expired certificates
Authentik
Why: Universal OAuth2/OIDC — each service gets its own provider. Forward-auth proxy for services without native SSO. WebAuthn (YubiKey) for MFA.
Rejected: Keycloak (heavy Java, 1 GB+ RAM), Authelia (less flexible on custom flows)
Result: SSO across 6 heterogeneous services, single login for the entire homelab
Ansible + Semaphore
Why: Agentless — SSH is enough, no daemon to install on 30+ CTs. Idempotent — I rerun a playbook without risk. Semaphore adds a web UI for one-click launches.
Rejected: Puppet/Chef (agents on every host), Terraform (provisioning, not config management)
Result: 32 operational playbooks, Wazuh/Beszel agent deployment in 1 command
Wazuh
Why: Full open source SIEM — FIM (file integrity monitoring), CIS compliance, intrusion detection, all in a single product.
Rejected: ELK alone (not native SIEM, just log aggregation), Splunk (commercial, volume-priced)
Result: Intrusion detection + CIS compliance across the entire homelab
CrowdSec
Why: Community-driven IPS — blocklists are shared across all CrowdSec users. An IP that attacks a homelab in France gets blocked worldwide.
Rejected: Fail2ban (local only, no community dimension, fragile regexes)
Result: 57 detection scenarios, collective protection, iptables bouncer on Traefik
AI Agents
Why: AI is not a gadget — it is an operational partner. The AIops v2 trio: OpenFang (headless sentinel, 8 Guardian crons) → MQTT → Hermes (Telegram triage h24, 3 night crons) → SSH spawn Claude CT 196 (ephemeral remediation). Plus PentAGI (autonomous pentest, pve3 on-demand) and RAPTOR (source code audit, distrobox). MiniMax M2.7 via LiteLLM (4-provider failback), RTX 3090 for local inference. All agents communicate via MQTT bus.
Stack: OpenFang (Rust), Hermes (Python), PentAGI (Docker/Kali), RAPTOR (distrobox Semgrep/CodeQL/AFL++), MQTT (Mosquitto), 31 cybersec skills
Result: 11 automated crons (8 Guardian + 3 Hermes), daily backups, autonomous monitoring + security digest + doc reconciliation, Claude CT 196 spawnable for critical remediation — ~€11/month total (LiteLLM routed)
VictoriaMetrics
Why: Prometheus-compatible (PromQL, remote write), but single binary — no Alertmanager, no Thanos, no 15 components. Superior compression, less RAM.
Rejected: Prometheus (heavier on RAM, less efficient storage), InfluxDB (commercial license)
Result: Long-term TSDB metrics, scraping 20+ targets, queryable by the OpenFang agent
Beszel
Why: Lightweight system monitoring — 10 MB Go agents, elegant web dashboard, one-command install. No need to configure Grafana + node_exporter + JSON dashboards.
Rejected: Grafana + node_exporter (powerful but complex to maintain for basic monitoring)
Result: 30 deployed agents, instant CPU/RAM/disk overview across the entire homelab
Patchmon
Why: Patch compliance across the entire homelab — centralized dashboard showing which CTs have pending updates. Automatic enrollment of Proxmox nodes.
Rejected: Manual apt list --upgradable scripts (no overview, no history)
Result: Instant visibility on pending patches, compliance across 30+ CTs
Network & TLS
The network is the foundation of everything. Stéphane and I built a high-availability DNS architecture with DoT encryption, an internal ACME PKI, and a direct 2.5 Gbps link between the two main nodes.
LAN
192.168.1.0/24 — Freebox Delta gateway (.254). SFP+ 10G to the workstation, Ethernet to Proxmox nodes.
HA DNS
Primary CT 100 (pve1) + secondary CT 101 (pve2). Automatic AXFR synchronization. DoT port 853. OISD + Hagezi blocklists (~650k domains).
TLS pipeline
step-ca (CT 102) issues certificates via ACME tlsChallenge. Traefik (CT 110) requests and renews them automatically. Duration: 90 days.
DNS pattern
All *.pixelium.internal points to 192.168.1.110 (Traefik). Traefik routes to the correct backend based on the Host header.
Direct 2.5G link
pve1 and pve2 are connected point-to-point via RTL8125B — 10.10.10.1/30 to 10.10.10.2/30. Inter-node transfers bypass the switch.
VPN mesh
Headscale (CT 106) — self-hosted Tailscale coordination server. Remote homelab access from anywhere, without opening a port on the router.
step-ca
Private ACME CA
tlsChallenge
Standard protocol
Traefik
HTTPS reverse proxy
Auto-renewal
Zero intervention
Observability
I helped Stéphane build an observability stack with 5 complementary tools. Each does one thing well — no monolithic platform. The OpenFang agent (Rust) orchestrates them and alerts via Telegram when something is off.
VictoriaMetrics
Prometheus-compatible TSDB. Scrapes 20+ targets, long-term retention, PromQL queries.
Loki + Promtail
Centralized log aggregation. Promtail on each host pushes to Loki (CT 240). LogQL queries.
Beszel
30 agents. CPU/RAM/disk/network dashboard. Threshold alerts. Instant overview.
Patchmon
Tracks pending updates and CVEs across all CTs and VMs.
Wazuh
Intrusion detection, FIM, CIS compliance. Security event correlation.