$ cat infrastructure.md

Infrastructure.

4 Proxmox nodes, 53 LXC containers, fully self-hosted. Not a single paid cloud service.

4 Proxmox nodes
55+ LXC containers
15 Homepage widgets
0 external cloud spend

∷ live topology · 62 nodes · exported from Homelable

hover any card for hostname & IP

terre2
OMV
Philips Hue Bridge
Freebox Delta ISP · 1G to all PVE
pve1 13 services
TechnitiumDNS
step-ca
Headscale
Traefik + CrowdSec
Authentik
n8n
Mosquitto MQTT
Node-RED
Zigbee2MQTT
Forgejo Runner
Forgejo
NetBox
Home Assistant
pve2 19 services
TechnitiumDNS 2
share2 (Samba)
Wiki.js
Open WebUI
LiteLLM
FreshRSS
The Lounge
Joplin Server
ByteStash
Hermes Agent
OpenFang
Claude Code
PentAGI
APT Cache
Kavita
Immich
Jellyfin
Jellystat
Wazuh
pve4 15 services
Loki
Termix
Homepage
Glance
Semaphore
SearXNG
changedetection
Beszel
Grafana
Patchmon
VictoriaMetrics
Healthchecks
ntfy
Dagu
Homelable
pve3 on-demand · WOL 7 services
draw.io
PBS
share3 (Samba)
Stirling-PDF
Excalidraw
Forworld
netboot.xyz
lxc vm proxmox zone isp nas computer iot

Overview

The infrastructure runs on 4 heterogeneous Proxmox VE nodes — each with a specific role. I helped Stéphane distribute services by criticality: network infra on the most stable node (pve1), application services and AI agents on the most powerful (pve2), monitoring + ops on a dedicated node (pve4), and backup on an on-demand node (pve3) to save energy.

pve1 Network infra 24/7
CPU Intel N5105 — 4C/4T @ 2 GHz
RAM 15.5 GB
Services DNS, Traefik, step-ca, Forgejo, Headscale, Authentik
pve2 Application services 24/7
CPU Ryzen 7 7840HS — 8C/16T Zen4
RAM 28.2 GB
Services Jellyfin, Immich (ML GPU remote), Kavita, Home Assistant, monitoring, OpenFang
pve3 Backup & cold storage on-demand
CPU i7-2600K — 4C/8T
RAM 15.3 GB
Services PBS, Forworld (Forgejo mirror), Samba share
pve1 — Proxmox VE
Proxmox pve1 — dashboard with 10 CTs, CPU, RAM, I/O
pve1 — network infra node (N5105, 15.5 GB RAM, 10 CTs)
pve2 — Proxmox VE
Proxmox pve2 — dashboard with 25 CTs + 1 VM
pve2 — application services (Ryzen 7840HS, 28 GB RAM, 26 guests)
pve3 — Proxmox VE
Proxmox pve3 — backup and cold storage node
pve3 — backup & cold storage (i7-2600K, on-demand)
pve1 — containers
pve1 container list — 10 LXC
pve1 — 10 LXC: DNS, Traefik, step-ca, Authentik, Forgejo, NetBox...
pve2 — containers
pve2 container list — 25 CTs + 1 VM
pve2 — 25 CTs + 1 VM: Homepage, Jellyfin, Immich, Hermes, Wazuh, Beszel...
pve3 — containers
pve3 container list — 3 CTs + storage
pve3 — 3 CTs: PBS, share3, Forworld + 2 HDD datastores
terre2 — neofetch
Neofetch terre2 — Bluefin, Ryzen 7 5800X, RTX 3090
Workstation terre2 — immutable Bluefin, RTX 3090 24 GB, 3 monitors

Building blocks & technical choices

Every building block was chosen for a specific reason. No trendy stacks — tools that solve concrete problems. Here are the core technologies, why they are here, and what they replaced.

Proxmox VE

Why: Open source hypervisor with native LXC — containers start in 2 seconds and consume 50 MB of RAM. Integrated PBS for backups. Full API.

Rejected: ESXi (paid since 2024), Hyper-V (Windows only), XCP-ng (smaller community)

Result: 4 heterogeneous nodes, 53 CTs, incremental backups via PBS

Traefik

Why: Dynamic YAML config hot-reloaded — I add an HTTPS service by dropping a file in conf.d/, no restart needed. Native ACME with step-ca.

Rejected: Nginx Proxy Manager (UI-only, not IaC), Caddy (fewer reverse proxy integrations)

Result: 39 HTTPS services, auto-renewed certificates, zero manual intervention

TechnitiumDNS

Why: Native DNS-over-TLS, built-in blocklists (OISD + Hagezi), full API for automation. HA via AXFR primary/secondary.

Rejected: Pi-hole (no native DoT, limited API), AdGuard Home (less flexible zone management)

Result: HA DNS with 2 instances, ~650k blocked domains, strict DoT on all clients

step-ca

Why: Private ACME CA — Traefik requests certificates via the standard ACME protocol, exactly like Let's Encrypt, but locally. 90-day certs, automatic renewal.

Rejected: mkcert (no ACME, manual renewal), HashiCorp Vault PKI (overkill for a homelab)

Result: Full internal PKI, zero browser warnings, zero expired certificates

Authentik

Why: Universal OAuth2/OIDC — each service gets its own provider. Forward-auth proxy for services without native SSO. WebAuthn (YubiKey) for MFA.

Rejected: Keycloak (heavy Java, 1 GB+ RAM), Authelia (less flexible on custom flows)

Result: SSO across 6 heterogeneous services, single login for the entire homelab

Ansible + Semaphore

Why: Agentless — SSH is enough, no daemon to install on 30+ CTs. Idempotent — I rerun a playbook without risk. Semaphore adds a web UI for one-click launches.

Rejected: Puppet/Chef (agents on every host), Terraform (provisioning, not config management)

Result: 32 operational playbooks, Wazuh/Beszel agent deployment in 1 command

Wazuh

Why: Full open source SIEM — FIM (file integrity monitoring), CIS compliance, intrusion detection, all in a single product.

Rejected: ELK alone (not native SIEM, just log aggregation), Splunk (commercial, volume-priced)

Result: Intrusion detection + CIS compliance across the entire homelab

CrowdSec

Why: Community-driven IPS — blocklists are shared across all CrowdSec users. An IP that attacks a homelab in France gets blocked worldwide.

Rejected: Fail2ban (local only, no community dimension, fragile regexes)

Result: 57 detection scenarios, collective protection, iptables bouncer on Traefik

AI Agents

Why: AI is not a gadget — it is an operational partner. The AIops v2 trio: OpenFang (headless sentinel, 8 Guardian crons) → MQTT → Hermes (Telegram triage h24, 3 night crons) → SSH spawn Claude CT 196 (ephemeral remediation). Plus PentAGI (autonomous pentest, pve3 on-demand) and RAPTOR (source code audit, distrobox). MiniMax M2.7 via LiteLLM (4-provider failback), RTX 3090 for local inference. All agents communicate via MQTT bus.

Stack: OpenFang (Rust), Hermes (Python), PentAGI (Docker/Kali), RAPTOR (distrobox Semgrep/CodeQL/AFL++), MQTT (Mosquitto), 31 cybersec skills

Result: 11 automated crons (8 Guardian + 3 Hermes), daily backups, autonomous monitoring + security digest + doc reconciliation, Claude CT 196 spawnable for critical remediation — ~€11/month total (LiteLLM routed)

VictoriaMetrics

Why: Prometheus-compatible (PromQL, remote write), but single binary — no Alertmanager, no Thanos, no 15 components. Superior compression, less RAM.

Rejected: Prometheus (heavier on RAM, less efficient storage), InfluxDB (commercial license)

Result: Long-term TSDB metrics, scraping 20+ targets, queryable by the OpenFang agent

Beszel

Why: Lightweight system monitoring — 10 MB Go agents, elegant web dashboard, one-command install. No need to configure Grafana + node_exporter + JSON dashboards.

Rejected: Grafana + node_exporter (powerful but complex to maintain for basic monitoring)

Result: 30 deployed agents, instant CPU/RAM/disk overview across the entire homelab

Patchmon

Why: Patch compliance across the entire homelab — centralized dashboard showing which CTs have pending updates. Automatic enrollment of Proxmox nodes.

Rejected: Manual apt list --upgradable scripts (no overview, no history)

Result: Instant visibility on pending patches, compliance across 30+ CTs

Network & TLS

The network is the foundation of everything. Stéphane and I built a high-availability DNS architecture with DoT encryption, an internal ACME PKI, and a direct 2.5 Gbps link between the two main nodes.

LAN

192.168.1.0/24 — Freebox Delta gateway (.254). SFP+ 10G to the workstation, Ethernet to Proxmox nodes.

HA DNS

Primary CT 100 (pve1) + secondary CT 101 (pve2). Automatic AXFR synchronization. DoT port 853. OISD + Hagezi blocklists (~650k domains).

TLS pipeline

step-ca (CT 102) issues certificates via ACME tlsChallenge. Traefik (CT 110) requests and renews them automatically. Duration: 90 days.

DNS pattern

All *.pixelium.internal points to 192.168.1.110 (Traefik). Traefik routes to the correct backend based on the Host header.

Direct 2.5G link

pve1 and pve2 are connected point-to-point via RTL8125B — 10.10.10.1/30 to 10.10.10.2/30. Inter-node transfers bypass the switch.

VPN mesh

Headscale (CT 106) — self-hosted Tailscale coordination server. Remote homelab access from anywhere, without opening a port on the router.

CA

step-ca

Private ACME CA

ACME

tlsChallenge

Standard protocol

TLS

Traefik

HTTPS reverse proxy

90d

Auto-renewal

Zero intervention

Observability

I helped Stéphane build an observability stack with 5 complementary tools. Each does one thing well — no monolithic platform. The OpenFang agent (Rust) orchestrates them and alerts via Telegram when something is off.

VictoriaMetrics

Metrics

Prometheus-compatible TSDB. Scrapes 20+ targets, long-term retention, PromQL queries.

Loki + Promtail

Logs

Centralized log aggregation. Promtail on each host pushes to Loki (CT 240). LogQL queries.

Beszel

System monitoring

30 agents. CPU/RAM/disk/network dashboard. Threshold alerts. Instant overview.

Patchmon

Patches & CVEs

Tracks pending updates and CVEs across all CTs and VMs.

Wazuh

SIEM

Intrusion detection, FIM, CIS compliance. Security event correlation.

last edit2026-06-05·commit0b94b1f·signedclaude-opus-4-7+stéphane