Incident: OPNsense Router Unresponsive¶
Date: 2025-12-28 (discovered), 2025-12-20 (started) Severity: P1 Duration: ~8 days Status: Resolved (manual reboot)
Summary¶
OPNsense router (10.89.97.1) became unresponsive on approximately Dec 20, 2025. Web UI, SSH, and Unbound DNS stopped responding. WireGuard VPN continued to function (kernel-level), allowing limited remote access. Required physical power cycle to restore service.
Timeline¶
| Time (UTC) | Event |
|---|---|
| 2025-12-20 03:01 | Last successful dmesg log rotation |
| 2025-12-20 ~03:00 | Last known good state based on log timestamps |
| 2025-12-20 onwards | ARP flapping detected in dmesg for IP 10.89.97.180 |
| 2025-12-21 - 12-27 | Router unresponsive (web UI, SSH, DNS) |
| 2025-12-28 02:00 | OPNsense backup script fails with "Connection timed out" |
| 2025-12-28 ~17:30 | Physical reboot performed |
| 2025-12-28 19:34 | Router boots successfully, all services restored |
Impact¶
- Network access: All LAN services requiring DNS or new connections affected
- Remote access: WireGuard VPN continued working (kernel module)
- Web UI/SSH: Completely unresponsive
- DNS (Unbound): Down - no internal name resolution
- DHCP: Existing leases worked, new leases likely failed
- Backups: OPNsense config backup failed Dec 28 (router unreachable)
- Data loss: None
Root Cause¶
Suspected: ARP table exhaustion or related network subsystem hang
Evidence from /var/log/dmesg.today (Dec 20) shows severe ARP flapping:
arp: 10.89.97.180 moved from e6:91:49:76:d9:0a to bc:24:11:5a:ff:d3 on igc1
arp: 10.89.97.180 moved from bc:24:11:5a:ff:d3 to e6:91:49:76:d9:0a on igc1
[repeated 50+ times]
Two devices were claiming IP 10.89.97.180:
- bc:24:11:5a:ff:d3 - LXC 180 (surfsense container) - configured MAC
- e6:91:49:76:d9:0a - Unknown device (locally-administered MAC)
LXC 180 was stopped at time of investigation, but may have been running before outage. The unknown MAC (e6:91:49:76:d9:0a) is not in any Proxmox configuration and remains unidentified.
Why WireGuard still worked: WireGuard runs as a kernel module and handles packets directly in the kernel network stack. The hang likely affected userspace daemons (lighttpd, sshd, unbound) while kernel networking continued.
Unconfirmed factors: - Memory exhaustion from ARP storm - FreeBSD kernel deadlock - Disk/filesystem issue (USB storage errors in boot logs)
Resolution¶
- Physical power cycle - Only available option as SSH/web UI unresponsive
- Verified services restored:
Lessons Learned¶
-
No crash dumps enabled: OPNsense crash reporter was disabled, preventing post-mortem analysis.
-
No alerting for router reachability: Would have caught this within minutes instead of days.
-
IP conflict went undetected: The ARP flapping indicates an IP conflict that should have been caught earlier.
-
Remote power control needed: Being away from home with no IPMI/smart-plug on router meant 8-day outage.
-
Backup script didn't alert on failure: OPNsense backup failures logged but no notification sent.
Action Items¶
| Item | Owner | Status |
|---|---|---|
| Enable OPNsense crash dumps (System > Settings > Miscellaneous) | User | In Progress |
| Add uptime/reachability monitoring for OPNsense | - | Pending |
| Identify and remove rogue MAC e6:91:49:76:d9:0a source | - | Pending |
| Consider smart plug for remote router power cycle | - | Pending |
| Add alerting for OPNsense backup failures | - | Pending |
| Remove or reassign IP from stopped LXC 180 (surfsense) | - | Pending |
| Review ARP table limits in OPNsense | - | Pending |
References¶
- OPNsense Crash Reporter Docs
- Disaster Recovery Procedures
- Backup script:
/root/scripts/backup-opnsense.sh - LXC 180 config:
/etc/pve/lxc/180.conf