Skip to content

Incident: OPNsense Router Unresponsive

Date: 2025-12-28 (discovered), 2025-12-20 (started) Severity: P1 Duration: ~8 days Status: Resolved (manual reboot)


Summary

OPNsense router (10.89.97.1) became unresponsive on approximately Dec 20, 2025. Web UI, SSH, and Unbound DNS stopped responding. WireGuard VPN continued to function (kernel-level), allowing limited remote access. Required physical power cycle to restore service.


Timeline

Time (UTC) Event
2025-12-20 03:01 Last successful dmesg log rotation
2025-12-20 ~03:00 Last known good state based on log timestamps
2025-12-20 onwards ARP flapping detected in dmesg for IP 10.89.97.180
2025-12-21 - 12-27 Router unresponsive (web UI, SSH, DNS)
2025-12-28 02:00 OPNsense backup script fails with "Connection timed out"
2025-12-28 ~17:30 Physical reboot performed
2025-12-28 19:34 Router boots successfully, all services restored

Impact

  • Network access: All LAN services requiring DNS or new connections affected
  • Remote access: WireGuard VPN continued working (kernel module)
  • Web UI/SSH: Completely unresponsive
  • DNS (Unbound): Down - no internal name resolution
  • DHCP: Existing leases worked, new leases likely failed
  • Backups: OPNsense config backup failed Dec 28 (router unreachable)
  • Data loss: None

Root Cause

Suspected: ARP table exhaustion or related network subsystem hang

Evidence from /var/log/dmesg.today (Dec 20) shows severe ARP flapping:

arp: 10.89.97.180 moved from e6:91:49:76:d9:0a to bc:24:11:5a:ff:d3 on igc1
arp: 10.89.97.180 moved from bc:24:11:5a:ff:d3 to e6:91:49:76:d9:0a on igc1
[repeated 50+ times]

Two devices were claiming IP 10.89.97.180: - bc:24:11:5a:ff:d3 - LXC 180 (surfsense container) - configured MAC - e6:91:49:76:d9:0a - Unknown device (locally-administered MAC)

LXC 180 was stopped at time of investigation, but may have been running before outage. The unknown MAC (e6:91:49:76:d9:0a) is not in any Proxmox configuration and remains unidentified.

Why WireGuard still worked: WireGuard runs as a kernel module and handles packets directly in the kernel network stack. The hang likely affected userspace daemons (lighttpd, sshd, unbound) while kernel networking continued.

Unconfirmed factors: - Memory exhaustion from ARP storm - FreeBSD kernel deadlock - Disk/filesystem issue (USB storage errors in boot logs)


Resolution

  1. Physical power cycle - Only available option as SSH/web UI unresponsive
  2. Verified services restored:
    ssh root@10.89.97.1 "uptime; cat /var/log/boot.log | tail -20"
    # Confirmed clean boot at 19:34 UTC
    

Lessons Learned

  1. No crash dumps enabled: OPNsense crash reporter was disabled, preventing post-mortem analysis.

  2. No alerting for router reachability: Would have caught this within minutes instead of days.

  3. IP conflict went undetected: The ARP flapping indicates an IP conflict that should have been caught earlier.

  4. Remote power control needed: Being away from home with no IPMI/smart-plug on router meant 8-day outage.

  5. Backup script didn't alert on failure: OPNsense backup failures logged but no notification sent.


Action Items

Item Owner Status
Enable OPNsense crash dumps (System > Settings > Miscellaneous) User In Progress
Add uptime/reachability monitoring for OPNsense - Pending
Identify and remove rogue MAC e6:91:49:76:d9:0a source - Pending
Consider smart plug for remote router power cycle - Pending
Add alerting for OPNsense backup failures - Pending
Remove or reassign IP from stopped LXC 180 (surfsense) - Pending
Review ARP table limits in OPNsense - Pending

References