RADIUS Authentication Storm — Cluster Controller CPU Saturation After Power Outage

April 08, 2026

Overview

After a power outage, all wireless clients and infrastructure devices re-authenticate simultaneously, creating a RADIUS authentication storm on the cluster controller. FreeRADIUS threads saturate CPU and flood PostgreSQL with concurrent queries, creating a feedback loop that prevents new authentications and blocks customer traffic. Gateway nodes remain healthy — the bottleneck is entirely on the controller. Restarting rxgd alone does not resolve it because radiusd runs as a separate process.

Problem / Question

No customer traffic passing after power is restored
All CPU cores pegged at 100% on the cluster controller
Restarting rxgd has no effect
Gateway nodes appear healthy with low CPU
Health notices: "principal backend daemon was hung", "cycle taking too long"
Session count may exceed license limit due to stale pre-outage sessions

Root Cause

When power is restored to a site, every AP, switch, and client device attempts RADIUS re-authentication simultaneously. On a cluster deployment, the controller (cc) handles all RADIUS processing. FreeRADIUS spawns threads per-request, each opening a PostgreSQL connection for account/realm/VLAN lookups via the rlm_perl hook (/space/rxg/rxgd/bin/freeradius_hook). Simultaneously, rxgd's own task cycle generates concurrent queries against the same tables (commonly triggers, cluster_nodes, health_notices).

The cascade: 1. Thousands of devices authenticate concurrently 2. FreeRADIUS threads consume 500-900%+ CPU 3. Each thread queries PostgreSQL — dozens of active connections pile up 4. rxgd task queries (e.g., SELECT DISTINCT ON ("name") ... FROM triggers) compete for the same DB resources 5. PostgreSQL becomes the bottleneck — query latency increases 6. Longer query latency means RADIUS threads hold resources longer, spawning more threads 7. rxgd cycle slows to the point where PF rules (especially BiNAT anchors) never fully regenerate 8. Load average exceeds core count (e.g., 68 on a 48-core system)

Why restarting rxgd doesn't help: radiusd is forked by rxgd at startup but runs as an independent process. service rxgd onestop sends SIGTERM to rxgd but does NOT kill radiusd. The new rxgd instance starts into an already-overloaded environment with radiusd still consuming 900%+ CPU.

This differs from the retransmit storm pattern (see "RADIUS Throughput Collapse" KB article) — that scenario is driven by memory exhaustion and swap thrashing. This scenario is pure CPU/DB saturation with no swap involvement.

Solution / Answer

Step 1: Confirm the Pattern

SSH to the cluster controller and verify the symptoms.

# Check load average and uptime
uptime

# Identify top CPU consumers — radiusd will be at 500-900%+
top -b -d 1 -o cpu | head -15

# Check PostgreSQL query pileup
psql -U rails -d config -c "SELECT left(query, 100), count(*) FROM pg_stat_activity WHERE state = 'active' GROUP BY 1 ORDER BY 2 DESC LIMIT 10"

# Verify gateway nodes are healthy (low load)
cd /space/rxg/console && /usr/local/bin/bundle exec rails runner 'ClusterNode.all.each { |n| puts "#{n.name} | mode=#{n.node_mode} | hb=#{n.heartbeat_at}" }'

Confirms this pattern if: - radiusd is the top CPU consumer at 500%+ - Multiple identical PostgreSQL queries are piled up - Gateway nodes have recent heartbeats and low load - Load average exceeds CPU core count

Step 2: Stop RADIUS to Break the Storm

Stop radiusd FIRST — this is the critical step that breaks the CPU feedback loop.

service radiusd stop

Step 3: Wait for PostgreSQL to Drain

Allow 60 seconds for active queries to complete and connections to close.

# Verify DB load is dropping
psql -U rails -d config -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active'"

Active count should drop to under 5 within 60 seconds.

Step 4: Restart rxgd

service rxgd onestop && sleep 5 && service rxgd onestart

rxgd will start a fresh cycle, regenerate all PF rules (including BiNAT anchors), and spawn a new radiusd process. Devices will re-authenticate in a staggered fashion rather than all at once.

Step 5: Monitor Recovery

# Load should trend down within 2-3 minutes
uptime

# Verify PF rules are fully loaded (check BiNAT specifically)
pfctl -a bBNT -s nat | wc -l

# Verify RADIUS is running
service radiusd status

# Check memory — should have no swap usage
vmstat 1 3

CLI Verification

# Load average should be below core count within 5 minutes
uptime

# BiNAT anchor should have rules (>0)
pfctl -a bBNT -s nat | wc -l

# Filter and NAT anchors should be populated
pfctl -a fRXG -s rules | wc -l
pfctl -a nLNT -s nat | wc -l

# Online sessions should be accumulating as clients re-auth
cd /space/rxg/console && /usr/local/bin/bundle exec rails runner 'puts "Online sessions: #{LoginSession.where(online: true).count}"'

# Health notices should start self-curing
cd /space/rxg/console && /usr/local/bin/bundle exec rails runner 'HealthNotice.where(cured_at: nil).where("created_at > ?", 1.hour.ago).each { |n| puts "[#{n.severity}] #{n.short_message}" }'

back to articles home