ZeroSSL Certificate Renewal Failure -- Stale DNS and MPIC Validation
April 08, 2026
Overview
ZeroSSL certificate renewal can fail on data plane (DP) nodes when stale DNS A records are present. Since September 2025, ZeroSSL requires Multi-Perspective Issuance Corroboration (MPIC), meaning 2-6 validation servers from different geographic locations must all successfully reach the node on port 80. Duplicate DNS A records (one correct, one stale) cause DNS round-robin to send ~50% of validators to a dead IP, resulting in 100% cert issuance failure.
Problem / Question
- Certificate renewal succeeds on some nodes (e.g., CC) but times out on others (e.g., DP2)
- Health notice: "All authorizations were not finalized by the CA"
- Certbot reaches the challenge phase and the authenticator hook runs successfully, but validation times out after 1800 seconds (
Delayed::Worker.max_run_time) - EAB credentials obtained successfully from ZeroSSL API
- ACL disabled during challenge with no effect
Root Cause
Stale DNS A Records
Nodes with duplicate A records -- one correct WAN IP and one stale IP that routes nowhere -- fail MPIC validation 100% of the time. DNS round-robin distributes MPIC validators across both IPs. Validators that reach the stale IP time out, and since MPIC requires ALL validators to succeed, the certificate is never issued.
Example:
| Node | DNS A Records | Status |
|------|--------------|--------|
| cc01.example.com | 69.80.164.5 (correct, single) | Renews OK |
| dp02-1.example.com | 69.80.164.39 (correct) + 69.80.164.71 (stale) | Fails |
The CC node has a single correct A record and renews without issue. DP nodes with duplicate records fail every time.
Separate Issue: zerossl-bot Path Bug (16.246 and earlier)
On versions before 16.381, the zerossl-bot executable path was resolved via which zerossl-bot, which failed when the PATH didn't include $RXGD_DIR/bin. This produced the error "not supported on this platform." Fixed in 16.381 (MR 4440 / Issue #3512) by hardcoding the path. A fleet manager patch can be applied as a temporary fix.
Resolution
Fix stale DNS records
Identify stale records:
bash nslookup dp02-1.example.com nslookup dp02-2.example.com nslookup cc01.example.comAny node with multiple A records needs cleanup.Remove the stale A records from the DNS zone. Only the correct WAN IP should remain.
Wait for DNS TTL to expire:
bash dig +nocmd dp02-1.example.com +noall +answerVerify DNS cleanup: ```bash nslookup dp02-1.example.com
Should return only one IP
Test global reachability before retrying: ```bash curl -m 10 -s -o /dev/null -w "HTTP %{http_code} from %{remote_ip}\n" http://dp02-1.example.com/
Or use check-host.net API for multi-perspective test
curl -s -H "Accept: application/json" "https://check-host.net/check-http?host=http://dp02-1.example.com/" ```
Retry cert renewal:
bash cd /space/rxg/console && bundle exec rails runner 'SslKeyChain.find(ID).request_certificate!'Clean up stuck delayed jobs:
bash cd /space/rxg/console && bundle exec rails runner 'DelayedJob.where("handler LIKE ?", "%request_certificate%").where.not(last_error: nil).destroy_all'
Diagnostic commands
# Check current SSL cert status
cd /space/rxg/console && bundle exec rails runner 'SslKeyChain.all.each { |skc| puts "#{skc.id}: #{skc.name} - expires: #{skc.not_after} - active: #{skc.active}" }'
# Check for cert-related health notices
cd /space/rxg/console && bundle exec rails runner 'HealthNotice.where(cured_at: nil).where("message LIKE ?", "%cert%").each { |h| puts h.message }'
# Check for stuck cert renewal jobs
cd /space/rxg/console && bundle exec rails runner 'DelayedJob.where("handler LIKE ?", "%request_certificate%").each { |j| puts "ID: #{j.id}, attempts: #{j.attempts}, last_error: #{j.last_error&.truncate(200)}" }'
# Test WAN reachability between cluster nodes
curl -m 10 -s -o /dev/null -w "HTTP %{http_code} from %{remote_ip}\n" http://NODE_FQDN/