Why Cron Jobs Fail Silently: Heartbeat Monitoring for Scheduled Tasks
Silent failures in cron jobs don’t throw errors — they just stop doing the work. Learn how Heartbeat Monitoring (healthchecks) catches overdue jobs with a Dead Man’s Switch approach.
What you’ll learn
- Why scheduled tasks fail without obvious errors
- Why logs aren’t monitoring (Dead Man’s Switch)
- How to add Heartbeat Monitoring / Healthchecks with a one-liner
- How to use Overdue, Grace Period, and Payload Inspection to catch real-world failures
Why cron jobs fail silently (even in “stable” systems)
Cron jobs are supposed to be boring. But the most expensive failures are Silent Failures: nothing crashes, nobody gets paged, and you only notice when data is missing or customers complain.
- A host reboots and cron doesn’t restart correctly
- Disk fills up and your script exits early (or produces a broken output)
- Transient DNS/network issues cause early returns
- Credentials expire (S3 keys, DB passwords, OAuth tokens)
- The job runs, but produces the wrong result (e.g.
files_processed = 0)
Logs are not monitoring (Dead Man’s Switch)
Logs are an internal signal. They often fail together with the system that’s supposed to produce them. A heartbeat monitor flips the dependency:
- If the job runs, it pings.
- If it doesn’t, the ping never arrives.
That absence is the alert. That’s a Dead Man’s Switch.
Heartbeat Monitoring / Healthchecks: the reliable baseline
Heartbeat Monitoring (Healthchecks) means your job sends a success ping when it finishes. If the ping doesn’t arrive in time, the monitor becomes Overdue.
- Interval: expected run cadence (e.g. 24h)
- Grace Period: buffer for retries, queue delays, cold starts
- Overdue: late beyond interval + grace
Full API documentation: /api/heartbeat/.
Quick setup: one-line healthcheck ping
If you can run curl, you can monitor a cron job:
Example crontab entry:
Add failure signaling (so you know why)
A pure success ping detects missed runs. Add an explicit fail ping when you can.
Catch “ran, but wrong” with Payload Inspection
Some Silent Failures are bad outcomes. Send metrics and alert on suspicious values.
Start / success / fail for duration + better alerts
For longer jobs, send a start ping too. This improves Workflow Observability and makes duration regressions visible.
n8n and Make.com: workflow observability beyond logs
No-code workflows have the same failure mode as cron: if triggers stall or a worker hangs, you get Silent Failures. External Heartbeat Monitoring is the clean baseline.
watchflow’s native n8n and Make integrations help you emit heartbeats from critical workflows without building custom webhook glue.
Recommended defaults
- Daily job:
interval: 24h - Grace Period: 30–60 minutes
- Send a small payload and alert on suspicious values (Payload Inspection)
Conclusion
Silent failures are unavoidable. Missing detection is optional.
Start with the examples in /api/heartbeat/ and set up your first heartbeat monitor.