Heartbeat Monitoring
Your crons are failing.
Silently.
A heartbeat check is a dead-man's switch for your scheduled jobs. Your job pings a URL when it finishes. If the ping never arrives — process crashed, server rebooted, job hung — SteadyCron wakes you up.
Free tier — no credit card required.
Without a heartbeat check
# /etc/cron.d/backup
0 2 * * * root /usr/bin/backup.sh
# Output: none.
# Exit code: who knows.
# Last successful run: unclear.
# Will anyone notice? No. With a heartbeat check
# /etc/cron.d/backup
0 2 * * * root /usr/bin/backup.sh \
&& curl -fsS https://ping.steadycron.com/abc123
✓ 02:04 — ping received, job healthy
✗ 03:02 — no ping — alerting #ops-slack The problem
Fire-and-forget is not a monitoring strategy
Cron has no built-in concept of success or failure. A job runs, produces output no one reads, and exits — and the scheduler considers its job done regardless. By the time you notice something went wrong, the logs have rotated, the window has closed, and the damage is done.
- The nightly backup silently failed. The data is gone.
- The invoice job hasn't run in three days. Finance noticed first.
- A script hung and blocked the next run — for six hours.
- The process was OOM-killed. Logs already rotated.
A typical cron failure timeline
12 hours between failure and discovery.
How it works
Silence is the signal
You don't push data to us — your job does. If it stops, we know.
Your job pings a URL
After your script runs, it calls a unique ping URL — nothing more than a single HTTP GET. Pass /start before, /success after, and optionally /fail on error.
curl -fsS https://ping.steadycron.com/{id} We open a grace window
SteadyCron knows your schedule and allows a configurable grace period to absorb normal jitter — midnight maintenance windows, server load spikes, and schedule drift.
grace: 10m # no alert if ping arrives within 10 min Silence fires an alert
If the ping doesn't arrive before the grace period expires, an alert is sent to your configured channel — Slack, email, Telegram, Discord, or webhook.
→ #ops-slack nightly-backup missed (14 h overdue) What we detect
Three ways a job can go wrong
Each failure type is distinct, detectable, and alerts independently.
Missed run
The job never started. The process was killed, the server rebooted, or the scheduler skipped it. No ping arrived — not even a /fail.
Detected when: no /start or /success ping arrives within the grace window of the expected schedule.
Job failed
The job ran but exited non-zero, or your code caught an exception and explicitly called the /fail endpoint before returning.
...
$ python invoice.py
ConnectionError: DB unreachable
→ /fail ping sent
Alerted: #billing-alerts Detected when: your code calls the /fail endpoint, or the job exits without a /success ping.
Hung job
The job started — /start was received — but never sent /success. It's still running, or it deadlocked silently. A max-duration threshold catches it.
Detected when: a /start was received but /success or /fail never arrives within the configured max duration.
Integration
One line. Any language.
No SDK to install, no agent to deploy. An HTTP GET is the integration. Works from a shell script, a Python lambda, a Go binary, or a Kubernetes CronJob.
# Minimal — one line at the end of your crontab command
0 2 * * * /usr/bin/backup.sh && curl -fsS https://ping.steadycron.com/abc123
# With start/fail signals (recommended)
0 2 * * * curl -fsS https://ping.steadycron.com/abc123/start \
&& /usr/bin/backup.sh \
&& curl -fsS https://ping.steadycron.com/abc123 \
|| curl -fsS https://ping.steadycron.com/abc123/fail The ping URL is generated when you create the check — copy it into your script. Prefer to define monitors as code? Declare them in YAML or Terraform.
Smart alerting · Grace periods
Not every delay is a failure
Schedules are aspirational. Server load, midnight maintenance windows, and leap-second handling all push jobs a few minutes past their scheduled time. A grace period absorbs this normal jitter so you're only paged when something is genuinely wrong.
Set it per check — a quick 5-minute health ping might use a 2-minute grace; a heavy nightly ETL that sometimes takes 20 minutes might use 30.
Grace period in action
within 10 min grace — no alert
grace expired — alerting
grace: 600 # seconds — 10 minutes Smart alerting · Noise reduction
Built to reduce noise, not add to it
Alert fatigue is real. SteadyCron has multiple mechanisms to keep your pager quiet unless something genuinely needs your attention.
Consecutive failure threshold
A single blip — a momentary network hiccup, a one-off OOM kill — should not page your team at 3 AM. Set a threshold of N consecutive failures before an alert fires. The first failure is logged; the third fires the alert.
alert_after: 3 # only page after 3 consecutive misses Flapping guard
A job that rapidly alternates between healthy and failing generates a storm of alerts with no useful signal. The flapping guard detects this oscillation and suppresses repeat notifications until the job stabilises — one alert, not twenty.
# fail → ok → fail → ok → fail
⚠ flapping detected — alert held Quiet hours
Mark a time window per check where non-critical failures are suppressed — the alert is logged but not delivered. When your team is back online, everything that happened is in the delivery log with its suppression reason.
quiet_hours: "22:00–07:00 Europe/Berlin" Alert lifecycle
nightly-backup missed
02:14 — alert sent to #ops-slack
Alert suppressed (quiet hours)
03:00 — subsequent miss logged, not sent
nightly-backup recovered
02:07 next day — auto-resolved
#ops-slack notified: resolved
No manual acknowledgement needed
Smart alerting · Auto-resolve
Alerts resolve themselves
When a job recovers — the next ping arrives successfully — SteadyCron automatically resolves the open alert and sends a resolution notification to the same channel. You never need to manually close an incident that has already fixed itself.
Every alert event is logged with its trigger, delivery channel, status, and suppression reason if applicable — so you always have a complete record of what fired, what was held, and why.
One platform, three jobs
HTTP Execution
Run your endpoints on schedule — retries, timeouts, and full run logs.
Heartbeat Monitoring
You're hereWatch jobs you run anywhere; get alerted the moment one goes silent.
Cron as Code
Define every job, monitor, and alert in YAML or Terraform.
Your jobs should tell you they ran.
Add a heartbeat check in two minutes. No agent, no SDK, no infrastructure change — just a URL at the end of your script.
- Free tier — no credit card
- EU-hosted, GDPR-native
- Any language, any platform