[Reliability] Ideas to ensure guaranteed execution of scheduled jobs

Background

We have configured a few background jobs using standard ```scheduler_events``` in ```hooks.py```

e.g. following custom job calculate sales projections for next day sale every night.

"daily":[
            ".....create_projections",
    ],

Similarly we have some auto email reports configured to be sent at a specific hour.
[PR to be submitted]

Hourly reports are sent using hourly scheduled events in hooks.py

The Issue

We are in active development and our production deployments happen quite often. Incase our servers are down at the time when the scheduled event was supposed to be triggered, it gets missed due to the downtime.

Once the event is missed, there is no way it is replayed again. E.g If there was an auto emails to be sent at 4PM and server is down, those emails are never sent.

How do we build a retry mechanism to ensure guaranteed event execution?

Any inputs?

This isn’t a comment on retrying, but we end up dealing with this manually using a monitor. Currently, we use https://healthchecks.io/.

After a job is run, it pings the URL. If the job isn’t run, the URL isn’t pinged and we get a notification. Once we receive the notification, we act on it depending on the source of the issue.

This is fine for our jobs because we design jobs that can fail - meaning, if it is supposed to run at 4PM and there is an issue so it doesn’t run, it will still run and do what it is supposed to do the next time that job is to run. That combined with monitoring above has worked for us - although YMMV.