How are server issues (faults etc) monitored?
We have automated heart beat monitors which checks the health of the servers and a number of background tasks. In addition to checking the list of known tasks database read & write actions are performed and the available disk space is checked. If the heart beat monitor itself takes more than 5 minutes to run an alert is sent. The heart beat monitor runs every 15 minutes.
The list of tasks that the system will monitor are defined in the class DBTask.
A task is defined by:-
- code which uniquely identifies this task