Downtime and data loss are an unavoidable fact of life for systems administrators. The main thing in our control about such service failures is their frequency and their severity.

Often, these failures are due to embarrassingly simple errors, omissions, oversights, or mistakes on the part of the sysadmin. This isn't because sysadmins are stupid; rather, there's usually so much to do, some tasks and checkups fall through the cracks.

This checklist is intended to provide a preventative checklist for people administering server machines. Admins can use it to do a system checkup once every 1-6 months to reduce the chances, and effects, of serious system failures. Even more valuable is to use it for a peer audit by another trusted, experienced sysadmin. Another person can provide a fresh perspective, and they won't get distracted fixing one potential problem and forget to do the other checks.

Note that the questions may seem particularly stupid; wise sysadmins don't assume that they would never make such simple and ridiculous mistake... usually because they've learned from experience that they can.

Time

Disk

Filesystem backup

Database

Network