Tips for being a better system administrator

Here are a few tips that I have discovered or implemented during my career, that can help anyone get better results:

  • Follow the 3-2-1 backup strategy  or better
    • 3 copies of the data
    • 3 different media
    • 1 copy offsite
  • Test your backups regularly. Ideally, automate your recovery tests.
  • Use Whatever-as-code as much as you can. IaC is the first that comes to my mind, but using code to define objects and their properties have several advantages:
    • Auto-documentation of changes
    • Ability to easily rollback (using source code management like Git)
    • Makes it easier for standardization an compliance
    • Ansible, Chef, Puppet, Salt, Terraform, Pulumi are good examples
  • Maintain a changelog of all the major changes in your infrastructure (that isn't already "documented" because you're using IaC).
  • Build a spreadsheet with all the components that you manage and make sure that everyone on the sysadmin team is either primary, secondary, tertiary on it. 
    • The primary sysadmin owns that technology. They decide when to upgrade to the next version, or when to replace old hardware. They define configuration standards and documentation.
    • The Secondary nerd is responsible for understanding what the Primary decided and where everything is, and how to support it.
    • Tertiary nerds are always responsible for having enough knowledge to triage whatever the technology is to determine it really is broke, and knowing where to find the documentation on how to try to address it. They need to try before they escalate a ticket to the Primary.
  • Take the spreadsheet built in previous point and calculate how busy your team is with just keeping components up-to-date. 
    • The first step is to find the number of hours required for each components. For example, if your team manages 20 hardware servers, you must calculate how much time it takes to choose new hardware, order it, rack and configure it, install the operating system on it, then divide it by the number of years you keep your servers in average, then add the time spent, per year, in firmware update, changing disks or other failed components, etc.
    • The second step is to find the number of productive hours per year your time does. Productive work hours your team can do per year.  Productive hours: nb of hours/week x (52 - number of vacation weeks/y), minus holidays, sick days, etc, minus a percentage (25-35%) for communication, meetings, etc.  
    • Once you have the total number of hours needed to maintain your systems, you divide it by the number of productive hours to get the occupation rate caused by simply running the business.  This doesn't include MACDs (Move/Add/Change), projects, tech support.
  • Create and maintain a list of risks.  Ideally, keep normal risks separated from risks induced because of the lack of resources. Share with your managers.
  • Create and maintain a list of EOL dates of all the components under your team's responsibility. Hardware and software.
  • Find a way to get notified of every new version of software you're using. 
  • If you ever make a mistake (you will, everyone does), let others know and start looking for a solution.  Don't hide anything: mistakes are part of life, e
  • Always learn. There are many ways to learn without breaking the bank

Comments

Popular posts from this blog

Postgresql tips and links

Networker automated recovery testing using the REST API - introduction