The Art of Troubleshooting

And yes, it really is an art. It’s a challenge for some to learn how to figure out problems. In the IT industry, this is a requirement, almost as vital as having oxygen to breath. But because of the nature of troubleshooting — based on logic and deduction — it can be hard to teach all aspects of it. There are some things that can be taught and would be worthwhile to know when considering a computer, network or security issue at hand.

First, before you even get to the stage of having to troubleshoot get to know a system when it’s well. This means checking to see what processes are common at startup, how much memory is used when various applications run and what the over-all system does things. One of the things that helped me was this book for information on XP (I detest Vista with a passion) and this one for hardware. This allowed me to investigate and learn what’s new.

Second, have a good resource to read and learn from. Books are great (especially for a traditionalist like myself who likes the feel of paper) but often get quickly outdated. Something like Maximum PC Magazine is a great regular magazine to have for those who like to read while on subway, a plane or just relaxing at home. Even better is to have a regular forum to ask questions in. Tom’s Hardware is one of the foremost sites when it comes to hardware news, updates, reviews and even forums. It’s amongst the best out there just for that alone. Linux can answer pretty much any question on linux issues. And when it comes to Windows questions, I go to Antionline. While it’s primarily a security site, there are enough knowledgable people there that can help eliminate security as the issue. These can act as resources for learning so you know when things work as well as for troubleshooting. The reality is that to address any issue we need to learn about the “thing” that we’re focused on fixing. Ideally, we want to do this before a problem occurs and to also help prevent problems from occuring. Also, ensure that you have valid, working backups. The ultimate solution to most problems is a re-format of the OS (often needed if a hard drive fails). Even on home systems a form of back of important data should be done regularly to avoid issues. And don’t backup the whole system, just the critical data that you need (e.g., mailboxes, bookmark links, documents, pictures, iTunes songs, logs, etc.) Installation files for applications and such should already be saved on CD for when re-installation is necessary.

The next step is when there is an issue, identifying the source. Unless you know there is an issue you’ll never be able to address it. It important to get all the facts you can about the issue and to be as detailed as possible. This may include screenshots, writing error messages down and noting anything that had been recently added, changed or removed. Poor software design and/or coding can cause serious issues at times so it’s important to keep track of changes. Sometimes even updates to the Operating System can cause it to behave in a manner similar to that of spyware activities. Also, use numerical values where possible when describing something. “It’s slow!” doesn’t mean a heck of a lot compared to “It used to take only 28 seconds to boot up; now it takes about a minute to boot up”.

The kernel of an OS is the portion that loads up all drivers and allows for full interaction with the user. Both Linux and Windows have modes where the kernel isn’t loaded. This can be a pretty good indicator as to whether the issue is outside of the OS or not. Booting into what is referred to as “Safe Mode” or “Single User” mode minimizes what’s loaded (usually it’s only basic drivers) and can help determine if it’s a driver or startup application that may be causing the issue. If the issue is still there, then it’s possible it’s at the BIOS level. Gathering all this info means you are now armed to find out the cause. This is where Google and/or the sites listed above can be handy.

Put any error code messages into Google and/or describe the issue, with as much detail as possible, to the forum. If it’s an error code and you’ll Google it, you’ll find either a KB page or other forums that detail the issue or both. The only exception is if you’re the first to hit upon this but given the speed of the internet, often issues spread rapidly around the world. Read up the details of others’ who have experienced the same issue and see how much of it matches your own. Sometimes problems are like Shrek’s proverbial onion: layer after layer after layer until you get to the root of the issue. As you research your problem more, you’ll find your solution and be able to address it.

Once addressed, you start the process over again. As you go through problems, you’ll learn what to recognize as a problem and what isn’t. You’ll also learn habits on how to fix these faster each time. Troubleshooting is an art to a degree but it’s an art that’s learned through trial and error. Unless you experience it, you won’t know. The one thing to keep in mind is that it isn’t magic but rather simple legwork and a bit of work to find the problem at hand.