Introduction to Troubleshooting¶

Troubleshooting is one of the most important skills to the Operations professional.

Troubleshooting is the process of:

realizing an issue exists
identifying the issue through duplication (how and under what conditions it happens)
researching the potential causes of the issue
isolate the root cause of the issue by systematically eliminating the potential causes
researching a fix for the root cause of the issue
fixing the root cause of the issue
checking that the fix resolves the issue
communicating about the issue with others

As you can see from this list research is a foundational aspect of troubleshooting. After an issue has been discovered it is usually up to the Operations team to identify the issue, research the issue, and ultimately fix the issue. Troubleshooting skills are improved through experience. At the start of your career you won’t have much knowledge of what can wrong, and how to fix a broken deployment. However, as you continue to research and fix issues your troubleshooting skills will grow with your knowledge.

In this article we will discuss issues you may have already experienced through this class, talk about a methodology to assist you in researching issues, and talk about some of the basic tools you can use while troubleshooting.

note

This article is in no way exhaustive, as you continue throughout your career you will learn about new techniques, tools, and solutions to issues.

Although development troubleshooting will be mentioned this class will primarily focus on Operations troubleshooting.

To start let’s build some troubleshooting knowledge by examining some issues we may have seen already when deploying the Coding Events API.

What You May Have Experienced Already¶

Throughout the Azure portion of this class we have focused on operations and we will explore more operations issues. However, identifying, communicating, and resolving development issues is often a responsibility of the DevOps professional.

We will look at some of the common operations and development issues.

Note

The issues throughout this article will be framed as issues in the Coding Events API. Many of the issues discussed may be similar, or identical, to issues seen across many different deployments.

Example Operation Issues¶

Operation issues are issues that don’t involve the source code of the deployed application. This could be issues relating to:

Virtual Machine
Key Vault
AADB2C
MySQL
NGINX
application software dependencies (dotnet)
the VM operating system (file permission issue)
the network connecting the various resources (a network security group rule)
application external configurations (appsettings.json or a Key Vault secret)

The first example will walk you through all of the troubleshooting steps to illustrate the process. The remaining examples will simply discuss or show the issue and talk about the potential root causes.

Note

The following group walkthrough will require you to perform the troubleshooting steps together. If you pay attention to the potential root causes, you may figure out how to solve some of the issues will a relatively small amount of research.

Connection Timeout¶

Realize Issue¶

The troubleshooting process is kicked off by an issue brought to our attention. In this case someone sends us a screenshot of their browser encountering a Connection Timeout when attempting to access our API in the browser:

../../_images/connection-timeout-browser.png

Identify Issue through Duplication¶

Our first step is to identify this issue by reproducing it on our system. This will rule out the possibility of end user error.

First up let’s reproduce the issue in the exact way the end user did with a request from the browser:

Looks like we are getting the same issue. Let’s reproduce this error with PowerShell using Invoke-RestMethod from our terminal:

../../_images/connection-timeout-powershell.png

Since this is a learning environment let’s reproduce the issue again this time from Bash using curl:

../../_images/connection-timeout-bash.png

We can definitively state that a Connection Timeout is happening when users attempt to access the Coding Events API on port 443 from the browser, Invoke-RestMethod and curl!

Research Potential Causes¶

The next step is to research the potential causes of the issue. Typically you would rely on your experience and research skills to come up with a list of potential causes, but to save time we have provided them for you:

the URL may have been incorrect
the VM is currently down
the VM lacks a Network Security Group rule for the given port

Isolate Root Cause¶

The next step is to isolate the root cause of the issue by systematically eliminating potential causes until we have found the root cause, or have exhausted our known options.

In this case we would need to check that the initial request was going to the correct URL, that the VM is currently running, and that the VM has the appropriate NSG inbound security rule for port 443. At this point in time in the class you should know how to do these things through the Azure Web Portal or the AZ CLI.

Just to continue the example let’s say the root cause was that the VM lacks a NSG rule for port 443, and we discovered this by looking at all three of the potential issues and the only one that was incorrect were the NSG rules.

Research Root Cause Fixes¶

Our next step would be to research a solution to the issue, in this case we simply need to create a new NSG inbound rule for port 443.

Implement Root Cause Fix¶

After fixing the issue our final step is to reproduce the steps to ensure our issue has been resolved!

Check that Fix Resolves Issue¶

Browser:

Our screen advanced and now we are getting the message about accepting the risk associated with a self-signed certificate. That’s what we expect! Let’s checkout PowerShell and Bash:

PowerShell:

Bash:

Uh oh!

We are getting a new error.

The good news is we resolved our connection timeout issue by opening port 443 NSG inbound rule. Our fix resolved the issue, we are no longer experiencing a Connection Timeout error. We have solved this error and need to move on to the next one which according to our web requests is a 502 Bad Gateway.

Note

An issue is not always solved with one change. In some instances a combination of steps are necessary to solve one issue.

In this case solving one issue revealed a new issue. Revealing a new issue is great progress in troubleshooting assuming you have checked that your fix resolved the initial issue, which we have done!

The final step is being able to communicate this issue with others:

Communicate Issue with Others¶

The Coding Events API located at https://40.114.86.145/ was not responding to HTTP requests in the browser, Invoke-RestMethod or curl. Users were experiencing a Connection Timeout error. We researched potential causes for this issue and determined that the Virtual Machine did not have a NSG inbound rule for port 443. We opened this port to all traffic and the issue was fixed. Connection Timeout errors have not been experienced across Invoke-RestMethod, curl or the browser after making the change.