11.5. Group Walkthrough: Troubleshooting a Broken Linux Deployment

This walkthrough is different from the walkthroughs you have done so far. Troubleshooting skills can only be developed through experience. The methodologies and tools can be taught, but the experience can only be gained from solving real-world problems. You will be working as a group, like you would be on a professional team. Together your group will be troubleshooting a broken deployment. You will need to work together to engage in a troubleshooting discussion to reach a resolution to the issues presented in the deployment.

Each student will be given read-only access to the Azure resources used in the deployment. Your TA will be the only member of your group with administrative access. For each issue the group encounters, your TA will facilitate discussion and take any actions the group agrees upon.

Warning

You will be collaborating with your group members and TA. Make sure you do not change anything in the machine. Your role is purely observational. The TA will perform any mutating actions to ensure a manageable process for everyone in the group.

After setting up access of the group members, you will have one hour to reach a functional deployment. After the time is up, your instructor will walkthrough each issue and its resolution.

Note

Take your time on each issue. The purpose of this exercise is to engage as a team in identifying, isolating, and resolving issues with a live deployment.

11.5.1. Troubleshooting Tools

For this troubleshooting exercise we will need troubleshooting tools to work with our broken deployment. Luckily, some of these tools will look familiar, as we have worked with them throughout this course.

Note

The tools we use in this exercise are not exhaustive. Due to the variable nature of deployments, many different tools have been created to assist in troubleshooting. Throughout your career you will use many of these different tools as a response to different systems, tech stack, or team preferences.

11.5.1.1. Our Troubleshooting Tools

In large-scale systems you would rely on bulk remote-management tools, but in our case we only need to manage one machine. For our Linux machine we will use ssh to securely connect to the shell of the provisioned VM.

We are bound to the tools that come included in the selected Ubuntu image. This includes the file system navigation tools you learned in the Bash chapter, and the following:

  • cat or less: to inspect configuration files
  • service: to view the status of the services
  • journalctl: to view log outputs
  • curl: to make network requests from inside the Ubuntu machine

Outside of the host machine, we will use the following tools for external troubleshooting:

  • az CLI: for information about each resource component (or the Azure web portal)
  • Browser dev tools: to inspect response behavior in the browser
  • Invoke-RestMethod: to make network requests from your Windows development machine

11.5.1.1.1. SSH Into the Machine

Within the Setup section of this walkthrough your TA will invite you to your Reader role for accessing the Azure resources related to this broken deployment. After you have accepted the invitation and configured your default resource group and VM in the az CLI you can request the public IP address of the VM.

Using the az CLI you can store the public IP in a variable so it can be used as the host component of the ssh command:

Windows/PowerShell
> $VmPublicIp = az vm list-ip-addresses --query '[0].virtualMachine.network.publicIpAddresses[0].ipAddress'

The first time you connect to a machine over SSH you will be prompted to trust or reject the remote host. When prompted enter yes to trust the host and connect to it:

Windows/PowerShell
> ssh student@$VmPublicIp
# trust the remote host
# enter password: LaunchCode-@zure1

Warning

Using knowledge-based authentication (username and password) is much less secure than using something owned like a private (digital) key. The topic of using RSA keys with SSH falls outside the scope of this course, but you should know it is more secure and the preferred mechanism.

We will authenticate using credentials to avoid distracting from the learning goals of this troubleshooting exercise.

Once you have connected to the machine, you will be able to navigate and use Bash and the tools listed above to gather information and plan a solution with your group.

Note

If you are not on a Windows machine, remember that you will need to output in TSV format using the -o tsv option:

Linux/BASH
$ vm_public_ip=$(az vm list-ip-addresses -o tsv --query '[0].virtualMachine.network.publicIpAddresses[0].ipAddress')
$ ssh student@"$vm_public_ip"
# trust the remote host
# enter password: LaunchCode-@zure1

11.5.1.1.2. Using service

The service program is a wrapper that simplifies how several of the init systems on a Linux machine can be managed through a single tool. Init systems are used to initialize and manage background processes running on Linux systems.

On Ubuntu machines the systemd init system and its client program systemctl (system control manager) are used by default to manage service units. In the configuration script of our final deployments we created a systemd unit file to define how our Coding Events API would be operated as a background service on the Ubuntu VM. The script also used the service tool (rather than the underlying systemctl it wraps) to make our script portable across supporting Linux distributions.

In addition to controlling services, the service tool can be used to view the status of any registered service units like our coding-events-api, nginx, and mysql:

Warning

Be mindful of your group’s effort in troubleshooting the deployment. Only use the service tool for observation with the status command.

After reaching a group consensus your TA can issue the service commands that mutate service state.

Linux/BASH
 service <service-name> status

For example, if you were to check the status of a functioning API service you would receive the following output:

Linux/BASH
$ service coding-events-api status

● coding-events-api.service - Coding Events API
  Loaded: loaded (/etc/systemd/system/coding-events-api.service; disabled; vendor preset: enabled)
  Active: active (running) since Tue 2020-10-31 19:04:51 UTC; 1 day 4h ago
Main PID: 18196 (dotnet)
    Tasks: 16 (limit: 4648)
  CGroup: /system.slice/coding-events-api.service
          └─18196 /usr/bin/dotnet /opt/coding-events-api/CodingEventsAPI.dll

11.5.1.1.3. Using journalctl

The journalctl tool can be used to view the logs written by systemd services. You can use it to view the logs of a particular service unit using the -u (unit name) option:

Linux/BASH
$ journalctl -u <service-name>

The systemd journal can store thousands of logs and lines within them. Often it is useful to view just the most recent logs. The -f option will follow the logs starting from the last 10 lines and continuously display new lines as they are written:

Linux/BASH
$ journalctl -f -u <service-name>

# shorthand (-u comes after to pair with the service name argument)
$ journalctl -fu <service-name>

Note

Like other foreground CLI programs that attach to your Terminal, you can use ctrl+c to exit journalctl.

11.5.1.1.4. Working with Self-Signed Certificates

A self-signed certificate means that the signature used to sign the certificate is not recognized by an external certificate authority (CA). These certificates can still be used for TLS encryption but are not inherently trusted like traditional SSL certificates due to their unknown signing authority. By default, HTTP client applications like browsers or CLI tools will automatically reject self-signed certificates as a security measure.

In our configure-ssl.sh deployment script our VM internally generated the signing key used to sign the SSL certificate with the openssl tool. Recall that when you first connected to the Swagger documentation of your API in the browser, you had to bypass the warning and accept (explicitly trust) the self-signed certificate. CLI tools can be configured similarly to also accept self-signed certificates.

When working with Invoke-RestMethod cmdlet the default certificate validation behavior for self-signed certificates results in the following error for servers using self-signed certificates:

Windows/PowerShell
 Invoke-RestMethod: The remote certificate is invalid according to the validation procedure.

We can override the default validation procedure by using the -SkipCertificateCheck option:

Windows/PowerShell
> Invoke-RestMethod -Uri https://<PUBLIC IP> -SkipCertificateCheck

Similarly, when working inside the Ubuntu VM with curl the validation can be skipped using the -k option:

Linux/Bash
$ curl https://localhost -k

When troubleshooting within a VM, you can use curl to isolate networking-related issues. If you are able to connect successfully from inside the machine, but receive a timeout when connecting externally, it indicates that an internal firewall or external network security rule is the issue.

Note

In Ubuntu the default ufw tool is used for managing internal firewall rules.

11.5.2. Discussion: Components of a Functioning System

When troubleshooting, your first step is to form a mental model of the system you are working on. Due to the introductory nature of this course, you and your group will begin by discussing what you know about a fully functioning system. Consider all of the deployment components you have learned throughout this course. Pay particular attention to the components that have given you trouble in your previous studios.

For each component, you should define what expectations need to be met for it to operate in a healthy state and which misconfigurations could lead to a failed state. By thinking about the system holistically you will be able to keep track of which expectations are not met and collectively decide on the actions needed to reach a resolution.

These assumptions will be the starting points for troubleshooting once your group gains access to Azure. Any expectations that are not met in the broken deployment will offer a clue as to what needs to be fixed.

Your TA will lead your group in this discussion you will have 20 minutes to discuss the system. Consider the network, service, host, and application levels with regards to the Coding Events API:

  • What components are in this level?
  • How does each component need to be configured to function properly?
  • What common misconfigurations have you encountered with each component?
  • What was the behavior that led to discovering an issue with a component?

11.5.2.1. Example

You do not need to be exhaustive, but every expectation you define will help guide you when you are troubleshooting. For example, if you were to describe the components in the service level:

  • AADB2C
  • Key vault

You could then proceed to list some of the expectations of an operational AADB2C component:

  • A tenant directory linked to an active subscription
  • At least two registered applications—the Coding Events API and Postman client
  • A SUSI flow that uses the local email account provider
  • An exposed user_impersonation scope for restricting access to the API that has been granted to the Postman client
  • Postman is configured to use the implicit flow and the hosted postman redirect URI

Note

After you gain experience with troubleshooting, you will be able to hone in on one component or level at a time. However, when you are just starting out it is beneficial to think about the system as a whole.

11.5.3. Setup

Before the troubleshooting timer begins, you will need to work with your TA to set up your access to the Azure resources and VM. For this walkthrough your TA will grant you Reader access to their directory and lab subscription. Once you have registered with their directory and assumed the Reader role, you will be able to access the public IP address of the VM and ssh into the machine.

11.5.3.1. Access Troubleshooting Subscription

For this exercise, an Azure subscription will be setup for your group. Your TA will be the administrator of this group and each student will have read-only access. You will be able to view the deployment components, but will need to work together with your team to diagnose the issue and tell your TA how to resolve it.

Even though you already have an account with Microsoft, it is only associated with your subscription. In order to access your TA’s subscription (and the resources of the broken deployment) you will need to register an account in their directory through the following steps:

  1. Accept the email for the directory invitation
  2. Create a new account in your TA’s directory
  3. Setup your AZ CLI to use the TA’s subscription

11.5.3.1.1. Accept Email

The first step is accessing the email that was sent from Microsoft on your TAs behalf. The email will include a link that will allow you to associate your email address with a new account under the directory and subscription the TA administers.

Upon clicking the link you will be taken to a Microsoft web page that will prompt you to create an account in your TA’s tenant directory.

11.5.3.1.2. Create Account In the TA Tenant Directory

The form will come pre-populated with your email address (since you navigated to the webpage from your email client) and you will need to enter a password.

This account, and subscription, will be temporary, so we will use the same password to make things consistent. Copy the password below, then paste it in the form to avoid spelling mistakes:

  • Password: LaunchCode-@zure
  • Display name: your name (should default)
Put in password and display name to create account in TA directory

An email will be sent to you containing a security code. Copy the code and paste it into the verification form:

Verify email security code to create account in TA directory

You will then be prompted to accept the invitation permissions (select Accept):

Verify email security code to create account in TA directory

Note

It may take some time for the account to be created.

At the next prompt, you can select the Skip for now link, as this is only temporary for this final exercise:

Select skip for now for temporary access

Then select Yes to stay signed in:

Select stay signed in

11.5.3.1.3. Confirm Resources Access

You now have access to the resources created under the TA troubleshooting subscription. Select All Resources from the home dashboard to confirm that the broken deployment resources are available for you to view:

View all resources

11.5.3.1.4. Setup AZ CLI

First we need to clear the AZ CLI cache:

> az account clear

Now we need to login again which will present us with the form to authenticate:

> az login

Because you selected Stay signed in, from the previous step, it will default to your account within the TA tenant directory. All you need to do is select your name from the list:

Log into the TA directory from az CLI

Back in your PowerShell Terminal you will see your account information output:

Windows/PowerShell
> az login
# output example
[
   {
      ...trimmed...
      "id": "095dea07-a8e5-4bd1-ba75-54d61d581524",
      "name": "Troubleshooting - TA <Name>",
      "user": {
         "name": "[email protected]",
         "type": "user"
      }
      ...trimmed...
   }
]

Warning

Before continuing, confirm that the name of the subscription matches your TA name: Troubleshooting - TA <Name>.

After configuring the AZ CLI to use the new subscription, let’s set up our AZ CLI defaults for the correct resource group and virtual machine:

> az configure -d group=linux-ts-rg vm=broken-linux-vm

You can verify everything worked by looking at the default VM. It should be identical to your group-mates and TA:

> az vm show

Note

You only have read-access to the resources in your TA’s Azure subscription. Feel free to look around all you want, however any Azure commands will need to be run by your TA.

11.5.3.2. Configure Postman

For this walkthrough, you will use a Postman collection that has the AADB2C details pre-configured as variables.

11.5.3.2.1. Import the Final Postman Collection

You can import this collection by selecting the Import button and then selecting the Link tab in the import window. Paste in the following link then select Continue:

Postman import collection from gist URL

11.5.3.2.2. Update Access Token Settings

After importing, you will need to update your access token settings to use the following pre-defined variables (from top to bottom of the access token form). As a reminder, you can access this by selecting the three dots next to the collection name, selecting Edit then from the Authorization tab select Get New Access Token:

  • Token Name: {{TokenName}}
  • Redirect URL: {{RedirectURL}}
  • Auth URL: {{AuthURL}}
  • Client ID: {{ClientID}}
  • Scopes: {{Scopes}}
  • State: {{State}}

Note

You can copy and paste each {{Variable}} value into the settings form. If you misspell any variable it will turn red.

If you would like to preserve your existing settings, you can copy them to another document before pasting in the variable references.

After updating the form your settings should match the image below:

Postman configure access token variables

You can now request an access token and create a new account in this shared AADB2C tenant. After receiving your access token, leave the edit collection window open.

11.5.3.2.3. Update the baseURL Variable

From the Authorization tab select the Variables tab. Then (as seen in a previous studio) in the Current Value entry on the right side replace the current value, https://localhost:5001, with the public IP address of your group’s VM:

Postman update the baseUrl variable with the public IP address of the API

Note

You will not currently be able to access the API due to the broken state of the deployment. However, once you and your group have resolved enough issues to begin making requests you will not need to configure anything else in Postman.

11.5.4. Deployment Issues

Warning

While troubleshooting all changes made to the state of a component needs to be accounted for. Defer to your TA for taking any mutating actions—do not make changes on your own.

As your TA makes changes, consider the outcome and adjust your mental model accordingly.

Once everyone in your group has configured access to Azure you can begin troubleshooting! You can start by using external tools for diagnosis (like the browser, az CLI or Invoke-RestMethod). Then for each issue you discover you can use the following prompts to discuss and progress towards resolving it:

  • What clues have been discovered so far?
  • What level do you think the issue related to?
  • What components do you think are involved?
  • What tools will you need to use to identify the issue?
  • What action do you suggest should be taken and why?
  • What clues are presented after your TA attempted to fix the issue?

11.5.5. Final Mission

If you and your group are able to fix the deployment, you will be able to load the Swagger documentation at the public IP of the host machine. At this point, the API will be fully functional and you can complete your final mission using Postman:

  • Create an account in the AADB2C tenant to get an access token
  • Create a coding event and read its description

11.5.6. Resetting the AZ CLI

Note

You do not need to reset your AZ CLI to complete this walkthrough. However, if you would like to continue working with your resources for the remaining time in the course, the following instructions can be used.

First re-issue the account clear and login commands:

Windows/PowerShell
> az account clear
> az login

This time rather than selecting the default account (the account registered in your TA’s directory) you will select the User another account option:

AZ CLI login select use another account option

Enter your email address for Azure to look up all related accounts.

Note

Although your email is the same this process allows you to differentiate between different accounts associated with the same email address login.

From the select account view you will need to select the Personal Account option:

AZ CLI login select your personal account

Back in your PowerShell Terminal you will now see the subscriptions associated with your personal account.

Warning

If you can still view an account output that has your TA’s subscription name (Troubleshooting - TA <Name>) you have selected the wrong account. Restart the process and make sure you select Personal Account.