Troubleshooting Executors

This page compiles a list of common troubleshooting steps found during development and administration of executors.

Checking for issues with an executor instance

To debug problems you might face with an executor instance, you can apply the following steps.

First, prepare the instance:

ssh into the host VM (see Connecting to cloud provider executor instances)
sudo su to become the root user
systemctl stop executor to stop the executor service
export $(cat /etc/systemd/system/executor.env | xargs) to load the executor environment into your shell

Validating the executor configuration

You can now run executor validate, which will inform you about any configuration issues. Fix any reported issues before proceeding.

Creating a debug Firecracker VM

The next step is to create a temporary Firecracker VM for debugging purposes.

NOTE: if the host VM is provisioned with the Sourcegraph terraform modules, the VMs may be configured to stop automatically. Refer to Disabling the auto-deletion of Executor VMs for information to prevent this.

Run one of the following commands executor test-vm to generate a test firecracker VM:

SHELL
# Test if a firecracker VM can be started
executor test-vm
 
# Test if a firecracker VM can be started and if a repository can be cloned into the VM's workspace
executor test-vm [--repo=github.com/sourcegraph/sourcegraph --revision=main]

The command will output a line like:

SHELL
Success! Connect to the VM using
$ ignite attach executor-test-vm-0160f53f-e765-4481-a81e-aa3c704d07bd

Execute the generated ignite attach <vm> command to gain a shell to the Firecracker VM.

Disabling the auto-deletion of Executor VMs

NOTE: These instructions are for users using the VMs deployed via the Terraform Modules

The Executor host VMs are configured to automatically tear themselves down once all jobs in the queue are completed. While this is desired behaviour under regular circumstances, it complicates debugging issues in the executor configuration or connections. To prevent the VMs from automatically stopping:

ssh into the VM
sudo su to become the root user
Remove (or rename) the /shutdown_executor.sh file

The VM should now persist after all jobs are satisfied.

Recreating a Firecracker VM

If a server-side batch change fails unexpectedly, it's possible to recreate the generated Firecracker VM from the batch change execution.

NOTE: if the host VM is provisioned with the Sourcegraph terraform modules, the VMs may be configured to stop automatically. Refer to Disabling the auto-deletion of Executor VMs for information to prevent this.

Navigate to the failed execution page of the Batch Change
Select a failed Workspace on the left and click the Diagnostics link on the right pane
In the modal, expand the Setup step by clicking the text or the expansion arrow on the right
Copy the command from the final step of Setup starting with ignite run
ssh into the host VM
sudo su to become the root user
systemctl stop executor to stop the executor service
export $(cat /etc/systemd/system/executor.env | xargs) to load the executor environment into your shell
Paste in the command copied from the batch change. You may need to remove the --copy-files and --volumes directives as those volumes and files may not exist on the VM any longer. Surround the --kernel-args arguments in quotes as well
Execute the command and wait for the VM to start
Run ignite ps to list all currently running VMs
Run ignite attach <vm id> to get a shell to the running VM

List preferred Linux distros

An ARM64 (x86_64) linux distro must be used due to the machine type of the VM. You may list available ARM64 distros with the following command, depending on your cloud provider:

GCP

SHELL
gcloud compute images list --filter='(family~amd)'

AWS

SHELL
aws ec2 describe-instances --filters architecture=x86_64

Configure the log level of executors

The log level of executors are set using the environment variable SRC_LOG_LEVEL. The following values are allowed:

dbug
info
warn (default)
error
crit

Update or set this value in the shell profile or environment file of the instance, then run executor run to restart the instance.

Problems with the Docker mirror instance

Verify that the Docker mirror instance is functioning properly by testing the following:

Mirror is reachable from the executor instance

Run the following command on the executor instance to determine whether it responds properly:

SHELL
# If EXECUTOR_DOCKER_REGISTRY_MIRROR_URL is set to a custom URL, replace the base endpoint with its value
curl http://localhost:5000/v2/_catalog

Registry is mounted in the file system

Verify that the registry is mounted under the expected path in the file system by running:

SHELL
# This directory should always be mounted
ls /mnt/registry
 
# If jobs have been processed, the following path should exist
ls /mnt/registry/docker/registry/v2/repositories/<public repository name>

Connecting to cloud provider executor instances

The following commands allow you to SSH into an executor instance, depending on your cloud platform of choice.

GCP

Find the name of an executor instance with

SHELL
# optionally provide the --project flag
gcloud compute instances list --filter="name~executor" --format="get(name)"

Then, using the name of an instance, run

SHELL
# optionally provide the --project flag
# use an identity-aware proxy tunnel with --tunnel-through-iap
gcloud compute ssh ${INSTANCE_NAME}

Alternatively, you may navigate to the compute instance in the GCP web console, where you will be able to connect with SSH in-browser.

AWS

In order to connect to an EC2 instance using SSH, you must have specified a key pair when the instance was launched. If you have not done so, you can connect to your instance through the web console instead.

Assuming you have specified the key pair, first run

SHELL
chmod 400 path/to/key.pem

Find the public DNS value of your instance either through the web console or by using aws ec2 describe-instances, then run

SHELL
ssh -i "path/to/key.pem" root@${INSTANCE_PUBLIC_DNS}

Misconfigured environment variables

This section lists some common mistakes with environment variables. Some of these will be exposed by running executor validate on the executor instance.

Env var	Common mistakes
`EXECUTOR_FRONTEND_URL`	No protocol included (e.g. `https://`
`EXECUTOR_FRONTEND_PASSWORD`	Not set in `executor.accessToken` in the site config
`EXECUTOR_QUEUE_NAME`	Value doesn't match one of [`codeintel`, `batches`], or neither of `EXECUTOR_QUEUE_NAME` and `EXECUTOR_QUEUE_NAMES` is set
`EXECUTOR_QUEUE_NAMES`	Value doesn't match one of [`codeintel`, `batches`]
`EXECUTOR_MAXIMUM_RUNTIME_PER_JOB` `EXECUTOR_MAX_ACTIVE_TIME` `EXECUTOR_QUEUE_POLL_INTERVAL` `EXECUTOR_CLEANUP_TASK_INTERVAL`	Value format can't be parsed by `time.ParseDuration`
`EXECUTOR_JOB_MEMORY` `EXECUTOR_JOB_NUM_CPUS`	Value format not recognized by virtual machine or Docker
`EXECUTOR_FIRECRACKER_DISK_SPACE`	Value format not recognized by virtual machine
`EXECUTOR_DOCKER_REGISTRY_MIRROR_URL`	Wrong IP or port specified
`EXECUTOR_DOCKER_HOST_MOUNT_PATH`	Workspace does not exist at provided mount path
`EXECUTOR_VM_STARTUP_SCRIPT_PATH`	Script does not exist at provided file path
`EXECUTOR_FIRECRACKER_IMAGE` `EXECUTOR_FIRECRACKER_KERNEL_IMAGE` `EXECUTOR_FIRECRACKER_SANDBOX_IMAGE`	Image does not exist for provided repository, name, or tag
`NODE_EXPORTER_URL` `DOCKER_REGISTRY_NODE_EXPORTER_URL`	`/metrics` path is included or wrong IP or port specified
`SRC_LOG_LEVEL`	not set to one of [`dbug`, `info`, `warn`, `error`, `crit`]

Verify Firecracker support

The VM instance must support KVM. In effect, this means the instance must meet certain requirements depending on the Cloud provider in use.

GCP

Nested virtualization must be enabled on the machine.

SSH into the executor instance (see Connecting to cloud provider executor instances)
Run the following command. If it outputs anything other than 0, nested virtualization is enabled:
```
SHELL
grep -cw vmx /proc/cpuinfo
```

AWS

Verify that the machine type in use is of type .metal (e.g. M5.metal).

Why use `iptables`

iptables provides network isolation, security, and regulated access for Firecracker VMs. It implements NAT of private IP addresses for each VM, and allows forwarding only specific ports to VMs. It also blocks all other traffic, and prevents IP spoofing.

Allowed traffic

Description	Purpose	Relevant rules
DNS traffic	DNS resolution	`iptables -A CNI-ADMIN -p udp --dport 53 -j ACCEPT`
Host to guest, established connections guest to host	SSH access	`iptables -A INPUT -d 10.61.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT`
From guest to gateway	Outbound internet access	`iptables -A CNI-ADMIN -s 10.61.0.1/32 -d 10.61.0.0/16 -j ACCEPT` `iptables -A CNI-ADMIN -d 10.61.0.0/16 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT`

Blocked traffic

Description	Purpose	Relevant rules
Guest to host	Block outbound traffic (e.g. other executors or the Docker registry)	`iptables -A INPUT -s 10.61.0.0/16 -j DROP`
Guest to guest	Block outbound traffic to other Firecracker VMs	`iptables -A INPUT -s 10.61.0.0/16 -d 10.61.0.0/16 -j DROP`
Guest to link-local	Block Cloud provider resources such as instance metadata	`iptables -A INPUT -s 10.61.0.0/16 -d 169.254.0.0/16 -j DROP`

Kubernetes Job Scheduling

There are a few environment variables available that can be used to determine which node an Executor Job Pod will be scheduled in. The Job Pods need to be scheduled in the same node as the Executor Pod (in order to mount the Persistence Volume Claim).

The following environment variables can be used to determine where the Job Pods will be scheduled.

Name	Default Value	Description
EXECUTOR_KUBERNETES_NODE_NAME	N/A	The name of the Kubernetes Node to create Jobs in. If not specified, the Pods are created in the first available node.
EXECUTOR_KUBERNETES_NODE_SELECTOR	N/A	A comma separated list of values to use as a node selector for Kubernetes Jobs. e.g. `foo=bar,app=my-app`
EXECUTOR_KUBERNETES_NODE_REQUIRED_AFFINITY_MATCH_EXPRESSIONS	N/A	The JSON encoded required affinity match expressions for Kubernetes Jobs. e.g. `[{"key": "foo", "operator": "In", "values": ["bar"]}]`
EXECUTOR_KUBERNETES_NODE_REQUIRED_AFFINITY_MATCH_FIELDS	N/A	The JSON encoded required affinity match fields for Kubernetes Jobs. e.g. `[{"key": "foo", "operator": "In", "values": ["bar"]}]`
EXECUTOR_KUBERNETES_POD_AFFINITY	N/A	The JSON encoded pod affinity for Kubernetes Jobs. e.g. `[{"labelSelector": {"matchExpressions": [{"key": "foo", "operator": "In", "values": ["bar"]}]}, "topologyKey": "kubernetes.io/hostname"}]`
EXECUTOR_KUBERNETES_POD_ANTI_AFFINITY	N/A	The JSON encoded pod anti-affinity for Kubernetes Jobs. e.g. `[{"labelSelector": {"matchExpressions": [{"key": "foo", "operator": "In", "values": ["bar"]}]}, "topologyKey": "kubernetes.io/hostname"}]`

Scheduling Errors

If you encounter the following errors,

TEXT
deleted by scheduler: pod could not be scheduled

TEXT
unexpected end of watch

Add/update the environment variable SRC_LOG_LEVEL to dbug to start receiving debug logs. The specific debug logs that may help troubleshoot the errors is Watching pod

The Watching pod debug logs contain conditions that may describe why a Job Pod is not being scheduled correctly. For example,

JSON
{
  "conditions": {
    "condition[0]": {
      "type": "PodScheduled",
      "status": "False",
      "reason": "Unschedulable",
      "message": "0/1 nodes are available: 1 node(s) didn't match pod affinity rules. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling."
    }
  }
}

Tells us that the Pod cannot be scheduled because the pod affinity rules (EXECUTOR_KUBERNETES_POD_AFFINITY) we configured do not match any nodes.

In this case, the EXECUTOR_KUBERNETES_POD_AFFINITY needs to be modified to correctly target the node.