Document toolboxDocument toolbox

Troubleshooting for Endpoint Agent

This document is geared toward people in charge of Devo Endpoint Agent deployment or administration and includes ways of troubleshooting Devo EA, as well as information about common errors cases that have been identified in the field.

Deployment

This section describes typical trouble scenarios and troubleshooting guidelines for deployment scenarios.

Controlled error messages during deployment

There are some tasks that execute commands or operations that can end with some kind of error. These tasks are designed to check some services or configurations and apply customized changes in some cases or ignore those configurations in others.  …ignoring text will be displayed just after displaying error message and can help us to identify these kinds of controlled error tasks. 

For example: Task that is checking if the firewalld service is running:


This error will pop up when the firewalld service is not present or it is stopped. The error can be safely ignored and does not affect the deployment. As a rule, any message that is tagged as ...ignoring can be safely ignored and has no effect in the deployment sequence.

Timeout when waiting for 127.0.0.1:8080

The deployment process cannot connect with the interface that Fleet starts in the port 8080 or the connection took longer than 60 seconds. Typically, there has been a problem in the deployment sequence and the fleet instance could not boot up. Check the Endpoint Agent Manager section and see if the logs give any hint as to what they problem may be.

The most common root cause for this error message is that the certificates used to send data from the EA Manager to Devo have not been properly configured, make sure that all the steps in the guide have been followed properly. In a default installation process, the domain-certs folder should look like the following screenshot:


Make sure that the name of the certificates are using the same names as in the screenshot above.

Also, make sure that the collector configured in the entrypoint is correct as explained in the deployment guide.

Proxy is needed to access Internet

Set up the following settings:

Enable and set up correctly http_proxy and https_proxy environment variables.

Enable proxy for docker environment as described here, by editing file ~/.docker/config.json.

Shared connection closed

If you see this error during the deployment process (it is likely to happen at the beginning of the deployment process and do not let the process continue if so):

TASK [duam-internal-services : Set hostname with name in inventory] ************ fatal: [devo-ua-manager]: FAILED! => {"changed": false, "module_stderr": "Shared connection to 10.239.74.38 closed.\r\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}


The reason for this issue is that Ansible is not able to make use of SSH to properly perform the deployment. Ansible will connect via SSH to every server included in the inventory to perform the installation. The solution is to fix the environment so Ansible can make proper use of SSH.

If the deployment is being done in a single server the following workaround has been proved to work:

  1. Edit your inventory file.

  2. Add ansible_connection: local in the all->vars section.

  3. Save your inventory file.

  4. Run the playbook again.

Be aware that this solution will tell Ansible to not use SSH so this is only valid for local deployments.

Docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object

If you see this error during the deployment process (only in the case that you have configured dockerized versions of MySQL and Redis in your inventory file):

TASK [duam-internal-services : Deploy and configure int-srv for DEA manager (RedHat / CentOS)] ************ fatal: [devo-ua-manager]: FAILED! => {"changed": true, "cmd": "cd /srv/duam-internal-services\n/usr/local/bin/docker-compose up -d\n", "delta": "0:00:00.141889", "end": "2021-07-13 16:11:48.867585", "msg": "non-zero return code", "rc": 127, "start": "2021-07-13 16:11:48.725696", "stderr": "/usr/local/bin/docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted", "stderr_lines": ["/usr/local/bin/docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted"], "stdout": "", "stdout_lines": []}

The reason for this issue usually is a problem with /tmp permissions. Permissions are managed directly by file-system or by mount system. If /tmp is a mounted path with noexec option set, then we will see this error even though filesystem permissions are properly granted (read, write and exec: 777).

Next command helps to check noexec mount option:

$ mount | grep "/tmp" | fgrep noexec

Empty output is expected if noexec option is not set or /tmp is not a mounted path. In another case, we will see output data similar to:

To temporally fix the noexec permission issue, remount /tmp with exec option:

Then run ansible-playbook ... command again

Be aware that this solution is temporal in the sense that exec permission will maintain until VM will be restarted, you should persist this change following appropriate procedure (editing /etc/fstab, tmp.mount service, ...) for your Linux distribution.

Unable to find any of pip3/pip to use.  pip needs to be installed (can appear with both pip or pip3)

If once requirements are installed executing:

ansible-galaxy install -r requirements.txt

From now on it is assumed that the issue is related with pip3, but the procedure works in the same way with pip.

... you still get the following error launching devo-endpoint-agent playbook...


Since Ansible will use sudo to execute the tasks, you should check if pip3 is in root user $PATH using which:

This is an example, paths may differ depending on the host.


Since /usr/bin is not included in $PATH it must be added:


And then launch the playbook again:


If error persists, it may be due to sudoers secure_path configuration,  which limits value of root user environment variable $PATH.

To check it use the following:


If nothing is returned it must be added to the sudoers file.

First of all make a copy of the sudoers file as backup:


Modify the original and check if it is ok:


And then launch the playbook again:

Unsupported nginxinc dependency roles

If you see one or both errors similar to following ones:

or


Usually these errors are related to the version of nginxinc dependency roles. 

Implement the following steps to fix it:

  • Edit playbooks/roles/requirements.txt file (with vi for example), and set fixed version for nginxinc.nginx and/or nginxinc.nginx_config roles. nginxinc.nginx version must be 0.21.0 and nginxinc.nginx_config version must be 0.3.3:

  • Uninstall current versions of these roles using ansible-galaxy:

  • Reinstall roles dependencies using ansible-galaxy with -f option:

  • Continue from step that calls ansible-playbook command with the deployment/configuration process.

Cannot start NGINX (address already in use)

If the ansible playbook fails because NGINX cannot start, and with the following errors:


 

These errors are caused because there is another service running in the port 80. The EA Manager is intended to run by itself and to not share resources with other products. While NGINX will only listen in the port 8081 (by default), during the installation process it will use the port 80 until the port 8081 is configured.

To solve the issue, stop the service from listening in the port 80 and run the playbook again.

NGINX deployment management

If you are experiencing issues when deploying NGINX using Ansible, EA Deployer gives you the option to ignore the installation and do it manually. NGINX works decoupled from the EA Manager's normal service, and it is only used to provide a repository where you will be able to download the generated agents. This is only available from version 1.2.1 on.

Include one of the following variables in your inventory to disable NGINX deployment. Using this variable requires the user to deploy manually the NGINX server:


dea_ap_deploy_nginx_software_base: completely disables the NGINX software deployment. However, NGINX http server configuration (ansible role: nginxinc.nginx_config) will still run and configure your service appropriately.

Agent repository (NGINX) returns 403

If agent-repository is returning 403 status error-code and you are sure that the user and password provided to basic authentication are right. For example:

Usually the problem is related with paths owner and permissions where nginx is loading the files to serve. The next commands (prefixed with $, the other lines are the expected output of the commands) help to check the owner/permissions.

If one or more of them are not matching with the output previously displayed, the next commands should fix it (repeat in all nodes where agent-repository was deployed).

At this point you will be able to download the agent deployment archives from agent-repository.

Deployment pre-check of https://pkg.osquery.io/ URL fails

(Only in 1.2.1) If you see the next error during the execution of pre-checks : Checking url access with curl ansible task when devo-ea-deployer.yaml playbook is executing:

And which failing URL checked is 'url': 'https://pkg.osquery.io/'.

The reason is that pkg.osquery.io service was migrated and it now redirects (HTTP 302) instead of replying with an HTTP 200 as usual. The new configuration allows redirections when performing the pre-check.

You can update the check parameters with the next command (assuming that working path is the path where devo-ea-deployer-1.2.1.tgz was extracted):

If no error is returned after running these commands then you will be able to run devo-ea-manager.yaml ansible-playbook again.

Deployment pre-check of Devo certificates

If you see an error during pre-checks: Checking Devo certificates with tls connection ansible step like the following one:

It is probably caused by deprecated Devo certificates (chain, .crt and .key), due to security reasons, the platform now offers certificates with stronger encryption and is deprecating the previous ones. To solve it, you can create and download new ones at Devo.

Certifi dependency library incompatible with python 2 up to devo-ea-deployer 1.2.x

If you see the following error, or similar, during ansible-playbook command runs when using python 2 to run Ansible commands in target hosts:

It is probably caused by a non-supported version of certifi python library installed in the target hosts during the deployment process.

This issue will be fixed in future devo-ea-deployer versions but affects versions prior to 1.2.X. Run next commands in the target host (or target hosts) as a workaround to solve the problem (replacing 2.7 for the python 2.X version in your host).

Then run the ansible-playbook command again to continue with the deployment process.

Docker pull rate limit reached

If you see an error similar to next during ansible-playbook command runs:

The cause should be that the docker client, in guest mode or authenticated with a free tier user, has reached the maximum number of pulls for a certain period of time.

To solve the issue there is a solution whose main goal is to manually download and deploy in local the required docker images. These images can be downloaded directly from public HTTPS repository. This repository is the same that servers devo-ea-deployer.-XXXX.tgz package.

Commands listed below must be executed in the host defined in allchildrendeamintsrvshosts section in the inventory file, that is the host where MySQL and Redis services will be deployed using docker containers.

Follow the next steps to solve the problem.

  1. Download the backup archive for each required image. Replacing <DEA-REPO-USER> and <DEA-REPO-PASSWD> with username and password to authenticate in the repository:

     

  2. OPTIONAL Check md5 hash code of the downloaded files. Replacing <DEA-REPO-USER> and <DEA-REPO-PASSWD> with username and password to authenticate in the repository:

     

  3. Load the docker images

     

  4. Remove temporal files

     

Then run the ansible-playbook command again to continue with the deployment process.

Authentication error during dea-migrations playbook in upgrade to 1.3 procedure

Affected versions: 1.3.0.1

If you see a similar error to the one shown above while the next command listed in Endpoint Agent 1.3 upgrade procedure 1.2 is executing:

The cause is a missing variable in the inventory file. The variable should be added by tools/miginvt.py in a previous step but it did not create the new inventory properly.

To solve this issue, add the deam_admin_email variable with no-reply@localhost.local value in the inventory file generated by tools/miginvt.py tool. See the next example:

Save the file and continue with the procedure from execution of dea-migrations.yaml playbook step (included):

yum/dnf update fails on Devo UA Manager server with runc and containers-common in RHEL 8

If you see an error similar to the one below during RHEL 8 OS packages upgrade (yum / dnf):

It is probably caused by incompatibility between docker-ce dependencies and RHEL 8 official docker engine pacakages as described in here. The solution is to disable the yum module whose mission is to update the RHEL containerization engine.

This prevents the system from trying to install the conflicting packages, which are the Red Hat-specific container packages which conflict with the Docker installation. It does not remove any packages, it just simply tells the system “Don’t try to install these packages.”

Then continue with the upgrade process.

'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK' error deploying on RHEL 8

If you see an error similar to:

while you are deploying on RHEL 8, usually it is because the version of pyOpenSSL installed from a package is too old. To solve it you need to uninstall the OS package and run the next command in all manager nodes:

Then continue the installation of Devo EA 1.3.1 or higher.

update ca-certificates package in Ubuntu error: 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK'

If you see an error similar to next one while you are deploying on Ubuntu 18.04 or 20.04:

The cause should be an incompatible version of pyOpenSSL version installed by python3-openssl package.

To solve this run the next command in all manager nodes:

Then launch the Ansible playbook that returned the error, and continue with the procedure.

python setup.py egg_info error displayed by getansible-venv.sh on Amazon Linux distro

If you see an error similar to next one while you are installing ansible with getansible-venv.sh on Amazon Linux distribution (version 2 or before):

The cause should be that python3 was previously installed on your host. Ansible 2.9 installed on a virtual environment by getansible-venv.sh is incompatible with python 3 in Amazon Linux hosts. The solution is to remove python 3 from the host.

To solve this run the next command in all manager nodes:

Then launch the command to download and run getansible-venv.sh and continue with the procedure.

Endpoint Agent client


This section describes typical trouble scenarios and troubleshooting guidelines for the client (OSQuery + extensions).

Effective configuration applied in a running and connected agent.

You can get the effective configuration options set value in each agent with next SQL query:


One perfect way to run this query is using devo-ea-manager web UI because we are in the scenario that agent is connected to DEAM.

Packages loaded in a running and connected agent.

You can get the packs loaded for each agent with next SQL query:


One perfect way to run this query is using devo-ea-manager web UI because we are on the scenario that agent is connected to DEAM.

Scheduled queries in a running and connected agent.

You can get the current scheduled queries loaded for each agent with next SQL query:


One perfect way to run this query is using devo-ea-manager web UI because we are in the scenario that agent is connected to DEAM.

Temporally set log-level to debug in agent.

You can temporally set agent level logger to debug and displaying messages to stdout following the next steps.

Windows platform

All commands detailed below must be run as admin in a  Power-shell console

  • Ensure osqueryd service is stopped:

  • Run osqueryd with Debug log-level (assuming devo-ea agent was installed following default installation):

  • Alternatively, you can save stdout to a file running next commands, instead of running previous ones:

  • Stop typing Ctrl + C in PowerShell console to stop current debug process when you will finish your tests or probes.

  • Start service:


Linux platform

Next commands are based on systemd init system. Adapt start/stop services for other init systems.

  • Ensure osqueryd service is stopped:

  • Run osqueryd with Debug log-level (assuming devo-ea agent was installed following default installation):

  • Alternatively, you can save stdout to a file running next commands, instead of running previous ones:

  • Stop typing Ctrl + C in previous console to stop current debug process when you will finish your tests or probes.

  • Set right owner if you saved output to a file:

  • Start service:


macOS platform

  • Ensure osqueryd service is stopped:

  • Run osqueryd with Debug log-level (assuming devo-ea agent was installed following default installation):

  • Alternatively, you can save stdout to a file running next commands, instead of run previous ones:

  • Stop typing Ctrl + C in previous console to stop current debug process when you will finish your tests or probes.

  • Set right owner if you saved output to a file:

  • Start service:

My agent host is not showing up in the Manager Interface.

  1. Make sure you can reach the DEAM manager from your client:
    a. Linux: sudo telnet devo-ea-manager:8080
    b. Windows: Open a PS shell in admin mode: Test-NetConnection -ComputerName devo-ea-manager -InformationLevel Detailed -Port 8080

  2. Make sure there is no firewall or antivirus interfering with the connection.
    a. Windows: We have seen previous issues with some installed antivirus: It could be necessary to create an Outbound rule both in antivirus and windows firewall to enable communication with the manager. After rule is created, test again with the step in the previous step. In order to check FW status: netsh firewall show state

Will not autoload extension with unsafe directory permissions

Osquery will refuse to load an extension from the filesystem if the file’s permissions allow it to be written or modified accounts that lack required privileges. The installation script of the EA should take care of it, but in case of error, make sure that the extensions files are owned by the root account.

On Windows, because of permission inheritance, just changing the owner of a file is not sufficient. You must also change the owner of the parent directory, remove all inherited DACLs, and disable inheritance. EA installation script should take care of the permissions, but in case of issues, the following commands will set permissions that satisfy osquery:

Make sure that your organization is not including extra permissions globally to the files in "C:\Program Files\osquery\osqueryd". Write/modify permissions should only be given to privileged accounts like “Administrators” or “System”. 

macOS: extension cannot be opened because the developer cannot be verified

To run software in MacOS, it needs to be signed and notarised, otherwise the Gatekeeper will block it as you might have noticed.

Osquery itself is signed by the osquery devs, and we are working to be able to sign our extensions (that run on top of osquery).

Until Devo is able to sign their own software for macOS, the only thing that can be done is to manually allow the extensions to run as user even though they are not signed. Follow the instructions in this link to do so.

Endpoint Agent Manager

This section describes typical trouble scenarios and troubleshooting guidelines for the manager (Fleet).

How to check DEAM logs

  • systemctl status devo-ea-manager to check status of the process.

  • journalctl -u devo-ea-manager to check manager logs.

  • journalctl -fu devo-ea-manager to check real time logs.

DEAM certificates were not properly generated or uploaded

If you see error messages similar to next in DEAM logs (journalctl -u devo-ea-manager):




If you provided your custom certificates ensure that they are placed in the provided-deam-certs folder under devo-ea-deployer path.

Then run the ansible command again (follow steps described in the deployment guide).

If you are delegating creation of self-signed certificates to devo-ea-deployer, run ansible command again and pay attention to messages marked with the service-certificates tag. This can help you to identify the root cause.

Domain certificates were not properly configured


If you see error messages similar to next in DEAM logs (journalctl -u devo-ea-manager):














The most likely cause is that the domain certificates were not properly configured. Ensure that you deploy them following instructions described in the deployment guide.

One point to check is that the certificates are owned by the ansible use configured in inventories, and not by root.

Then run the ansible command again (follow steps described in the deployment guide).

MySQL issues


Error 500 after login / "mysql":"could not connect to db: dial tcp [::1]:3306: connect: connection refused

If you get a 500 error right after logging in, and a screen similar to the following, you are likely running into MySQL issues. Verify that you have connectivity with MySQL.


The EA Manager may also show some of the following errors:






If using dockerized version of the internal services (MySQL and Redis) check that the dockers are up & running in the EA Manager server with: sudo docker ps -a.

Execute the following to restart the internal services:


REDIS issues

REDIS is not in the critical path unless the labeling feature in the EA Manager is in use. REDIS issues will surface with the following symptoms:

  • Cannot run Live Queries.

  • Labeling does not work.

  • Errors in EA Manager logs.

If you see a similar error when accessing the Queries menu in the Web UI, it's likely your REDIS instance is not available:

 

Send events through a Relay

Assuming that the relay in-house IP is 192.168.43.147, you should configure deam_relay_entrypoint: tcp://192.168.43.147:13000
Example snipped of inventory file based on that configuration:


Modify listen port of Devo EA package repository

Ensure that the 8081 port is available and isn’t busy by another service in the client infrastructure. To overwrite the port you must add in the inventory file based on that configuration the parameter:


Events are not ingested in right Devo domain

Ensure that right devo-domain certificates were configured. You can inspect domain.crt certificate with next command:


Then look for a line similar to:


CN value should be the domain name.

In the same output, you can look for a line similar to:


CN = userAWSxxxxx indicates to us which is the site that creates certificates. US in the example, that matches with https://us.devo.com

Missing --devo_relay/KOLIDE_DEVO_RELAY setting in deam-fleet configuration 

You will see similar traces in DEAM logs and process does not start when Devo relay property was not set. 


devo-ea-deployer installation procedure fills this value from deam_relay_entrypoint variable (set to tcp://us.elb.relay.logtrust.net:443 value by default) the value is assigned to KOLIDE_DEVO_RELAY environment variable configured in /etc/devo-ea-manager/devo-ea-manager file by default.

Sending Windows Events for testing, using the command line

The Windows Events are generated by the system according to some internal conditions, not controlled directly by the system administrators. Sometimes is useful for testing purposes to be able to send certain events by demand., for instance to check that we are receiving events of some type. This is possible using the windows utility eventcreate from the command line. For example: 

eventcreate /t ERROR /id 100 /l application /d "test event"

A complete description of the tool and more examples can be found here.

Repeated errors when selecting a connection from the pool to send data to Devo

There are several reasons why the EA Manager will not be able to open a connection to Devo. There are occasions where the amount of events being sent is too much, and the default amount of sockets that the EA Manager opens towards Devo cannot handle that amount of data.

If this happens, you should see errors in the EA Manager with the component Devo-logging and errors in the Obtaining new connection from pool step.

The ansible playbooks give you the option of increasing the amount of sockets that the EA Manager will open toward Devo and the number of retries to perform when attempting to secure a socket for sending data. By default, the EA Manager will open five sockets to send results data, and one socket to send status data.

The following variables can be added to your current inventory to change that configuration:

  • deam_devo_result_max_pool_size. Maximum connections in the pool used to send result events to Devo. Default value 5.

  • deam_devo_result_min_pool_size. Minimum connections in the pool used to send result events to Devo. Default value 1.

  • deam_devo_status_max_pool_size. Maximum connections in the pool used to send status events to Devo. Default value 1.

  • deam_devo_status_min_pool_size. Minimum connections in the pool used to send status events to Devo. Default value 1.

  • deam_devo_client_pool_retries. Number of retries before discarding the connection and creating a new one in the Devo pools, when current pool connection returns any kind of error. 0 implies discard in each error and makes new a connection. Default value 0.

Remember to run the playbook after making the changes in order for them to take effect. Changes will be reflected in /etc/devo-ea-manager/devo-ea-manager of every EA Manager node.

cookiecutinv.py does not work in old distributions.

This issue happens in old Linux distributions, we identified it in RedHat 7 but it can happen in other old Linux versions too. cookiecutinv.py displays next error message just when it starts to create inventory file:

The main cause is very old version of libraries required by cookiecutinv.py. Remember that all dependency libraries of tools are just the same as Ansible, but in this case, a and b versions installed in those old distributions are very old.

To fix the problem you can install more recent Ansible version, with mentioned dependencies updated, using getansible.sh script. This script will install Ansible 2.9 only for the user that is executing the script.

Follow the next steps to complete the installation:

  • Remove Ansible tool installed using package manager. Remember that ansible command will not be available to other host users after running this step (yum is the package manager tool used in this example and maybe you should use the appropriate tool for your case):

  • Run getansible.sh downloading directly from public Devo agent repository.
    curl -sSu '<<DEA_AGENT_REPO_USER>>:<<DEA_AGENT_REPO_PASSWD>>' https://d2ur64jmn3k7yt.cloudfront.net/gtls/getansible.sh | /bin/bash
    NOTE: sudo access is required in order to install package dependencies.
    Where:

    • <<DEA_AGENT_REPO_USER>> is the username required to authenticate in Devo agent repository

    • <<DEA_AGENT_REPO_USER>> is the password required to authenticate in Devo agent repository

  • Be aware that might you need to close and open a new session in order to get ansible command available for use that ran the script.

  • Run cookiecutinv.py to create inventory and continue with the installation procedure.