Troubleshooting for Endpoint Agent
This document is geared toward people in charge of Devo Endpoint Agent deployment or administration and includes ways of troubleshooting Devo EA, as well as information about common errors cases that have been identified in the field.
Deployment
This section describes typical trouble scenarios and troubleshooting guidelines for deployment scenarios.
Controlled error messages during deployment
There are some tasks that execute commands or operations that can end with some kind of error. These tasks are designed to check some services or configurations and apply customized changes in some cases or ignore those configurations in others. …ignoring
text will be displayed just after displaying error message and can help us to identify these kinds of controlled error tasks.
For example: Task that is checking if the firewalld
service is running:
This error will pop up when the firewalld
service is not present or it is stopped. The error can be safely ignored and does not affect the deployment. As a rule, any message that is tagged as ...ignoring
can be safely ignored and has no effect in the deployment sequence.
Timeout when waiting for 127.0.0.1:8080
The deployment process cannot connect with the interface that Fleet starts in the port 8080 or the connection took longer than 60 seconds. Typically, there has been a problem in the deployment sequence and the fleet instance could not boot up. Check the Endpoint Agent Manager section and see if the logs give any hint as to what they problem may be.
The most common root cause for this error message is that the certificates used to send data from the EA Manager to Devo have not been properly configured, make sure that all the steps in the guide have been followed properly. In a default installation process, the domain-certs
folder should look like the following screenshot:
Make sure that the name of the certificates are using the same names as in the screenshot above.
Also, make sure that the collector configured in the entrypoint is correct as explained in the deployment guide.
Proxy is needed to access Internet
Set up the following settings:
Enable and set up correctly http_proxy
and https_proxy
environment variables.
Enable proxy for docker environment as described here, by editing file ~/.docker/config.json.
Shared connection closed
If you see this error during the deployment process (it is likely to happen at the beginning of the deployment process and do not let the process continue if so):
TASK [duam-internal-services : Set hostname with name in inventory] ************
fatal: [devo-ua-manager]: FAILED! => {"changed": false, "module_stderr": "Shared connection to 10.239.74.38 closed.\r\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
The reason for this issue is that Ansible is not able to make use of SSH to properly perform the deployment. Ansible will connect via SSH to every server included in the inventory to perform the installation. The solution is to fix the environment so Ansible can make proper use of SSH.
If the deployment is being done in a single server the following workaround has been proved to work:
Edit your inventory file.
Add
ansible_connection: local
in the all->vars section.Save your inventory file.
Run the playbook again.
Be aware that this solution will tell Ansible to not use SSH so this is only valid for local deployments.
Docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object
If you see this error during the deployment process (only in the case that you have configured dockerized versions of MySQL and Redis in your inventory file):
TASK [duam-internal-services : Deploy and configure int-srv for DEA manager (RedHat / CentOS)] ************
fatal: [devo-ua-manager]: FAILED! => {"changed": true, "cmd": "cd /srv/duam-internal-services\n/usr/local/bin/docker-compose up -d\n", "delta": "0:00:00.141889", "end": "2021-07-13 16:11:48.867585", "msg": "non-zero return code", "rc": 127, "start": "2021-07-13 16:11:48.725696", "stderr": "/usr/local/bin/docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted", "stderr_lines": ["/usr/local/bin/docker-compose: error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted"], "stdout": "", "stdout_lines": []}
The reason for this issue usually is a problem with /tmp
permissions. Permissions are managed directly by file-system or by mount system. If /tmp
is a mounted path with noexec
option set, then we will see this error even though filesystem permissions are properly granted (read, write and exec: 777
).
Next command helps to check noexec
mount option:
$ mount | grep "/tmp" | fgrep noexec
Empty output is expected if noexec
option is not set or /tmp
is not a mounted path. In another case, we will see output data similar to:
To temporally fix the noexec
permission issue, remount /tmp
with exec
option:
Then run ansible-playbook ...
command again
Be aware that this solution is temporal in the sense that exec permission will maintain until VM will be restarted, you should persist this change following appropriate procedure (editing /etc/fstab
, tmp.mount
service, ...) for your Linux distribution.
Unable to find any of pip3/pip to use. pip needs to be installed (can appear with both pip or pip3)
If once requirements are installed executing:
ansible-galaxy install -r requirements.txt
From now on it is assumed that the issue is related with pip3, but the procedure works in the same way with pip.
... you still get the following error launching devo-endpoint-agent playbook...
Since Ansible will use sudo
to execute the tasks, you should check if pip3 is in root user $PATH
using which:
This is an example, paths may differ depending on the host.
Since /usr/bin
is not included in $PATH
it must be added:
And then launch the playbook again:
If error persists, it may be due to sudoers secure_path configuration, which limits value of root user environment variable $PATH
.
To check it use the following:
If nothing is returned it must be added to the sudoers file.
First of all make a copy of the sudoers file as backup:
Modify the original and check if it is ok:
And then launch the playbook again:
Unsupported nginxinc dependency roles
If you see one or both errors similar to following ones:
or
Usually these errors are related to the version of nginxinc
dependency roles.
Implement the following steps to fix it:
Edit
playbooks/roles/requirements.txt
file (with vi for example), and set fixed version fornginxinc.nginx
and/ornginxinc.nginx_config
roles.nginxinc.nginx
version must be 0.21.0 andnginxinc.nginx_config
version must be 0.3.3:
Uninstall current versions of these roles using
ansible-galaxy
:
Reinstall roles dependencies using
ansible-galaxy
with-f
option:
Continue from step that calls
ansible-playbook
command with the deployment/configuration process.
Cannot start NGINX (address already in use)
If the ansible playbook fails because NGINX cannot start, and with the following errors:
These errors are caused because there is another service running in the port 80. The EA Manager is intended to run by itself and to not share resources with other products. While NGINX will only listen in the port 8081 (by default), during the installation process it will use the port 80 until the port 8081 is configured.
To solve the issue, stop the service from listening in the port 80 and run the playbook again.
NGINX deployment management
If you are experiencing issues when deploying NGINX using Ansible, EA Deployer gives you the option to ignore the installation and do it manually. NGINX works decoupled from the EA Manager's normal service, and it is only used to provide a repository where you will be able to download the generated agents. This is only available from version 1.2.1 on.
Include one of the following variables in your inventory to disable NGINX deployment. Using this variable requires the user to deploy manually the NGINX server:
dea_ap_deploy_nginx_software_base
: completely disables the NGINX software deployment. However, NGINX http
server configuration (ansible role: nginxinc.nginx_config
) will still run and configure your service appropriately.
Agent repository (NGINX) returns 403
If agent-repository is returning 403 status error-code and you are sure that the user and password provided to basic authentication are right. For example:
Usually the problem is related with paths owner and permissions where nginx is loading the files to serve. The next commands (prefixed with $, the other lines are the expected output of the commands) help to check the owner/permissions.
If one or more of them are not matching with the output previously displayed, the next commands should fix it (repeat in all nodes where agent-repository was deployed).
At this point you will be able to download the agent deployment archives from agent-repository.
Deployment pre-check of https://pkg.osquery.io/ URL fails
(Only in 1.2.1) If you see the next error during the execution of pre-checks : Checking url access with curl
ansible task when devo-ea-deployer.yaml
playbook is executing:
And which failing URL checked is 'url': 'https://pkg.osquery.io/'
.
The reason is that pkg.osquery.io service was migrated and it now redirects (HTTP 302) instead of replying with an HTTP 200 as usual. The new configuration allows redirections when performing the pre-check.
You can update the check parameters with the next command (assuming that working path is the path where devo-ea-deployer-1.2.1.tgz
was extracted):
If no error is returned after running these commands then you will be able to run devo-ea-manager.yaml
ansible-playbook again.
Deployment pre-check of Devo certificates
If you see an error during pre-checks: Checking Devo certificates with tls connection
ansible step like the following one:
It is probably caused by deprecated Devo certificates (chain
, .crt
and .key
), due to security reasons, the platform now offers certificates with stronger encryption and is deprecating the previous ones. To solve it, you can create and download new ones at Devo.
Certifi dependency library incompatible with python 2 up to devo-ea-deployer 1.2.x
If you see the following error, or similar, during ansible-playbook
command runs when using python 2 to run Ansible commands in target hosts:
It is probably caused by a non-supported version of certifi
python library installed in the target hosts during the deployment process.
This issue will be fixed in future devo-ea-deployer
versions but affects versions prior to 1.2.X. Run next commands in the target host (or target hosts) as a workaround to solve the problem (replacing 2.7 for the python 2.X version in your host).
Then run the ansible-playbook
command again to continue with the deployment process.
Docker pull rate limit reached
If you see an error similar to next during ansible-playbook
command runs:
The cause should be that the docker client, in guest mode or authenticated with a free tier user, has reached the maximum number of pulls for a certain period of time.
To solve the issue there is a solution whose main goal is to manually download and deploy in local the required docker images. These images can be downloaded directly from public HTTPS repository. This repository is the same that servers devo-ea-deployer.-XXXX.tgz
package.
Commands listed below must be executed in the host defined in all
→ children
→ deamintsrvs
→ hosts
section in the inventory file, that is the host where MySQL and Redis services will be deployed using docker containers.
Follow the next steps to solve the problem.
Download the backup archive for each required image. Replacing
<DEA-REPO-USER>
and<DEA-REPO-PASSWD>
with username and password to authenticate in the repository:OPTIONAL Check md5 hash code of the downloaded files. Replacing
<DEA-REPO-USER>
and<DEA-REPO-PASSWD>
with username and password to authenticate in the repository:Load the docker images
Remove temporal files
Then run the ansible-playbook
command again to continue with the deployment process.
Authentication error during dea-migrations
playbook in upgrade to 1.3 procedure
Affected versions: 1.3.0.1
If you see a similar error to the one shown above while the next command listed in Endpoint Agent 1.3 upgrade procedure 1.2 is executing:
The cause is a missing variable in the inventory file. The variable should be added by tools/miginvt.py
in a previous step but it did not create the new inventory properly.
To solve this issue, add the deam_admin_email
variable with no-reply@localhost.local
value in the inventory file generated by tools/miginvt.py
tool. See the next example:
Save the file and continue with the procedure from execution of dea-migrations.yaml
playbook step (included):
yum/dnf update fails on Devo UA Manager server with runc and containers-common in RHEL 8
If you see an error similar to the one below during RHEL 8 OS packages upgrade (yum / dnf):
It is probably caused by incompatibility between docker-ce dependencies and RHEL 8 official docker engine pacakages as described in here. The solution is to disable the yum module whose mission is to update the RHEL containerization engine.
This prevents the system from trying to install the conflicting packages, which are the Red Hat-specific container packages which conflict with the Docker installation. It does not remove any packages, it just simply tells the system “Don’t try to install these packages.”
Then continue with the upgrade process.
'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK' error deploying on RHEL 8
If you see an error similar to:
while you are deploying on RHEL 8, usually it is because the version of pyOpenSSL
installed from a package is too old. To solve it you need to uninstall the OS package and run the next command in all manager nodes:
Then continue the installation of Devo EA 1.3.1 or higher.
update ca-certificates package in Ubuntu error: 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK'
If you see an error similar to next one while you are deploying on Ubuntu 18.04 or 20.04:
The cause should be an incompatible version of pyOpenSSL
version installed by python3-openssl
package.
To solve this run the next command in all manager nodes:
Then launch the Ansible playbook that returned the error, and continue with the procedure.
python setup.py egg_info error displayed by getansible-venv.sh on Amazon Linux distro
If you see an error similar to next one while you are installing ansible with getansible-venv.sh
on Amazon Linux distribution (version 2 or before):
The cause should be that python3
was previously installed on your host. Ansible 2.9 installed on a virtual environment by getansible-venv.sh
is incompatible with python 3 in Amazon Linux hosts. The solution is to remove python 3 from the host.
To solve this run the next command in all manager nodes:
Then launch the command to download and run getansible-venv.sh
and continue with the procedure.
Endpoint Agent client
This section describes typical trouble scenarios and troubleshooting guidelines for the client (OSQuery + extensions).
Effective configuration applied in a running and connected agent.
You can get the effective configuration options set value in each agent with next SQL query:
One perfect way to run this query is using devo-ea-manager
web UI because we are in the scenario that agent is connected to DEAM.
Packages loaded in a running and connected agent.
You can get the packs loaded for each agent with next SQL query:
One perfect way to run this query is using devo-ea-manager
web UI because we are on the scenario that agent is connected to DEAM.
Scheduled queries in a running and connected agent.
You can get the current scheduled queries loaded for each agent with next SQL query:
One perfect way to run this query is using devo-ea-manager
web UI because we are in the scenario that agent is connected to DEAM.
Temporally set log-level to debug in agent.
You can temporally set agent level logger to debug and displaying messages to stdout following the next steps.
Windows platform
All commands detailed below must be run as admin in a Power-shell console
Ensure
osqueryd
service is stopped:
Run osqueryd with Debug log-level (assuming
devo-ea
agent was installed following default installation):
Alternatively, you can save
stdout
to a file running next commands, instead of running previous ones:
Stop typing
Ctrl + C
in PowerShell console to stop current debug process when you will finish your tests or probes.Start service:
Linux platform
Next commands are based on systemd
init system. Adapt start/stop services for other init systems.
Ensure
osqueryd
service is stopped:
Run
osqueryd
with Debug log-level (assumingdevo-ea
agent was installed following default installation):
Alternatively, you can save
stdout
to a file running next commands, instead of running previous ones:
Stop typing
Ctrl + C
in previous console to stop current debug process when you will finish your tests or probes.Set right owner if you saved output to a file:
Start service:
macOS platform
Ensure
osqueryd
service is stopped:
Run
osqueryd
with Debug log-level (assumingdevo-ea
agent was installed following default installation):
Alternatively, you can save
stdout
to a file running next commands, instead of run previous ones:
Stop typing
Ctrl + C
in previous console to stop current debug process when you will finish your tests or probes.Set right owner if you saved output to a file:
Start service:
My agent host is not showing up in the Manager Interface.
Make sure you can reach the DEAM manager from your client:
a. Linux:sudo telnet devo-ea-manager:8080
b. Windows: Open a PS shell in admin mode:Test-NetConnection -ComputerName devo-ea-manager -InformationLevel Detailed -Port 8080
Make sure there is no firewall or antivirus interfering with the connection.
a. Windows: We have seen previous issues with some installed antivirus: It could be necessary to create an Outbound rule both in antivirus and windows firewall to enable communication with the manager. After rule is created, test again with the step in the previous step. In order to check FW status:netsh firewall show state
Will not autoload extension with unsafe directory permissions
Osquery will refuse to load an extension from the filesystem if the file’s permissions allow it to be written or modified accounts that lack required privileges. The installation script of the EA should take care of it, but in case of error, make sure that the extensions files are owned by the root account.
On Windows, because of permission inheritance, just changing the owner of a file is not sufficient. You must also change the owner of the parent directory, remove all inherited DACLs, and disable inheritance. EA installation script should take care of the permissions, but in case of issues, the following commands will set permissions that satisfy osquery:
Make sure that your organization is not including extra permissions globally to the files in "C:\Program Files\osquery\osqueryd"
. Write/modify permissions should only be given to privileged accounts like “Administrators” or “System”.
macOS: extension cannot be opened because the developer cannot be verified
To run software in MacOS, it needs to be signed and notarised, otherwise the Gatekeeper will block it as you might have noticed.
Osquery itself is signed by the osquery devs, and we are working to be able to sign our extensions (that run on top of osquery).
Until Devo is able to sign their own software for macOS, the only thing that can be done is to manually allow the extensions to run as user even though they are not signed. Follow the instructions in this link to do so.
Endpoint Agent Manager
This section describes typical trouble scenarios and troubleshooting guidelines for the manager (Fleet).
How to check DEAM logs
systemctl status devo-ea-manager
to check status of the process.journalctl -u devo-ea-manager
to check manager logs.journalctl -fu devo-ea-manager
to check real time logs.
DEAM certificates were not properly generated or uploaded
If you see error messages similar to next in DEAM logs (journalctl -u devo-ea-manager
):
If you provided your custom certificates ensure that they are placed in the provided-deam-certs
folder under devo-ea-deployer path.
Then run the ansible
command again (follow steps described in the deployment guide).
If you are delegating creation of self-signed certificates to devo-ea-deployer
, run ansible
command again and pay attention to messages marked with the service-certificates
tag. This can help you to identify the root cause.
Domain certificates were not properly configured
If you see error messages similar to next in DEAM logs (journalctl -u devo-ea-manager
):
The most likely cause is that the domain certificates were not properly configured. Ensure that you deploy them following instructions described in the deployment guide.
One point to check is that the certificates are owned by the ansible use configured in inventories, and not by root.
Then run the ansible
command again (follow steps described in the deployment guide).
MySQL issues
Error 500 after login / "mysql":"could not connect to db: dial tcp [::1]:3306: connect: connection refused
If you get a 500 error right after logging in, and a screen similar to the following, you are likely running into MySQL issues. Verify that you have connectivity with MySQL.
The EA Manager may also show some of the following errors:
If using dockerized version of the internal services (MySQL and Redis) check that the dockers are up & running in the EA Manager server with: sudo docker ps -a.
Execute the following to restart the internal services:
REDIS issues
REDIS is not in the critical path unless the labeling feature in the EA Manager is in use. REDIS issues will surface with the following symptoms:
Cannot run Live Queries.
Labeling does not work.
Errors in EA Manager logs.
If you see a similar error when accessing the Queries menu in the Web UI, it's likely your REDIS instance is not available:
Send events through a Relay
Assuming that the relay in-house IP is 192.168.43.147, you should configure deam_relay_entrypoint: tcp://192.168.43.147:13000
.
Example snipped of inventory file based on that configuration:
Modify listen port of Devo EA package repository
Ensure that the 8081 port is available and isn’t busy by another service in the client infrastructure. To overwrite the port you must add in the inventory file based on that configuration the parameter:
Events are not ingested in right Devo domain
Ensure that right devo-domain
certificates were configured. You can inspect domain.crt
certificate with next command:
Then look for a line similar to:
CN
value should be the domain name.
In the same output, you can look for a line similar to:
CN = userAWSxxxxx
indicates to us which is the site that creates certificates. US
in the example, that matches with https://us.devo.com.
Missing --devo_relay/KOLIDE_DEVO_RELAY setting in deam-fleet configuration
You will see similar traces in DEAM logs and process does not start when Devo relay property was not set.
devo-ea-deployer
installation procedure fills this value from deam_relay_entrypoint
variable (set to tcp://us.elb.relay.logtrust.net:443
value by default) the value is assigned to KOLIDE_DEVO_RELAY
environment variable configured in /etc/devo-ea-manager/devo-ea-manager
file by default.
Sending Windows Events for testing, using the command line
The Windows Events are generated by the system according to some internal conditions, not controlled directly by the system administrators. Sometimes is useful for testing purposes to be able to send certain events by demand., for instance to check that we are receiving events of some type. This is possible using the windows utility eventcreate from the command line. For example:
eventcreate /t ERROR /id 100 /l application /d "test event"
A complete description of the tool and more examples can be found here.
Repeated errors when selecting a connection from the pool to send data to Devo
There are several reasons why the EA Manager will not be able to open a connection to Devo. There are occasions where the amount of events being sent is too much, and the default amount of sockets that the EA Manager opens towards Devo cannot handle that amount of data.
If this happens, you should see errors in the EA Manager with the component Devo-logging
and errors in the Obtaining new connection from pool
step.
The ansible playbooks give you the option of increasing the amount of sockets that the EA Manager will open toward Devo and the number of retries to perform when attempting to secure a socket for sending data. By default, the EA Manager will open five sockets to send results data, and one socket to send status data.
The following variables can be added to your current inventory to change that configuration:
deam_devo_result_max_pool_size
. Maximum connections in the pool used to send result events to Devo. Default value5
.deam_devo_result_min_pool_size
. Minimum connections in the pool used to send result events to Devo. Default value1
.deam_devo_status_max_pool_size
. Maximum connections in the pool used to send status events to Devo. Default value1
.deam_devo_status_min_pool_size
. Minimum connections in the pool used to send status events to Devo. Default value1
.deam_devo_client_pool_retries
. Number of retries before discarding the connection and creating a new one in the Devo pools, when current pool connection returns any kind of error.0
implies discard in each error and makes new a connection. Default value0
.
Remember to run the playbook after making the changes in order for them to take effect. Changes will be reflected in /etc/devo-ea-manager/devo-ea-manager
of every EA Manager node.
cookiecutinv.py does not work in old distributions.
This issue happens in old Linux distributions, we identified it in RedHat 7 but it can happen in other old Linux versions too. cookiecutinv.py
displays next error message just when it starts to create inventory file:
The main cause is very old version of libraries required by cookiecutinv.py
. Remember that all dependency libraries of tools are just the same as Ansible, but in this case, a
and b
versions installed in those old distributions are very old.
To fix the problem you can install more recent Ansible version, with mentioned dependencies updated, using getansible.sh
script. This script will install Ansible 2.9 only for the user that is executing the script.
Follow the next steps to complete the installation:
Remove Ansible tool installed using package manager. Remember that
ansible
command will not be available to other host users after running this step (yum is the package manager tool used in this example and maybe you should use the appropriate tool for your case):Run
getansible.sh
downloading directly from public Devo agent repository.curl -sSu '<<DEA_AGENT_REPO_USER>>:<<DEA_AGENT_REPO_PASSWD>>' https://d2ur64jmn3k7yt.cloudfront.net/gtls/getansible.sh | /bin/bash
NOTE:sudo
access is required in order to install package dependencies.
Where:<<DEA_AGENT_REPO_USER>>
is the username required to authenticate in Devo agent repository<<DEA_AGENT_REPO_USER>>
is the password required to authenticate in Devo agent repository
Be aware that might you need to close and open a new session in order to get
ansible
command available for use that ran the script.Run
cookiecutinv.py
to create inventory and continue with the installation procedure.