Hashicorp Nomad failed job with missing drivers

29 Nov 2018, 00:00

hashicorp / nomad / docker

The Problem:

Launching a job on Nomad fails with an error:

[root@theargo ~]# nomad job run example.nomad
==> Monitoring evaluation "3ed6ecae"
    Evaluation triggered by job "example"
    Evaluation within deployment: "f2416c97"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "3ed6ecae" finished with status "complete" but failed to place all allocations:
    Task Group "cache" (failed to place 1 allocation):
      * Constraint "missing drivers" filtered 1 nodes
    Evaluation "d88ac9d7" waiting for additional capacity to place remainder

The solution:

Verify that Docker is installed on the Nomad agent node and that the user configured to run Nomad jobs has permissions to run Docker.

Backstory:

When Nomad starts, it verifies the available drivers and machine specs using fingerprints. When Nomad does not detect a fingerprint it will query for that fingerprint on a schedule. This was implemented to cover the case where Nomad starts before Docker, for example, and on subsequent fingerprint runs it will register Docker once it starts. Once Docker is installed and running, Nomad will register it and verify the fingerprint. This will allow the previously failed Nomad job to successful be placed and started, using the Docker driver.

==> Nomad agent started! Log data will stream in below:

    2018/11/30 05:09:13 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.46.0.6:4647 Address:10.46.0.6:4647}]
    2018/11/30 05:09:13 [INFO] serf: EventMemberJoin: packer-hashistack-test-1.global 10.46.0.6
    2018/11/30 05:09:13.765791 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]
    2018/11/30 05:09:13.766198 [INFO] client: using state directory /opt/nomad/data/client
    2018/11/30 05:09:13.768161 [INFO] client: using alloc directory /opt/nomad/data/alloc
    2018/11/30 05:09:13.770668 [DEBUG] client.fingerprint_manager: built-in fingerprints: [arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gce]
    2018/11/30 05:09:13.770903 [INFO] fingerprint.cgroups: cgroups are available
<snip>
    2018/11/30 05:09:13.773956 [DEBUG] fingerprint.cpu: frequency: 2199 MHz
    2018/11/30 05:09:13.773960 [DEBUG] fingerprint.cpu: core count: 1
    2018/11/30 05:09:13.776533 [DEBUG] client.fingerprint_manager: fingerprinting consul every 15s
    2018/11/30 05:09:13.776574 [WARN] fingerprint.network: Unable to parse Speed in output of '/usr/sbin/ethtool lo'
    2018/11/30 05:09:13.776654 [DEBUG] fingerprint.network: Unable to read link speed from /sys/class/net/lo/speed
    2018/11/30 05:09:13.776658 [DEBUG] fingerprint.network: link speed could not be detected and no speed specified by user. Defaulting to 1000
    2018/11/30 05:09:13.776721 [DEBUG] fingerprint.network: Detected interface lo with IP: 127.0.0.1
    2018/11/30 05:09:13.776723 [DEBUG] fingerprint.network: Detected interface lo with IP: ::1
    2018/11/30 05:09:13.779256 [DEBUG] client.fingerprint_manager: fingerprinting vault every 15s
    2018/11/30 05:09:13.785514 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2018/11/30 05:09:13.790288 [DEBUG] client.fingerprint_manager: detected fingerprints [arch cgroup cpu host network nomad signal storage]
    2018/11/30 05:09:13.790343 [DEBUG] driver.docker: using client connection initialized from environment
    2018/11/30 05:09:13.790395 [DEBUG] driver.docker: using client connection initialized from environment
    2018/11/30 05:09:13.790544 [DEBUG] driver.docker: could not connect to docker daemon at unix:///var/run/docker.sock: Get http://unix.sock/version: dial unix /var/run/docker.sock: connect: no such file or directory
    2018/11/30 05:09:13.790609 [DEBUG] driver.exec: exec driver is enabled
    2018/11/30 05:09:13.790632 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
    2018/11/30 05:09:13.790776 [DEBUG] client.fingerprint_manager: detected drivers [exec raw_exec]
    2018/11/30 05:09:13.790971 [DEBUG] client.fingerprint_manager: fingerprinting driver docker every 15s
    2018/11/30 05:09:13.790979 [DEBUG] client.fingerprint_manager: health checking driver docker every 1m0s
    2018/11/30 05:09:13.790986 [DEBUG] client.fingerprint_manager: fingerprinting driver exec every 15s
    2018/11/30 05:09:13.791008 [DEBUG] client.fingerprint_manager: fingerprinting driver rkt every 15s

[root@theargo ~]# systemctl start docker
[root@theargo ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2018-11-30 05:46:50 UTC; 5s ago
     Docs: http://docs.docker.com
 Main PID: 1636 (dockerd-current)
   CGroup: /system.slice/docker.service

[root@theargo ~]# nomad job plan example.nomad
+ Job: "example"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
[root@theargo ~]# nomad job run -check-index 0 example.nomad
==> Monitoring evaluation "7e38484b"
    Evaluation triggered by job "example"
    Evaluation within deployment: "f2ceacc1"
    Allocation "fdbfd58c" created: node "0cc8c9cd", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7e38484b" finished with status "complete"