maanantai 14. huhtikuuta 2014

Self registering Jenkins hosts with Docker and Ansible

At work (Sysart), we have had a lot of problems with Jenkins builds interfering each other when running on same slave. Problems ranged from port conflicts to trying to use same Firefox instance during Selenium tests. Easiest thing to solve this is to run only one build at a time per slave. But we have some decent hardware with Intel I7 processors (4 cores, HT enabled), so running one job at a time per slave is kinda wasteful.

Previously, we used Ovirt for creating virtual machines, and then added them manually to Jenkins as slaves. But as we wanted to 10+ slaves, this would've been tedious. Also running a VM has overhead, which starts to hurt pretty quickly.

So enter Docker, Ansible and Swarm -plugin.

Basic idea in this is to have Docker image which connects to Jenkins immediately at the start. The image contains everything needed for running our tests, including stuff required for Selenium tests like Firefox. Building of images and containers are handled with Ansibles docker and docker-image -modules, actual starting and stopping of running containers is done with systemd mainly because I wanted to learn how to use that too :). Systemd also has systemd-journal, which is pretty amazing.

The image is build on containing host for now, as it was just easier. I'm definitely checking Docker repository in near future.

Volumes are used for workspace, mainly to persist Maven repositories between restarts. I had some problems with write permissions on the first try, but resolved this with some bash scripting.

Started containers can have labels, which are just added in the playbook with docker -modules "command" -variable. There's some funny quoting to get parameters right, see "start.sh" for details.

Main files are added below and example playbook with module can be found from Github.

Of course, there were some problems doing this.

Ansible
  • The docker-image module reports changes every time. This effectively prevents usage of handlers to restart containers.
  • Couldn't get uri -module to accept multiple http codes as return code ("Can also be comma separated list of status codes."). Most likely just misunderstanding of documentation
  • service -module failed to parse output when starting/stopping container services complaining about inability to parse json. Docker start and stop output the id of container in to stdout, so this might be the reason?

Docker
  • docker -d starts all the containers as default. This can be prevented by adding -r as parameter. But this doesn't seem to affect when the service is restarted. If docker -d starts the containers, then systemd tries to start container which fails causing restart.
  • I couldn't get volumes to be chowned for jenkins user. We need to have a non-root user for our tests, as we do some filesystem permission tests. 
  • docker -d starts all the containers as default. This can be prevented by adding -r as parameter. But this doesn't seem to affect when the service is restarted. If docker -d starts the containers, then systemd tries to start container which fails causing restart.
Jenkins
  • Slave removal is slow, which can easily cause problems as containers are stopped and restarted quickly. Luckily this can be checked via REST api.

There's still few things I'd like to add here:
  • Enable commiting and downloading the used container for a given test run. This would be helpful in situations where tests were successful on developers environment but not on Jenkins. But then, developers should use same image base as the test environment :)
  • Have a production image, which would be extended by test image.

And protip for image development. Have two different images, "jenkins-slave" and "jenkins-slave-test". The "jenkins-slave-test" is inherited from "jenkins-slave", but has ENTRYPOINT overridden to "/bin/bash" so you can explore the image.

So, the main parts of how this was done. I'm sure that there's a lot of better ways to do things, so please, tell me :).

The jenkins_slaves.yml -playbook is something like this:
- hosts: jenkins-slaves
vars:
- jenkins_master: "http://jenkins.example.com"
- container_names: [ builder1, builder2, builder3, builder4, builder5, builder6 ]
roles:
- { role: docker-host, image_name: jenkins_builder }

The template for docker file is following:
FROM fedora
MAINTAINER jyrki.puttonen@sysart.fi
RUN yum install -y java-1.7.0-openjdk-devel blackbox firefox tigervnc-server dejavu-sans-fonts dejavu-serif-fonts ImageMagick unzip ansible puppet git tigervnc
RUN useradd jenkins
ADD vncpasswd /home/jenkins/.vnc/passwd
RUN chown -R jenkins:jenkins /home/jenkins/.vnc
# Run as jenkins user. Biggest reason for this is that in our tests, we want
# # check some filesystem rights, and those tests will fail if the user is root.
#ADD http://maven.jenkins-ci.org/content/repositories/releases/org/jenkins-ci/plugins/swarm-client/1.15/swarm-client-1.15-jar-with-dependencies.jar /home/jenkins/
ADD swarm-client-1.15-jar-with-dependencies.jar /home/jenkins/
# Without this, maven has problems with umlauts in tests
ENV JAVA_TOOL_OPTIONS -Dfile.encoding=UTF8
#so vncserver etc use right directory
ENV HOME /home/jenkins
WORKDIR /home/jenkins/
ADD start.sh /home/jenkins/
RUN chmod 755 /home/jenkins/start.sh
ENTRYPOINT ["/home/jenkins/start.sh"]

Start.sh starts jenkins swarm plugin:
#!/bin/bash
OWNER=$(stat -c %U /workspace)
if [ OWNER != "jenkins" ]
then
chown -R jenkins:jenkins /workspace
fi
# Use swarm client to connect to jenkins. Broadcast didn't work due to container networking,
# so easiest thing to do was just to set right address.
{% set labelscsv = labels|join(",") -%}
{% set labelsflag = '-labels ' + labelscsv -%}
su -c "/usr/bin/java -jar swarm-client-1.15-jar-with-dependencies.jar -master {{jenkins_master}} -executors 1 -mode {{mode}} {{ labelsflag if labels else '' }} -fsroot /workspace $@" - jenkins


vars/main.yml has following variables defined
docker_directory: "docker"
image_name: "igor-builder"
docker_file: "Dockerfile.j2"
docker_data_directory: "/data/docker"
image_build_directory: "{{docker_data_directory}}/{{image_name}}"

And tasks/main.yml is like this. There's a lot of comments inside so I decided to include it as is to here.

# As I want to control individual containers with systemd, install new unit
# file that adds "-r" to options so docker -d doesn't start containers.
# Without this, containers started by systemd would fail to start, and would be
# started again
- name: install unit file for docker
copy: src=docker.service dest=/etc/systemd/system/docker.service
notify:
- reload systemd

# Install docker from updates-testing, as there 0.9.1 available and it handles deleting containers better
- name: install docker
yum: name=docker-io state=present enablerepo=updates-testing

- name: start docker service
service: name=docker enabled=yes state=started

- name: install virtualenv
yum: name=python-virtualenv state=absent

- name: install pip
yum: name=python-pip state=present

# docker module requires version that is > 0.3, which is not in Fedora repos, so install with pip
- name: install docker-py
pip: name=docker-py state=present

- name: create working directory {{image_build_directory}} for docker
file: path={{image_build_directory}} state=directory

- name: install unit file for systemd {{container_names}}
template: src=container-runner.service.j2 dest=/etc/systemd/system/{{item}}.service
with_items: container_names
notify:
- enable services for {{container_names}}
- reload systemd

# Setup files needed for building docker image for Jenkins usage
- name: Download swarm client
get_url: url="http://maven.jenkins-ci.org/content/repositories/releases/org/jenkins-ci/plugins/swarm-client/1.15/swarm-client-1.15-jar-with-dependencies.jar" dest={{image_build_directory}}

- name: copy vnc password file
copy: src=vncpasswd dest={{image_build_directory}}

- name: copy additional files
copy: src={{item}} dest={{image_build_directory}}
with_items: additional_files

- name: create start.sh
template: src=start.sh.j2 dest={{image_build_directory}}/start.sh validate="bash -n %s"

- name: copy {{docker_file}} to host
template: src="{{docker_file}}" dest="{{image_build_directory}}/Dockerfile"

# This is something I would like to dom but docker module can't set volumes as rw,
# volumes="/data/builders/{{item}}:/home/jenkins/work:rw"
# Also I couldn't get the user to "jenkins" for volumes
- name: create volume directories for containers
file: path="/data/builders/{{item}}" state=directory
with_items: container_names

#
# For some reason, this will always return changed
- name: build docker image {{ image_name }}
docker_image: path="{{image_build_directory}}" name="{{image_name}}" state=present
notify:
- stop {{container_names}}
- wait for containers to be removed on Jenkins side
- remove {{container_names}}
- create containers {{container_names}} with image {{image_name}}
- wait for containers to be started
- start {{container_names}}


and handlers/main.yml
- name: reload systemd
command: systemctl daemon-reload

# Can't use service here, Ansible fails to parse output
- name: enable services for {{container_names}}
command: /usr/bin/systemctl enable {{ item }}
with_items: container_names

# service cannot be used here, Ansible fails to parse output.
- name: stop {{container_names}}
command: /usr/bin/systemctl stop {{ item }}
# service: name={{ item }} state=stopped
with_items: container_names

# Jenkins takes a while to remove slaves. If containers are started immediately, they will have names
#.containing ip -address of the host in them. ugly :(
- name: wait for containers to be removed on Jenkins side
command: curl -s -w %{http_code} {{ jenkins_master }}/computer/{{ansible_hostname}}-{{item}}/api/json -o /dev/null
register: result
tags: check
until: result.stdout.find("404") != -1
retries: 10
delay: 5
with_items: container_names

- name: remove {{container_names}}
docker: name="{{item}}" state=absent image="{{image_name}}"
with_items: container_names

- name: create containers {{container_names}} with image {{image_name}}
docker: image="{{image_name}}" name="{{item}}" hostname="{{item}}" memory_limit=2048MB state=present command="\"-name {{ansible_hostname}}-{{item}}\"" volumes="/data/builders/{{item}}:/workspace"
with_items: container_names

- name: wait for containers to be started
pause: seconds=10

- name: start {{container_names}}
command: /usr/bin/systemctl start {{ item }}
with_items: container_names