The Solr project uses a Jenkins instance provided by the Apache Software Foundation ("ASF") for running tests, validation, etc.
This file aims to document our [ASF Jenkins](https://ci-builds.apache.org/job/Solr/) usage and administration, to prevent it from becoming "tribal knowledge" understood by just a few.
We run a number of jobs on Jenkins, each validating an overlapping set of concerns:
-
Solr-Artifacts-*
- daily jobs that run./gradlew assemble
to ensure that build artifacts (except docker images) can be created successfully -
Solr-Lint-*
- daily jobs that run static analysis (i.e.precommit
andcheck -x test
) on a branch -
Solr-Test-*
- "hourly" jobs that run all (non-integration) tests (i.e../gradlew test
) -
Solr-TestIntegration-*
- daily jobs that run project integration tests (i.e../gradlew integrationTests
) -
Solr-Docker-Nightly-*
- daily jobs that./gradlew testDocker dockerPush
to validate docker image packaging. Snapshot images are pushed to hub.docker.com -
Solr-reference-guide-*
- daily jobs that build the Solr reference guide via./gradlew checkSite
and push the resulting artifact to the staging/preview sitenightlies.apache.org
-
Solr-Smoketest-*
- daily jobs that produce a snapshot release (via theassembleRelease
task) and run the release smoketester
Most jobs that validate particular build artifacts are run "daily", which is sufficient to prevent any large breaks from creeping into the build. On the other hand, jobs that run tests are triggered "hourly" in order to squeeze as many test runs as possible out of our Jenkins hardware. This is a necessary consequence of Solr’s heavy use of randomization in its test-suite. "Hourly" scheduling ensures that a test run is either currently running or in the build queue at all times, and enables us to get the maximum data points from our hardware.
All Solr jobs run on Jenkins agents marked with the 'solr' label. Currently, this maps to two Jenkins agents:
-
lucene-solr-1
- available at lucene1-us-west.apache.org -
lucene-solr-2
- available (confusingly) at lucene-us-west.apache.org
These agents are "project-specific" VMs shared by the Lucene and Solr projects. That is: they are VMs requested by a project for their exclusive use. (INFRA policy appears to be that each Apache project may request 1 dedicated VM; it’s unclear how Solr ended up with 2.)
Maintenance of these agent VMs falls into a bit of a gray area. INFRA will still intervene when asked: to reboot nodes, to deploy OS upgrades, etc. But some burden also falls on Lucene and Solr as project teams to monitor the the VMs and keep them healthy.
With a few steps, Solr committers can access our project’s Jenkins agent VMs via SSH to troubleshoot and resolve issues.
-
Ensure your account on id.apache.org has an SSH key associated with it.
-
Ask INFRA to give your Apache ID SSH access to these boxes. (See [this JIRA ticket](https://issues.apache.org/jira/browse/INFRA-3682) for an example.)
-
SSH into the desired box with:
ssh <apache-id>@$HOSTNAME
(where$HOSTNAME
is eitherlucene1-us-west.apache.org
orlucene-us-west.apache.org
)
Often, SSH access on the boxes is not sufficient, and administrators require "root" access to diagnose and solve problems. Sudo/su priveleges can be accessed via a one-time pad ("OTP") challenge, managed by the "Orthrus PAM" module. Users in need of root access can perform the following steps:
-
Open the ASF’s [OTP Generator Tool](https://selfserve.apache.org/otp-calculator.html) in your browser of choice
-
Run
ortpasswd
on the machine. This will print out a OTP "challenge" (e.g.otp-md5 497 lu6126
) and provide a password prompt. This password prompt should be given a OTP password, generated in steps 3-5 below. -
Copy the "challenge" from the previous step into the relevant field on the "OTP Generator Tool" form.
-
Choose a password to use for OTP Challenges (or recall one you’ve used in the past), and type this into the relevant field on the "OTP Generator Tool" form.
-
Click "Compute", and copy the first line from the "Response" box into your SSH session’s password prompt. You’re now established in the "Orthrus PAM" system.
-
Run a command requesting
su
escalation (e.g.sudo su -
). This should print another "challenge" and password prompt. Repeat steps 3-5.
If this fails at any point, open a ticket with INFRA. You may need to be added to the 'sudoers' file for the VM(s) in question.
One recurring problem with the Jenkins agents is that they periodically run out of disk-space. Usually this happens when enough "workspaces" are orphaned or left behind, consuming all of the agent’s disk space.
Solr Jenkins jobs are currently configured to clean up the previous workspace at the start of the subsequent run. This avoids orphans in the common case but leaves workspaces behind any time a job is renamed or deleted (as happens during the Solr release process).
Luckily, this has an easy fix: SSH into the agent VM and delete any workspaces no longer needed in /home/jenkins/jenkins-slave/workspace/Solr
.
Any workspace that doesn’t correspond to a [currently existing job](https://ci-builds.apache.org/job/Solr/) can be safely deleted.
(It may also be worth comparing the Lucene workspaces in /home/jenkins/jenkins-slave/workspace/Lucene
to [that project’s list of jobs](https://ci-builds.apache.org/job/Lucene/).)