New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 870926 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 15
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

DevserverProvisionTask fails simple staging call sanity check at the end

Reported by jrbarnette@chromium.org, Aug 3

Issue description

At then end of `bin/run_server_task DevserverProvisionTask`, the
command performs two staging calls to the devserver service on
the target system (presumably as a sanity check).  Sometimes, that
check fails.  Retrying the command tends to work.

The failure looks like this:
[chromeos15-infra-devserver12.cros.corp.google.com] sudo: service devserver status
[chromeos15-infra-devserver12.cros.corp.google.com] out: sudo password:
[chromeos15-infra-devserver12.cros.corp.google.com] out: devserver start/running, process 1092
[chromeos15-infra-devserver12.cros.corp.google.com] out: 

[chromeos15-infra-devserver12.cros.corp.google.com] run: gsutil ls gs://chromeos-image-archive/daisy-release/ | tail -2 | head -1
[chromeos15-infra-devserver12.cros.corp.google.com] out: gs://chromeos-image-archive/daisy-release/R70-10934.0.0/
[chromeos15-infra-devserver12.cros.corp.google.com] out: 

[chromeos15-infra-devserver12.cros.corp.google.com] run: curl "http://localhost:8082/stage?artifacts=full_payload,stateful,autotest_packages&files=&archive_url=gs://chromeos-image-archive/daisy-release/R70-10934.0.0/"
[chromeos15-infra-devserver12.cros.corp.google.com] out: curl: (7) Failed to connect to localhost port 8082: Connection refused

At first blush, this looks like it might be a race with devserver
startup (the command is run immediately after a reboot).  Morever, at
one point I went to a server that had failed this way, and manually ran
the same 'curl' command; it passed without incident.

Reinforcing the "race condition" hypothesis:  When I saw this fail, it
was during a sequence of deployments on three different servers:
  * The first server failed, and succeeded on retry.
  * The second server failed twice, and succeeded on the third try.
  * The third server never saw this failure.

Attached is the log from the server with the single failure.

 
deploy-12.log
37.5 KB View Download
Ugh, we are restarting nginx every run

  exec { 'disable_nginx_init_script':
    command => '/usr/sbin/update-rc.d nginx disable',
    user    => 'root',
  }

So it's hit or miss whether any particular run will pass.


There's still a race though, if Puppet actually needs to restart nginx.  Fixing the bug in #1 will make it so it should not take more than two runs, but adding a well placed sleep should make that rare enough.
Owner: ayatane@chromium.org
Status: Started (was: Available)
Project Member

Comment 4 by bugdroid1@chromium.org, Aug 8

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/085e11b7cafb195ba285b36603d49a06fc23c3b7

commit 085e11b7cafb195ba285b36603d49a06fc23c3b7
Author: Allen Li <ayatane@chromium.org>
Date: Wed Aug 08 18:00:29 2018

Project Member

Comment 5 by bugdroid1@chromium.org, Aug 15

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/c4fe5897a767efca54c38bfff206cc124f41a1ee

commit c4fe5897a767efca54c38bfff206cc124f41a1ee
Author: Allen Li <ayatane@chromium.org>
Date: Wed Aug 15 18:43:44 2018

Status: Fixed (was: Started)

Sign in to add a comment