New issue
Advanced search Search tips
Starred by 5 users
Status: Fixed
Owner:
Closed: Nov 15
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug

Blocking:
issue 781538



Sign in to add a comment
Provision failure spike in the lab
Project Member Reported by shuqianz@chromium.org, Nov 13 Back to list
https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/3012

3:37:20: INFO: Refreshing due to a 401 (attempt 1/2)
13:37:20: INFO: Refreshing access_token
13:40:00: WARNING: Exception is not retriable return code: 3; command: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpu2cqhD/tmpG8Ca5q/temp_summary.json --raw-cmd --task-name kevin-paladin/R64-10119.0.0-rc2-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:kevin-paladin/R64-10119.0.0-rc2' '--tags=task_name:kevin-paladin/R64-10119.0.0-rc2-bvt-inline' '--tags=board:kevin' -- /usr/local/autotest/site_utils/run_suite.py --build kevin-paladin/R64-10119.0.0-rc2 --board kevin --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --suite_args "{'attr_filter': u'(suite:bvt-inline) and (subsystem:default)'}" --job_keyvals "{'cidb_build_stage_id': 62057792L, 'cidb_build_id': 2036003, 'datastore_parent_key': ('Build', 2036003, 'BuildStage', 62057792L)}" -m 155705242
Priority was reset to 100
Triggered task: kevin-paladin/R64-10119.0.0-rc2-bvt-inline
Waiting for results from the following shards: 0
Waiting for results from the following shards: 0
chromeos-golo-server2-165: 39c5b79ceeefa510 3
  Autotest instance created: cautotest-prod
  11-11-2017 [13:04:39] Created suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=155705242
  @@@STEP_LINK@Link to suite@http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=155705242@@@
  The suite job has another 0:59:54.657513 till timeout.
  11-11-2017 [13:37:27] Suite job is finished.
  11-11-2017 [13:37:27] Start collecting test results and dump them to json.
  Suite job                               [ PASSED ]
  platform_DMVerityBitCorruption.first    [ PASSED ]
  platform_DMVerityBitCorruption.middle   [ PASSED ]
  platform_DMVerityBitCorruption.last     [ PASSED ]
  platform_DMVerityCorruption             [ PASSED ]
  provision                               [ FAILED ]
  provision                                 FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos6-row2-rack24-host20: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Devserver did not start',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Devserver did not start',), 
  provision_AutoUpdate.double             [ PASSED ]
  graphics_GLMark2                        [ PASSED ]
  security_Minijail0                      [ PASSED ]
  security_ProfilePermissions.login       [ PASSED ]
  security_ProfilePermissions.guest       [ PASSED ]
  security_RootCA                         [ PASSED ]
  security_SandboxLinuxUnittests          [ PASSED ]
  security_ModuleLocking                  [ PASSED ]
 
Labels: -Pri-2 Pri-0
https://viceroy.corp.google.com/chromeos/provision?duration=8d provision failure is extremely high.

Checking & upgrade the priority.
Components: Infra>Client>ChromeOS
Cc: vapier@chromium.org davidri...@chromium.org
I log into one of the failed DUT and manually run devserver.py:

localhost ~ # devserver.py --logfile=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/dev_server.log --pidfile /tmp/devserver_wrapper.pid --port=0 --critical_update --portfile=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/dev_server.port --static_dir=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static

>>>
[13/Nov/2017:11:08:53] DEVSERVER Using cache directory /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static/cache
[13/Nov/2017:11:08:53] DEVSERVER Serving from /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static
[13/Nov/2017:11:08:53] XBUDDY Using shadow config file stored at /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/src/shadow_xbuddy_config.ini
[13/Nov/2017:11:08:53] ENGINE Listening for SIGHUP.
[13/Nov/2017:11:08:53] ENGINE Listening for SIGTERM.
[13/Nov/2017:11:08:53] ENGINE Listening for SIGUSR1.
[13/Nov/2017:11:08:53] ENGINE Bus STARTING
[13/Nov/2017:11:08:53] ENGINE PID 23500 written to '/tmp/devserver_wrapper.pid'.
[13/Nov/2017:11:08:53] ENGINE Started monitor thread '_TimeoutMonitor'.
[13/Nov/2017:11:08:53] ENGINE (wait_for_free_port) No cached port to wait for.
[13/Nov/2017:11:08:53] ENGINE (wait_for_occupied_port) Waiting for actual port 33039.
[13/Nov/2017:11:08:53] ENGINE Port 33039 written to './dev_server.port'.
[13/Nov/2017:11:08:53] ENGINE Serving on :::0
[13/Nov/2017:11:08:53] ENGINE Bus STARTED
::ffff:127.0.0.1 - - [13/Nov/2017:11:13:43] "GET /check_health HTTP/1.1" 200 442 "" "curl/7.55.1"

localhost ~ # wget http://127.0.0.1:33039/check_health -q -O                                                                                  
-bash: wget: command not found

@david mentioned vapier some recent changes about this, we may need to change 'wget' to other cmds
the log shows wget is installed in the dev image:
  https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fkevin-paladin%2F3012%2F%2B%2Frecipes%2Fsteps%2FBuildImage%2F0%2Fstdout

[binary  N     ] net-misc/wget-1.19.1-r1 to /mnt/host/source/src/build/images/kevin/R64-10119.0.0-rc2/rootfs/usr/local/ USE="ipv6 ssl zlib -debug -gnutls -idn -libressl -nls -ntlm -pcre -static {-test} -uuid" 0 KiB

but if the stateful partition is wiped, then wget won't exist because it's no longer a dev image.  anything that runs w/only a rootfs should only use curl.
i posted a PSA about this a week ago:
  https://groups.google.com/a/chromium.org/d/topic/chromium-os-dev/1elza8uHUr4/discussion

if need be, we could add wget to the chromeos-base/chromeos-test-root *temporarily* until dev server can be fixed, but that would only be a temporary measure and would be expected to dropped again soonish
Owner: xixuan@chromium.org
xixuan@, can you work on this?
Re #6, thanks! I missed the relationship between this PSA and the potential provision failure.

I will change rootfs to use 'curl' instead of 'wget'.


Project Member Comment 9 by bugdroid1@chromium.org, Nov 13
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9359091463af661da99ecad605727f85c6228219

commit 9359091463af661da99ecad605727f85c6228219
Author: Xixuan Wu <xixuan@chromium.org>
Date: Mon Nov 13 22:34:47 2017

cros-update: use curl instead of wget to check devserver health.

BUG= chromium:784462 
TEST=None

Change-Id: I5375daaca169976358ead00335b8a4e7a6f16ee9
Reviewed-on: https://chromium-review.googlesource.com/767293
Tested-by: Xixuan Wu <xixuan@chromium.org>
Trybot-Ready: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/9359091463af661da99ecad605727f85c6228219/lib/dev_server_wrapper.py

Summary: Provision failure spike in the lab (was: HWTest failed due to devserver did not start)
Cc: benchan@chromium.org shuqianz@chromium.org nsanders@chromium.org hiroh@chromium.org
 Issue 784222  has been merged into this issue.
Cc: rohi...@chromium.org pmanavalan@chromium.org kinaba@chromium.org kkan...@chromium.org
Issue 784549 has been merged into this issue.
The provision failure rate comes back to normal now
CTS India team, please also verify Issue 784549 as well.
NM, I will unlock the DUT to verify the fix.
Blocking: 781538
Labels: -Restrict-View-Google
Status: Fixed
can i assume we're all set here ?  hopefully there's no other latent wget breakage :/.
Sign in to add a comment