Provision failure spike in the lab |
|||||||||||
Issue descriptionhttps://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/3012 3:37:20: INFO: Refreshing due to a 401 (attempt 1/2) 13:37:20: INFO: Refreshing access_token [1;33m13:40:00: WARNING: Exception is not retriable return code: 3; command: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpu2cqhD/tmpG8Ca5q/temp_summary.json --raw-cmd --task-name kevin-paladin/R64-10119.0.0-rc2-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:kevin-paladin/R64-10119.0.0-rc2' '--tags=task_name:kevin-paladin/R64-10119.0.0-rc2-bvt-inline' '--tags=board:kevin' -- /usr/local/autotest/site_utils/run_suite.py --build kevin-paladin/R64-10119.0.0-rc2 --board kevin --suite_name suite_attr_wrapper --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only False --suite_args "{'attr_filter': u'(suite:bvt-inline) and (subsystem:default)'}" --job_keyvals "{'cidb_build_stage_id': 62057792L, 'cidb_build_id': 2036003, 'datastore_parent_key': ('Build', 2036003, 'BuildStage', 62057792L)}" -m 155705242 Priority was reset to 100 Triggered task: kevin-paladin/R64-10119.0.0-rc2-bvt-inline Waiting for results from the following shards: 0 Waiting for results from the following shards: 0 chromeos-golo-server2-165: 39c5b79ceeefa510 3 Autotest instance created: cautotest-prod 11-11-2017 [13:04:39] Created suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=155705242 @@@STEP_LINK@Link to suite@http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=155705242@@@ The suite job has another 0:59:54.657513 till timeout. 11-11-2017 [13:37:27] Suite job is finished. 11-11-2017 [13:37:27] Start collecting test results and dump them to json. Suite job [ PASSED ] platform_DMVerityBitCorruption.first [ PASSED ] platform_DMVerityBitCorruption.middle [ PASSED ] platform_DMVerityBitCorruption.last [ PASSED ] platform_DMVerityCorruption [ PASSED ] provision [ FAILED ] provision FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos6-row2-rack24-host20: 0) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Devserver did not start',), 1) RootfsUpdateError: Failed to perform rootfs update: DevServerStartupError('Devserver did not start',), provision_AutoUpdate.double [ PASSED ] graphics_GLMark2 [ PASSED ] security_Minijail0 [ PASSED ] security_ProfilePermissions.login [ PASSED ] security_ProfilePermissions.guest [ PASSED ] security_RootCA [ PASSED ] security_SandboxLinuxUnittests [ PASSED ] security_ModuleLocking [ PASSED ]
,
Nov 13 2017
https://viceroy.corp.google.com/chromeos/provision?duration=8d provision failure is extremely high. Checking & upgrade the priority.
,
Nov 13 2017
,
Nov 13 2017
I log into one of the failed DUT and manually run devserver.py: localhost ~ # devserver.py --logfile=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/dev_server.log --pidfile /tmp/devserver_wrapper.pid --port=0 --critical_update --portfile=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/dev_server.port --static_dir=/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static >>> [13/Nov/2017:11:08:53] DEVSERVER Using cache directory /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static/cache [13/Nov/2017:11:08:53] DEVSERVER Serving from /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/static [13/Nov/2017:11:08:53] XBUDDY Using shadow config file stored at /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.D57zLsEoJU/src/shadow_xbuddy_config.ini [13/Nov/2017:11:08:53] ENGINE Listening for SIGHUP. [13/Nov/2017:11:08:53] ENGINE Listening for SIGTERM. [13/Nov/2017:11:08:53] ENGINE Listening for SIGUSR1. [13/Nov/2017:11:08:53] ENGINE Bus STARTING [13/Nov/2017:11:08:53] ENGINE PID 23500 written to '/tmp/devserver_wrapper.pid'. [13/Nov/2017:11:08:53] ENGINE Started monitor thread '_TimeoutMonitor'. [13/Nov/2017:11:08:53] ENGINE (wait_for_free_port) No cached port to wait for. [13/Nov/2017:11:08:53] ENGINE (wait_for_occupied_port) Waiting for actual port 33039. [13/Nov/2017:11:08:53] ENGINE Port 33039 written to './dev_server.port'. [13/Nov/2017:11:08:53] ENGINE Serving on :::0 [13/Nov/2017:11:08:53] ENGINE Bus STARTED ::ffff:127.0.0.1 - - [13/Nov/2017:11:13:43] "GET /check_health HTTP/1.1" 200 442 "" "curl/7.55.1" localhost ~ # wget http://127.0.0.1:33039/check_health -q -O -bash: wget: command not found @david mentioned vapier some recent changes about this, we may need to change 'wget' to other cmds
,
Nov 13 2017
the log shows wget is installed in the dev image: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fkevin-paladin%2F3012%2F%2B%2Frecipes%2Fsteps%2FBuildImage%2F0%2Fstdout [binary N ] net-misc/wget-1.19.1-r1 to /mnt/host/source/src/build/images/kevin/R64-10119.0.0-rc2/rootfs/usr/local/ USE="ipv6 ssl zlib -debug -gnutls -idn -libressl -nls -ntlm -pcre -static {-test} -uuid" 0 KiB but if the stateful partition is wiped, then wget won't exist because it's no longer a dev image. anything that runs w/only a rootfs should only use curl.
,
Nov 13 2017
i posted a PSA about this a week ago: https://groups.google.com/a/chromium.org/d/topic/chromium-os-dev/1elza8uHUr4/discussion if need be, we could add wget to the chromeos-base/chromeos-test-root *temporarily* until dev server can be fixed, but that would only be a temporary measure and would be expected to dropped again soonish
,
Nov 13 2017
xixuan@, can you work on this?
,
Nov 13 2017
Re #6, thanks! I missed the relationship between this PSA and the potential provision failure. I will change rootfs to use 'curl' instead of 'wget'.
,
Nov 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9359091463af661da99ecad605727f85c6228219 commit 9359091463af661da99ecad605727f85c6228219 Author: Xixuan Wu <xixuan@chromium.org> Date: Mon Nov 13 22:34:47 2017 cros-update: use curl instead of wget to check devserver health. BUG= chromium:784462 TEST=None Change-Id: I5375daaca169976358ead00335b8a4e7a6f16ee9 Reviewed-on: https://chromium-review.googlesource.com/767293 Tested-by: Xixuan Wu <xixuan@chromium.org> Trybot-Ready: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/9359091463af661da99ecad605727f85c6228219/lib/dev_server_wrapper.py
,
Nov 13 2017
,
Nov 13 2017
Issue 784222 has been merged into this issue.
,
Nov 14 2017
Issue 784549 has been merged into this issue.
,
Nov 14 2017
The provision failure rate comes back to normal now
,
Nov 14 2017
CTS India team, please also verify Issue 784549 as well.
,
Nov 14 2017
NM, I will unlock the DUT to verify the fix.
,
Nov 15 2017
can i assume we're all set here ? hopefully there's no other latent wget breakage :/.
,
Dec 21 2017
This bug hit me on my moblab: I am still using R62, trying to update DUT to R65. Given moblab users for qual may still be on R62, we need to backport CL/767293 or force a moblab update.
,
Jan 2 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by shuqianz@chromium.org
, Nov 13 2017