Issue metadata
Sign in to add a comment
|
Falco-chrome-pfq failed because of Provision failure: Unhandled DevServerException |
||||||||||||||||||||||||
Issue descriptionThis provision failures on Falco-chrome-pfq started from https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5264 and has lasted for a few days (the latest failed builder with the same failure: https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5269). Selected error message: provision [ FAILED ] provision FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out @@@STEP_FAILURE@@@ 00:44:12: ERROR: ** HWTest did not complete due to infrastructure issues (code 3) ** 00:44:12: ERROR: BaseException in _RunParallelStages <class 'chromite.lib.failures_lib.StepFailure'>: Traceback (most recent call last): File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 441, in _Run self._task(*self._task_args, **self._task_kwargs) File "/b/cbuild/internal_master/chromite/cbuildbot/stages/generic_stages.py", line 625, in Run raise failures_lib.StepFailure() StepFailure Traceback (most recent call last): File "/b/cbuild/internal_master/chromite/cbuildbot/builders/generic_builders.py", line 118, in _RunParallelStages parallel.RunParallelSteps(steps) File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 678, in RunParallelSteps return [queue.get_nowait() for queue in queues] File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 675, in RunParallelSteps pass File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 561, in ParallelTasks raise BackgroundFailure(exc_infos=errors) BackgroundFailure: <class 'chromite.lib.failures_lib.StepFailure'>: Traceback (most recent call last): File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 441, in _Run self._task(*self._task_args, **self._task_kwargs) File "/b/cbuild/internal_master/chromite/cbuildbot/stages/generic_stages.py", line 625, in Run raise failures_lib.StepFailure() StepFailure 00:44:12: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Select at 0x6f70a90; Select object> 00:44:17: INFO: cidb query succeeded after 1 retries 00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x6f63410> 00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x6f63550> 00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Select at 0x6f638d0; Select object> 01:02:38: INFO: Running cidb query on pid 31932, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x47ef5d0> 01:02:39: WARNING: HttpsMonitor.send received status 429: { "error": { "code": 429, "message": "Resource has been exhausted (e.g. check quota).", "status": "RESOURCE_EXHAUSTED" } }
,
Jan 31 2017
,
Jan 31 2017
I think it could be flaky network connection, or network card or the DUT?
,
Jan 31 2017
+current infra trooper
,
Jan 31 2017
Looking at the debug info of the autotest suite (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/99076270-chromeos-test/chromeos2-row3-rack3-host9/debug/), I found the following error message: 01/31 12:24:42.250 ERROR| dev_server:0488| Devserver check_health failed. Free disk space is low. Only 19GB is available. 01/31 12:24:44.557 ERROR| base_utils:0280| [stderr] [0131/191459:INFO:update_engine_client.cc(493)] Querying Update Engine status... 01/31 12:24:54.193 ERROR| base_utils:0280| [stderr] 2017-01-31 12:24:53 URL:http://100.115.245.198:8082/static/falco-release/R57-9202.11.0/autotest/packages/client-autotest.tar.bz2 [11496355/11496355] -> "/tmp/sysinfo/autoserv-eDvJ5H/packages/client-autotest.tar.bz2" [1] 01/31 12:24:55.657 ERROR| base_utils:0280| [stderr] 2017-01-31 12:24:55 URL:http://100.115.245.198:8082/static/falco-release/R57-9202.11.0/autotest/packages/packages.checksum [34489/34489] -> "/tmp/sysinfo/autoserv-eDvJ5H/packages.checksum" [1] 01/31 12:24:56.818 ERROR| base_utils:0280| [stderr] cat: /tmp/sysinfo/autoserv-eDvJ5H/.checksum: No such file or directory 01/31 12:58:50.854 ERROR| base_utils:0280| [stderr] ssh: connect to host chromeos2-row3-rack3-host9 port 22: Connection timed out 01/31 12:58:50.857 ERROR| test:0240| Error cleaning up the sysinfo autotest/host objects, ignoring it Traceback (most recent call last): File "/usr/local/autotest/server/test.py", line 233, in cleanup self.host.close() File "/usr/local/autotest/server/hosts/cros_host.py", line 1128, in close super(CrosHost, self).close() File "/usr/local/autotest/server/hosts/abstract_ssh.py", line 732, in close super(AbstractSSHHost, self).close() File "/usr/local/autotest/server/hosts/remote.py", line 52, in close self.run('rm -rf "%s"' % (utils.sh_escape(dir))) File "/usr/local/autotest/server/hosts/ssh_host.py", line 283, in run ssh_failure_retry_ok) File "/usr/local/autotest/server/hosts/ssh_host.py", line 232, in _run raise error.AutoservSSHTimeout("ssh timed out", result) AutoservSSHTimeout: ('ssh timed out', * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_uOpgfpssh- master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row3-rack3-host9 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::close|close|close] -> ssh_run(rm -rf \\\"/tmp/sysinfo /autoserv-eDvJ5H\\\")\";fi; rm -rf \"/tmp/sysinfo/autoserv-eDvJ5H\"" Exit status: 255 Duration: 63.1774010658 Seems it's related to Issue 687237 (several devservers are failing health check with low disk messages)?? I'll wait to see if the cleaning-up work in that issue could also fix this issue....
,
Feb 1 2017
Several informational builds start to be affected by this failure... e.g., tricky-tot-chrome-pfq-informational: https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/tricky-tot-chrome-pfq-informational/builds/3327 peach_pit-tot-chrome-pfq-informational: https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/peach_pit-tot-chrome-pfq-informational/builds/4674 Any thoughts?
,
Feb 1 2017
,
Feb 1 2017
ssh connection timeout might be https://bugs.chromium.org/p/chromium/issues/detail?id=687402 ?
,
Feb 1 2017
Chrome Infra does not handle issues related to ChromeOS.
,
Feb 1 2017
,
Feb 1 2017
I noticed that we had a green cycle yesterday night. So seems the cleaning-up work in Issue 687237 did fix the problem. But then this morning the same failure came back again: 02-01-2017 [03:50:00] Output below this line is for buildbot consumption: @@@STEP_LINK@[Auto-Bug]: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out, 41 reports@https://code.google.com/p/chromium/issues/detail?id=666781@@@ @@@STEP_LINK@[Test-Logs]: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out@http://cautotest/tko/retrieve_logs.cgi?job=/results/99166108-chromeos-test/@@@ Will return from run_suite with status: INFRA_FAILURE
,
Feb 1 2017
+cc semenzato@ uekawa@
,
Feb 2 2017
This issue stays in Falco-chrome-pfq and has been blocking Chrome PFQ for 3 days. (Note: there was one green Falco-chrome-pfq run yesterday then it failed again with the same failure for the following runs. I think the green cycle might be because of the cleaning-up work in Issue 687237 or dhcp config failure in Issue 687402 was resolved.) The latest failed one is https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5277. akeshet@, could you take a look at it or find a better owner for this issue please? It seems likely to be Infra errors.
,
Feb 2 2017
looking at the past 200 runs, falco chrome pfq has been more red than green since around Jan 17 https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq?numbuilds=200
,
Feb 2 2017
for the last two runs chromeos2-row3-rack4-host3 did not respond after AU, whether that means it's a faulty device or faulty change I don't know, but it's hard to believe something is specific to falco.
,
Feb 2 2017
An update on this thread: We have another green cycle early this morning. However by looking at the history it's hard to tell this issue has been fixed or it was just happenstance. I'll keep monitoring the new run.
,
Feb 2 2017
The new Falco-chrome-pfq run (see https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5280) just failed again with the same SSHConnectionError timeout error along with a new HWTest video_ChromeHWDecodeUsed error (see Issue 688118 ).
,
Feb 2 2017
the last run also had ssh timeout for chromeos2-row3-rack4-host3 (https://bugs.chromium.org/p/chromium/issues/detail?id=666781)
,
Feb 3 2017
All of the recent failures were attempts to provision chromeos2-row3-rack4-host3. We should file an issue for Infra to remove that from the pool.
,
Feb 3 2017
Re#19: Thanks for the observation! Filed Issue 688187 to track the removal of chromeos2-row3-rack4-host3.
,
Feb 3 2017
,
Feb 3 2017
ok, let's see if locking the host improves the situation.
,
Feb 3 2017
Now it failed with chromeos2-row3-rack3-host9 a new and different host.
,
Feb 3 2017
From the last, chromeos2-row3-rack4-host1 has failed, so it's probably not the hosts. provision [ FAILED ] provision FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host1: DeviceNotPingableError: Device chromeos2-row3-rack4-host1 is not pingable.
,
Feb 3 2017
From the recent builds: 5280 was a provision failure to chromeos2-row3-rack4-host3 5281 was a provision failure to chromeos2-row3-rack4-host3 5282 was a slightly different provision failure to chromeos2-row3-rack3-host9 5283 was a provision failure to chromeos2-row3-rack4-host1 5284 succeeded So, I am pretty certain that there was definitely a problem with chromeos2-row3-rack4-host3. It is less clear about 3-3-9 or 3-4-1, but when I looked further back in the history those two hosts failed fairly often (but I'm not sure how many builders are actually in the pool, so that may or may not be indicative of anything). It would be helpful to be able to see the history for a particular DUT, but I am unaware of how to do so (there is an open YAQS question, https://yaqs.googleplex.com/eng/q/5545813244116992) akeshet@ - any suggestions?
,
Feb 3 2017
Re#24: The DeviceNotPingableError error is a different error. It's the same issue as in Issue 683304. It happened intermittently for the past week and semenzato@ has CLs to fix it:) Now the latest build passed (https://uberchromegw.corp.google.com/i/chromeos/builders/falco-chrome-pfq/builds/5284). Maybe the lock of chromeos2-row3-rack4-host3 in Issue 688187 did work, or maybe it's just happenstance. I'll keep monitoring it.
,
Feb 3 2017
I suspect that chromeos2-row3-rack4-host3 was extremely flakey (or just plain bad), and that the others are only semi-flakey. We'll keep an eye on things and may need to investigate further next week.
,
Feb 3 2017
$ dut-status -f chromeos2-row3-rack4-host3
chromeos2-row3-rack4-host3
2017-02-02 17:39:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99410498-chromeos-test/
2017-02-02 17:39:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512914-reset/
2017-02-02 16:40:46 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99396164-chromeos-test/
2017-02-02 16:40:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512705-reset/
2017-02-02 16:39:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99395491-chromeos-test/
2017-02-02 16:31:01 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512673-provision/
2017-02-02 16:21:17 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512648-repair/
2017-02-02 16:07:22 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512590-provision/
2017-02-02 16:06:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512586-repair/
2017-02-02 16:06:29 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512584-reset/
2017-02-02 15:54:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99395660-chromeos-test/
2017-02-02 15:54:27 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512541-reset/
2017-02-02 15:53:39 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392328-chromeos-test/
2017-02-02 15:53:15 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512537-reset/
2017-02-02 15:52:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392322-chromeos-test/
2017-02-02 15:52:10 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512531-reset/
2017-02-02 15:51:07 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392301-chromeos-test/
2017-02-02 15:50:47 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512525-reset/
2017-02-02 15:49:45 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392291-chromeos-test/
2017-02-02 15:49:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512516-reset/
2017-02-02 15:48:09 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392285-chromeos-test/
2017-02-02 15:39:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512461-provision/
2017-02-02 11:30:38 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99380649-chromeos-test/
2017-02-02 11:22:26 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512175-provision/
2017-02-02 11:13:09 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512126-repair/
2017-02-02 11:02:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512110-provision/
,
Feb 3 2017
It does seem somewhat provision-failure-prone. You might get some insight examining those repair logs.
,
Feb 3 2017
I was looking at some arbitrary var/log/messages chromeos-autotest-results/hosts/chromeos2-row3-rack4-host3/1512126-repair/20170202111305/crashinfo.chromeos2-row3-rack4-host3/var/log /chromeos-autotest-results/hosts/chromeos2-row3-rack4-host3/1512586-repair/20170202160644/chromeos2-row3-rack4-host3/before_repair/log I was a little bit surprised that timezone is mixed, on one boot it is GMT and another became PST. For some reason the time doesn't look right. 2017-02-02T17:00:57.486227+00:00 ERR wpa_supplicant[641]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied 2017-02-02T09:01:03.714591-08:00 INFO kernel: [ 0.000000] Initializing cgroup subsys cpu this is for 2017-02-02 11:13:09 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512126-repair/ what's this timestamp? is this a run at around 16:00PST ? I'm trying to wrap my head around which run is the failing run and which one is the repair
,
Feb 3 2017
The GMT/PST problem is known and probably not fixable. Usually it's not too difficult to understand by context whether you need to subtract 8 hours from the syslog timestamps. I've been looking at failures to reconnect after reboot. See issue 677572 , particularly comment 32. In that failure, eth0 doesn't come up after a reboot. check_ethernet.hook tries to bring it up, attempting several reboots in the process. Finally autoserv does an EC reset (or power cycle), which, unlike the reboot, cuts power to the USB dongle. That makes the dongle work again. I wonder if this is the same situation.
,
Feb 6 2017
,
Feb 6 2017
chromeos2-row3-rack4-host3 was manually reflashed and seems to provision fine now. LMK if there is more to be done here.
,
Feb 6 2017
issue 666781 shows that this happens to a few other boards, but falco is the most frequent by far. I suggest keeping this bug open until we can identify why the other falco DUTs are so flaky.
,
Feb 6 2017
There was a definitely problem with chromeos2-row3-rack4-host3, fixed in issue 688187 . Starting with build #5284 (when chromeos2-row3-rack4-host3 was pulled), we have had 3 failures out of 17 runs, all due to failures on chromeos2-row3-rack4-host1. This isn't *super* high, but it's higher than we would like, so I would file a new issue to have chromeos2-row3-rack4-host1 reflashed to see if that helps reduce flaco flakiness. My notes that chromeos2-row3-rack4-host1 has been more flaky than average for a while, but it has largely been masked by consistent failures on chromeos2-row3-rack4-host1.
,
Feb 8 2017
issue 689794 is a similar problem for chromeos2-row3-rack9-host11 which runs samus-android-pfq. So it might suggest something is wrong for the whole chromeos2-row3. Does anyone know who in the infra team we should route this to?
,
Feb 8 2017
List of all similar issues: issue 689794, issue 666781, issue 687883, issue 687432 Looks like the problem actually happens on all boards and all DUTs.
,
Feb 9 2017
,
Feb 11 2017
Discussed with xixuan@ She has a retry around these SSHConnectionError in the pipeline that should hopefully resolve this (or atleast reduce it) soon.
,
Feb 14 2017
,
Feb 14 2017
When should we expect this to cycle green? The latest run is still failing.
,
Feb 14 2017
after CL https://chromium-review.googlesource.com/c/441389/ is pushed in. ETA is the end of this week. We don't want the retry to add the devserver's load and as a result break CQ. So we will first adding a metrics to monitor such errors, then push CL 441389 in to see whether it makes any difference.
,
Feb 14 2017
xixuan@, is the cl CL https://chromium-review.googlesource.com/c/441389/ already in? Is the issue 692247 caused by the same root you are trying to fix by adding more re-tries?
,
Feb 14 2017
No. Haven't submitted the CL. I update on that bug (692247). it's not a network issue for that host.
,
Feb 15 2017
,
Feb 21 2017
,
Mar 2 2017
OK. Going back to first principles, here's history
around the original failure on chromeos2-row3-rack4-host3:
chromeos2-row3-rack4-host3
2017-01-31 01:48:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503494-reset/
2017-01-31 01:47:52 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98984451-chromeos-test/
2017-01-31 01:36:51 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503470-provision/
2017-01-31 00:19:15 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503299-repair/
2017-01-31 00:03:00 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503245-provision/
2017-01-30 20:24:12 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502943-repair/
2017-01-30 20:19:37 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502918-provision/
2017-01-30 20:18:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502914-repair/
2017-01-30 20:14:34 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502883-provision/
2017-01-30 20:13:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502878-repair/
2017-01-30 19:54:50 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502803-provision/
2017-01-30 19:28:46 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98934521-chromeos-test/
2017-01-30 19:28:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502743-reset/
2017-01-30 19:27:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502736-cleanup/
2017-01-30 18:24:11 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98928486-chromeos-test/
2017-01-30 18:19:41 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502422-reset/
So, the provision failure was part of a period
of general failures that lasted for over 7 hours.
Looking at the central sequence of provision/repair
events, there's a common theme:
* Provision failed after the devserver was unable to
talk to the DUT over ssh.
* The DUT was accessible over ssh less than a minute
later when the shard started repair.
This is consistent with some more recently observed
problems attributed to devserver load.
,
Mar 2 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by x...@chromium.org
, Jan 31 2017