server test failures (CTS/GTS and others) appear to be infra related |
|||||||
Issue descriptionThe following builds failed from seeming apparently infra related reasons: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/257 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521768 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1099 https://viceroy.corp.google.com/chromeos/build_details?build_id=1521764 https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/2561 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521750 From veyron_minnie-paladin: https://storage.cloud.google.com/chromeos-autotest-results/117862566-chromeos-test/chromeos4-row9-rack11-host12/ssp_logs/debug/autoserv.ERROR Failed to setup container for test: retry exception (function="download_extract()"), timeout = 900s. Tests were scheduled but not run for both cave and kevin.
,
May 16 2017
,
May 16 2017
dshi@, is this related to the recent container update?
,
May 16 2017
I've been seeing vmlog file upload errors in several of the failed builds, related or separate bug? scp: /var/log/vmlog/vmlog.1.LATEST: No such file or directory scp: /var/log/vmlog/vmlog.1.PREVIOUS: No such file or directory
,
May 16 2017
scp failure is not the cause of test timed out. The status.log showing START cheets_CTS_N.CtsOpenGLTestCases cheets_CTS_N.CtsOpenGLTestCases timestamp=1494945131 localtime=May 16 07:32:11 INFO ---- ---- Job aborted by autotest_system on 2017-05-16 07:32:59 The suite job started at 05/16 06:01:40, so it's the suite timed out after 90mins We need to find out if that's a dut shortage or there are some tests ran for too long.
,
May 16 2017
c#5: Which failure are you talking about?
,
May 16 2017
job 117848005 timed out in downloading autotest_server_package.tar.bz2 05/16 07:21:47.189 DEBUG| utils:0202| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/static/cave-paladin/R60-9557.0.0-rc2/autotest_server_package.tar.bz2"'' 05/16 07:26:47.683 WARNI| utils:0929| run process timeout (300) fired on: ssh 172.24.184.160 'curl "http://172.24.184.160:8082/static/cave-paladin/R60-9557.0.0-rc2/autotest_server_package.tar.bz2"' The chosen devserver is 172.24.184.160, which is a ganeti devserver (server87). It seems that we have some performance issue of the ganeti devservers.
,
May 16 2017
For kevin-paladin:1099 that is reported as being timed out: https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1099 kevin-paladin: The HWTest [bvt-inline] stage failed: ** Suite timed out before completion ** I don't think it's actually a time out: ** Start Stage HWTest [bvt-inline] - Tue, 16 May 2017 05:54:54 -0700 (PDT) 07:12:46: INFO: Translating result ** Suite timed out before completion ** to fail. And that appears to be a 9000 second (2.5h) timeout. Suite details shows two tests not getting run and an provision failure which the repair showed as being fine after. https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521764
,
May 16 2017
We should consider to disable ssh ... curl ... when downloading from ganeti devserver, direct wget on http should work. I'm suspecting the timeout is from ssh. Also, curl is more flaky than wget.
,
May 16 2017
what should we do in the short term? Are we deciding the devserver being unresponsive the cause (or ssh/curl)?
,
May 16 2017
test job 117853646 (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117853646-chromeos-test/chromeos4-row9-rack11-host11/debug/) timed out for a different reason, it failed to get job_repo_url from host attribute. I thought the host attributes are bundled in the control file already? So the test shouldn't make RPC calls. On the other hand, the test running inside container should be able to make RPC unless the shard AFE server is hang? +phobbs on the rpc issue
,
May 16 2017
For the devserver issue, it seems that server87 is not that loaded during the test timeout: https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server87&build_id=1521764&duration=1d&refresh=-1 I'll see if I can put together a quick fix to get rid of ssh first.
,
May 16 2017
https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/258 provision_AutoUpdate.double failed: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117872662-chromeos-test/ status.log: FAIL ---- ---- timestamp=1494952345 localtime=May 16 09:32:25 Failed to setup container for test: retry exception (function="download_extract()"), timeout = 900s. Check logs in ssp_logs folder for more details. platform_OSLimits had the following exception aborting the job: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117872747-chromeos-test/ debug/autoserv.ERROR 05/16 10:02:51.303 ERROR| server_job:0809| Exception escaped control file, job aborting: Traceback (most recent call last): File "/usr/local/autotest/server/server_job.py", line 801, in run self._execute_code(server_control_file, namespace) File "/usr/local/autotest/server/server_job.py", line 1301, in _execute_code execfile(code_file, namespace, namespace) File "/usr/local/autotest/results/117872747-chromeos-test/chromeos6-row2-rack18-host2/control.srv", line 10, in <module> job.parallel_simple(run_client, machines) File "/usr/local/autotest/server/server_job.py", line 625, in parallel_simple return_results=return_results) File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple function(arg) File "/usr/local/autotest/results/117872747-chromeos-test/chromeos6-row2-rack18-host2/control.srv", line 7, in run_client at.run(control, host=host, use_packaging=use_packaging) File "/usr/local/autotest/server/autotest.py", line 381, in run client_disconnect_timeout, use_packaging=use_packaging) File "/usr/local/autotest/server/autotest.py", line 464, in _do_run client_disconnect_timeout=client_disconnect_timeout) File "/usr/local/autotest/server/autotest.py", line 943, in execute_control 'client %s: %s', (self.host.name, last)) AttributeError: 'chromeos6-row2-rack18-host2_host' object has no attribute 'name' 05/16 10:02:51.308 ERROR| logging_manager:0626| tko parser: {'builds': "{'cros-version': 'cave-paladin/R60-9557.0.0-rc3'}", 'job_started': 1494953819, 'offload_failures_only': 'True', 'hostname': 'chromeos6-row2-rack18-host2', 'status_version': 1, 'label': 'cave-paladin/R60-9557.0.0-rc3/bvt-inline/platform_OSLimits', 'parent_job_id': 117872595, 'drone': 'chromeos-server108.mtv.corp.google.com', 'user': 'chromeos-test', 'suite': 'bvt-inline', 'job_queued': 1494950201, 'experimental': 'False', 'build': 'cave-paladin/R60-9557.0.0-rc3'} There were also numerous tests which weren't scheduled.
,
May 16 2017
c#13: s/weren't scheduled/weren't run/ https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521952
,
May 16 2017
Well, chromeos-server87 is in HOT. Can we not run CTS from there? Aviv, can you move cave to one of the MTV servers?
,
May 16 2017
atest shard list | grep cave 152 chromeos-server108.mtv.corp.google.com board:celes, board:lars, board:cave
,
May 16 2017
,
May 16 2017
chromeos-test@chromeos-server87:~$ ps aux | grep dhclient | wc
777 8560 57680
We're leaking dhclient processes?
,
May 16 2017
will open sep bug
,
May 16 2017
Filed at Issue 722982
,
May 17 2017
After rebooting chromeos-server87 the newest cave paladin run is progressing fine so far. All server tests are passing... http://cros-autotest-shard1.cbf.corp.google.com/results/117977748-chromeos-test/hostless/debug/
,
May 17 2017
I think it was still having problems (it passed, but was running slow), and we've had a subsequent failure this morning: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/265
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf59e756d583ad90a487ab53539c41040000ae40 commit bf59e756d583ad90a487ab53539c41040000ae40 Author: Dan Shi <dshi@google.com> Date: Thu May 18 06:36:50 2017 [autotest] Force not to use ssh devserver call if it's not in restricted subnet This helps to reduce flakes caused by ssh, and also improve the performance as ssh to the devserver and run curl tends to be less efficient and flaky comparing to direct wget call. BUG=chromium:720219, chromium:722894 TEST=unittest, local run ssp test Change-Id: I6566bc5b0b08f771330512e66beee876fb84da48 Reviewed-on: https://chromium-review.googlesource.com/506664 Tested-by: Dan Shi <dshi@google.com> Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Dan Shi <dshi@google.com> [modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/client/common_lib/cros/dev_server.py [modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/client/common_lib/cros/dev_server_unittest.py [modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/global_config.ini
,
May 18 2017
Still seeing CTS errors: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14619
,
May 18 2017
,
May 18 2017
If you want to call it that way (I would call it slow server and servo timeout) 05/18 03:37:07.704 DEBUG| suite:1210| Scheduled 16 tests, writing the total to keyval. 05/18 03:37:07.705 DEBUG| dynamic_suite:0606| Waiting on suite. 05/18 03:52:45.137 INFO | server_job:0184| START 118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate provision timestamp=1495103863 localtime=May 18 03:37:43 05/18 03:52:45.138 INFO | server_job:0184| FAIL 118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate provision timestamp=1495104440 localtime=May 18 03:47:20 Servo initialize timed out., Unhandled TimeoutError: Timeout occurred- waited 300.0 seconds. 05/18 03:52:45.138 INFO | server_job:0184| END FAIL 118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate provision timestamp=1495104440 localtime=May 18 03:47:20 05/18 03:52:45.139 DEBUG| suite:1474| Adding job keyval for provision=118265287-chromeos-test 05/18 03:52:45.139 DEBUG| suite:1129| Scheduling cheets_GTS.4.1_r2.GtsPlacementTestCases, to retry afe job 118265287 05/18 03:52:46.015 DEBUG| suite:1170| Job 118266494 created to retry job 118265287. Have retried for 1 time(s) 05/18 03:52:46.016 DEBUG| suite:1474| Adding job keyval for cheets_GTS.4.1_r2.GtsPlacementTestCases=118266494-chromeos-test 05/18 04:03:33.467 ERROR| db:0024| 04:03:33 05/18/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 05/18 04:04:25.824 INFO | server_job:0184| START 118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate provision timestamp=1495103866 localtime=May 18 03:37:46 05/18 04:04:25.825 INFO | server_job:0184| FAIL 118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate provision timestamp=1495105208 localtime=May 18 04:00:08 Servo initialize timed out., Unhandled TimeoutError: Timeout occurred- waited 300.0 seconds. 05/18 04:04:25.826 INFO | server_job:0184| END FAIL 118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate provision timestamp=1495105208 localtime=May 18 04:00:08 05/18 04:04:25.826 DEBUG| suite:1474| Adding job keyval for provision=118265285-chromeos-test 05/18 04:04:25.827 DEBUG| suite:1129| Scheduling cheets_GTS.4.1_r2.GtsAdminTestCases, to retry afe job 118265285 05/18 04:04:26.463 DEBUG| suite:1170| Job 118266783 created to retry job 118265285. Have retried for 1 time(s)
,
May 24 2017
think looks to be fixed. please reopen if still an issue.
,
Aug 1 2017
,
Jan 22 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by shuqianz@chromium.org
, May 16 2017