New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 722894 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

server test failures (CTS/GTS and others) appear to be infra related

Project Member Reported by davidri...@chromium.org, May 16 2017

Issue description

There was a container update last week, maybe related.
Cc: chingcodes@chromium.org
 Issue 722895  has been merged into this issue.
dshi@, is this related to the recent container update?
I've been seeing vmlog file upload errors in several of the failed builds, related or separate bug?

scp: /var/log/vmlog/vmlog.1.LATEST: No such file or directory
scp: /var/log/vmlog/vmlog.1.PREVIOUS: No such file or directory

Comment 5 by dshi@chromium.org, May 16 2017

scp failure is not the cause of test timed out. The status.log showing
START	cheets_CTS_N.CtsOpenGLTestCases	cheets_CTS_N.CtsOpenGLTestCases	timestamp=1494945131	localtime=May 16 07:32:11	
INFO	----	----	Job aborted by autotest_system on 2017-05-16 07:32:59

The suite job started at 05/16 06:01:40, so it's the suite timed out after 90mins

We need to find out if that's a dut shortage or there are some tests ran for too long.
c#5: Which failure are you talking about?

Comment 7 by dshi@chromium.org, May 16 2017

job 117848005 timed out in downloading autotest_server_package.tar.bz2

05/16 07:21:47.189 DEBUG|             utils:0202| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/static/cave-paladin/R60-9557.0.0-rc2/autotest_server_package.tar.bz2"''
05/16 07:26:47.683 WARNI|             utils:0929| run process timeout (300) fired on: ssh 172.24.184.160 'curl "http://172.24.184.160:8082/static/cave-paladin/R60-9557.0.0-rc2/autotest_server_package.tar.bz2"'


The chosen devserver is 172.24.184.160, which is a ganeti devserver (server87). It seems that we have some performance issue of the ganeti devservers.
For kevin-paladin:1099 that is reported as being timed out:
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1099
kevin-paladin: The HWTest [bvt-inline] stage failed: ** Suite timed out before completion ** 

I don't think it's actually a time out:
** Start Stage HWTest [bvt-inline] - Tue, 16 May 2017 05:54:54 -0700 (PDT)
07:12:46: INFO: Translating result ** Suite timed out before completion ** to fail.

And that appears to be a 9000 second (2.5h) timeout.

Suite details shows two tests not getting run and an provision failure which the repair showed as being fine after.
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521764

Comment 9 by dshi@chromium.org, May 16 2017

We should consider to disable ssh ... curl ... when downloading from ganeti devserver, direct wget on http should work.
I'm suspecting the timeout is from ssh. Also, curl is more flaky than wget.
what should we do in the short term? Are we deciding the devserver being unresponsive the cause (or ssh/curl)?

Comment 11 by dshi@chromium.org, May 16 2017

test job 117853646 (https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117853646-chromeos-test/chromeos4-row9-rack11-host11/debug/) timed out for a different reason, it failed to get job_repo_url from host attribute.

I thought the host attributes are bundled in the control file already? So the test shouldn't make RPC calls. On the other hand, the test running inside container should be able to make RPC unless the shard AFE server is hang?

+phobbs on the rpc issue

Comment 12 by dshi@chromium.org, May 16 2017

For the devserver issue, it seems that server87 is not that loaded during the test timeout:
https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server87&build_id=1521764&duration=1d&refresh=-1

I'll see if I can put together a quick fix to get rid of ssh first.
https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/258

provision_AutoUpdate.double failed:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117872662-chromeos-test/
status.log:
FAIL	----	----	timestamp=1494952345	localtime=May 16 09:32:25	Failed to setup container for test: retry exception (function="download_extract()"), timeout = 900s. Check logs in ssp_logs folder for more details.

platform_OSLimits had the following exception aborting the job:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/117872747-chromeos-test/
debug/autoserv.ERROR
05/16 10:02:51.303 ERROR|        server_job:0809| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 801, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1301, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/117872747-chromeos-test/chromeos6-row2-rack18-host2/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/usr/local/autotest/server/server_job.py", line 625, in parallel_simple
    return_results=return_results)
  File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple
    function(arg)
  File "/usr/local/autotest/results/117872747-chromeos-test/chromeos6-row2-rack18-host2/control.srv", line 7, in run_client
    at.run(control, host=host, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 381, in run
    client_disconnect_timeout, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 464, in _do_run
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 943, in execute_control
    'client %s: %s', (self.host.name, last))
AttributeError: 'chromeos6-row2-rack18-host2_host' object has no attribute 'name'
05/16 10:02:51.308 ERROR|   logging_manager:0626| tko parser: {'builds': "{'cros-version': 'cave-paladin/R60-9557.0.0-rc3'}", 'job_started': 1494953819, 'offload_failures_only': 'True', 'hostname': 'chromeos6-row2-rack18-host2', 'status_version': 1, 'label': 'cave-paladin/R60-9557.0.0-rc3/bvt-inline/platform_OSLimits', 'parent_job_id': 117872595, 'drone': 'chromeos-server108.mtv.corp.google.com', 'user': 'chromeos-test', 'suite': 'bvt-inline', 'job_queued': 1494950201, 'experimental': 'False', 'build': 'cave-paladin/R60-9557.0.0-rc3'}

There were also numerous tests which weren't scheduled.

c#13:
s/weren't scheduled/weren't run/

https://viceroy.corp.google.com/chromeos/suite_details?build_id=1521952

Comment 15 by ihf@chromium.org, May 16 2017

Cc: ihf@chromium.org
Well, chromeos-server87 is in HOT. Can we not run CTS from there?

Aviv, can you move cave to one of the MTV servers?
atest shard list | grep cave
152  chromeos-server108.mtv.corp.google.com  board:celes, board:lars, board:cave
chromeos-test@chromeos-server87:~$ ps aux | grep dhclient | wc
    777    8560   57680

We're leaking dhclient processes?
Cc: ayatane@chromium.org
will open sep bug
Filed at Issue 722982

Comment 21 by ihf@chromium.org, May 17 2017

Summary: server test failures (CTS/GTS and others) appear to be infra related (was: CTS/GTS failures appear to be infra related)
After rebooting chromeos-server87 the newest cave paladin run is progressing fine so far. All server tests are passing...
http://cros-autotest-shard1.cbf.corp.google.com/results/117977748-chromeos-test/hostless/debug/
I think it was still having problems (it passed, but was running slow), and we've had a subsequent failure this morning:
https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/265
Project Member

Comment 23 by bugdroid1@chromium.org, May 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf59e756d583ad90a487ab53539c41040000ae40

commit bf59e756d583ad90a487ab53539c41040000ae40
Author: Dan Shi <dshi@google.com>
Date: Thu May 18 06:36:50 2017

[autotest] Force not to use ssh devserver call if it's not in restricted subnet

This helps to reduce flakes caused by ssh, and also improve the performance as
ssh to the devserver and run curl tends to be less efficient and flaky comparing
to direct wget call.

BUG=chromium:720219, chromium:722894 
TEST=unittest, local run ssp test

Change-Id: I6566bc5b0b08f771330512e66beee876fb84da48
Reviewed-on: https://chromium-review.googlesource.com/506664
Tested-by: Dan Shi <dshi@google.com>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Commit-Queue: Dan Shi <dshi@google.com>

[modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/client/common_lib/cros/dev_server.py
[modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/client/common_lib/cros/dev_server_unittest.py
[modify] https://crrev.com/bf59e756d583ad90a487ab53539c41040000ae40/global_config.ini

Comment 26 by ihf@chromium.org, May 18 2017

If you want to call it that way (I would call it slow server and servo timeout)

05/18 03:37:07.704 DEBUG|             suite:1210| Scheduled 16 tests, writing the total to keyval.
05/18 03:37:07.705 DEBUG|     dynamic_suite:0606| Waiting on suite.
05/18 03:52:45.137 INFO |        server_job:0184| START	118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate	provision	timestamp=1495103863	localtime=May 18 03:37:43	
05/18 03:52:45.138 INFO |        server_job:0184| 	FAIL	118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate	provision	timestamp=1495104440	localtime=May 18 03:47:20	Servo initialize timed out., Unhandled TimeoutError: Timeout occurred- waited 300.0 seconds.
05/18 03:52:45.138 INFO |        server_job:0184| END FAIL	118265287-chromeos-test/chromeos4-row9-rack11-host11/provision_AutoUpdate	provision	timestamp=1495104440	localtime=May 18 03:47:20	
05/18 03:52:45.139 DEBUG|             suite:1474| Adding job keyval for provision=118265287-chromeos-test
05/18 03:52:45.139 DEBUG|             suite:1129| Scheduling cheets_GTS.4.1_r2.GtsPlacementTestCases, to retry afe job 118265287
05/18 03:52:46.015 DEBUG|             suite:1170| Job 118266494 created to retry job 118265287. Have retried for 1 time(s)
05/18 03:52:46.016 DEBUG|             suite:1474| Adding job keyval for cheets_GTS.4.1_r2.GtsPlacementTestCases=118266494-chromeos-test
05/18 04:03:33.467 ERROR|                db:0024| 04:03:33 05/18/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
05/18 04:04:25.824 INFO |        server_job:0184| START	118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate	provision	timestamp=1495103866	localtime=May 18 03:37:46	
05/18 04:04:25.825 INFO |        server_job:0184| 	FAIL	118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate	provision	timestamp=1495105208	localtime=May 18 04:00:08	Servo initialize timed out., Unhandled TimeoutError: Timeout occurred- waited 300.0 seconds.
05/18 04:04:25.826 INFO |        server_job:0184| END FAIL	118265285-chromeos-test/chromeos4-row9-rack10-host18/provision_AutoUpdate	provision	timestamp=1495105208	localtime=May 18 04:00:08	
05/18 04:04:25.826 DEBUG|             suite:1474| Adding job keyval for provision=118265285-chromeos-test
05/18 04:04:25.827 DEBUG|             suite:1129| Scheduling cheets_GTS.4.1_r2.GtsAdminTestCases, to retry afe job 118265285
05/18 04:04:26.463 DEBUG|             suite:1170| Job 118266783 created to retry job 118265285. Have retried for 1 time(s)
Status: Fixed (was: Available)
think looks to be fixed. please reopen if still an issue.
Labels: VerifyIn-61

Comment 29 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment