New issue
Advanced search Search tips

Issue 844094 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Jul 20
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Autotest took 27 minutes to start running login_LoginSuccess on eve-tot-chrome-pfq-informational

Project Member Reported by derat@chromium.org, May 17 2018

Issue description

http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8946288729579521152 reports that HWTest [bvt-cq] timed out on eve-tot-chrome-pfq-informational.

When I look at the suite timeline at http://cros-goldeneye/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2579613, it reports that login_LoginSuccess took almost 30 minutes to run, from ~06:40 until ~07:08.

When I look at the client info log at https://storage.cloud.google.com/chromeos-autotest-results/200887109-chromeos-test/chromeos6-row4-rack17-host4/debug/client.0.INFO?_ga=2.33838753.-1412206605.1526576906, I can see that the test code itself logged messages between 07:07:08.989 and 07:07:39.735, i.e. just ~30 seconds.

When I look at the autoserv info log at https://storage.cloud.google.com/chromeos-autotest-results/200887109-chromeos-test/chromeos6-row4-rack17-host4/debug/autoserv.INFO?_ga=2.268300814.-1412206605.1526576906, I see a 27-minute pause between "Got job repo url from host attributes" and the execution of the test's control file:

05/17 06:39:57.446 INFO |          autotest:0245| Got job repo url from host attributes: http://100.115.185.226:8082/static/eve-tot-chrome-pfq-informational/R68-10686.0.0-b2579613/autotest/packages
05/17 06:40:00.655 INFO |          packages:0190| Fetching packages.checksum from http://100.115.185.226:8082/static/eve-tot-chrome-pfq-informational/R68-10686.0.0-b2579613/autotest/packages to /usr/local/autotest/packages.checksum
05/17 06:40:01.570 INFO |          packages:0207| Successfully fetched packages.checksum from http://100.115.185.226:8082/static/eve-tot-chrome-pfq-informational/R68-10686.0.0-b2579613/autotest/packages/packages.checksum
05/17 06:40:08.579 ERROR|             utils:0282| [stderr] cat: /usr/local/autotest/.checksum: No such file or directory
05/17 06:40:12.091 INFO |          autotest:0352| Installation of autotest completed using the packaging system.
05/17 06:40:14.576 INFO |          autotest:0245| Got job repo url from host attributes: http://100.115.185.226:8082/static/eve-tot-chrome-pfq-informational/R68-10686.0.0-b2579613/autotest/packages
05/17 07:07:07.262 INFO |          autotest:0974| Executing /usr/local/autotest/bin/autotest /usr/local/autotest/control phase 0
05/17 07:07:09.807 INFO |        server_job:0218| START	----	----	timestamp=1526566029	localtime=May 17 07:07:09	
05/17 07:07:09.807 INFO |        server_job:0218| 	START	login_LoginSuccess	login_LoginSuccess	timestamp=1526566029	localtime=May 17 07:07:09	
05/17 07:07:09.808 INFO |          autotest:1340| Entered autotestd_monitor.
05/17 07:07:09.808 INFO |          autotest:1340| Finished launching tail subprocesses.
05/17 07:07:09.808 INFO |          autotest:1340| Finished waiting on autotestd to start.
05/17 07:07:32.134 INFO |        server_job:0218| 		GOOD	login_LoginSuccess	login_LoginSuccess	timestamp=1526566051	localtime=May 17 07:07:31	completed successfully
05/17 07:07:32.139 INFO |        server_job:0218| 	END GOOD	login_LoginSuccess	login_LoginSuccess	timestamp=1526566051	localtime=May 17 07:07:31	
05/17 07:07:40.291 INFO |        server_job:0218| END GOOD	----	----	timestamp=1526566059	localtime=May 17 07:07:39
 
Labels: Hotlist-Deputy
Status: Archived (was: Assigned)
Looking at the sysmon metrics for that shard, the load average had jumped to 100% for a short time exactly matching the symptom here, and we'd lost other metrics (likely due to the whole server being locked up).

This incident is too old to get more information at this point, and we haven't seen it recur.

Sign in to add a comment