New issue
Advanced search Search tips

Issue 824813 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Mar 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Suite job hung inexplicably.

Reported by jrbarnette@chromium.org, Mar 22 2018

Issue description

This CQ slave run failed:
    https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/2963

The failure is a suite abort for this suite job:
    http://cautotest-prod/afe/#tab_id=view_job&object_id=185678000

The suite reported that this job aborted:
    http://cautotest-prod/afe/#tab_id=view_job&object_id=185678341

The job is reported as aborted, but in fact the logs show that the
test ran, completed, and passed:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/185678341-chromeos-test/chromeos6-row1-rack23-host9/

Looking through the suite logs, they clearly show job 185678341 being created.
However, the suite logs stop abruptly about 1h20m before the timeout, and with
every job but the failed job apparently reported good.  Here are the final lines:
====
03/22 06:57:40.894 INFO |        server_job:0218| START	185678419-chromeos-test/chromeos6-row2-rack21-host13/cheets_WindowManagerTest	cheets_WindowManagerTest	timestamp=1521726590	localtime=Mar 22 06:49:50	
03/22 06:57:40.894 INFO |        server_job:0218| 	GOOD	185678419-chromeos-test/chromeos6-row2-rack21-host13/cheets_WindowManagerTest	cheets_WindowManagerTest	timestamp=1521726985	localtime=Mar 22 06:56:25	completed successfully
03/22 06:57:40.895 INFO |        server_job:0218| END GOOD	185678419-chromeos-test/chromeos6-row2-rack21-host13/cheets_WindowManagerTest	cheets_WindowManagerTest	timestamp=1521726985	localtime=Mar 22 06:56:25	
03/22 06:57:40.895 DEBUG|             suite:1450| Adding job keyval for cheets_WindowManagerTest=185678419-chromeos-test
====

By contrast, here's what a successful conclusion should look like:
====
03/22 04:51:17.275 INFO |        server_job:0218| START	185663407-chromeos-test/chromeos6-row1-rack23-host1/cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1521719054	localtime=Mar 22 04:44:14	
03/22 04:51:17.277 INFO |        server_job:0218| 	GOOD	185663407-chromeos-test/chromeos6-row1-rack23-host1/cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1521719335	localtime=Mar 22 04:48:55	completed successfully
03/22 04:51:17.277 INFO |        server_job:0218| END GOOD	185663407-chromeos-test/chromeos6-row1-rack23-host1/cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1521719335	localtime=Mar 22 04:48:55	
03/22 04:51:17.278 DEBUG|             suite:1450| Adding job keyval for cheets_StartAndroid.stress=185663407-chromeos-test
03/22 04:51:40.888 DEBUG|     dynamic_suite:0614| Finished waiting on suite. Returning from _perform_reimage_and_run.
03/22 04:51:40.889 DEBUG|     dynamic_suite:0519| Returning from dynamic_suite.reimage_and_run.
03/22 04:51:40.889 INFO |        server_job:0885| Finished processing control file
03/22 04:51:40.891 WARNI|        subcommand:0085| parallel_simple was called with an empty arglist, did you forget to pass in a list of machines?
03/22 04:51:40.901 INFO |            client:0570| Attempting refresh to obtain initial access_token
03/22 04:51:40.961 INFO |            client:0872| Refreshing access_token
03/22 04:51:41.537 DEBUG|   logging_manager:0627| Logging subprocess finished
03/22 04:51:41.540 DEBUG|   logging_manager:0627| Logging subprocess finished
====

 
The shard serving caroline went down earlier today.  The basic
summary of the problem was that database corruption caused the
shard-client process to go into a crash loop.  The specific
corruption was that multiple aspects of host chromeos3-row1-rack2-host8
were wrong:
  * The host had two (identical) HWID attribute settings.
  * The host was labeled with both board:stumpy _and_ board:whirlwind.

That problem has been temporarily(?) resolved by editing the database
to delete the problem host entry.

Status: Fixed (was: Untriaged)

Sign in to add a comment