New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 852821 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Aug 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Make aborted suite don't show error message as 'suite timed out'

Project Member Reported by steve...@chromium.org, Jun 14 2018

Issue description

Currently eve-tot-chrome-pfq-informational is failing consistently:  issue 851758 .

The builds are failing because one or more HWTest stages are timing out.

We pass an explicit timeout if 4 hours (14400 seconds) to each HWTest suite and this timeout is exceeded (because we only have 2 DUTs):

02:52:50: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/cbuildbot-tmpvtNd1h/tmptC7vrf/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name eve-tot-chrome-pfq-informational/R69-10781.0.0-b2663993-bvt-arc --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:bvt-arc' '--tags=build:eve-tot-chrome-pfq-informational/R69-10781.0.0-b2663993' '--tags=task_name:eve-tot-chrome-pfq-informational/R69-10781.0.0-b2663993-bvt-arc' '--tags=board:eve' -- /usr/local/autotest/site_utils/run_suite.py --build eve-tot-chrome-pfq-informational/R69-10781.0.0-b2663993 --board eve --suite_name bvt-arc --pool continuous --file_bugs True --priority PFQ --timeout_mins 180 --retry False --minimum_duts 1 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 82810451L, 'cidb_build_id': 2663993, 'datastore_parent_key': ('Build', 2663993, 'BuildStage', 82810451L)}" -c

...

  Suite job                                 [ FAILED ]
  Suite job                                   ABORT: 
  cheets_GTS.GtsPlacementTestCases          [ PASSED ]
...
  cheets_CleanShutDown_SERVER_JOB           [ FAILED ]
  cheets_CleanShutDown_SERVER_JOB             ABORT: 
  cheets_CleanShutDown                      [ PASSED ]
  cheets_DownloadsFilesystem                [ FAILED ]
  cheets_DownloadsFilesystem                  ABORT: Job aborted unexpectedly
  cheets_CameraOrientation                  [ PASSED ]
...
  Reason: Tests failed.
   06-14-2018 [05:55:05] Output below this line is for buildbot consumption:
  @@@STEP_LINK@[Test-Logs]: Suite job: ABORT@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/208421908-chromeos-test/@@@
  @@@STEP_LINK@[Test-History]: Suite job@https://stainless.corp.google.com/search?test=^Suite\ job$&first_date=2018-05-17&last_date=2018-06-14&row=model&col=build&view=matrix@@@
  @@@STEP_LINK@[Test-Logs]: cheets_CleanShutDown_SERVER_JOB: ABORT@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/208422016-chromeos-test/@@@
  @@@STEP_LINK@[Test-History]: cheets_CleanShutDown_SERVER_JOB@https://stainless.corp.google.com/search?test=^cheets\_CleanShutDown\_SERVER\_JOB$&first_date=2018-05-17&last_date=2018-06-14&row=model&col=build&view=matrix@@@
  @@@STEP_LINK@[Test-Logs]: cheets_DownloadsFilesystem: ABORT: Job aborted unexpectedly@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/208422019-chromeos-test/@@@
  @@@STEP_LINK@[Test-History]: cheets_DownloadsFilesystem@https://stainless.corp.google.com/search?test=^cheets\_DownloadsFilesystem$&first_date=2018-05-17&last_date=2018-06-14&row=model&col=build&view=matrix@@@
  Will return from run_suite with status: ERROR

The above is difficult to parse, and somewhat misleading ('Reason: Tests failed.').

It would be super helpful if we could recognize that exceeding the 4 hour time limit was the cause for these failures and show a more useful summary message, e.g. "Suite exceeded time limit: 4 hours".

 
Components: -Infra Infra>Client>ChromeOS>Test
I see "Reason: Tests were aborted before running; suite must have timed out." in the most recent one: https://luci-logdog.appspot.com/v/?s=chromeos/buildbucket/cr-buildbucket.appspot.com/8943669615559812208/+/steps/HWTest__bvt-arc_/0/stdout

From: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943669615559812208

Was there something different about the run you were looking at?

Nothing that I am aware of. I looked at half a dozen failures which all appeared to have timed out and the output varied from one to the next, presumably depending on what state the DUTs were in when the test was aborted.

The example in comment #1 shows:
  cheets_CameraOrientation                    ABORT: Timed out, did not run.

Whereas my initial example shows:
  cheets_DownloadsFilesystem                  ABORT: Job aborted unexpectedly

Which may explain the difference in the 'Reason' output.


Comment 3 by minch@chromium.org, Jun 25 2018

Cc: minch@chromium.org x...@chromium.org
can still see this problem in eve and tricky-tot-chrome-pfq-informational
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2675468

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2676622

Comment 4 by x...@chromium.org, Jun 29 2018

Owner: xixuan@chromium.org
This failure is still seen occasionally on eve and tricky chrome informational builders. See https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942376870866941344 as an example.

It looks like it's a TestLabFailure. Assign to current deputy for triage.

Comment 5 by xixuan@chromium.org, Jun 29 2018

Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>CI
Owner: ----
The hwtest stage takes about 15 minutes: 09:16:48 ~ 09:31:27, and about to finish, but get aborted due to this build runs out of time. 

I won't think the HWTest stage is too long to cause this build to be aborted. The problem is buildPackage costs 1 hour. Not sure whether it's normal. Pass to CI team to decide.
It's Chrome that's taking all of that build time: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/emergePackageDetails?cidbBuildId=2707560

Is this a Chrome build regression or should we up the PFQ timeouts?
Looking at the graph, runs for that builder usually take around 4 hours, (~1 hour of which is BuildPackages, mostly building chrome). That is not a behavior change

The examples linked in comments #4 and #6 aborted after 129 and 120 minutes for some reason, so maybe an individual test timed out, or something else?

(Note: This issue was filed to discuss the messaging, not so much the individual failures, although these examples do illustrate the potential confusion of the messaging, since in the original example the total time limit was reached, and in these examples it was an individual test timeout or something else).

Components: Infra>Client>ChromeOS>Test
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
Yea, I agree that doesn't make sense: it looks like Autotest is aborting too early. Can we get some clarity on what is happening on the Autotest side? I dug through logs but found no explanation for the abort.

Comment 9 by x...@chromium.org, Jun 29 2018

Yes, It also doesn't seems to me the build runs out of time for me. For example, this build https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2706836 it only takes 115 mins until the suite job aborted. 
See log of this build here https://00e9e64bac15685248161fd20ad79251323691ad17d0c7eadd-apidata.googleusercontent.com/download/storage/v1/b/chromeos-autotest-results/o/212793129-chromeos-test%2Fhostless%2Fdebug%2Fautoserv.DEBUG?qk=AD5uMEssfp6D9fehnMAVEb0JtfXg6G9X_z5LOpe9GwZnew-DtjbdnvZDZeHU90ChKZmEwIi0xkrgolMdrtYqPaZfTx_z72VfqJAPPtEbyM224uXxKTtfx8XtUnTBQPjF0HBC80HnblKYnCbopBTgOaYxGBqnDActug2jQdKHbiIFuhk8oJGhUdW8r7CrYjE9o6hPTR3ncYXsc5ZZp2ZaVezlbCLD1CJWPH79vWsYmf2l8aaM058nhuwp8YiMMnj7ZhyutbOi2mW9K7m_ziEcR4i5no-rnooWeAX4myUgxBntGhXEb5C418SrF30-aNuc-7kCQqPKCPXfYEA48cp7MNncc_Gm7Jzh5o7qL0MVbpNYvwzWJ_oJ3auZE5iH54B0lovqwF24n5KGW5I5e-QzhqNnpStXEnqteqZWJfaeshlkwRkikOz8L8JNAPAMveoRqOdFSEsZrzgycpvhUlJIUkBYlasF9DO39_vKk3thSUTLSJwrlZr3fH3U-blZNup4pw66x9iNg_usvfztUBXmPI0KGkp0MVO12a-A9tBzOt9f2w2xlLBvrH4voXieqd8b70sVRcWdIpcGo0Pv9TxKXAwGmHc4XMBnr0_w3iv1XNG2GdD0tu02DSOgnD3lvbyTHbwuujAVFtp99F5DiVXqmulyYolXXIrEwP_NhV-BFsZT_9tY_UshGv6mkL9b6IqlpB-54uc4wy82iwUXBZmz5Am8uob8EdRS6qQtHrAUGZjPTZrU-tSRZGN8pM7DabomekQNRM80xyLeD6Qk7Q0cIr9xuaMiK25aOjcTiARu1sDbLa-XTHL27UE 


Looking more closely to the log, I doubt this is the original issue that was filed in comment#0. "Suite job: ABORT" in the latest failed build doesn't seems to be caused by timeout, but caused by other issue? See selected error message:


06/29 00:50:12.777 INFO |        server_job:0216| 	ABORT	----	caroline-tot-chrome-pfq-informational/R69-10827.0.0-b2706836/bvt-inline/build_RootFilesystemSize	timestamp=1530258612	localtime=Jun 29 00:50:12	
06/29 00:50:12.777 INFO |        server_job:0216| END ABORT	----	caroline-tot-chrome-pfq-informational/R69-10827.0.0-b2706836/bvt-inline/build_RootFilesystemSize	timestamp=1530258612	localtime=Jun 29 00:50:12	
06/29 00:50:13.024 ERROR|   logging_manager:0626| Current thread 0x00007fbd00b3c740:
06/29 00:50:13.025 ERROR|   logging_manager:0626|   File "/usr/local/autotest/tko/db.py", line 196 in _commit
06/29 00:50:13.025 ERROR|   logging_manager:0626|   File "/usr/local/autotest/tko/db.py", line 164 in run_with_retry
06/29 00:50:13.025 ERROR|   logging_manager:0626|   File "/usr/local/autotest/tko/db.py", line 202 in commit
06/29 00:50:13.025 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/frontend.py", line 156 in get_job_test_statuses_from_db
06/29 00:50:13.026 ERROR|   logging_manager:0626|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 493 in wrapper
06/29 00:50:13.026 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 170 in _yield_job_results
06/29 00:50:13.026 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 133 in wait_for_results
06/29 00:50:13.027 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1171 in wait
06/29 00:50:13.027 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 613 in _run_suite
06/29 00:50:13.027 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 577 in _run_suite_with_spec
06/29 00:50:13.027 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 563 in _perform_reimage_and_run
06/29 00:50:13.027 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 517 in reimage_and_run
06/29 00:50:13.028 ERROR|   logging_manager:0626|   File "/usr/local/autotest/results/212793129-chromeos-test/hostless/control.srv", line 92 in <module>
06/29 00:50:13.028 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 1326 in _execute_code
06/29 00:50:13.028 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/server_job.py", line 803 in run
06/29 00:50:13.028 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 603 in run_autoserv
06/29 00:50:13.029 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 805 in main
06/29 00:50:13.029 ERROR|   logging_manager:0626|   File "/usr/local/autotest/server/autoserv", line 822 in <module>
06/29 00:50:13.044 DEBUG|          autoserv:0396| Received SIGTERM
06/29 00:50:13.044 DEBUG|          autoserv:0399| Finished writing to pid_file. Killing process.
Cc: xixuan@chromium.org
Labels: -Pri-2 Pri-0
Owner: gu...@chromium.org
Summary: Autotest killing jobs for timeout when actually the timeout hasn't been reached (was: HWTest timeout failure message is unclear: "Suite job: ABORT")
This killed at least one CQ run over the weekend: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18999

The timeout is set to 90 minutes but only 12 minutes elapse before Autotest kills everything for a timeout on two of the builders.

Over to current Deputy and setting to P0.

Note: The failures in comment #10 do not appear to be do to the master timeout (unless it is much shorter in the CQ?). The master-paladin only ran for 1 hrs, 56 mins.

Labels: -Pri-0 Pri-1
The reason of 'suite timed out' for #4 & #10 are different

1) For #10, master-paladin timed out:

I check aborted suite on https://luci-milo.appspot.com/buildbot/chromeos/cyan-paladin/6444 & https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3345, 

it's not 'suite aborted' actually, it's caused by tko flakiness.

"lucifer/job_reporter_output.log" in some "aborted" jobs, like
https://stainless.corp.google.com/browse/chromeos-autotest-results/213105375-chromeos-test/
https://stainless.corp.google.com/browse/chromeos-autotest-results/213105047-chromeos-test/

shows that there's a mysql error:

ifer_run_job: 2018/06/30 04:00:05 Sending event parsing
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 209, in <module>
    sys.exit(main(sys.argv))
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 46, in main
    ret = _main(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 93, in _main
    return _run_autotest_job(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 106, in _run_autotest_job
    ret = _run_lucifer_job(handler, args, job)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 148, in _run_lucifer_job
    event_handler=event_handler, args=command_args)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 93, in run_event_command
    _handle_subprocess_events(event_handler, proc)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 112, in _handle_subprocess_events
    _handle_output_line(event_handler, line)
  File "/usr/local/autotest/venv/lucifer/eventlib.py", line 127, in _handle_output_line
    event_handler(event, message)
  File "/usr/local/autotest/venv/lucifer/handlers.py", line 57, in __call__
    handler(msg)
  File "/usr/local/autotest/venv/lucifer/handlers.py", line 72, in _handle_gathering
    status=models.HostQueueEntry.Status.GATHERING)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/query.py", line 567, in update
    rows = query.get_compiler(self.db).execute_sql(None)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 839, in execute_sql
    cursor = self.connection.cursor()
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/backends/__init__.py", line 326, in cursor
    cursor = util.CursorWrapper(self._cursor(), self)
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-a234a83456f26d726445fc6c8e6ce271/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 405, in _cursor
    self.connection = Database.connect(**kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/MySQLdb/connections.py", line 187, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")

It makes tko database doesn't have the records for these jobs, and make dynamic_suite think they're aborted.

We may retry tko connection if possible. I file  Issue 859578  for tracking this bug.


2) For 'suite timed out' in #4 & #9, debug/autoserv.DEBUG shows the suite script receives abortion signal:

06/29 09:28:31.148 DEBUG|          autoserv:0396| Received SIGTERM

Logs in #1 also shows that the test received SIGTERM, so it stops running. It's possibly caused by build timeout, suite timeout, or abort_suite.py script. Investigating.





Owner: ----
ok I found the source of SIGTERM.

For every build, we abort all suites kicked off by last build in stage "BuildReexecutionFinished", then kick off new suites in stage "HWTest" to make sure we have enough DUTs to test.

However, 2 consecutive tricky-tot-chrome-pfq-informational builds have overlap:

The first build https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942376870866941344 is still running, hwtest is kicked off at 09:16:48.

The next build https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942369242728658960 is created at 2018-06-29 09:19, Especially stage "BuildReexecutionFinished" is kicked off at 09:27, and suite bvt-cq for last build is aborted at:

9:28:12: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpaWcZnX/tmplkhFj1/temp_summary.json --print-status-updates --raw-cmd --task-name abort-tricky-tot-chrome-pfq-informational/R69-10828.0.0-b2707815-bvt-cq --dimension os Ubuntu-14.04 --dimension pool default --expiration 1200 -- /usr/local/autotest/site_utils/abort_suite.py -i tricky-tot-chrome-pfq-informational/R69-10828.0.0-b2707815 -s bvt-cq
Triggered task: abort-tricky-tot-chrome-pfq-informational/R69-10828.0.0-b2707815-bvt-cq
chromeos-golo-server2-286: 3e6531273cea7410 0
  2018-06-29 09:28:19,193 INFO | The following suites have been aborted:
  [{'Job': 212926466, 'Job name': 'tricky-tot-chrome-pfq-informational/R69-10828.0.0-b2707815-test_suites/control.bvt-cq', 'HostQueueEntry': 213527531}]

which almost matches the SIGTERM received by the first build: (in debug/autoserv.DEBUG of https://stainless.corp.google.com/browse/chromeos-autotest-results/212926466-chromeos-test/)

06/29 09:28:53.776 DEBUG|          autoserv:0396| Received SIGTERM


Proposed fix will be 1) make next tricky-tot-chrome-pfq-informational build not be created before last tricky-tot-chrome-pfq-informational hasn't finished, or 2) don't abort suites of previous build, but this could make more and more jobs queued as you schedule more builds.

Pass back to CI team to find an owner for this. If 2) is the solution, please assign it back and we can remove running of abort_suite.py.


Why would we have ever thought that it was a good idea to cancel the last run's hardware tests well in advance of actually needing to run the current builds hardware tests? That's what I don't understand. We run BuildReexecutionFinished at 09:27 but we don't run HWTest until 10:59.

So the bots would have time to finish repairing before we tried to use them again.
Doesn't the current build's image need to be repaired onto the DUT's? I don't think I understand why we want to repair an additional time ahead of that.

Not repaired, provisioned.

The repair process does a bunch of things to ensure the DUT is in a good state for the next test, including making sure it's healthy enough for a provision to succeed.
Owner: xixuan@chromium.org
I've opened  https://crbug.com/859683  to track increasing the time between PFQ runs.

Xixuan, please use this bug to track changing the 'suite timed out' error messaging around this to something more informative.
Labels: -Pri-1 Pri-2
Summary: Make aborted suite don't show error message as 'suite timed out' (was: Autotest killing jobs for timeout when actually the timeout hasn't been reached)
In skylab, if a suite job is aborted, all its child tasks will be canceled, e.g.

https://chrome-swarming.appspot.com/task?id=3e76ad910e59cf10&refresh=10

If a suite job found a child task is canceled, it will report itself as 'Canceled' (in logs), and return as INFRA_FAILURE, e.g.

https://chrome-swarming.appspot.com/task?id=3e76acc27c062911&refresh=10&show_raw=1

So this won't be a problem in skylab. Downgrade it to P2 as we target to migrate CQ to skylab soon.
Just closing the bug sounds fine too if it's more likely that Skylab will be here before we get to this.
Status: WontFix (was: Assigned)

Sign in to add a comment