New issue
Advanced search Search tips

Issue 796348 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

guado_moblab-paladin failed to report failure

Reported by jrbarnette@chromium.org, Dec 19 2017

Issue description

This paladin run failed:
    https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8168

The failure was because of this CL:
    https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/816074/

Because the failure was pegged as an infrastructure failure, the
CL was retried immediately here:
    https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8169

That paladin reported itself completely green.  Yet, the
"HWTest [moblab_quick]" stage log shows failures:
  moblab_DummyServerSuite: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163672756
  12-19-2017 [04:31:43] Suite job is finished.
  Suite timed out. Started on 12-19-2017 [02:53:48], timed out on 12-19-2017 [04:31:43]
  12-19-2017 [04:31:43] Start collecting test results and dump them to json.
  Suite job         [ PASSED ]
  moblab_RunSuite   [ FAILED ]
  moblab_RunSuite     ABORT: None
[ ... ]
  Reason: Suite job timed out.
[ ... ]
 12-19-2017 [04:31:58] Output below this line is for buildbot consumption:
Will return from run_suite with status: SUITE_TIMEOUT

Given that the run was an utter failure, why did the paladin
report success?

 
Cc: nxia@chromium.org pprabhu@chromium.org dgarr...@chromium.org
Problem is in the result reported by run_suite

#JSON_START#{"autotest_instance": "cautotest-prod", "return_code": 0, "suite_job_id": 163672730, "suite_timings": {"artifact_download_end": "2017-12-19 02:54:14", "download_start": "2017-12-19 02:53:37", "payload_download_end": "2017-12-19 02:53:45", "suite_start": "2017-12-19 02:54:08", "tests_end": "9999-12-31 23:59:59.999999", "tests_start": "2017-12-19 03:07:46"}, "tests": {"Suite job": {"attributes": [], "link_to_logs": "http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/163672730-chromeos-test/", "reason": "", "retry_count": 0, "sponge_url": "http://tests/4d78c3f5-7e6d-43c0-97b7-a22121c669f6", "status": "GOOD"}, "moblab_RunSuite": {"attributes": ["suite:moblab_quick", "subsystem:default"], "link_to_logs": "http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/163672756-chromeos-test/", "reason": "None", "retry_count": 0, "sponge_url": "http://tests/4d78c3f5-7e6d-43c0-97b7-a22121c669f6", "status": "ABORT"}}}#JSON_END#


return_code was 0 which will be interpreted as "nothing went wrong" here: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/commands.py?l=900

The suite should have shown up as failed, but it suceeded, even though the test inside it had been aborted.
It is possible to re-run just the return result computation by:

pprabhu@pprabhu:files$ /usr/local/autotest/site_utils/run_suite.py --build guado_moblab-paladin/R65-10228.0.0-rc4 --board guado_moblab --suite_name moblab_quick -m 163672730 -w cautotest
Autotest instance created: cautotest
12-19-2017 [02:53:48] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=163672730
@@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=163672730@@@
12-19-2017 [15:26:21] Suite job is finished.
12-19-2017 [15:26:21] Start collecting test results and dump them to json.
Suite job         [ PASSED ]
moblab_RunSuite   [ FAILED ]
moblab_RunSuite     ABORT: None

Suite timings:
Downloads started at 2017-12-19 02:53:37
Payload downloads ended at 2017-12-19 02:53:45
Suite started at 2017-12-19 02:54:08
Artifact downloads ended (at latest) at 2017-12-19 02:54:14
Testing started at 2017-12-19 03:07:46
Testing ended at 9999-12-31 23:59:59.999999


Links to test logs:
Suite job http://cautotest/tko/retrieve_logs.cgi?job=/results/163672730-pprabhu/
moblab_RunSuite http://cautotest/tko/retrieve_logs.cgi?job=/results/163672756-pprabhu/



 12-19-2017 [15:26:43] Output below this line is for buildbot consumption:
Will return from run_suite with status: OK

Cc: akes...@chromium.org
poking around with pdb, I see that the test was forgiven because we determined that it was aborted, but it had not reached its own timeout:

(Pdb) (end_t - start_t).total_seconds()/60.0
79.8
(Pdb) test_view.afe_job.max_runtime_mins
1440

... and looking at the test logs themselves:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/163672756-chromeos-test/chromeos2-row1-rack8-host1/

status.log says abort (some 1.5 hours after start time),
parse.log says abort

but debug/autoserv.DEBUG finished just fine.

This is the case of parsing choking on something in the logs and then marking the test as aborted.
That + our run_suite logic treats such aborts as PASS => we pass a test that actually may have passed / failed, but that got marked as aborted because parsing is stupid.

I've seen this before. Lemme look up that bug.

https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/site_utils/run_suite.py?l=1170


Cc: -akes...@chromium.org ayatane@chromium.org
- akeshet +aytane

I think we should start failing in run_suite where a test is aborted for an unknown reason. Marking that run as passed is too naive overall.
Why did this problem only occur once out of some half dozen
opportunities?

This happens when parsing dies in a weird way. It used to be that parsing would die every time it straddled a scheduler restart but I fixed that a few months ago.
I don't know what was special about this one.
Project Member

Comment 11 by bugdroid1@chromium.org, Jan 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/316180c95a3ff2f44db8b75804cab5760c571762

commit 316180c95a3ff2f44db8b75804cab5760c571762
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Sat Jan 06 05:14:13 2018

autotest: Tweak run_suite return_code inference to fail aborted tests.

BUG=chromium:796348
TEST=None

Change-Id: I44fde5fdb5b61527e84b1fdd68c4c18097094712
Reviewed-on: https://chromium-review.googlesource.com/835363
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/316180c95a3ff2f44db8b75804cab5760c571762/site_utils/run_suite.py
[modify] https://crrev.com/316180c95a3ff2f44db8b75804cab5760c571762/site_utils/run_suite_unittest.py

Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS

Comment 14 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment