guado_moblab-paladin failed to report failure
Reported by
jrbarnette@chromium.org,
Dec 19 2017
|
||||||
Issue description
This paladin run failed:
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8168
The failure was because of this CL:
https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/816074/
Because the failure was pegged as an infrastructure failure, the
CL was retried immediately here:
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8169
That paladin reported itself completely green. Yet, the
"HWTest [moblab_quick]" stage log shows failures:
moblab_DummyServerSuite: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=163672756
12-19-2017 [04:31:43] Suite job is finished.
Suite timed out. Started on 12-19-2017 [02:53:48], timed out on 12-19-2017 [04:31:43]
12-19-2017 [04:31:43] Start collecting test results and dump them to json.
Suite job [ PASSED ]
moblab_RunSuite [ FAILED ]
moblab_RunSuite ABORT: None
[ ... ]
Reason: Suite job timed out.
[ ... ]
12-19-2017 [04:31:58] Output below this line is for buildbot consumption:
Will return from run_suite with status: SUITE_TIMEOUT
Given that the run was an utter failure, why did the paladin
report success?
,
Dec 19 2017
The paladin failures were reported and debugged in bug 796210 . Here's the list of CQ runs that included the bad code (both before and after the bad commit): https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8168 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8169 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8170 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8171 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8172 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8173 https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/8174 By and large, this problem only seems to have affected one run. That's still one run too many.
,
Dec 19 2017
Problem is in the result reported by run_suite
#JSON_START#{"autotest_instance": "cautotest-prod", "return_code": 0, "suite_job_id": 163672730, "suite_timings": {"artifact_download_end": "2017-12-19 02:54:14", "download_start": "2017-12-19 02:53:37", "payload_download_end": "2017-12-19 02:53:45", "suite_start": "2017-12-19 02:54:08", "tests_end": "9999-12-31 23:59:59.999999", "tests_start": "2017-12-19 03:07:46"}, "tests": {"Suite job": {"attributes": [], "link_to_logs": "http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/163672730-chromeos-test/", "reason": "", "retry_count": 0, "sponge_url": "http://tests/4d78c3f5-7e6d-43c0-97b7-a22121c669f6", "status": "GOOD"}, "moblab_RunSuite": {"attributes": ["suite:moblab_quick", "subsystem:default"], "link_to_logs": "http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/163672756-chromeos-test/", "reason": "None", "retry_count": 0, "sponge_url": "http://tests/4d78c3f5-7e6d-43c0-97b7-a22121c669f6", "status": "ABORT"}}}#JSON_END#
return_code was 0 which will be interpreted as "nothing went wrong" here: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/commands.py?l=900
The suite should have shown up as failed, but it suceeded, even though the test inside it had been aborted.
,
Dec 19 2017
status.log entry for the suite job seems to have the correct (ABORT) status: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/163672730-chromeos-test/hostless I think something went wrong in run_suites return_code computation: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/site_utils/run_suite.py?q=run_suite.py&sq=package:chromeos+file:src/third_party/autotest/files&dr=C&l=1151
,
Dec 19 2017
It is possible to re-run just the return result computation by: pprabhu@pprabhu:files$ /usr/local/autotest/site_utils/run_suite.py --build guado_moblab-paladin/R65-10228.0.0-rc4 --board guado_moblab --suite_name moblab_quick -m 163672730 -w cautotest Autotest instance created: cautotest 12-19-2017 [02:53:48] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=163672730 @@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=163672730@@@ 12-19-2017 [15:26:21] Suite job is finished. 12-19-2017 [15:26:21] Start collecting test results and dump them to json. Suite job [ PASSED ] moblab_RunSuite [ FAILED ] moblab_RunSuite ABORT: None Suite timings: Downloads started at 2017-12-19 02:53:37 Payload downloads ended at 2017-12-19 02:53:45 Suite started at 2017-12-19 02:54:08 Artifact downloads ended (at latest) at 2017-12-19 02:54:14 Testing started at 2017-12-19 03:07:46 Testing ended at 9999-12-31 23:59:59.999999 Links to test logs: Suite job http://cautotest/tko/retrieve_logs.cgi?job=/results/163672730-pprabhu/ moblab_RunSuite http://cautotest/tko/retrieve_logs.cgi?job=/results/163672756-pprabhu/ 12-19-2017 [15:26:43] Output below this line is for buildbot consumption: Will return from run_suite with status: OK
,
Dec 19 2017
poking around with pdb, I see that the test was forgiven because we determined that it was aborted, but it had not reached its own timeout: (Pdb) (end_t - start_t).total_seconds()/60.0 79.8 (Pdb) test_view.afe_job.max_runtime_mins 1440 ... and looking at the test logs themselves: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/163672756-chromeos-test/chromeos2-row1-rack8-host1/ status.log says abort (some 1.5 hours after start time), parse.log says abort but debug/autoserv.DEBUG finished just fine. This is the case of parsing choking on something in the logs and then marking the test as aborted. That + our run_suite logic treats such aborts as PASS => we pass a test that actually may have passed / failed, but that got marked as aborted because parsing is stupid. I've seen this before. Lemme look up that bug. https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/site_utils/run_suite.py?l=1170
,
Dec 19 2017
- akeshet +aytane I think we should start failing in run_suite where a test is aborted for an unknown reason. Marking that run as passed is too naive overall.
,
Dec 20 2017
,
Dec 20 2017
Why did this problem only occur once out of some half dozen opportunities?
,
Dec 21 2017
This happens when parsing dies in a weird way. It used to be that parsing would die every time it straddled a scheduler restart but I fixed that a few months ago. I don't know what was special about this one.
,
Jan 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/316180c95a3ff2f44db8b75804cab5760c571762 commit 316180c95a3ff2f44db8b75804cab5760c571762 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Sat Jan 06 05:14:13 2018 autotest: Tweak run_suite return_code inference to fail aborted tests. BUG=chromium:796348 TEST=None Change-Id: I44fde5fdb5b61527e84b1fdd68c4c18097094712 Reviewed-on: https://chromium-review.googlesource.com/835363 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/316180c95a3ff2f44db8b75804cab5760c571762/site_utils/run_suite.py [modify] https://crrev.com/316180c95a3ff2f44db8b75804cab5760c571762/site_utils/run_suite_unittest.py
,
Mar 30 2018
,
Mar 30 2018
,
Jun 8 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, Dec 19 2017