Improve reporting of Tast test failures caused by lost SSH connections |
|||||||
Issue descriptionBoard: grunt Build: 11151.2.0 Logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/248403643-chromeos-test/chromeos2-row3-rack10-host9/ The video.PlayH264 log seems to only contain 2 lines: 2018/10/14 07:54:58 Started test video.PlayH264 2018/10/14 07:54:58 [07:54:57.872] Restarting ui job I'm not very familiar with log structure used by Tast, so would appreciate any hints for picking out the interesting logs in the future.
,
Oct 16
I wonder if this is not a video problem, because - video_VideoSanityH264, which exercise same video features (by a different video), was passed and - if it had failed while playing video, tast.video.PlayH264 must have showed more detailed message than "Failed to run tests: context deadline exceeded". So, I think this failure was caused by somewhere else (e.g. network? UI?). Also, I guess issue 884939 and issue 895725 are same. +derat@ +nya@ I wonder how we can figure out who caused this failure actually. At least, I'd like to improve error messages for this kind of failures. Do you have any idea?
,
Oct 16
,
Oct 16
http://go/tast-running#Interpreting-test-results has information about test results, and http://go/tast-failures has some more info that may be helpful. The end of full.txt shows what happened here ("^^^" added by me to point out important bits): ... 2018/10/14 07:54:58 Completed test ui.VirtualKeyboardTyping in 2m0.185s with 1 error(s) 2018/10/14 07:54:58 Started test video.PlayH264 2018/10/14 07:54:58 [07:54:57.872] Restarting ui job 2018/10/14 07:58:39 Failed to run tests: context deadline exceeded 2018/10/14 07:58:39 Lost SSH connection to chromeos2-row3-rack10-host9:22: EOF ^^^ 2018/10/14 07:58:39 Collecting system information 2018/10/14 07:58:39 Connecting to chromeos2-row3-rack10-host9:22 2018/10/14 07:58:39 Copying /tmp/tast_logs.013093184 from host to /usr/local/autotest/results/lxc_job_folder/tast/results/system_logs 2018/10/14 07:58:40 Cleaning /tmp/tast_logs.013093184 on host 2018/10/14 07:58:40 Copying /tmp/tast_crashes.116442527 from host to /usr/local/autotest/results/lxc_job_folder/tast/results/crashes 2018/10/14 07:58:40 Cleaning /tmp/tast_crashes.116442527 on host 2018/10/14 07:58:40 -------------------------------------------------------------------------------- 2018/10/14 07:58:40 cryptohome.Login [ PASS ] 2018/10/14 07:58:40 graphics.ScreenshotCLI [ PASS ] 2018/10/14 07:58:40 graphics.ScreenshotChrome [ PASS ] 2018/10/14 07:58:40 platform.Histograms [ PASS ] 2018/10/14 07:58:40 platform.MLServiceBootstrap [ SKIP ] missing deps: ml_service 2018/10/14 07:58:40 security.GPUSandboxed [ PASS ] 2018/10/14 07:58:40 security.OpenFDs [ PASS ] 2018/10/14 07:58:40 security.SELinuxFileLabelWithChrome [ PASS ] 2018/10/14 07:58:40 security.Sandboxed [ PASS ] 2018/10/14 07:58:40 ui.Exceptions [ PASS ] 2018/10/14 07:58:40 ui.MashLogin [ PASS ] 2018/10/14 07:58:40 ui.SingleProcessMashLogin [ PASS ] 2018/10/14 07:58:40 ui.SupervisedUserCrash [ PASS ] 2018/10/14 07:58:40 ui.VirtualKeyboardOmnibox [ PASS ] 2018/10/14 07:58:40 ui.VirtualKeyboardTyping [ FAIL ] Failed to wait for the virtual keyboard to render: cdp.Runtime: Evaluate: context deadline exceeded 2018/10/14 07:58:40 Run did not finish successfully; results are incomplete ^^^ 2018/10/14 07:58:40 -------------------------------------------------------------------------------- 2018/10/14 07:58:40 Results saved to /usr/local/autotest/results/lxc_job_folder/tast/results 2018/10/14 07:58:40 Lost SSH connection to chromeos2-row3-rack10-host9:22: EOF ^^^ It's likely that the DUT just died here midway through the test run; it's probably unrelated to the tests that were being run. I'll try to make the SSH error also end up in the per-test log, since that would make things clearer.
,
Oct 17
,
Oct 17
Dan, reporting SSH errors in per-test logs sounds great. This is what we want. But, I wonder if we can reduce test sheriff's burden more. For example, can we report "N/A" instead of "FAIL" when the failure caused by a global error? Or, we might want to retry a test that failed by a global error automatically.
,
Oct 17
Issue 895725 has been merged into this issue.
,
Oct 18
> For example, can we report "N/A" instead of "FAIL" when the > failure caused by a global error? Or, we might want to retry > a test that failed by a global error automatically. I'm reluctant to do that (and I think I've heard arguments against doing it from ihf@ too) since it's possible for tests to trigger bugs that crash the DUT, and it'd be bad if we silently ignored those errors. I've thought about adding support to "tast run" itself (or maybe to the tast.py Autotest server test that runs the tast executable) to support retrying the test run (skipping already-completed tests) if the connection to the DUT is lost, but I'm not sure how we want to treat the test that was running at the time of the lost connection.
,
Oct 18
I think the test that was running while something really bad happened should be failed for that execution. Otherwise we are losing information. The question is what should be done with this information? It always should be surfaced to the sheriff to investigate - otherwise we won't get a fix for the condition and things will go downhill. The failure could be automatically retried[1] if it is part of the CQ for instance. But a success during retry should not hide the original failure (as we often do now). And maybe the retry bar should be larger than "Try N times and if it passes a single time assume it is ok". Maybe the bar should be to gather statistics running N times and make sure it doesn't fail more than 5 percent of the time? The 30k problem is the code base changes over time. Having JOB_RETRIES=2 everywhere (= run 3 times and if it passes once let it in) just makes the CQ let flakes through everywhere. It wouldn't be so bad if the current CQ system wouldn't hide the flakes at the same time (so does stainless by default - you have to uncheck "Exclude retried tests" to see flakes). The CQ and Tast cannot keep flakes out reliably. But they can make it harder for them to get in and they can surface them to create back pressure to the teams to fix flakes. For instance the graphics team likes to search for retries periodically. Other example: we run CTS with many retries, but surface the flakes as warnings. From time to time Kazuhiro chases these flakes and tries to reduce them. So even if you are willing to tolerate some amount of flakiness in the code base (which the loss of the ssh connection seems to be), it has to be made very clear that it occurred and if repeated needs fixing (medium term). [1] I would skip a test like that and move it to the very end so if we are dealing with something bad we don't hit it immediately again. --- TestNA is not a good choice. It potentially hides every badness.
,
Oct 18
Thanks, Dan and Ilja. Now, I agree that the result like "N/A" was a bad idea as you two said. On second thought, I think what I really want here is so precise failure reasons that the sheriff can determine whether the failure was not caused by video components at a glance. For example, failure reason here is "Test did not finish", which is not informative. https://stainless.corp.google.com/search?exclude_retried=true&first_date=2018-10-12&master_builder_name=&builder_name_number=&shard=&exclude_acts=true&builder_name=&master_builder_name_number=&owner=&retry=&exclude_cts=false&exclude_non_production=false&hostnam[e=&board=&test=tast.video.*&suite=&build=%5ER71%5C-11151%5C.2%5C.0%24&status=FAIL&status=ERROR&status=ABORT&reason=&waterfall=&exclude_not_run=false&last_date=2018-10-18&exclude_non_release=true&exclude_au=true&model=%5Egrunt%24&view=list If this message is like "lost SSH connection to DUT" (and the failure happened only once), video team's sheriff can ignore this failure easily. Dan, will your CL also improve this "Reason" message? If so, it would work well. --- Supporting retrying in "tast run" sounds great. Though I think there is no need to hurry, such features would be useful.
,
Oct 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/a2cac93f928e818f893ec597bd57001c17d1123e commit a2cac93f928e818f893ec597bd57001c17d1123e Author: Daniel Erat <derat@chromium.org> Date: Thu Oct 18 21:53:36 2018 tast: Include global errors in per-test log messages. When the entire test run is aborted due to a global error that happens to occur in the middle of a test, write the error to the test's log.txt file to aid in debugging. BUG= chromium:895716 TEST=added unit tests; also manually verified that tests/<test-name>/log.txt in results dir contains an error when i unplug the DUT's network connection mid-test or pass a short timeout to "tast run" Change-Id: I376d1c0e1ec4f9eab004bab6e2c97006b74ef4a1 Reviewed-on: https://chromium-review.googlesource.com/1285850 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/a2cac93f928e818f893ec597bd57001c17d1123e/src/chromiumos/cmd/tast/run/results_test.go [modify] https://crrev.com/a2cac93f928e818f893ec597bd57001c17d1123e/src/chromiumos/cmd/tast/run/results.go
,
Oct 18
#10: Thanks for clarifying. Re improving the "Test did not finish", that seems like a good idea to me. I'll use this bug to track that.
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/0ab163fda06fbf065426123442a95bde71499a33 commit 0ab163fda06fbf065426123442a95bde71499a33 Author: Daniel Erat <derat@chromium.org> Date: Mon Oct 29 18:21:41 2018 tast: Write global error to run_error.txt. Make "tast run" write global errors (most commonly caused by SSH connections to DUTs being lost) to a run_error.txt file in the results directory. This will be used by the tast.py server test to provide better errors for interrupted tests than "Test did not finish". Also move the run package's deadlineBefore function to the ctxutil package, as it seems like a generally-useful thing to have when checking if there's enough time left to perform an operation. BUG= chromium:895716 TEST=manual: rebooted a DUT in the middle of a test and verified that run_error.txt contains an error like "Lost SSH connection to ___: EOF". also added a unit test to verify that run_error.txt is written. Change-Id: I1ec89b848ed8fe6944b56448cba14bb49ed4de2f Reviewed-on: https://chromium-review.googlesource.com/1303290 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/tast/ctxutil/ctxutil_test.go [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/cmd/tast/run/remote.go [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/cmd/tast/run/local.go [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/docs/running_tests.md [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/tast/ctxutil/ctxutil.go [modify] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/cmd/tast/run/run.go [add] https://crrev.com/0ab163fda06fbf065426123442a95bde71499a33/src/chromiumos/cmd/tast/run/run_test.go
,
Nov 1
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9624149e102879e6bf2e420041e89a6e14089ae0 commit 9624149e102879e6bf2e420041e89a6e14089ae0 Author: Daniel Erat <derat@chromium.org> Date: Thu Nov 01 04:59:26 2018 autotest: Make tast.py read run_error.txt. Make the tast.py server test read a run_error.txt file that is written by "tast run" when the entire test run is aborted (e.g. due to the SSH connection to the DUT being lost). This message is then reported as an error for any incomplete tests, as it provides a better clue to the underlying problem than "Test did not finish". BUG= chromium:895716 TEST=added unit test; also killed tests on DUT while using test_that to run tast.informational and checked that the run error was reported for the last test in status.log Change-Id: I90dc587e193cf5e8826991401f9c487020bcf1cb Reviewed-on: https://chromium-review.googlesource.com/1308747 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/9624149e102879e6bf2e420041e89a6e14089ae0/server/site_tests/tast/testdata/fake_tast.py [modify] https://crrev.com/9624149e102879e6bf2e420041e89a6e14089ae0/server/site_tests/tast/tast_unittest.py [modify] https://crrev.com/9624149e102879e6bf2e420041e89a6e14089ae0/server/site_tests/tast/tast.py
,
Nov 1
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by tfiga@chromium.org
, Oct 16