autotest client/cros/cros_logging.py:AbstractLogReader.set_start_by_reboot gets confused when crosvm logs its kernel output |
|||||||
Issue descriptionSample log from elm: https://storage.cloud.google.com/chromeos-autotest-results/179958285-chromeos-test/chromeos2-row7-rack9-host17/logging_UserCrash/debug/logging_UserCrash.DEBUG?_ga=2.199270378.-517888296.1519867600 Job: https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=179958285 02/28 02:57:06.736 INFO | crash_test:0694| ====================Running reporter_startup==================== 02/28 02:57:06.736 DEBUG| utils:0214| Running 'touch /var/lib/crash_sender_paused' 02/28 02:57:06.765 DEBUG| utils:0214| Running 'rm -rf /var/lib/crash_sender' 02/28 02:57:06.785 DEBUG| utils:0214| Running 'rm -rf /var/spool/crash' 02/28 02:57:06.805 DEBUG| utils:0214| Running 'rm -rf /home/chronos/u-*/crash /home/chronos/crash' 02/28 02:57:06.829 DEBUG| utils:0214| Running 'echo "|/sbin/crash_reporter --user=%P:%s:%u:%e" > /proc/sys/kernel/core_pattern' 02/28 02:57:06.857 DEBUG| utils:0214| Running 'ls -tr1 /var/log/messages.*' 02/28 02:57:06.883 DEBUG| utils:0214| Running 'ls -tr1 /var/log/messages.*' 02/28 02:57:06.907 DEBUG| test:0410| Test failed due to user space crash handling was not started during last boot. Exception log follows the after_iteration_hooks. 02/28 02:57:06.908 DEBUG| test:0415| Starting after_iteration_hooks for logging_UserCrash 02/28 02:57:06.909 DEBUG| base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/interrupts.after 02/28 02:57:06.911 DEBUG| base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/meminfo.after 02/28 02:57:06.913 DEBUG| base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/slabinfo.after 02/28 02:57:06.914 DEBUG| base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/schedstat.after 02/28 02:57:06.915 DEBUG| utils:0214| Running 'logger "autotest finished iteration /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1"' 02/28 02:57:06.937 DEBUG| test:0420| after_iteration_hooks completed 02/28 02:57:06.938 WARNI| test:0637| The test failed with the following exception Traceback (most recent call last): File "/usr/local/autotest/common_lib/test.py", line 631, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/common_lib/test.py", line 831, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/common_lib/test.py", line 495, in execute dargs) File "/usr/local/autotest/common_lib/test.py", line 362, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/common_lib/test.py", line 400, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 306, in run_once must_run_all=False) File "/usr/local/autotest/cros/crash/crash_test.py", line 711, in run_crash_tests getattr(self, '_test_' + test_name)() File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 38, in _test_reporter_startup 'user space crash handling was not started during last boot') TestFail: user space crash handling was not started during last boot https://stainless.corp.google.com/search?view=matrix&row=build&col=model&first_date=2018-02-02&last_date=2018-03-01&suite=bvt-cq&test=%5Elogging%5C_UserCrash%24&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=user+space+crash+handling+was+not+started+during+last+boot&exclude_cts=true&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Nov 30
logging_UserCrash has multiple tests which need to be in the CQ "Enabling user crash handling" should be in the logs because it's written by the crash reporter init script which initializes things. why isn't in the logs ? i vaguely recall this used to be flaky because messages would get rotated, but i thought we fixed that by having it look at all the logs (including rotated ones).
,
Nov 30
The message looks like it's supposed to come from user_collector.cc:
bool UserCollector::SetUpInternal(bool enabled) {
CHECK(initialized_);
LOG(INFO) << (enabled ? "Enabling" : "Disabling") << " user crash handling";
,
Nov 30
yes, the code path is: - /etc/init/crash-reporter.conf: `crash_reporter --init` - crash_reporter.cc:Initialize() -> user_collector->Enable() - user_collector.h:Enable() -> SetUpInternal() - user_collector.cc:SetUpInternal() -> LOG Enabling user crash handling so the test is making sure that `crash_reporter --init` is called correctly during boot. if it wasn't, that'd be bad.
,
Nov 30
Please figure out why the log message wasn't found in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8928559381630727616. I see it in the messages file that was collected, at least: 2018-11-29T05:21:29.401019+00:00 INFO crash_reporter[655]: Enabling user crash handling Ideally we just wouldn't run this test on the Chrome PFQ, but that's hard to accomplish with the way that we run Autotest tests. (It's automatic with Tast, but I haven't looked into porting this test yet.)
,
Dec 1
This failed again today: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928414701064286384 If you want to keep this test in the CQ (and Chrome PFQ, regrettably), you need to make it non-flaky.
,
Dec 3
,
Dec 3
i've verified that forcibly rotating the logs first and then running the autotest doesn't break it. the test is able to find the log message in the previous messages.1 file.
the test is just using autotest's client/cros/cros_logging.py code to return all the syslog data. if that code is broken, it's not a bug in code that uses it.
my guess is that the set_start_by_reboot function is getting tricked by the logged output from crosvm:
client/cros/cros_logging.py
def set_start_by_reboot(self, index):
return self.set_start_by_regexp(index,
r'.*000\] Linux version \d')
from the messages file from that run:
2018-11-30T19:31:21.562898+00:00 NOTICE kernel: [ 0.000000] Linux version 4.4.164-15533-g40c5dfe24177 (chrome-bot@swarm-cros-434) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Thu Nov 29 21:12:01 PST 2018
2018-11-30T19:31:21.591103+00:00 INFO crash_reporter[658]: Enabling user crash handling
2018-11-30T20:29:57.392029+00:00 INFO VM(3)[23251]: [ 0.000000] Linux version 4.19.2-02071-g869138e6e829 (chrome-bot@swarm-cros-632) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Wed Nov 21 07:16:53 PST 2018#015
2018-11-30T20:31:00.479618+00:00 INFO VM(4)[24478]: [ 0.000000] Linux version 4.19.2-02071-g869138e6e829 (chrome-bot@swarm-cros-632) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Wed Nov 21 07:16:53 PST 2018#015
so autotest stops searching once crosvm logs its kernel output. which means any test trying to use set_start_by_reboot is broken if it runs after crosvm starts up.
,
Dec 3
Thanks, Mike. So, I'll just reassign this to the person who owns the autotest cros_logging.py library for a prompt fix. Kidding. I uploaded https://crrev.com/c/1358712.
,
Dec 4
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e99939627669f7d3bf65d34fc5a1c48b97b5cba6 commit e99939627669f7d3bf65d34fc5a1c48b97b5cba6 Author: Daniel Erat <derat@chromium.org> Date: Tue Dec 04 04:32:11 2018 autotest: Make cros_logging not match crosvm messages. Update the regexp used by the cros_logging module's AbstractLogReader.set_start_by_reboot to avoid matching kernel boot messages logged by crosvm. Without this, it sometimes doesn't scan far enough back in the log. BUG= chromium:817946 TEST=none Change-Id: I2bd2185f784e5a3bbc8e12c6f424faa84f3eef87 Reviewed-on: https://chromium-review.googlesource.com/1358712 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Jeffrey Kardatzke <jkardatzke@google.com> Reviewed-by: Stephen Barber <smbarber@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/e99939627669f7d3bf65d34fc5a1c48b97b5cba6/client/cros/cros_logging.py
,
Dec 4
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by derat@chromium.org
, Nov 30Labels: -Pri-2 -M-65 -M-66 Build-PFQ-Failures Pri-1
Summary: logging_UserCrash is flaky and fails with "User space crash handling was not started during last boot" (was: [BVT-CQ: daisy & oak] logging_UserCrash failing with "User space crash handling was not started during last boot" error)