New issue
Advanced search Search tips

Issue 817946 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

autotest client/cros/cros_logging.py:AbstractLogReader.set_start_by_reboot gets confused when crosvm logs its kernel output

Project Member Reported by harpreet@chromium.org, Mar 1 2018

Issue description

Sample log from elm: https://storage.cloud.google.com/chromeos-autotest-results/179958285-chromeos-test/chromeos2-row7-rack9-host17/logging_UserCrash/debug/logging_UserCrash.DEBUG?_ga=2.199270378.-517888296.1519867600

Job: https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=179958285



02/28 02:57:06.736 INFO |        crash_test:0694| ====================Running reporter_startup====================
02/28 02:57:06.736 DEBUG|             utils:0214| Running 'touch /var/lib/crash_sender_paused'
02/28 02:57:06.765 DEBUG|             utils:0214| Running 'rm -rf /var/lib/crash_sender'
02/28 02:57:06.785 DEBUG|             utils:0214| Running 'rm -rf /var/spool/crash'
02/28 02:57:06.805 DEBUG|             utils:0214| Running 'rm -rf /home/chronos/u-*/crash /home/chronos/crash'
02/28 02:57:06.829 DEBUG|             utils:0214| Running 'echo "|/sbin/crash_reporter --user=%P:%s:%u:%e" > /proc/sys/kernel/core_pattern'
02/28 02:57:06.857 DEBUG|             utils:0214| Running 'ls -tr1 /var/log/messages.*'
02/28 02:57:06.883 DEBUG|             utils:0214| Running 'ls -tr1 /var/log/messages.*'
02/28 02:57:06.907 DEBUG|              test:0410| Test failed due to user space crash handling was not started during last boot. Exception log follows the after_iteration_hooks.
02/28 02:57:06.908 DEBUG|              test:0415| Starting after_iteration_hooks for logging_UserCrash
02/28 02:57:06.909 DEBUG|      base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/interrupts.after
02/28 02:57:06.911 DEBUG|      base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/meminfo.after
02/28 02:57:06.913 DEBUG|      base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/slabinfo.after
02/28 02:57:06.914 DEBUG|      base_sysinfo:0124| Loggable saves logs to /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1/schedstat.after
02/28 02:57:06.915 DEBUG|             utils:0214| Running 'logger "autotest finished iteration /usr/local/autotest/results/default/logging_UserCrash/sysinfo/iteration.1"'
02/28 02:57:06.937 DEBUG|              test:0420| after_iteration_hooks completed
02/28 02:57:06.938 WARNI|              test:0637| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 631, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 831, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/common_lib/test.py", line 495, in execute
    dargs)
  File "/usr/local/autotest/common_lib/test.py", line 362, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/common_lib/test.py", line 400, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 306, in run_once
    must_run_all=False)
  File "/usr/local/autotest/cros/crash/crash_test.py", line 711, in run_crash_tests
    getattr(self, '_test_' + test_name)()
  File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 38, in _test_reporter_startup
    'user space crash handling was not started during last boot')
TestFail: user space crash handling was not started during last boot




https://stainless.corp.google.com/search?view=matrix&row=build&col=model&first_date=2018-02-02&last_date=2018-03-01&suite=bvt-cq&test=%5Elogging%5C_UserCrash%24&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=user+space+crash+handling+was+not+started+during+last+boot&exclude_cts=true&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
 
Cc: derat@chromium.org vapier@chromium.org satorux@chromium.org benchan@chromium.org
Labels: -Pri-2 -M-65 -M-66 Build-PFQ-Failures Pri-1
Summary: logging_UserCrash is flaky and fails with "User space crash handling was not started during last boot" (was: [BVT-CQ: daisy & oak] logging_UserCrash failing with "User space crash handling was not started during last boot" error)
I see what looks like the same error in the eve-tot-chrome-pfq-informational run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928559381630727616.

From the logs in http://stainless/browse/chromeos-autotest-results/262073007-chromeos-test/ :

11/28 22:32:49.019 INFO |        crash_test:0237| Created /run/crash_reporter/crash-test-in-progress
11/28 22:32:53.454 INFO |      base_sysinfo:0400| ChromeOS BOARD = eve_3.6GHz_16GB
11/28 22:32:53.482 INFO |        crash_test:0714| ====================Running reporter_startup====================
11/28 22:32:53.578 WARNI|              test:0606| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 800, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/common_lib/test.py", line 464, in execute
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/common_lib/test.py", line 371, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 306, in run_once
    must_run_all=False)
  File "/usr/local/autotest/cros/crash/crash_test.py", line 731, in run_crash_tests
    getattr(self, '_test_' + test_name)()
  File "/usr/local/autotest/tests/logging_UserCrash/logging_UserCrash.py", line 38, in _test_reporter_startup
    'user space crash handling was not started during last boot')
TestFail: user space crash handling was not started during last boot

---

This test looks like it's failing because it doesn't see 'Enabling user crash handling' in /var/log/messages. I don't think that checking for messages in logs is a good way to verify behavior.

If anyone cares about this test enough to make it non-flaky, please speak up. Otherwise, I'm going to move it to bvt-perbuild -- we can't have flaky tests on the CQ and PFQ.
Components: -Internals>CrashReporting OS>Systems>CrashReporting
logging_UserCrash has multiple tests which need to be in the CQ

"Enabling user crash handling" should be in the logs because it's written by the crash reporter init script which initializes things.  why isn't in the logs ?

i vaguely recall this used to be flaky because messages would get rotated, but i thought we fixed that by having it look at all the logs (including rotated ones).
Owner: vapier@chromium.org
Status: Assigned (was: Untriaged)
The message looks like it's supposed to come from user_collector.cc:

bool UserCollector::SetUpInternal(bool enabled) {
  CHECK(initialized_);
  LOG(INFO) << (enabled ? "Enabling" : "Disabling") << " user crash handling";

yes, the code path is:
- /etc/init/crash-reporter.conf: `crash_reporter --init`
- crash_reporter.cc:Initialize() -> user_collector->Enable()
- user_collector.h:Enable() -> SetUpInternal()
- user_collector.cc:SetUpInternal() -> LOG Enabling user crash handling

so the test is making sure that `crash_reporter --init` is called correctly during boot.  if it wasn't, that'd be bad.
Please figure out why the log message wasn't found in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8928559381630727616. I see it in the messages file that was collected, at least:

2018-11-29T05:21:29.401019+00:00 INFO crash_reporter[655]: Enabling user crash handling

Ideally we just wouldn't run this test on the Chrome PFQ, but that's hard to accomplish with the way that we run Autotest tests. (It's automatic with Tast, but I haven't looked into porting this test yet.)
This failed again today: http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8928414701064286384

If you want to keep this test in the CQ (and Chrome PFQ, regrettably), you need to make it non-flaky.
Cc: wzang@chromium.org lucmult@chromium.org
Labels: Hotlist-CrOS-Gardener
Owner: ----
Summary: autotest client/cros/cros_logging.py:AbstractLogReader.set_start_by_reboot gets confused when crosvm logs its kernel output (was: logging_UserCrash is flaky and fails with "User space crash handling was not started during last boot")
i've verified that forcibly rotating the logs first and then running the autotest doesn't break it.  the test is able to find the log message in the previous messages.1 file.

the test is just using autotest's client/cros/cros_logging.py code to return all the syslog data.  if that code is broken, it's not a bug in code that uses it.

my guess is that the set_start_by_reboot function is getting tricked by the logged output from crosvm:
client/cros/cros_logging.py
    def set_start_by_reboot(self, index):
        return self.set_start_by_regexp(index,
                                        r'.*000\] Linux version \d')

from the messages file from that run:
2018-11-30T19:31:21.562898+00:00 NOTICE kernel: [    0.000000] Linux version 4.4.164-15533-g40c5dfe24177 (chrome-bot@swarm-cros-434) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Thu Nov 29 21:12:01 PST 2018
2018-11-30T19:31:21.591103+00:00 INFO crash_reporter[658]: Enabling user crash handling
2018-11-30T20:29:57.392029+00:00 INFO VM(3)[23251]: [    0.000000] Linux version 4.19.2-02071-g869138e6e829 (chrome-bot@swarm-cros-632) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Wed Nov 21 07:16:53 PST 2018#015
2018-11-30T20:31:00.479618+00:00 INFO VM(4)[24478]: [    0.000000] Linux version 4.19.2-02071-g869138e6e829 (chrome-bot@swarm-cros-632) (Chromium OS 8.0_pre339409_p20180926-r6 clang version 8.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/clang.git 6601c8f525499269dba75f75bbd1ee2671aaa262) (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm.git 36f54002c931a026f490f9fb074c11d91e3487a2) (based on LLVM 8.0.0svn)) #1 SMP PREEMPT Wed Nov 21 07:16:53 PST 2018#015

so autotest stops searching once crosvm logs its kernel output.  which means any test trying to use set_start_by_reboot is broken if it runs after crosvm starts up.
Cc: smbar...@chromium.org jkardatzke@chromium.org dgreid@chromium.org
Owner: derat@chromium.org
Status: Started (was: Assigned)
Thanks, Mike.

So, I'll just reassign this to the person who owns the autotest cros_logging.py library for a prompt fix.

Kidding. I uploaded https://crrev.com/c/1358712.
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e99939627669f7d3bf65d34fc5a1c48b97b5cba6

commit e99939627669f7d3bf65d34fc5a1c48b97b5cba6
Author: Daniel Erat <derat@chromium.org>
Date: Tue Dec 04 04:32:11 2018

autotest: Make cros_logging not match crosvm messages.

Update the regexp used by the cros_logging module's
AbstractLogReader.set_start_by_reboot to avoid matching
kernel boot messages logged by crosvm. Without this, it
sometimes doesn't scan far enough back in the log.

BUG= chromium:817946 
TEST=none

Change-Id: I2bd2185f784e5a3bbc8e12c6f424faa84f3eef87
Reviewed-on: https://chromium-review.googlesource.com/1358712
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Jeffrey Kardatzke <jkardatzke@google.com>
Reviewed-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/e99939627669f7d3bf65d34fc5a1c48b97b5cba6/client/cros/cros_logging.py

Status: Fixed (was: Started)

Sign in to add a comment