New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 702794 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature

Blocking:
issue 199893
issue 217282
issue 702500



Sign in to add a comment

Move crash-reporter --init before boot-services

Project Member Reported by vapier@chromium.org, Mar 17 2017

Issue description

since the adjustment to crash-reporter startup, logging_UserCrash has been flaky
https://chromium-review.googlesource.com/335467

see issue 702500 for lots of examples
 

Comment 1 by vapier@chromium.org, Mar 17 2017

Blocking: 702500
I'm fine with reverting that change if needed, but other than that I'm not really sure how to debug this. Every time I run it locally it seems to work fine. The error looks like crash_reporter --init wasn't run correctly, but no real indication why (since the execution moved earlier in the boot process, I don't think it could really be a race condition with the autotest starting too early?).

Could either of you guys who are more familiar with this stuff try to take a look?
Labels: -Pri-2 Pri-0
reverting, this is hitting the pre-cq

Comment 6 by ra...@google.com, Mar 18 2017

I took a look at this, all failures seem to happen on iteration 1.

I suspect that rather than being a race condition, maybe this happens on the first test after a fresh provision? Or maybe the first test after a repair. It must be looking for a string that only fails to exist the first time the test is run on a fresh DUT. That would also explain why jwerner@ was unable to replicate if he wasn't using a freshly-installed DUT.
Yes, I'm testing with the Kevin on my desk. I don't really know enough about the lab to simulate the provisioning process accurately. Can you guys give me some simple instructions for how I could either get my Kevin into that state or lock and prepare a lab DUT correctly so that I could play around with it on there?

Essentially, I think the test reads /proc/sys/kernel/core_pattern and fails because it doesn't find the expected string (|/sbin/crash_reporter --user=%%P:%%s:%%u:%%e). After every boot, upstart runs crash_reporter --init (with my patch a little earlier than before) and writes that string in there. I really have no idea where this could fail for a newly provisioned board. If crash_reporter --init had some sort of error it should write that to syslog (I think?) but I can't find anything in the logs here.

Comment 8 by ra...@google.com, Mar 18 2017

Looked a little more, the failure definitely happens during iteration 1 - there is no occurrence of the string "Enabling user crash handling" in /var/log/messages on the affected hosts by the time the test begins. The crash_reporter --init doesn't seem to have been run at all.

I no longer think this depends on the DUT state prior to the test run.

I've attached a copy of /var/log/messages from one of the testruns, could you take a look and see if anything jumps out at you Julius?

I'm now suspicious of the line "start on starting boot-services and started syslog" in the upstart script in your CL for crash-reporter.conf.

If boot-services doesn't start or syslog doesn't start successfully due to a failure unrelated to your CL, this test would fail for this reason.

Comment 9 by ra...@google.com, Mar 18 2017

here's that /var/log/messages
messages.txt
5.5 KB View Download

Comment 10 by ra...@google.com, Mar 18 2017

I believe I can confirm that function does not fail due to not finding that expected string "|/sbin/crash_reporter --user=%%P:%%s:%%u:%%e". That would raise "core pattern should have been %s, not %s" at https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/logging_UserCrash/logging_UserCrash.py?l=38
Labels: -Pri-0 Pri-2
Yeah, good point... nice attention to detail there! So this would mean that the pattern was actually set correctly (otherwise we would've gotten the other error), which means crash_reporter --init probably did run correctly. It just couldn't output to syslog for some reason.

I think you may have uploaded the wrong file by accident there (or Monorail is having a weird bug when I click it right now)... here's another one. The bug happened during the last boot, and you can see that while we usually get "Enabling user crash handling" and shortly afterwards "Enabling kernel crash handling", the last boot only has the latter.

I don't think syslog didn't start at all because we obviously still got a log, but is it possible that there's a race condition and it can't accept messages for a short time window after launching? (Who's good with rsyslogd here?) Could we somehow fix that so it only returns to upstart once it is really ready to log stuff... or should we just not bother and remove that part of the test?

(Also lowering priority again. The patch is reverted and this is only about whether we want to try re-adding it.)
messages
959 KB View Download
This is not confined to the lab, it happens in VMTest. See for instance:

https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/pre_cq/builds/23492

Cc: jrbarnette@chromium.org
Looking at syslog.conf, it seems that we're currently launching rsyslogd in non-daemonizing mode. As far as I understand that wouldn't give upstart any chance to make sure it actually becomes ready to receive log entries before proceeding. The upstart cookbook suggests daemons to use the expect fork or expect daemon stanzas with a daemonizing executable to solve this, and all rsyslogd upstart scripts I can find online also do that.

It seems that we used to do that until http://crosreview.com/32253 (associated with  issue 212885 ), which changed it without explicit justification. +Richard who reviewed it back in the day.
Owner: jwer...@chromium.org
*ping*

Richard? Mike? Does anyone here know about syslog.conf or know who to ask? Should I just upload a patch to change it and see what breaks?
The rsyslogd invocation changed from this:
    expect fork
    exec rsyslogd -c4 -f /etc/rsyslog.chromeos

To this
    exec rsyslogd -c4 -n -f /etc/rsyslog.chromeos

Here's what the man page says about the -n option:
       -n     Avoid auto-backgrounding.  This  is  needed  especially  if  the
              rsyslogd is started and controlled by init(8).

So, on its face, it looks like a plausible change.

I don't know a whole lot more about why that change was made;
mostly I trusted the author to know enough.

Status: Assigned (was: Available)
I'm pretty sure the man page reference there refers to a less sophisticated version of /sbin/init, not upstart. upstart can explicitly track when a process is daemonizing and as far as I can tell that is how you're supposed to hook up rsyslogd with it (e.g. https://github.com/librato/rsyslog-cookbook/blob/master/files/default/rsyslog.upstart.conf). So if nobody knows a good reason why I shouldn't, I'll change it back to that.
the good reason would be to not waste resources forking & ptracing in the first place.  it's cheaper to not do all of that.

i don't think upstart differentiates "i've launched the daemon" and "the daemon is fully initialized & ready to process requests".  we'd have to add our own logic to support that.
> i don't think upstart differentiates "i've launched the daemon"
> and "the daemon is fully initialized & ready to process requests".
> we'd have to add our own logic to support that.

Upstart can't make that distinction.  If the daemon is actually
written so that it's ready to serve before it daemonizes, then
undoing the "-n" change will do the right thing, but I doubt we
should depend on that sort of behavior, even if it exists.

I think the right thing to do is along the lines of what vapier@
said:  Add a 'post-start' stanza to the upstart job that
specifically waits until rsyslogd can be observed to be serving.
Then 'start on started syslog' will be known to do what we want.

we could use `logger` in a post-start like:

 while ! logger --no-act --socket-errors=on testing; do
   sleep 0.1
 done

but considering this is about as early as you can get in the boot process, would be cleaner to write a small C++ program that just keeps trying to connect(/dev/log) and returns once it works.  if rsyslog hasn't started up and taken over the socket yet, you'll get ECONNREFUSED.
> while ! logger --no-act --socket-errors=on testing; do
>   sleep 0.1
> done

I'd recommend no 'sleep', just 'while ...; do true; done'.

As noted, this is part of the boot time critical path; any
sleep would likely guarantee a delay equivalent to the length
of the sleep.  A polling loop without the sleep will consume
CPU, but at that point in boot, the CPU isn't heavily used,
and the CPU load in any event wouldn't last more than a few
milliseconds, which is much more tolerable than 100ms of doing
nothing.

(As a practical matter, an empty loop body is likely to be free:
the fork/exec of 'logger' will cost more than rsyslogd initialization,
so that the result would be to invoke 'logger' at most twice, once to
see it fail, once to see it pass.)

A custom C++ program would be efficient, but I doubt the efficiency
would pay for the cost of writing and maintaining the code.

seems pretty easy in C:

#include <errno.h>
#include <sched.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/un.h>
int main() {
  int s = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
  if (s < 0)
    return -1;
  struct sockaddr_un sa;
  memset(&sa, 0, sizeof(sa));
  sa.sun_family = AF_UNIX;
  strcpy(sa.sun_path, "/dev/log");
  while (connect(s, (struct sockaddr *)&sa, sizeof(sa)) == -1) {
    if (errno != ECONNREFUSED && errno != ENOENT)
      return -1;
    sched_yield();
  }
  return 0;
}

because this is early boot, i tend to prefer not creating a shell to constantly fork a program (logger)
Labels: -Type-Bug-Regression Type-Feature
Owner: ----
Status: Available (was: Assigned)
Summary: Move crash-reporter --init before boot-services (was: logging_UserCrash flaking in vmtest)
In the upstart cookbook (http://upstart.ubuntu.com/cookbook/#expect and http://upstart.ubuntu.com/cookbook/#daemon-behaviour) it's made clear that this is the expected way to track service readiness. Of course, the service has to cooperate for that to work. I don't know if rsyslog does that... at least its ubuntu package configures upstart in the same way (https://github.com/rsyslog/rsyslog-pkg-ubuntu/blob/master/rsyslog/precise/v8-stable/debian/rsyslog.upstart). I'd say it's worth a try.

But if you really think it has to be done with explicit polling, then somebody else needs to take this over (if there's still enough interest in doing it at all). It doesn't have anything to do with what I was doing on crash-reporter anyway, Mike just suggested it would be easy to do and nice to have. But I'm not going to write and maintain some custom socket-checker C++ tool for it.
Blocking: 199893
Blocking: 217282
 Issue 758939  has been merged into this issue.
Owner: vapier@chromium.org
let me take a stab at picking this up again
in reading and tracing rsyslog, it has the following behavior:
- first process parses command line and config files
- fork:
  - parent process waits for the child to notify it
  - child returns to continue processing
- child loads all of its configured modules
- imuxsock is loaded which opens /dev/log
  - only here is rsyslog actually "available" by our definition
- child finishes loading/configuring everything and notifies parent
- parent exits
- child goes to main processing loop

this is good from the pov of a normal processor: the parent does not return until the child is ready to process log messages.

however, from reading the docs Julius referenced, and from reading the upstart code, i don't think this is sufficient from upstart's pov.  specifically:
* Signalling "readiness": Since Upstart tracks forks, it can only assume that once the final fork(2) call has been made (as indicated by the expect stanza specification), that the job is "ready" to accept work from other parts of the system.

that means as soon as rsyslog calls the first fork, upstart considers it ready, not when the parent exits.  i think the race window is smaller relative to upstart forking & execing rsyslog, but it isn't gone.

to double check i read that upstart description correctly, i looked at the upstart code.  init/job_process.c:job_process_trace_new_child seems to match.
/**
 * job_process_trace_new_child:
 * @job: job that changed,
 * @process: specific process. 
 *
 * This function is called whenever a traced @process attached to @job stops
 * after the fork() so that we can set the options before continuing it. 
 *  
 * Check to see whether we've reached the number of forks we expected, if
 * so detach the process and move towards the running state; otherwise we
 * set our important ptrace options, update the trace state so that
 * further SIGSTOP or SIGTRAP signals are treated as just that and allow
 * the process to continue.  
 **/
...
    /* We need to fork at least twice unless we're expecting a
     * single fork when we only need to fork once; once that limit
     * has been reached, end the trace.
     */
    job->trace_forks++;
    if ((job->trace_forks > 1) || (job->class->expect == EXPECT_FORK))
    { 
        if (ptrace (PTRACE_DETACH, job->pid[process], NULL, 0) < 0)
            nih_warn (_("Failed to detach traced "
                    "%s %s process (%d): %s"),
                  job_name (job), process_name (process),
                  job->pid[process], strerror (errno));
 
        job->trace_state = TRACE_NONE;
        job_change_state (job, job_next_state (job));
        return;
    }

it's the same logic as job_process_trace_exec, so afaict, it is still problematic.  before these funcs run, the job state is JOB_SPAWNING, so they'll both transition to the next state (and beyond).  although it's not easy to trace the upstart code as it's very event based, so there might be other logic hidden away where upstart will wait on the parent to exit.

to doubly check things, i wrote two test init scripts:
# cat fork.conf:
expect fork
script
  echo start test >> /tmp/log
  dash -c 'sleep 10'
  echo end test >> /tmp/log
end script

# cat fork-wait.conf
start on started fork
script
  echo start test2 >> /tmp/log
end script

when i ran `start fork`, i saw "start test" and "start test2" immediately in the log, then 10 seconds later i saw "end test".  upstart did not wait for the parent to exit, just for the fork to happen, which doesn't work with rsyslog's startup method :(.
Status: Started (was: Available)
posted CLs.  will add reviewers once testing finishes.
Project Member

Comment 29 by bugdroid1@chromium.org, Sep 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/c96d2da00a9720d2b780a8f953021391f6fdf797

commit c96d2da00a9720d2b780a8f953021391f6fdf797
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Sep 06 19:42:45 2017

init: add a helper for waiting for unix sockets

Currently our syslog init script returns before it's fully available
(namely, apps can syslog() and not lose messages).  We need a method
for polling the /dev/log UNIX socket and not returning until it's
usable.  Hence, this simple tool.

It's written in such a way that we could conceivably extend it to
TCP/UDP ports if the need arises.

BUG= chromium:702794 
TEST=precq passes

Change-Id: I6be60cee4c63289fc2a091b1ff6a3a44ff65364c
Reviewed-on: https://chromium-review.googlesource.com/639090
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Mattias Nissler <mnissler@chromium.org>

[modify] https://crrev.com/c96d2da00a9720d2b780a8f953021391f6fdf797/init/init.gyp
[add] https://crrev.com/c96d2da00a9720d2b780a8f953021391f6fdf797/init/net_poll_tool.cc

Project Member

Comment 30 by bugdroid1@chromium.org, Sep 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/242f4c9e56310063169798c4694bf1c5148aa6f8

commit 242f4c9e56310063169798c4694bf1c5148aa6f8
Author: Mike Frysinger <vapier@chromium.org>
Date: Fri Sep 08 04:21:26 2017

chromeos-init: install new net_poll_tool

BUG= chromium:702794 
TEST=precq passes
CQ-DEPEND=CL:639090

Change-Id: I476de4301759fafe375f9870b69f68f057ffecf9
Reviewed-on: https://chromium-review.googlesource.com/639110
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>

[modify] https://crrev.com/242f4c9e56310063169798c4694bf1c5148aa6f8/chromeos-base/chromeos-init/chromeos-init-9999.ebuild

Project Member

Comment 31 by bugdroid1@chromium.org, Sep 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/54d1e4c9be35636f40d80d21b24922ba16e6aa50

commit 54d1e4c9be35636f40d80d21b24922ba16e6aa50
Author: Mike Frysinger <vapier@chromium.org>
Date: Fri Sep 08 22:05:13 2017

init: rsyslog: fix startup race

Upstart marks rsyslog as ready before it is actually ready (i.e. it
is accepting /dev/log messages via syslog()), so add a post-start
step to poll availability directly.

BUG= chromium:702794 
TEST=precq passes

Change-Id: I3d94ba5c5afc3f6e094d0f13ef1a260761f41e56
Reviewed-on: https://chromium-review.googlesource.com/639130
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/54d1e4c9be35636f40d80d21b24922ba16e6aa50/init/upstart/syslog.conf

Status: Fixed (was: Started)
hopefully this should be all set now after relanding Julius's original CL

Comment 33 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 34 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment