Move crash-reporter --init before boot-services |
||||||||||||||
Issue descriptionsince the adjustment to crash-reporter startup, logging_UserCrash has been flaky https://chromium-review.googlesource.com/335467 see issue 702500 for lots of examples
,
Mar 17 2017
I'm fine with reverting that change if needed, but other than that I'm not really sure how to debug this. Every time I run it locally it seems to work fine. The error looks like crash_reporter --init wasn't run correctly, but no real indication why (since the execution moved earlier in the boot process, I don't think it could really be a race condition with the autotest starting too early?). Could either of you guys who are more familiar with this stuff try to take a look?
,
Mar 17 2017
reverting, this is hitting the pre-cq
,
Mar 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/1c43d466bc82fe88f4ca3b5f640d54894980725f commit 1c43d466bc82fe88f4ca3b5f640d54894980725f Author: Aviv Keshet <akeshet@chromium.org> Date: Sat Mar 18 00:12:09 2017 Revert "crash: Move crash-reporter before boot-services" BUG= chromium:702794 TEST=None This reverts commit 570b787c4020b8cb29f0fb16f1c8d52f5f8371a9. Change-Id: Ie602001e82f5b5354615b70176d74622658d1d4e Reviewed-on: https://chromium-review.googlesource.com/456629 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Trybot-Ready: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-reporter.conf [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-reporter.service [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/chrome_collector.cc [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-boot-collect.conf [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/crash_collector.cc [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/arc_collector.cc
,
Mar 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/1c43d466bc82fe88f4ca3b5f640d54894980725f commit 1c43d466bc82fe88f4ca3b5f640d54894980725f Author: Aviv Keshet <akeshet@chromium.org> Date: Sat Mar 18 00:12:09 2017 Revert "crash: Move crash-reporter before boot-services" BUG= chromium:702794 TEST=None This reverts commit 570b787c4020b8cb29f0fb16f1c8d52f5f8371a9. Change-Id: Ie602001e82f5b5354615b70176d74622658d1d4e Reviewed-on: https://chromium-review.googlesource.com/456629 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Trybot-Ready: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-reporter.conf [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-reporter.service [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/chrome_collector.cc [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/init/crash-boot-collect.conf [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/crash_collector.cc [modify] https://crrev.com/1c43d466bc82fe88f4ca3b5f640d54894980725f/crash-reporter/arc_collector.cc
,
Mar 18 2017
I took a look at this, all failures seem to happen on iteration 1. I suspect that rather than being a race condition, maybe this happens on the first test after a fresh provision? Or maybe the first test after a repair. It must be looking for a string that only fails to exist the first time the test is run on a fresh DUT. That would also explain why jwerner@ was unable to replicate if he wasn't using a freshly-installed DUT.
,
Mar 18 2017
Yes, I'm testing with the Kevin on my desk. I don't really know enough about the lab to simulate the provisioning process accurately. Can you guys give me some simple instructions for how I could either get my Kevin into that state or lock and prepare a lab DUT correctly so that I could play around with it on there? Essentially, I think the test reads /proc/sys/kernel/core_pattern and fails because it doesn't find the expected string (|/sbin/crash_reporter --user=%%P:%%s:%%u:%%e). After every boot, upstart runs crash_reporter --init (with my patch a little earlier than before) and writes that string in there. I really have no idea where this could fail for a newly provisioned board. If crash_reporter --init had some sort of error it should write that to syslog (I think?) but I can't find anything in the logs here.
,
Mar 18 2017
Looked a little more, the failure definitely happens during iteration 1 - there is no occurrence of the string "Enabling user crash handling" in /var/log/messages on the affected hosts by the time the test begins. The crash_reporter --init doesn't seem to have been run at all. I no longer think this depends on the DUT state prior to the test run. I've attached a copy of /var/log/messages from one of the testruns, could you take a look and see if anything jumps out at you Julius? I'm now suspicious of the line "start on starting boot-services and started syslog" in the upstart script in your CL for crash-reporter.conf. If boot-services doesn't start or syslog doesn't start successfully due to a failure unrelated to your CL, this test would fail for this reason.
,
Mar 18 2017
here's that /var/log/messages
,
Mar 18 2017
I believe I can confirm that function does not fail due to not finding that expected string "|/sbin/crash_reporter --user=%%P:%%s:%%u:%%e". That would raise "core pattern should have been %s, not %s" at https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/logging_UserCrash/logging_UserCrash.py?l=38
,
Mar 18 2017
Yeah, good point... nice attention to detail there! So this would mean that the pattern was actually set correctly (otherwise we would've gotten the other error), which means crash_reporter --init probably did run correctly. It just couldn't output to syslog for some reason. I think you may have uploaded the wrong file by accident there (or Monorail is having a weird bug when I click it right now)... here's another one. The bug happened during the last boot, and you can see that while we usually get "Enabling user crash handling" and shortly afterwards "Enabling kernel crash handling", the last boot only has the latter. I don't think syslog didn't start at all because we obviously still got a log, but is it possible that there's a race condition and it can't accept messages for a short time window after launching? (Who's good with rsyslogd here?) Could we somehow fix that so it only returns to upstart once it is really ready to log stuff... or should we just not bother and remove that part of the test? (Also lowering priority again. The patch is reverted and this is only about whether we want to try re-adding it.)
,
Mar 18 2017
This is not confined to the lab, it happens in VMTest. See for instance: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/pre_cq/builds/23492
,
Mar 18 2017
Looking at syslog.conf, it seems that we're currently launching rsyslogd in non-daemonizing mode. As far as I understand that wouldn't give upstart any chance to make sure it actually becomes ready to receive log entries before proceeding. The upstart cookbook suggests daemons to use the expect fork or expect daemon stanzas with a daemonizing executable to solve this, and all rsyslogd upstart scripts I can find online also do that. It seems that we used to do that until http://crosreview.com/32253 (associated with issue 212885 ), which changed it without explicit justification. +Richard who reviewed it back in the day.
,
Mar 29 2017
*ping* Richard? Mike? Does anyone here know about syslog.conf or know who to ask? Should I just upload a patch to change it and see what breaks?
,
Mar 29 2017
The rsyslogd invocation changed from this:
expect fork
exec rsyslogd -c4 -f /etc/rsyslog.chromeos
To this
exec rsyslogd -c4 -n -f /etc/rsyslog.chromeos
Here's what the man page says about the -n option:
-n Avoid auto-backgrounding. This is needed especially if the
rsyslogd is started and controlled by init(8).
So, on its face, it looks like a plausible change.
I don't know a whole lot more about why that change was made;
mostly I trusted the author to know enough.
,
Mar 29 2017
I'm pretty sure the man page reference there refers to a less sophisticated version of /sbin/init, not upstart. upstart can explicitly track when a process is daemonizing and as far as I can tell that is how you're supposed to hook up rsyslogd with it (e.g. https://github.com/librato/rsyslog-cookbook/blob/master/files/default/rsyslog.upstart.conf). So if nobody knows a good reason why I shouldn't, I'll change it back to that.
,
Mar 30 2017
the good reason would be to not waste resources forking & ptracing in the first place. it's cheaper to not do all of that. i don't think upstart differentiates "i've launched the daemon" and "the daemon is fully initialized & ready to process requests". we'd have to add our own logic to support that.
,
Mar 30 2017
> i don't think upstart differentiates "i've launched the daemon" > and "the daemon is fully initialized & ready to process requests". > we'd have to add our own logic to support that. Upstart can't make that distinction. If the daemon is actually written so that it's ready to serve before it daemonizes, then undoing the "-n" change will do the right thing, but I doubt we should depend on that sort of behavior, even if it exists. I think the right thing to do is along the lines of what vapier@ said: Add a 'post-start' stanza to the upstart job that specifically waits until rsyslogd can be observed to be serving. Then 'start on started syslog' will be known to do what we want.
,
Mar 30 2017
we could use `logger` in a post-start like: while ! logger --no-act --socket-errors=on testing; do sleep 0.1 done but considering this is about as early as you can get in the boot process, would be cleaner to write a small C++ program that just keeps trying to connect(/dev/log) and returns once it works. if rsyslog hasn't started up and taken over the socket yet, you'll get ECONNREFUSED.
,
Mar 30 2017
> while ! logger --no-act --socket-errors=on testing; do > sleep 0.1 > done I'd recommend no 'sleep', just 'while ...; do true; done'. As noted, this is part of the boot time critical path; any sleep would likely guarantee a delay equivalent to the length of the sleep. A polling loop without the sleep will consume CPU, but at that point in boot, the CPU isn't heavily used, and the CPU load in any event wouldn't last more than a few milliseconds, which is much more tolerable than 100ms of doing nothing. (As a practical matter, an empty loop body is likely to be free: the fork/exec of 'logger' will cost more than rsyslogd initialization, so that the result would be to invoke 'logger' at most twice, once to see it fail, once to see it pass.) A custom C++ program would be efficient, but I doubt the efficiency would pay for the cost of writing and maintaining the code.
,
Mar 30 2017
seems pretty easy in C:
#include <errno.h>
#include <sched.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/un.h>
int main() {
int s = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
if (s < 0)
return -1;
struct sockaddr_un sa;
memset(&sa, 0, sizeof(sa));
sa.sun_family = AF_UNIX;
strcpy(sa.sun_path, "/dev/log");
while (connect(s, (struct sockaddr *)&sa, sizeof(sa)) == -1) {
if (errno != ECONNREFUSED && errno != ENOENT)
return -1;
sched_yield();
}
return 0;
}
because this is early boot, i tend to prefer not creating a shell to constantly fork a program (logger)
,
Mar 31 2017
In the upstart cookbook (http://upstart.ubuntu.com/cookbook/#expect and http://upstart.ubuntu.com/cookbook/#daemon-behaviour) it's made clear that this is the expected way to track service readiness. Of course, the service has to cooperate for that to work. I don't know if rsyslog does that... at least its ubuntu package configures upstart in the same way (https://github.com/rsyslog/rsyslog-pkg-ubuntu/blob/master/rsyslog/precise/v8-stable/debian/rsyslog.upstart). I'd say it's worth a try. But if you really think it has to be done with explicit polling, then somebody else needs to take this over (if there's still enough interest in doing it at all). It doesn't have anything to do with what I was doing on crash-reporter anyway, Mike just suggested it would be easy to do and nice to have. But I'm not going to write and maintain some custom socket-checker C++ tool for it.
,
May 8 2017
,
May 8 2017
,
Aug 25 2017
Issue 758939 has been merged into this issue.
,
Aug 25 2017
let me take a stab at picking this up again
,
Aug 26 2017
in reading and tracing rsyslog, it has the following behavior:
- first process parses command line and config files
- fork:
- parent process waits for the child to notify it
- child returns to continue processing
- child loads all of its configured modules
- imuxsock is loaded which opens /dev/log
- only here is rsyslog actually "available" by our definition
- child finishes loading/configuring everything and notifies parent
- parent exits
- child goes to main processing loop
this is good from the pov of a normal processor: the parent does not return until the child is ready to process log messages.
however, from reading the docs Julius referenced, and from reading the upstart code, i don't think this is sufficient from upstart's pov. specifically:
* Signalling "readiness": Since Upstart tracks forks, it can only assume that once the final fork(2) call has been made (as indicated by the expect stanza specification), that the job is "ready" to accept work from other parts of the system.
that means as soon as rsyslog calls the first fork, upstart considers it ready, not when the parent exits. i think the race window is smaller relative to upstart forking & execing rsyslog, but it isn't gone.
to double check i read that upstart description correctly, i looked at the upstart code. init/job_process.c:job_process_trace_new_child seems to match.
/**
* job_process_trace_new_child:
* @job: job that changed,
* @process: specific process.
*
* This function is called whenever a traced @process attached to @job stops
* after the fork() so that we can set the options before continuing it.
*
* Check to see whether we've reached the number of forks we expected, if
* so detach the process and move towards the running state; otherwise we
* set our important ptrace options, update the trace state so that
* further SIGSTOP or SIGTRAP signals are treated as just that and allow
* the process to continue.
**/
...
/* We need to fork at least twice unless we're expecting a
* single fork when we only need to fork once; once that limit
* has been reached, end the trace.
*/
job->trace_forks++;
if ((job->trace_forks > 1) || (job->class->expect == EXPECT_FORK))
{
if (ptrace (PTRACE_DETACH, job->pid[process], NULL, 0) < 0)
nih_warn (_("Failed to detach traced "
"%s %s process (%d): %s"),
job_name (job), process_name (process),
job->pid[process], strerror (errno));
job->trace_state = TRACE_NONE;
job_change_state (job, job_next_state (job));
return;
}
it's the same logic as job_process_trace_exec, so afaict, it is still problematic. before these funcs run, the job state is JOB_SPAWNING, so they'll both transition to the next state (and beyond). although it's not easy to trace the upstart code as it's very event based, so there might be other logic hidden away where upstart will wait on the parent to exit.
to doubly check things, i wrote two test init scripts:
# cat fork.conf:
expect fork
script
echo start test >> /tmp/log
dash -c 'sleep 10'
echo end test >> /tmp/log
end script
# cat fork-wait.conf
start on started fork
script
echo start test2 >> /tmp/log
end script
when i ran `start fork`, i saw "start test" and "start test2" immediately in the log, then 10 seconds later i saw "end test". upstart did not wait for the parent to exit, just for the fork to happen, which doesn't work with rsyslog's startup method :(.
,
Aug 28 2017
posted CLs. will add reviewers once testing finishes.
,
Sep 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/c96d2da00a9720d2b780a8f953021391f6fdf797 commit c96d2da00a9720d2b780a8f953021391f6fdf797 Author: Mike Frysinger <vapier@chromium.org> Date: Wed Sep 06 19:42:45 2017 init: add a helper for waiting for unix sockets Currently our syslog init script returns before it's fully available (namely, apps can syslog() and not lose messages). We need a method for polling the /dev/log UNIX socket and not returning until it's usable. Hence, this simple tool. It's written in such a way that we could conceivably extend it to TCP/UDP ports if the need arises. BUG= chromium:702794 TEST=precq passes Change-Id: I6be60cee4c63289fc2a091b1ff6a3a44ff65364c Reviewed-on: https://chromium-review.googlesource.com/639090 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Mattias Nissler <mnissler@chromium.org> [modify] https://crrev.com/c96d2da00a9720d2b780a8f953021391f6fdf797/init/init.gyp [add] https://crrev.com/c96d2da00a9720d2b780a8f953021391f6fdf797/init/net_poll_tool.cc
,
Sep 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/242f4c9e56310063169798c4694bf1c5148aa6f8 commit 242f4c9e56310063169798c4694bf1c5148aa6f8 Author: Mike Frysinger <vapier@chromium.org> Date: Fri Sep 08 04:21:26 2017 chromeos-init: install new net_poll_tool BUG= chromium:702794 TEST=precq passes CQ-DEPEND=CL:639090 Change-Id: I476de4301759fafe375f9870b69f68f057ffecf9 Reviewed-on: https://chromium-review.googlesource.com/639110 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> [modify] https://crrev.com/242f4c9e56310063169798c4694bf1c5148aa6f8/chromeos-base/chromeos-init/chromeos-init-9999.ebuild
,
Sep 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/54d1e4c9be35636f40d80d21b24922ba16e6aa50 commit 54d1e4c9be35636f40d80d21b24922ba16e6aa50 Author: Mike Frysinger <vapier@chromium.org> Date: Fri Sep 08 22:05:13 2017 init: rsyslog: fix startup race Upstart marks rsyslog as ready before it is actually ready (i.e. it is accepting /dev/log messages via syslog()), so add a post-start step to poll availability directly. BUG= chromium:702794 TEST=precq passes Change-Id: I3d94ba5c5afc3f6e094d0f13ef1a260761f41e56 Reviewed-on: https://chromium-review.googlesource.com/639130 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/54d1e4c9be35636f40d80d21b24922ba16e6aa50/init/upstart/syslog.conf
,
Sep 9 2017
hopefully this should be all set now after relanding Julius's original CL
,
Jan 22 2018
,
Jan 23 2018
|
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by vapier@chromium.org
, Mar 17 2017