Some feedback reports silently omit debugd data |
||||||
Issue descriptionFor instance https://listnr.corp.google.com/product/208/report/70153125976 is missing at least dmesg, meminfo, and syslog. We should at least generate some error when those sections cannot be collected. This is not a one-off, I have seen it in other reports too, although now I don't know how to find them (perhaps search crbug for the three keywords "feedback" "report" and "missing").
,
Aug 12 2017
There was a separate bug for the syslog, issue 749969, recently fixed. I started looking into the Chrome code, I'll give this a stab.
,
Sep 21 2017
The log collection code is exceedingly hard to read for such basic functionality. But here's what I think happens.
src/extensions/browser/api/feedback_private/feedback_private_api.cc:FeedbackPrivateGetSystemInformationFunction::Run() calls CreateSystemLogsFetcher() which calls BuildChromeSystemLogsFetcher(), then invokes its Fetch() method.
The above builder (in src/chrome/browser/feedback/system_logs/) adds a number of sources as shown:
fetcher->AddSource(base::MakeUnique<ChromeInternalLogSource>());
fetcher->AddSource(base::MakeUnique<CrashIdsSource>());
fetcher->AddSource(base::MakeUnique<MemoryDetailsLogSource>());
#if defined(OS_CHROMEOS)
fetcher->AddSource(base::MakeUnique<CommandLineLogSource>());
fetcher->AddSource(base::MakeUnique<DBusLogSource>());
fetcher->AddSource(base::MakeUnique<DeviceEventLogSource>());
fetcher->AddSource(base::MakeUnique<LsbReleaseLogSource>());
fetcher->AddSource(base::MakeUnique<TouchLogSource>());
// Debug Daemon data source - currently only this data source supports
// the scrub_data parameter, but the others still get scrubbed by
// SystemLogsFetcher.
fetcher->AddSource(base::MakeUnique<DebugDaemonLogSource>(scrub_data));
#endif
So Dan's theory above might be correct, because the ChromeInternalLogSource appears to be collected before attempting to contact debugd. However, the DebugDaemonLogSource::Fetch() includes this line:
client->GetUserLogFiles(base::Bind(&DebugDaemonLogSource::OnGetUserLogFiles,
weak_ptr_factory_.GetWeakPtr()));
which is a dbus RPC:
<method name="GetUserLogFiles">
<tp:docstring>
Returns list of User log file names that Chrome itself must collect.
These logfiles are relative to the user's profile path and must be
collected separately for each user.
</tp:docstring>
implemented by src/platform2/debugd/src/log_tool.cc, which has this:
LogTool::LogMap LogTool::GetUserLogFiles() {
LogMap result;
for (size_t i = 0; kUserLogs[i].name; ++i)
result[kUserLogs[i].name] = kUserLogs[i].command;
return result;
}
Note that this returns a list of *filenames*, not the file contents.
// List of log files that must directly be collected by Chrome. This is because
// debugd is running under a VFS namespace and does not have access to later
// cryptohome mounts.
const Log kUserLogs[] = {
{"chrome_user_log", "log/chrome"},
{"login-times", "login-times"},
{"logout-times", "logout-times"},
{ nullptr, nullptr}
};
However, those are not the missing logs. Those are listed in kCommandLogs[], which is used in a few places including GetAllLogs(), GetFeedbackLogs(), and GetBigFeedbackLogs(). The most likely candidate is GetAllLogs(), which is called after GetUserLogFiles() in DebugDaemonLogSource::Fetch().
(to be continued)
,
Sep 21 2017
It turns out that GetScrubbedBigLogs() is calling GetBigFeedbackLogs() which is the RPC. But GetBigFeedbackLogs() is only dealing with the ARC++ bugreport... so I am puzzled---I must have made a mistake somewhere.
,
Sep 22 2017
,
Sep 22 2017
Semenzato, thanks for taking the deep dive. ChromeInternalLogSource doesn't actually read any log files. The log files are read by debugd. What you want to look at is GetBigFeedbackLogs(): https://cs.corp.google.com/chromeos_public/src/platform2/debugd/src/log_tool.cc?q=f:log_tool.cc&l=403
,
Sep 22 2017
Thank you Ahmed. I did see that those files are read by debugd and passed in another file, but that doesn't happen for the user logs (kUserLogs), so I still haven't figured out if the Dan's theory is correct and how to fix it. In any case, I will continue this---I just got momentarily tired after wading through dozens of files.
,
Sep 22 2017
Dan's theory is correct. GetUserLogFiles() is called here [1] *before* GetScrubbedBidLogs() (which calls GetBigFeedbackLogs()) That's why in the feedback report, we can see entries such as: - Profile[0] chrome_user_log=<multiline>, - Profile[0] login-times=<multiline>, - Profile[0] logout-times=<multiline> So whatever debugd error happens, it will most probably be logged after the the user logs were already collected. GetBigFeedbackLogs() are huge and take a long time to collect and close the pipe on which they're sent. We might want to investigate whether the 2-minute timeout [2] that we currently have is still or no longer (though I'd be very surprised!) enough, especially after ARC logs have increased the amounts of logs a lot. [1]: https://cs.chromium.org/chromium/src/chrome/browser/chromeos/system_logs/debug_daemon_log_source.cc?q=DebugDaemonLogSource::Fetch&sq=package:chromium&l=68 [2]: https://cs.chromium.org/chromium/src/chromeos/dbus/debug_daemon_client.cc?q=GetScrubbedBigLogs&sq=package:chromium&l=243
,
Sep 22 2017
Thanks. GetUserLogFiles() gets the file names from debugd, then it asynchronously reads them. I think we should delay that operation until GetBigFeedbackLogs() has completed. Then it will be a lot easier to see why those files are lost.
,
Sep 22 2017
SGTM.
,
Sep 25 2017
By the way, I am wondering if the existing code is racy. The pattern consists of scheduling various kinds of asynchronous log collection. The callback at the completion of each collection checks if num_pending_requests_ is 0, and if so it calls a function which packages and ships all collected logs. But if, say, the first collection (GetRoutes) ends before the next collection is scheduled, num_pending_requests_ will be zero and the ending function will be called prematurely. This is unlikely to happen but the code is still wrong, I think.
client->GetRoutes(true, // Numeric
false, // No IPv6
base::Bind(&DebugDaemonLogSource::OnGetRoutes,
weak_ptr_factory_.GetWeakPtr()));
++num_pending_requests_;
client->GetNetworkStatus(base::Bind(&DebugDaemonLogSource::OnGetNetworkStatus,
weak_ptr_factory_.GetWeakPtr()));
++num_pending_requests_;
client->GetModemStatus(base::Bind(&DebugDaemonLogSource::OnGetModemStatus,
weak_ptr_factory_.GetWeakPtr()));
++num_pending_requests_;
client->GetWiMaxStatus(base::Bind(&DebugDaemonLogSource::OnGetWiMaxStatus,
weak_ptr_factory_.GetWeakPtr()));
++num_pending_requests_;
,
Sep 26 2017
Not tested yet, but this is the idea: https://chromium-review.googlesource.com/c/chromium/src/+/683336
,
Sep 26 2017
Thanks, semenzato. Re#11: This shouldn't be a problem. The asynchronous tasks posted to run any response callbacks won't be scheduled until after DebugDaemonLogSource::Fetch() returns, which means that num_pending_requests_ will be in its expected state.
,
Sep 26 2017
Thanks! For my information, where/how would a reader of this code find out that Fetch() returns before the response callbacks are scheduled?
,
Sep 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/92a4cfcf1073517270650a5832d78859ca7f33e8 commit 92a4cfcf1073517270650a5832d78859ca7f33e8 Author: Luigi Semenzato <semenzato@chromium.org> Date: Thu Sep 28 22:58:27 2017 chromeos log collection: collect user logs last Chrome OS feedback reports often have missing sections, but it is difficult to understand why because errors in fetching those sections are logged in the chrome user logs, but those logs may be connected before the failures occur. This change forces the chrome user logs to be collected after all the other logs. BUG=chromium:753398 TEST=enabled feedbacks by adding these gn args: TEST=is_official_build=true is_chrome_branded=true TEST=then sent a report, and verified that it arrived with all files. Change-Id: Ib08c3bfa5b89dd3afd050abaf52bcf231a5ba79d Reviewed-on: https://chromium-review.googlesource.com/683336 Commit-Queue: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Ahmed Fakhry <afakhry@chromium.org> Cr-Commit-Position: refs/heads/master@{#505202} [modify] https://crrev.com/92a4cfcf1073517270650a5832d78859ca7f33e8/chrome/browser/chromeos/system_logs/debug_daemon_log_source.cc [modify] https://crrev.com/92a4cfcf1073517270650a5832d78859ca7f33e8/chrome/browser/chromeos/system_logs/debug_daemon_log_source.h
,
Sep 29 2017
We still don't know why those feedback reports were incomplete, but now we can hope to see more information. Leaving this open since it still has a manageable number of comments.
,
Oct 16 2017
https://listnr.corp.google.com/report/76219725956 is missing at least dmesg, meminfo, and syslog The reporter mentioned DUT was really slow on resume, lots of stutters while filing this feedback.
,
Jan 9 2018
Is there any update on this bug? is this bug still relevant?
,
Jan 9 2018
Sorry, my bad for not closing this. We think this is fixed as of comment #15
,
Jan 9 2018
Actually, sorry, I forgot about what I said in comment #16. So the first part of the bug is fixed. Now we're hoping that when this happens again, we'll have more information to fix the second part of the bug (missing logs). Apparently either nobody has noticed the missing logs, or maybe they did but are not aware of this bug, or maybe the logs are no longer missing. I'll leave it up to the author of comment 18 whether to close this now and reopen as needed, or leave it open. I would leave it open but I have no strong preference. Lowering priority to 3 since we have no new reports.
,
Jul 6
maybe we should have some UMA stats for ARC++ log collection timeouts to see if we're impacted at scale.
,
Jul 6
On a otherwise idle kevin I ran the following; it seems to take around 35s on my device. localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.01user 0.21system 0:30.20elapsed 0%CPU (0avgtext+0avgdata 6768maxresident)k 0inputs+0outputs (0major+128minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.03user 0.14system 0:30.75elapsed 0%CPU (0avgtext+0avgdata 6496maxresident)k 0inputs+0outputs (0major+110minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.01user 0.18system 0:30.10elapsed 0%CPU (0avgtext+0avgdata 6736maxresident)k 0inputs+0outputs (0major+129minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.01user 0.16system 0:29.59elapsed 0%CPU (0avgtext+0avgdata 7056maxresident)k 0inputs+0outputs (0major+130minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.03user 0.10system 0:36.92elapsed 0%CPU (0avgtext+0avgdata 6800maxresident)k 0inputs+0outputs (0major+112minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.01user 0.16system 0:35.09elapsed 0%CPU (0avgtext+0avgdata 6576maxresident)k 0inputs+0outputs (2major+124minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.02user 0.15system 0:35.34elapsed 0%CPU (0avgtext+0avgdata 6592maxresident)k 0inputs+0outputs (0major+130minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.01user 0.14system 0:38.42elapsed 0%CPU (0avgtext+0avgdata 6864maxresident)k 0inputs+0outputs (0major+124minor)pagefaults 0swaps localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 0.03user 0.12system 0:36.62elapsed 0%CPU (0avgtext+0avgdata 6768maxresident)k 0inputs+0outputs (0major+116minor)pagefaults 0swaps |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by derat@chromium.org
, Aug 8 2017Summary: Some feedback reports silently omit debugd data (was: some feedback reports silently omit sections (meminfo, syslog))