New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753398 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Some feedback reports silently omit debugd data

Project Member Reported by semenzato@chromium.org, Aug 8 2017

Issue description

For instance 

https://listnr.corp.google.com/product/208/report/70153125976

is missing at least dmesg, meminfo, and syslog.  We should at least generate some error when those sections cannot be collected.
 
This is not a one-off, I have seen it in other reports too, although now I don't know how to find them (perhaps search crbug for the three keywords "feedback" "report" and "missing").
 

Comment 1 by derat@chromium.org, Aug 8 2017

Cc: r...@chromium.org
Summary: Some feedback reports silently omit debugd data (was: some feedback reports silently omit sections (meminfo, syslog))
It looks to me like none of the sections gathered by platform2/debugd/src/log_tool.cc are included. I'd guess that Chrome's D-Bus call to debugd failed. It looks like DebugDaemonClientImpl should log an error if GetDebugLogs fails, although I don't see an error in the user log (but maybe that gets grabbed before the call).
Owner: semenzato@chromium.org
There was a separate bug for the syslog, issue 749969, recently fixed.  I started looking into the Chrome code, I'll give this a stab.
The log collection code is exceedingly hard to read for such basic functionality.  But here's what I think happens.

src/extensions/browser/api/feedback_private/feedback_private_api.cc:FeedbackPrivateGetSystemInformationFunction::Run() calls CreateSystemLogsFetcher() which calls BuildChromeSystemLogsFetcher(), then invokes its Fetch() method.

The above builder (in src/chrome/browser/feedback/system_logs/) adds a number of sources as shown:

  fetcher->AddSource(base::MakeUnique<ChromeInternalLogSource>());
  fetcher->AddSource(base::MakeUnique<CrashIdsSource>());
  fetcher->AddSource(base::MakeUnique<MemoryDetailsLogSource>());

#if defined(OS_CHROMEOS)
  fetcher->AddSource(base::MakeUnique<CommandLineLogSource>());
  fetcher->AddSource(base::MakeUnique<DBusLogSource>());
  fetcher->AddSource(base::MakeUnique<DeviceEventLogSource>());
  fetcher->AddSource(base::MakeUnique<LsbReleaseLogSource>());
  fetcher->AddSource(base::MakeUnique<TouchLogSource>());

  // Debug Daemon data source - currently only this data source supports
  // the scrub_data parameter, but the others still get scrubbed by
  // SystemLogsFetcher.
  fetcher->AddSource(base::MakeUnique<DebugDaemonLogSource>(scrub_data));
#endif

So Dan's theory above might be correct, because the ChromeInternalLogSource appears to be collected before attempting to contact debugd.  However, the DebugDaemonLogSource::Fetch() includes this line:

client->GetUserLogFiles(base::Bind(&DebugDaemonLogSource::OnGetUserLogFiles,
                                     weak_ptr_factory_.GetWeakPtr()));

which is a dbus RPC:

    <method name="GetUserLogFiles">
      <tp:docstring>
        Returns list of User log file names that Chrome itself must collect.
        These logfiles are relative to the user's profile path and must be
        collected separately for each user.
      </tp:docstring>

implemented by src/platform2/debugd/src/log_tool.cc, which has this:

LogTool::LogMap LogTool::GetUserLogFiles() {
  LogMap result;
  for (size_t i = 0; kUserLogs[i].name; ++i)
    result[kUserLogs[i].name] = kUserLogs[i].command;
  return result;
}

Note that this returns a list of *filenames*, not the file contents.

// List of log files that must directly be collected by Chrome. This is because
// debugd is running under a VFS namespace and does not have access to later
// cryptohome mounts.
const Log kUserLogs[] = {
  {"chrome_user_log", "log/chrome"},
  {"login-times", "login-times"},
  {"logout-times", "logout-times"},
  { nullptr, nullptr}
};

However, those are not the missing logs.  Those are listed in kCommandLogs[], which is used in a few places including GetAllLogs(), GetFeedbackLogs(), and GetBigFeedbackLogs().  The most likely candidate is GetAllLogs(), which is called after GetUserLogFiles() in DebugDaemonLogSource::Fetch().

(to be continued)





It turns out that GetScrubbedBigLogs() is calling GetBigFeedbackLogs() which is the RPC.  But GetBigFeedbackLogs() is only dealing with the ARC++ bugreport... so I am puzzled---I must have made a mistake somewhere.

Comment 5 by r...@chromium.org, Sep 22 2017

Cc: afakhry@chromium.org
Semenzato, thanks for taking the deep dive.

ChromeInternalLogSource doesn't actually read any log files. The log files are read by debugd. What you want to look at is GetBigFeedbackLogs(): https://cs.corp.google.com/chromeos_public/src/platform2/debugd/src/log_tool.cc?q=f:log_tool.cc&l=403
Status: Started (was: Untriaged)
Thank you Ahmed.

I did see that those files are read by debugd and passed in another file, but that doesn't happen for the user logs (kUserLogs), so I still haven't figured out if the Dan's theory is correct and how to fix it.  In any case, I will continue this---I just got momentarily tired after wading through dozens of files.
Dan's theory is correct.

GetUserLogFiles() is called here [1] *before* GetScrubbedBidLogs() (which calls GetBigFeedbackLogs())


That's why in the feedback report, we can see entries such as:
- Profile[0] chrome_user_log=<multiline>,
- Profile[0] login-times=<multiline>,
- Profile[0] logout-times=<multiline>

So whatever debugd error happens, it will most probably be logged after the the user logs were already collected. GetBigFeedbackLogs() are huge and take a long time to collect and close the pipe on which they're sent.


We might want to investigate whether the 2-minute timeout [2] that we currently have is still or no longer (though I'd be very surprised!) enough, especially after ARC logs have increased the amounts of logs a lot.


[1]: https://cs.chromium.org/chromium/src/chrome/browser/chromeos/system_logs/debug_daemon_log_source.cc?q=DebugDaemonLogSource::Fetch&sq=package:chromium&l=68

[2]: https://cs.chromium.org/chromium/src/chromeos/dbus/debug_daemon_client.cc?q=GetScrubbedBigLogs&sq=package:chromium&l=243
Thanks. GetUserLogFiles() gets the file names from debugd, then it asynchronously reads them. I think we should delay that operation until GetBigFeedbackLogs() has completed.  Then it will be a lot easier to see why those files are lost.

SGTM.
By the way, I am wondering if the existing code is racy.  The pattern consists of scheduling  various kinds of asynchronous log collection.  The callback at the completion of each collection checks if num_pending_requests_ is 0, and if so it calls a function which packages and ships all collected logs.  But if, say, the first collection (GetRoutes) ends before the next collection is scheduled, num_pending_requests_ will be zero and the ending function will be called prematurely.  This is unlikely to happen but the code is still wrong, I think.


  client->GetRoutes(true,   // Numeric
                    false,  // No IPv6
                    base::Bind(&DebugDaemonLogSource::OnGetRoutes,
                               weak_ptr_factory_.GetWeakPtr()));
  ++num_pending_requests_;
  client->GetNetworkStatus(base::Bind(&DebugDaemonLogSource::OnGetNetworkStatus,
                                      weak_ptr_factory_.GetWeakPtr()));
  ++num_pending_requests_;
  client->GetModemStatus(base::Bind(&DebugDaemonLogSource::OnGetModemStatus,
                                    weak_ptr_factory_.GetWeakPtr()));
  ++num_pending_requests_;
  client->GetWiMaxStatus(base::Bind(&DebugDaemonLogSource::OnGetWiMaxStatus,
                                    weak_ptr_factory_.GetWeakPtr()));
  ++num_pending_requests_;
Not tested yet, but this is the idea:

https://chromium-review.googlesource.com/c/chromium/src/+/683336
Thanks, semenzato. Re#11: This shouldn't be a problem. The asynchronous tasks posted to run any response callbacks won't be scheduled until after  DebugDaemonLogSource::Fetch() returns, which means that num_pending_requests_ will be in its expected state.
Thanks!
For my information, where/how would a reader of this code find out that Fetch() returns before the response callbacks are scheduled?
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/92a4cfcf1073517270650a5832d78859ca7f33e8

commit 92a4cfcf1073517270650a5832d78859ca7f33e8
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Thu Sep 28 22:58:27 2017

chromeos log collection: collect user logs last

Chrome OS feedback reports often have missing sections, but it is
difficult to understand why because errors in fetching those sections
are logged in the chrome user logs, but those logs may be connected
before the failures occur.  This change forces the chrome user logs
to be collected after all the other logs.
    
BUG=chromium:753398
TEST=enabled feedbacks by adding these gn args:
TEST=is_official_build=true is_chrome_branded=true
TEST=then sent a report, and verified that it arrived with all files.

Change-Id: Ib08c3bfa5b89dd3afd050abaf52bcf231a5ba79d
Reviewed-on: https://chromium-review.googlesource.com/683336
Commit-Queue: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Ahmed Fakhry <afakhry@chromium.org>
Cr-Commit-Position: refs/heads/master@{#505202}
[modify] https://crrev.com/92a4cfcf1073517270650a5832d78859ca7f33e8/chrome/browser/chromeos/system_logs/debug_daemon_log_source.cc
[modify] https://crrev.com/92a4cfcf1073517270650a5832d78859ca7f33e8/chrome/browser/chromeos/system_logs/debug_daemon_log_source.h

We still don't know why those feedback reports were incomplete, but now we can hope to see more information.  Leaving this open since it still has a manageable number of comments.

Comment 17 by pyeh@chromium.org, Oct 16 2017

https://listnr.corp.google.com/report/76219725956 is missing at least dmesg, meminfo, and syslog

The reporter mentioned DUT was really slow on resume, lots of stutters while filing this feedback.
Is there any update on this bug? is this bug still relevant?
Status: Fixed (was: Started)
Sorry, my bad for not closing this.  We think this is fixed as of comment #15
Cc: shahmoosh@google.com
Labels: -Pri-2 Pri-3
Status: Started (was: Fixed)
Actually, sorry, I forgot about what I said in comment #16.

So the first part of the bug is fixed.  Now we're hoping that when this happens again, we'll have more information to fix the second part of the bug (missing logs).

Apparently either nobody has noticed the missing logs, or maybe they did but are not aware of this bug, or maybe the logs are no longer missing.  I'll leave it up to the author of comment 18 whether to close this now and reopen as needed, or leave it open.  I would leave it open but I have no strong preference.  Lowering priority to 3 since we have no new reports.
maybe we should have some UMA stats for ARC++ log collection timeouts to see if we're impacted at scale.

On a otherwise idle kevin I ran the following; it seems to take around 35s on my device.

localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.01user 0.21system 0:30.20elapsed 0%CPU (0avgtext+0avgdata 6768maxresident)k
0inputs+0outputs (0major+128minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.03user 0.14system 0:30.75elapsed 0%CPU (0avgtext+0avgdata 6496maxresident)k
0inputs+0outputs (0major+110minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.01user 0.18system 0:30.10elapsed 0%CPU (0avgtext+0avgdata 6736maxresident)k
0inputs+0outputs (0major+129minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.01user 0.16system 0:29.59elapsed 0%CPU (0avgtext+0avgdata 7056maxresident)k
0inputs+0outputs (0major+130minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.03user 0.10system 0:36.92elapsed 0%CPU (0avgtext+0avgdata 6800maxresident)k
0inputs+0outputs (0major+112minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.01user 0.16system 0:35.09elapsed 0%CPU (0avgtext+0avgdata 6576maxresident)k
0inputs+0outputs (2major+124minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.02user 0.15system 0:35.34elapsed 0%CPU (0avgtext+0avgdata 6592maxresident)k
0inputs+0outputs (0major+130minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null
0.01user 0.14system 0:38.42elapsed 0%CPU (0avgtext+0avgdata 6864maxresident)k
0inputs+0outputs (0major+124minor)pagefaults 0swaps
localhost ~ # \time cat /run/arc/bugreport/pipe > /dev/null 
0.03user 0.12system 0:36.62elapsed 0%CPU (0avgtext+0avgdata 6768maxresident)k
0inputs+0outputs (0major+116minor)pagefaults 0swaps


Sign in to add a comment