Liveness check fails occasionally on chrome restart |
||||||
Issue descriptionChrome could be restarted when having custom flags (or policy switches). In some case (busy device?), chrome restart is not fast enough and missed a liveness poing. As a result, it is killed when liveness ping timeouts. e.g. http://crash/fa1e98276659a75f http://crash/0ef826c8eabcc97a Typical session manager log: 2018-09-01T10:32:09.342430+02:00 INFO session_manager[27678]: [INFO:browser_job.cc(148)] Running browser /opt/google/chrome/chrome --ppapi-flash-path=/opt/google/chrome/pepper/libpepflashplayer.so --ppapi-flash-... 2018-09-01T10:32:24.360316+02:00 ERR session_manager[27678]: [ERROR:object_proxy.cc(580)] Failed to call method: org.chromium.LivenessServiceInterface.CheckLiveness: object_path= /org/chromium/LivenessService: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.LivenessService was not provided by any .service files 2018-09-01T10:32:39.368552+02:00 WARNING session_manager[27678]: [WARNING:liveness_checker_impl.cc(65)] Browser hang detected! Note the CheckLiveness call fail. Chrome is not aware of the ping at all and gets killed later on.
,
Sep 11
Not sure what is happening on these devices either. The device seems to be very sluggish. Using the first crash (http://crash/fa1e98276659a75f) as example. session_manager starts chrome in 2018-09-01T10:32:09.342430. Yet the chrome log file name is chrome_20180901-103232. So 10:32:09 -> 10:32:32, it takes 3s for chrome to create its log file. Did not see that very often.
,
Sep 22
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/fefb938a693e9c39c900b38ac4ddd0eedc237093 commit fefb938a693e9c39c900b38ac4ddd0eedc237093 Author: Xiyuan Xia <xiyuan@google.com> Date: Sat Sep 22 05:25:52 2018 login: Dump top tasks on browser hang Liveness check ping fails more frequently than before. Crash dump shows no trace of chrome busy or deadlock. However, the device seems to be busy when the ping fails according to the log line timestamp (e.g. 10s for session_manager start). Dump top tasks to get more data of the situation. Also remove lsof debug log since it is no longer needed. BUG=chromium:883029 TEST=Manual. Change-Id: I7302fbf0fbd68e49a30df297f90fc72cceeb5476 Reviewed-on: https://chromium-review.googlesource.com/1238939 Commit-Ready: Xiyuan Xia <xiyuan@chromium.org> Tested-by: Xiyuan Xia <xiyuan@chromium.org> Reviewed-by: Xiyuan Xia <xiyuan@chromium.org> [modify] https://crrev.com/fefb938a693e9c39c900b38ac4ddd0eedc237093/login_manager/liveness_checker_impl.cc
,
Sep 28
Quick look at crashes for 71.0.3558.0 and the top 20 clients. The 1st one is busy flashrom, probably issue 865731. The 2nd one is chrome. The crash stack looks not that busy though. The 3rd is cras The 4th is OOM, kswapd
,
Sep 28
,
Oct 2
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/6c6d2f3138d061739ddd83a774818f3ec0be238d commit 6c6d2f3138d061739ddd83a774818f3ec0be238d Author: Xiyuan Xia <xiyuan@google.com> Date: Tue Oct 02 12:19:59 2018 login: Collect command line of top tasks on browser hang - Collect top task command line (up to 512 chars); - Trim by lines instead of string size; BUG=chromium:883029 TEST=Manual. Change-Id: Ifadf5c30ff62b6f645e6f4ed852be2e30c4efa8c Reviewed-on: https://chromium-review.googlesource.com/1252548 Commit-Ready: Xiyuan Xia <xiyuan@chromium.org> Tested-by: Xiyuan Xia <xiyuan@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/6c6d2f3138d061739ddd83a774818f3ec0be238d/login_manager/liveness_checker_impl.cc
,
Nov 29
Any update here? The associated signature has creeped up to be the #1 crash in the latest beta.
,
Nov 29
Looks like we got another spike on the flashrom issue. Looked the the top 3 clients in the list: https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AMessagePumpDefault%3A%3ARun%27 Will ping the flashrom bug with the new data.
,
Nov 29
This bug was originally filed to handle an issue where Chrome could apparently get a SIGABRT from session_manager due to Chrome taking so long to restart (e.g. for custom flags) that session_manager thinks that Chrome is hanging. Please be sure to not conflate that concern with all crashes where Chrome received a SIGABRT from session_manager. All that these SIGABRTs tell us is that Chrome was hanging. More investigation needs to happen to determine the cause (typically in the form of inspection of the main thread of the browser process). If http://crash/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AMessagePumpDefault%3A%3ARun%27 (from #8) is what's being looked at here, I don't think that those crashes are meaningful. As I click through the sample reports, I'm just seeing various processes receiving SIGABRT while idle. Please only look at crash reports from the browser process here.
,
Nov 29
The top 3 clients in the crash link are mostly (if not all) killed due to liveness check and seems because of flashrom per chrome.txt. We can jump to relevant browser process crash by client id, e.g. for the top 1 client e59189053fa244b2a6c76a149f21cc31 : https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+clientid%3D%27e59189053fa244b2a6c76a149f21cc31%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27
,
Nov 29
Should we file a separate issue for the flashrom problem? Any idea what's triggering the flashrom call that leads to the failure?
,
Nov 29
,
Nov 29
,
Nov 30
,
Nov 30
Yes, filed issue 910409 for flashrom issue (was using issue 865731 but vbendeb@ says it is likely a different cause, hence the new bug). Also filed issue 910409 that wallpaper resizer also hogs CPU, the 2nd popular in the crashes I saw.
,
Dec 5
Figured out a crash query to show crashes caused by slow chrome start as in #0. Basically look at browser crashes with ProcessUptime < 30000. Here is the one for M72: https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+ProcessUptime+%3C+30000+AND+crash.Reason%3D%27SIGABRT%27#samplereports Master issue 842272 has about 7506 crashes today and the query above has 1291 crashes. 7506 could roughly be converted to relevant browser crash by divide 3 (renderer/extension/utility) to 2500+. So this is about 50%. Unfotuntely, I have not figured out a good way to further break it down to flashrom related, OOM/slash related, hang on chrome start (DRM/GPU related) but those are the noted causes so far. We could also increase the ProcessUptime to 60000 and see liveness check kills but it is not due to slow chrome start when 30000 <= ProcessUptime < 60000. flashrom seems popular there too.
,
Dec 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748 commit f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748 Author: Xiyuan Xia <xiyuan@google.com> Date: Fri Dec 07 06:06:43 2018 login: Collect pstree of flashrom If flashrom shows up on the top output, dump pstree -sal -p <flashrom_pid> output to log. BUG=chromium:883029,chromium:910411 TEST=Manual Change-Id: I3ed8084763feea188b6b6b19ee64ee834ab22055 Reviewed-on: https://chromium-review.googlesource.com/1362396 Commit-Ready: Xiyuan Xia <xiyuan@chromium.org> Tested-by: Xiyuan Xia <xiyuan@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748/login_manager/liveness_checker_impl.cc |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by derat@chromium.org
, Sep 11Components: OS>Systems
Labels: -OS-iOS
I'm not an expert on this code, but it looks like session_manager is already trying to handle this. SessionManagerService::RunBrowser() calls LivenessChecker::Start(), and SessionManagerService::HandleExit() calls LivenessChecker::Stop() (before calling RunBrowser() again). I think that the logs that you quoted were from a non-stable-channel device since there was a 15s delay between Chrome being started and the D-Bus method call being made. See session_manager_main.cc: constexpr base::TimeDelta kHangDetectionIntervalStable = base::TimeDelta::FromSeconds(60); constexpr base::TimeDelta kHangDetectionIntervalDev = base::TimeDelta::FromSeconds(15); constexpr base::TimeDelta kHangDetectionIntervalTest = base::TimeDelta::FromSeconds(5); I don't know why we use different values here and wouldn't object to us always using 60s. But why had Chrome still not owned the org.chromium.LivenessService service name 15s after it was executed? That seems broken.