Latest R68 stable has unusably crashy Chrome |
||||||||||||||||
Issue descriptionChrome version: 68.0.3440.118 OS: R68-10718.88.2 djaklich@ reported it first on both eve and nocturne on stable channel. Steps: - cros flash R68-10718.88.2 over IP on eve (test image). - try to get past network screen onto enrollment/login. Expected: - can get to login Actual: - chrome crashes (PID before and after confirms this). - after a quick black screen, go back to network lists screen. Video: https://drive.google.com/file/d/1Z6QdtWsD17Bik-skXTHZzIcObrnmtYSb/view?usp=sharing Logs: sharing a drive link in a bit.
,
Sep 4
Does this only happen on certain platforms? Is this only seen on test images?
,
Sep 4
,
Sep 4
Dan noticed it on his corp-enrolled production image on Eve that got the R68 stable update. I haven't attempted repro-ing with a non-test image yet since its impossible to get logs or any debug info with such crashy chrome on a non-test / non-developer mode image. Last stable release was M68-STABLE-CHROMEOS-2: CrOS: 10718.71.5 Chrome: 68.0.3440.87 So the delta would be: CrOS delta: https://crosland.corp.google.com/log/10718.71.0..10718.88.0 Chrome delta: https://chromium.googlesource.com/chromium/src/+log/68.0.3440.87..68.0.3440.118 I can't find any smoking guns here - I am checking with 10718.71.0 if the issue is still repro-able, meanwhile, can we find an owner to bisect / root-cause this?
,
Sep 4
Hmm.. I can reproduce this on 10718.71.0 with a test image also :( Am I doing something wrong? I couldn't test the exact stable push (10718.71.5) as xbuddy refuses to serve that path.
,
Sep 4
likely a dup of bug 878002
,
Sep 4
If that is the case, do we think this might be something we just picked back into 68, or has something changed server side to cause the log in flow to blow up?
,
Sep 4
,
Sep 4
@Bernie: server side looks more likely since I repro-ed this on 10718.71.0 which has been out for > 1 month (pushed 3rd Aug per GE).
,
Sep 4
In that case crbug.com/878002 might not actually be a beta blocker, if it is server side and happening all over, holding back the OS does not really help.
,
Sep 4
I'm also unable to repro on my eve running 10718.88.2. There's got to be some other variable here. Has anyone been able to get the machine to a point where it can upload a crash report?
,
Sep 4
This is repeated 13 times: [1851:1851:0904/100329.142278:ERROR:device_cloud_policy_store_chromeos.cc(231)] Device policy read on enrolled device yields no DM token! Status: 0. [1851:1985:0904/100329.213199:ERROR:web_database_backend.cc(113)] Cannot initialize the web database: 2 [1851:1851:0904/100329.260957:ERROR:input_method_manager_impl.cc(1080)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class [1851:1851:0904/100332.332433:ERROR:network_type_pattern.cc(134)] NetworkTypePattern: wifi: Can not match empty type. [1851:1851:0904/100332.333074:ERROR:geolocation_handler.cc(146)] Failed to get Geolocation data: 0 device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class [1851:1851:0904/100334.497991:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.CryptohomeInterface.TpmGetVersionStructured: object_path= /org/chromium/Cryptohome: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.Cryptohome was not provided by any .service files [1851:1851:0904/110301.634785:ERROR:event_rewriter_chromeos.cc(416)] Device ID 1 is unknown. Crash dump received by crash_reporter Device policy, enrollment, TPM are words that stand out.
,
Sep 4
,
Sep 4
,
Sep 4
Hmm.. I went back to R67 stable just to get a datapoint if it was indeed a server-side switch that got flipped. But R67 is fine. So we now have: R67: 10575.58.0: Good, can login fine. R68: 10718.71.0/ 10718.71.5: Bad, cannot get past network screen. R68: 10718.88.2: Also Bad, cannot get past network screen.
,
Sep 4
+apronin@
,
Sep 5
Looking at the logs from comment #1 from the TPM angle since it's mentioned in #12. I see that 'cryptohome not on dbus' error mentioned in #12 once in ui.20180904-100327: [1851:1851:0904/100334.497991:ERROR:object_proxy.cc(616)] Failed to call method: org.chromium.CryptohomeInterface.TpmGetVersionStructured: object_path= /org/chromium/Cryptohome: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.Cryptohome was not provided by any .service files That's WAI, I guess, because cryptohomed is not started yet by that point - it is still starting. It starts a bit later, and from that moment on, that msg is not present in chrome reboot sequence, which is down to just the following in chrome ui logs: [8498:8498:0904/110317.750835:ERROR:device_cloud_policy_store_chromeos.cc(231)] Device policy read on enrolled device yields no DM token! Status: 0. [8498:8551:0904/110317.760673:ERROR:web_database_backend.cc(113)] Cannot initialize the web database: 2 [8498:8498:0904/110317.767175:ERROR:input_method_manager_impl.cc(1080)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class Crash dump received by crash_reporter On syslog side each time I see: 2018-09-04T11:03:15.932442-07:00 INFO cryptohomed[2489]: HasAuthorization: TPM Owner password not available. 2018-09-04T11:03:15.932518-07:00 ERR cryptohomed[2489]: Destroy() called with insufficient authorization. 2018-09-04T11:03:16.000204-07:00 WARNING crash_reporter[8457]: Received crash notification for chrome[1851] user 1000 (called directly) ... 2018-09-04T11:03:22.330223-07:00 INFO cryptohomed[2489]: HasAuthorization: TPM Owner password not available. 2018-09-04T11:03:22.330250-07:00 ERR cryptohomed[2489]: Destroy() called with insufficient authorization. 2018-09-04T11:03:22.423545-07:00 WARNING crash_reporter[8742]: Received crash notification for chrome[8498] user 1000 (called directly) Looks like each time Chrome crashes immediately after calling cryptohome to set or remove FWMP, which is not allowed by cryptohome since the device is long owned & the owner password is long gone. By design, we are only allowed to set FWMP at first boot since it contains flags like "disable dev mode" coming from policy, which shouldn't be settable afterwards. 1) Maybe Chrome doesn't expect RemoveFirmwareManagementParameters or SetFirmwareManagementParameters to return an error and crashes? 2) Why does it call those functions in any case. No notion of OOBE in Chrome logs, so why set FWMP? Note that on the previous boot we also saw this from cryptohomed: 2018-09-04T09:43:44.357056-07:00 INFO cryptohomed[1260]: Configuring TPM, ownership taken: 0. 2018-09-04T09:43:45.176477-07:00 INFO cryptohomed[1260]: HasAuthorization: TPM Owner password not available. 2018-09-04T09:43:45.176535-07:00 ERR cryptohomed[1260]: Destroy() called with insufficient authorization. Which may mean that some on-disk files could have been wiped w/o clearing the TPM (so, we re-attempted to own the TPM since the state was lost) - I can't tell with more confidence at this point, entering wild-guessing area. Could it be that some Chrome state has been wiped leading to this behavior?
,
Sep 5
As a side note: there was apparently Chrome OS update before reboot at ~10:03. Note that though iiuc we AU'ed to M68, cr50 RW remained at 0.3.9, as before the update (0.3.9 landed only in M69). That's WAI, rolling cr50 back is not allowed. Error 8 here means that rollback was attempted and rejected: 2018-09-04T10:01:28.132201-07:00 NOTICE cr50-update.sh[5249]: /usr/sbin/gsctool -t attempt 3 error 3 2018-09-04T10:01:28.135234-07:00 NOTICE cr50-update.sh[5249]: Error 8 Generally, if TPM is indeed suspected, for repro purposes it's better not to run M68 with newer 0.3.9, which should only be paired with M69+. But in any case, I haven't noticed any negative effects from doing so.
,
Sep 5
,
Sep 5
Strange. We have not refactored any of the TPM methods in cryptohome yet, so those haven't changed - at least on my end.
,
Sep 5
+jdufault, +alemate to bring expertise from Chrome UI side. I don't see anything unusual on cryptohomed side either. Any ideas why Chrome crashes?
,
Sep 5
What does vpd -l say? If it happens on (previously) enrolled devices, it sounds very similar to issue 876328 .
,
Sep 5
+atwilson@ as well since the error log jhawkins@ found also points to a possible problem with policy (I missed that in my first look)
,
Sep 5
I cannot reproduce this on non-EE device: cros flash X.X.X.X xbuddy://remote/eve/R68-10718.71.2 --disable-rootfs-verification --clobber-stateful Update was successful. I was able to log-in, run ARC++, etc. cros flash X.X.X.X xbuddy://remote/eve/R68-10718.88.2 Update was successful. I was able to log-in, run ARC++, etc. cros flash X.X.X.X xbuddy://remote/eve/R68-10718.71.2 --clobber-stateful Update succeeded.
,
Sep 5
My eve IS enterprise enrolled and I'm unable to repro. Build 10718.88.2 The logs have lines saying that "Crash dump received by crash_reporter" is anyone able to grab a client ID from an affected device?
,
Sep 5
There is actually another problem with crash reporter: 2018-09-05T23:18:21.857097+00:00 WARNING crash_reporter[6885]: Failed to write audit message to /var/spool/crash/selinux_violation.20180905.161821.0.log: File exists (But it does not seem to affect user experience).
,
Sep 6
Just an observation. The last time when Chrome didn't crash after "Login WebUI >> sign in" (in chrome_20180904-094256) we had this: [1439:1439:0904/094345.279258:VERBOSE1:wizard_controller.cc(1459)] Wizard screen exit code: UPDATE_NOUPDATE [1439:1439:0904/094345.279329:VERBOSE1:wizard_controller.cc(698)] Showing Auto-enrollment check screen. [1439:1439:0904/094345.279360:VERBOSE1:wizard_controller.cc(1291)] SetCurrentScreenSmooth: auto-enrollment-check [1439:1439:0904/094345.280994:VERBOSE1:wizard_controller.cc(1459)] Wizard screen exit code: ENTERPRISE_AUTO_ENROLLMENT_CHECK_COMPLETED [1439:1439:0904/094345.284413:VERBOSE1:wizard_controller.cc(550)] Showing login screen. [1439:1439:0904/094345.284506:VERBOSE1:login_display_host_webui.cc(699)] Login WebUI >> sign in and then it merrily continued with [1439:1439:0904/094345.353640:VERBOSE1:gaia_screen_handler.cc(1150)] LoadAuthExtension, force: 1, offline: 0 [1439:1439:0904/094345.375916:INFO:signin_screen_handler.cc(1483)] Login WebUI >> active: 1, source: gaia-signin [1439:1439:0904/094345.376249:VERBOSE1:signin_screen_handler.cc(1437)] Login WebUI >> loginVisible, src: gaia-loading, webui_visible_: 1 [1439:1439:0904/094345.426615:VERBOSE1:gaia_screen_handler.cc(616)] Auth extension finished loading [1439:1439:0904/094346.234377:VERBOSE1:gaia_screen_handler.cc(798)] Gaia is loaded [1439:1439:0904/094346.234439:VERBOSE1:signin_screen_handler.cc(1437)] Login WebUI >> loginVisible, src: gaia-signin, webui_visible_: 1 ... etc ... In the cases where it crashes, in chrome_20180904-100328, chrome_20180904-110317, it all ends with [1851:1851:0904/110315.924827:VERBOSE1:wizard_controller.cc(1265)] Wizard screen exit code: UPDATE_NOUPDATE [1851:1851:0904/110315.924992:VERBOSE1:wizard_controller.cc(637)] Showing Auto-enrollment check screen. [1851:1851:0904/110315.925062:VERBOSE1:wizard_controller.cc(1118)] SetCurrentScreenSmooth: auto-enrollment-check [1851:1851:0904/110315.925274:VERBOSE1:auto_enrollment_controller.cc(546)] Auto-enrollment required: flag in VPD. [1851:1851:0904/110315.925353:VERBOSE1:auto_enrollment_controller.cc(555)] Proceeding with FRE [1851:1851:0904/110315.925395:VERBOSE1:auto_enrollment_controller.cc(730)] New auto-enrollment state: 1 [1851:1851:0904/110315.925681:VERBOSE1:auto_enrollment_controller.cc(635)] Device already owned, skipping auto-enrollment check. [1851:1851:0904/110315.925722:VERBOSE1:auto_enrollment_controller.cc(730)] New auto-enrollment state: 5 [1851:1851:0904/110315.934243:VERBOSE1:wizard_controller.cc(1265)] Wizard screen exit code: ENTERPRISE_AUTO_ENROLLMENT_CHECK_COMPLETED [1851:1851:0904/110315.937700:VERBOSE1:wizard_controller.cc(505)] Showing login screen. [1851:1851:0904/110315.937957:VERBOSE1:login_display_host_webui.cc(690)] Login WebUI >> sign in <last line in chrome log, right after that in messages:> 2018-09-04T11:03:16.000204-07:00 WARNING crash_reporter[8457]: Received crash notification for chrome[1851] user 1000 (called directly) So, I wonder if this additional processing *at this point* makes a difference. Apparently, we have a flag in VPD that tells the device to auto-enroll & force-enroll (can attempted FRE also be that difference that causes repro). We skip it since it's too late, but maybe the logic breaks here: [1851:1851:0904/110315.925274:VERBOSE1:auto_enrollment_controller.cc(546)] Auto-enrollment required: flag in VPD. [1851:1851:0904/110315.925353:VERBOSE1:auto_enrollment_controller.cc(555)] Proceeding with FRE [1851:1851:0904/110315.925395:VERBOSE1:auto_enrollment_controller.cc(730)] New auto-enrollment state: 1 [1851:1851:0904/110315.925681:VERBOSE1:auto_enrollment_controller.cc(635)] Device already owned, skipping auto-enrollment check. [1851:1851:0904/110315.925722:VERBOSE1:auto_enrollment_controller.cc(730)] New auto-enrollment state: 5 In that chrome_20180904-094256, which didn't crash, we also had attempted auto-enrollment and FRE, but in a different order: [1439:1439:0904/094345.164934:VERBOSE1:auto_enrollment_controller.cc(546)] Auto-enrollment required: flag in VPD. [1439:1439:0904/094345.165049:VERBOSE1:auto_enrollment_controller.cc(555)] Proceeding with FRE [1439:1439:0904/094345.165105:VERBOSE1:auto_enrollment_controller.cc(732)] New auto-enrollment state: 1 ... [1439:1439:0904/094345.165580:VERBOSE1:auto_enrollment_controller.cc(635)] Device already owned, skipping auto-enrollment check. [1439:1439:0904/094345.165622:VERBOSE1:auto_enrollment_controller.cc(732)] New auto-enrollment state: 5 ... [1439:1439:0904/094345.279329:VERBOSE1:wizard_controller.cc(698)] Showing Auto-enrollment check screen. [1439:1439:0904/094345.279360:VERBOSE1:wizard_controller.cc(1291)] SetCurrentScreenSmooth: auto-enrollment-check [1439:1439:0904/094345.280994:VERBOSE1:wizard_controller.cc(1459)] Wizard screen exit code: ENTERPRISE_AUTO_ENROLLMENT_CHECK_COMPLETED Also, if that's indeed the point which breaks it, maybe even b/113359995 is related, since we see (different) weird things around policies & enrollment there. +emaxx, +drcrash, any thoughts? Can it be related?
,
Sep 6
Btw, we apparently go through network, EULA, update and other OOBE screens in chrome_20180904-094256. But why? TPM is owned at boot, logs are not cleared: there was no powerwash. Why OOBE? This also leads to all those "Device already owned, skipping auto-enrollment check." because on one hand we think it's OOBE and try to auto-enroll. On the other hand, device is long owned, there should have been no OOBE at this point.
,
Sep 6
Re #28: ok, OOBE is explained. It's always been OOBE - on 09/02, 08/30, 08/27, didn't look earlier. But looks like all the way to 05/28. So, it booted to OOBE screen many times, never had an owner. Was updated in that state. Now, maybe that's a difference that can lead to repro.
,
Sep 6
pmarko@ made some changes recently in that area, but he is on baby bonding leave and those changes were in May or so (for bug https://crbug.comn839353). Emaxx are you familiar with them?
,
Sep 6
Re #30: > and those changes were in May or so Looks like CLs from issue 839353 landed in M68 (and 68 branched at the end of May), and per #15 we are talking about R67: 10575.58.0: Good, can login fine. R68: 10718.71.0/ 10718.71.5: Bad, cannot get past network screen. So, those FRE changes can still be relevant.
,
Sep 6
Do we have a stack trace for this e.g. a crash id? That would be extremely helpful. In essence the FRE things should be done with this line: [1851:1851:0904/110315.934243:VERBOSE1:wizard_controller.cc(1265)] Wizard screen exit code: ENTERPRISE_AUTO_ENROLLMENT_CHECK_COMPLETED So that may be a red herring. The VPD flag which triggers the FRE processing (check_enrollment=1) is set on enrollment. So it could be that the logs where it didn't crash were pre enrollment, and the logs where it crashed were post enrollment. Would that make sense? (I can't open the logs on my phone right now)
,
Sep 6
Can you try to powerwash the device before cros flash the new image? The reason is, powerwash (as well as recovery flow) does reset TPM, while simple crosh flash the image doesn't.
,
Sep 6
Lowering priority, I think this is relatively rare (I don't see anything obvious in crash/ either). I borrowed kirtika@'s machine but it does not repro anymore. kirtika@: any idea how you got the machine into the broken state? I've tried enrolling my @google and then flashing a new image, but everything works as expected. From previous discussion in the bug it sounds like this may be an issue w.r.t. FRE. Setting owner to atwilson@ to drive from that side.
,
Sep 6
I believe #29 is quite plausible. I don't know how to debug this without firmware folks help though.
,
Sep 6
I tried to find in Crash any reports with the client ID equal to the one from the feedback/CLIENT_ID file from logs in comment #1. No reports with this data were found. Or does anyone know - maybe that file is containing some different ID, not the one which the Crash uses?
,
Sep 7
,
Sep 19
I recall that there was one potential crasher fixed by r562649 in the Chrome's response handler of the SetFirmwareManagementParameters cryptohome call, and this CL went into 69.0.3445.0. Perhaps this CL wasn't merged back into M68/M67 since we saw no occurrences of this in the wild, and it was unclear whether this issue may affect significant range of users. Maybe that's exactly the crash we're seeing here? Then I guess our current crbug could be WontFix'ed since the fix is already in the current stable M69.
,
Sep 19
CL mentioned in #38 could be related based on the following point in comment #17: 1) Maybe Chrome doesn't expect RemoveFirmwareManagementParameters or SetFirmwareManagementParameters to return an error and crashes? I can't verify it at this point though.
,
Sep 28
Triage nag: This Chrome OS bug has an owner but no component. Please add a component so that this can be tracked by the relevant team.
,
Sep 28
Closing this as obsolete - the assumption is that the problem was known and already fixed in M69 which is the current stable. See comment 38 and comment 39 for details. Also crash reports support that there were crashes in policy::EnrollmentHandlerChromeOS::OnFirmwareManagementParametersDataSet() in M67..M68, and they don't occur in M69 anymore: https://crash.corp.google.com/browse?q=expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27policy%3A%3AEnrollmentHandlerChromeOS%3A%3AOnFirmwareManagementParametersDataSet%27#samplereports |
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by kirtika@chromium.org
, Sep 4