M67: Caroline and Terra builds are RED; Chrome crashes at boot |
||||||||||||||||||||||||||||||||||
Issue descriptionCaroline build is RED for the last 1 week Starting R67-10530.0.0 RED builds are causing missing CTS test results on stainless. Stainless@: https://stainless.corp.google.com/search?view=matrix&row=build&col=board&first_date=2018-03-27&last_date=2018-04-09&suite=%5Earc%5C-cts%24&build=R67&board=%5Ecaroline%7Ccyan%7Ceve%7Ckefka%7Csamus%7Cveyron_minnie%7Cfizz%7Csoraka%24&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=NOSTATUS&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=false&exclude_non_production=true Goldeneye@: https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=caroline&milestone=67&chromeOsVersion=&chromeVersion=&startTimeFrom=&startTimeTo=&token=AIQH9qMmRD0kdb4kvKXbJmkNz3xE%3A1523013130485#%2F luci-milo@ https://luci-milo.appspot.com/buildbot/chromeos/caroline-release/1636
,
Apr 9 2018
This symptom is also impacting terra:
https://uberchromegw.corp.google.com/i/chromeos/builders/terra-release/builds/2099
,
Apr 9 2018
,
Apr 9 2018
is this dup of https://bugs.chromium.org/p/chromium/issues/detail?id=826163 ? based on c#33, we need a new chrome to fix the crash loop.
,
Apr 9 2018
> is this dup of https://bugs.chromium.org/p/chromium/issues/detail?id=826163 ? For purposes of bug tracking: NO. The prior bug was fixed, and caroline and terra both turned green at the time of the fix. This is a new failure, requiring a new bug. It could turn out that this new failure is related to the old failure in some way, but this would still be a new bug for all that.
,
Apr 9 2018
Note: I've checked caroline-chrome-pfq and terra-chrome-pfq: Neither of them are showing this failure. That's highly suspicious. Really, it shouldn't be possible.
,
Apr 9 2018
This is a Chrome failure, but the failure isn't due to a change
in Chrome. The last green caroline build was here:
https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-release/builds/1606
The first red build was here:
https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-release/builds/1607
Both of those builds used Chrome 67.0.3383.0.
The most obvious explanation would be a Chrome OS change that broke
Chrome. However, that would cause the PFQ to go red, and that hasn't
happened. So, we have a mystery on our hands.
,
Apr 9 2018
FTR, cros blamelist between the green/red release build: https://crosland.corp.google.com/log/10529.0.0..10530.0.0
,
Apr 9 2018
Adding current rotations.
,
Apr 9 2018
alemate@, when you've diagnosed the cause, please create a second bug (assigned to me) for how it got through the PFQ, with whatever info you gathered that's relevant.
,
Apr 9 2018
This bug could be related to Issue 825425 ?
,
Apr 9 2018
,
Apr 9 2018
,
Apr 9 2018
> This bug could be related to Issue 825425 ? Looking at the history of bug 825425 , that bug is both bug 826163 and this bug. Recent debug on 825425 (anything after about 3/29) is probably relevant here. Please note (in case it's not clear): This is likely an OS bug that causes Chrome to crash. But we need to study the Chrome crash in order to point the finger somewhere in the OS.
,
Apr 9 2018
I also think this looks like issue 825425 , (which is actually issue 827188 ). CCing graphics folks.
,
Apr 10 2018
Tagging as a M67 blocker for caroline and terra
,
Apr 10 2018
This needs to be visible to users (see bug 826163 ), and there's nothing secret here. So, dropping RVG.
,
Apr 10 2018
I tried 10562 on caroline and it does bring up part of the UI (background image and bottom menu), but not the login prompt in the middle of the screen. I'm getting this in /var/log/ui/ui.LATEST: [1566:1566:0410/095306.047447:ERROR:input_method_manager_impl.cc(1080)] IMEEngine for "jkghodnilhceideoidjikpgommlajknk" is not registered device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class [1566:1566:0410/095308.526384:FATAL:login_display_host_webui.cc(841)] Renderer crash on login window Unexpected crash report id length System crash_reporter failed to process crash report. Report Id:
,
Apr 10 2018
,
Apr 10 2018
I spent whole day yesterday trying to reproduce this locally, and I could not. Basically, dev image with 67.0.3390.0 chrome os 10.561.0.0 cannot start chrome. When I deploy locally built chrome using simple chrome workflow and sdk --version 10561.0.0 , it works OK, no failures.
,
Apr 10 2018
You can repro after deploying if you reboot.
,
Apr 10 2018
Re #21: - I cannot reproduce this. Are you sure you are not boooting to the previous version?
,
Apr 10 2018
Hm, yes, I may have been doing that...
,
Apr 10 2018
FTR, yes, I was passing --target-dir=/usr/local/chrome --mount-dir=/opt/google/chrome --nostrip to deploy_chrome and that gets unmounted on reboot, of course. Rebuilding and deploying 3390 without these options, I can't repro either.
,
Apr 10 2018
,
Apr 11 2018
I'm tempted to think that crbug.com/831649 is similar/same? At least the error on chromeos4-row9-rack9-host2 for https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-chrome-pfq/2955 is identical.
,
Apr 11 2018
And the same for the last few peach_pit-chrome-pfq runs: https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-chrome-pfq/
,
Apr 11 2018
(not a kernel problem)
,
Apr 12 2018
Issue 826163 has been merged into this issue.
,
Apr 13 2018
Issue 831649 has been merged into this issue.
,
Apr 13 2018
contrary to subject, I see veyron-minnie-chrome-pfq and peach-pit-chrome-pfq blocking chrome PFQ. Is this the right bug?
,
Apr 13 2018
> contrary to subject, I see veyron-minnie-chrome-pfq and peach-pit-chrome-pfq blocking chrome PFQ. Is this the right bug? My guess is that bug 831649 is an unrelated Chrome crash, but there's not enough data in the bug report to say this way or that.
,
Apr 13 2018
I've reopened bug 831649 ; it's not a duplicate.
,
Apr 14 2018
I could not reproduce this. My local Chrome OS image build succeeded. The image from the builder is definitely broken, but all the Chrome builds that I tried to deploy on it, succeeded.
,
Apr 14 2018
,
Apr 14 2018
,
Apr 15 2018
Why not download one of the failing canaries, and see what can be
reproduced with that build?
Also, it looks like the reproduction attempts are using the simple
chrome workflow. It may be that to reproduce it requires building
with the OS workflow. Certainly, it's necessary to build with the
latest OS bits: Although it's a Chrome crash, this failure was
caused by an OS change.
Also, we do know the blamelist for the change:
https://crosland.corp.google.com/log/10529.0.0..10530.0.0
Given that we've been trying to blame graphics for the failures,
we might study the mesa changes.
Finally, every failure in the waterfall produces logs. This is
the most recent for terra:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/192171038-chromeos-test/chromeos4-row8-rack6-host3
I can't find any crash dumps there, but under the "crashinfo"
directory, there's "messages" file, and it shows stuff like this:
2018-04-14T21:04:18.559280+00:00 INFO session_manager[1103]: [INFO:child_exit_handler.cc(77)] Handling 1153 exit.
2018-04-14T21:04:18.559559+00:00 ERR session_manager[1103]: [ERROR:child_exit_handler.cc(85)] Exited with signal 6
2018-04-14T21:04:18.559654+00:00 INFO session_manager[1103]: [INFO:session_manager_service.cc(296)] Exiting process is chrome.
From "crashinfo", digging down through var/log/ui or var/log/chrome,
you can find messages like this:
[6953:6953:0414/140818.466498:FATAL:login_display_host_webui.cc(841)] Renderer crash on login window
So, there's definitely a chrome crash involved.
,
Apr 15 2018
There is a small chance that this is related to CFI (Control Flow Integrity) checking, which is currently enabled only on the terra and caroline release builders. However CFI was enabled back on March 9, in Chrome OS R67-10475.0.0, and both caroline & terra started really failing quite a bit after that. I suppose perhaps some change may have been committed to Chrome since CFI was enabled, which might be causing a CFI failure...It might be worth building caroline & terra without CFI (turn off the USE="cfi" flag) and see if that fixes the issue...
,
Apr 16 2018
I created a CL to test disabling CFI on terra & caroline, then submitted tryjobs with that CL to the terra & caroline release tryjob builders. The terra builder succeeded: https://ci.chromium.org/p/chromeos/builds/b8949167561624961184 The caroline builder will probably fail because there are no working caroline boards in the suites pool, but the builder is here if you want to download & test the build image: https://ci.chromium.org/p/chromeos/builds/b8949167562946755616 The CL, which I'm guessing we will probably want to commit, to unblock these builders, is here: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1013064
,
Apr 16 2018
> There is a small chance that this is related to CFI > (Control Flow Integrity) checking, which is currently > enabled only on the terra and caroline release builders If CFI is enabled only on terra-release and caroline-release, then I'd rate the chance that this is related at well-nigh certain, since the fact that this failure is restricted to just those two builders is one of its key characteristics. Another key characteristic is that we can't reproduce it with local builds. Local builds, it would seem, also don't enable CFI. We're seeing this now presumably because one of the OS changes in the blamelist has tripped over an undiscovered problem with CFI. Given where we are, I'd say the best option will be to commit the CL to turn off "cfi" in the builders, and see what happens. If we get that in before 11:00 today, we'll have a definitive answer this afternoon.
,
Apr 16 2018
The CL is already on it's way through the commit queue...it will go in whenever the CQ gets through with it.
,
Apr 16 2018
> The CL is already on it's way through the commit queue... > it will go in whenever the CQ gets through with it. It might be wise to chump the CL; likely, it's important to make that 11:00 deadline.
,
Apr 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d88eaf5315d4963f1a16fc569aa90a5b7be531be commit d88eaf5315d4963f1a16fc569aa90a5b7be531be Author: Caroline Tice <cmtice@google.com> Date: Mon Apr 16 15:49:41 2018 [release builders] Disable CFI on caroline & terra. caroline & terra release builders have been failing recently. This Disabling CFI on those two builders seems to fix the issue. BUG= chromium:830321 TEST=Tested on terra-release-tryjob builder and it passed. Change-Id: I4e4709edc9ee2dade6b29486a6857bf2c6f440de Reviewed-on: https://chromium-review.googlesource.com/1013064 Reviewed-by: Manoj Gupta <manojgupta@chromium.org> Commit-Queue: Caroline Tice <cmtice@chromium.org> Tested-by: Caroline Tice <cmtice@chromium.org> Trybot-Ready: Caroline Tice <cmtice@chromium.org> [modify] https://crrev.com/d88eaf5315d4963f1a16fc569aa90a5b7be531be/cbuildbot/config_dump.json [modify] https://crrev.com/d88eaf5315d4963f1a16fc569aa90a5b7be531be/cbuildbot/chromeos_config.py
,
Apr 16 2018
Ok, the change has been chumped.
,
Apr 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b3eb773cd8b17c9aa4f37190d30d1040242d18c0 commit b3eb773cd8b17c9aa4f37190d30d1040242d18c0 Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Mon Apr 16 16:58:07 2018 Roll src/third_party/chromite/ c90ccbc26..d88eaf531 (1 commit) https://chromium.googlesource.com/chromiumos/chromite.git/+log/c90ccbc26d04..d88eaf5315d4 $ git log c90ccbc26..d88eaf531 --date=short --no-merges --format='%ad %ae %s' 2018-04-15 cmtice [release builders] Disable CFI on caroline & terra. Created with: roll-dep src/third_party/chromite BUG= chromium:830321 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I9289723920afdf2dda519c0cb1c750efadc2f29f Reviewed-on: https://chromium-review.googlesource.com/1014175 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#551013} [modify] https://crrev.com/b3eb773cd8b17c9aa4f37190d30d1040242d18c0/DEPS
,
Apr 16 2018
,
Apr 16 2018
Assigning to current gardener.
,
Apr 16 2018
Re #37: Richard, yes, I built at least two full images locally for caroline and peach_pit, and both of them worked.
,
Apr 16 2018
alemate, Did you use the same USE flags in build_packages/build-image when building local images? From the build_packages log at https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fcaroline-release%2F1606%2F%2B%2Frecipes%2Fsteps%2FBuildPackages__afdo_use_%2F0%2Fstdout : 'USE=-cros-debug cfi chrome_internal thinlto afdo_use'
,
Apr 16 2018
No, I naively expected build_packages to create correct build for the board.
,
Apr 17 2018
Caroline DEV is working again.
,
Apr 17 2018
(Removing RBD).
,
Apr 17 2018
Issue 833563 has been merged into this issue.
,
Apr 17 2018
Should this be release blocker?
,
Apr 17 2018
Issue 825425 has been merged into this issue.
,
Apr 17 2018
Tagging as a beta blocker so we don't lose this. Per alemate@, scope is dependent on feedback from toolchain team.
,
Apr 17 2018
I mean we probably need Toolchain team feedback to decide on further actions.
,
Apr 17 2018
> Per alemate@, scope is dependent on feedback from toolchain team. The confirmed failures were limited to caroline and terra, and the code change that fixed this problem was limited to caroline and terra. Also comment #38 says the configuration is limited to caroline and terra. So, this bug is limited to caroline and terra.
,
Apr 17 2018
#58: I'm asking about scope since crbug/825425 was tagged as a DUP and it included daisy, Peppy, Kip and Reks. That bug was perhaps closed as a DUP incorrectly, however. I need to be absolutely sure of scope if we're tagging blockers.
,
Apr 17 2018
> #58: I'm asking about scope since crbug/825425 was tagged > as a DUP and it included daisy, Peppy, Kip and Reks. > That bug was perhaps closed as a DUP incorrectly, however. Yeah, bug 825425 seemed to have become an agglomeration of multiple different bugs. It was originally the caroline and terra issue that preceded this one, but it seems to have been confused with other bugs, including this one. I've dropped the duplicate tag, for clarity. This bug is definitely only caroline and terra, and it's definitely fixed in the canary.
,
Apr 17 2018
Just to confirm what jrbarnette@ already said: This issue reported in this bug is limited to caroline and terra release builds. If there are fails on any other boards, they are unrelated issues.
,
Apr 17 2018
It looks like caroline-release and terra-release cycled green [1][2] (but then went red due to [3]) so I'm going to close this as fixed. Please reopen if there is additional action that needs to be taken here. 1: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8949066683798464880 2: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8949066677820217136 3: https://bugs.chromium.org/p/chromium/issues/detail?id=833886
,
Apr 17 2018
,
Apr 17 2018
[Auto-generated comment by a script] We noticed that this issue is targeted for M-67; it appears the fix may have landed after branch point, meaning a merge might be required. Please confirm if a merge is required here - if so add Merge-Request-67 label, otherwise remove Merge-TBD label. Thanks.
,
Apr 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b3eb773cd8b17c9aa4f37190d30d1040242d18c0 commit b3eb773cd8b17c9aa4f37190d30d1040242d18c0 Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Mon Apr 16 16:58:07 2018 Roll src/third_party/chromite/ c90ccbc26..d88eaf531 (1 commit) https://chromium.googlesource.com/chromiumos/chromite.git/+log/c90ccbc26d04..d88eaf5315d4 $ git log c90ccbc26..d88eaf531 --date=short --no-merges --format='%ad %ae %s' 2018-04-15 cmtice [release builders] Disable CFI on caroline & terra. Created with: roll-dep src/third_party/chromite BUG= chromium:830321 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I9289723920afdf2dda519c0cb1c750efadc2f29f Reviewed-on: https://chromium-review.googlesource.com/1014175 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#551013} [modify] https://crrev.com/b3eb773cd8b17c9aa4f37190d30d1040242d18c0/DEPS
,
Apr 23 2018
Did this get merged to M67 yet? Caroline is still failing at build and not making the RCs.
,
Apr 23 2018
,
Apr 23 2018
+tbarzic (current gardener) to check status of release builder.
,
Apr 23 2018
> Did this get merged to M67 yet? Caroline is still failing at build and not making the RCs. It seems it didn't get merged. However, the failures in caroline and terra for M67 are different from the failures in this bug. A new bug should be opened for M67. We should close this bug, presumably after merging the fix to M67, and probably also M66.
,
Apr 23 2018
,
Apr 23 2018
Approving merge to M67 Chrome OS.
,
Apr 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a5f2995ea53207dec31ad8597feb31cc8205c6a1 commit a5f2995ea53207dec31ad8597feb31cc8205c6a1 Author: Caroline Tice <cmtice@google.com> Date: Mon Apr 23 23:27:58 2018 [release builders] Disable CFI on caroline & terra. caroline & terra release builders have been failing recently. This Disabling CFI on those two builders seems to fix the issue. BUG= chromium:830321 TEST=Tested on terra-release-tryjob builder and it passed. Change-Id: I4e4709edc9ee2dade6b29486a6857bf2c6f440de Reviewed-on: https://chromium-review.googlesource.com/1013064 Reviewed-by: Manoj Gupta <manojgupta@chromium.org> Commit-Queue: Caroline Tice <cmtice@chromium.org> Tested-by: Caroline Tice <cmtice@chromium.org> Trybot-Ready: Caroline Tice <cmtice@chromium.org> (cherry picked from commit d88eaf5315d4963f1a16fc569aa90a5b7be531be) Reviewed-on: https://chromium-review.googlesource.com/1025130 Reviewed-by: Richard Barnette <jrbarnette@google.com> Commit-Queue: Bernie Thompson <bhthompson@chromium.org> Tested-by: Bernie Thompson <bhthompson@chromium.org> [modify] https://crrev.com/a5f2995ea53207dec31ad8597feb31cc8205c6a1/cbuildbot/config_dump.json [modify] https://crrev.com/a5f2995ea53207dec31ad8597feb31cc8205c6a1/cbuildbot/chromeos_config.py
,
Apr 24 2018
Tested on Terra with build 10575.13.0/67.0.3396.17 and was able to sign in successfully after recovery with USB stick.
,
Apr 25 2018
Caroline started showing the results on stainless.
,
Apr 27 2018
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 30 2018
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 30 2018
Assigning to cmtice to verify whether this has to be merged anywhere else.
,
Apr 30 2018
No, this does not need to be merged anywhere else. The initial change only went into R67. Do I mark this as verified now?
,
May 1 2018
yes, looks like it's fixed. I select caroline and terra and it shows green https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=caroline%2Cterra&milestone=&chromeOsVersion=&chromeVersion=&startTimeFrom=&startTimeTo=&token=ALeBcqH5rDfkLdkMx-K_xxwuEKve%3A1525132922279#%2F screenshot: https://screenshot.googleplex.com/sR5jdjQVD9K
,
May 8 2018
Just in case anybody here is interested, we finally figured out (and fixed) the cause of these failures. It was rather complicated, and a conjunction of multiple things occurring that caused the failure and made it hard to diagnose. The basic issue involved Goma + a compiler change goma did not know about: In order to work properly CFI has a blacklist file of known issues -- functions/files not to check. Goma knows about this and puts/looks for the file in a certain place. LLVM changed the location of the file, and nobody thought to tell goma. CFI was enabled in Chrome OS (on caroline & terra) on March 9, when LLVM & goma were both still using the old location. Everything worked properly, until LLVM was upgraded around March 20 (to start using the new location). The CFI files were now in a new location but goma was still looking for them in the old location. So goma builds (and ONLY goma builds) with CFI started failing. The issue was muddied by two green builds on the builders, near the end of March, which made it look like the old issue was fixed and a new issue came up. In fact, those were two builds where goma failed, and the build system fell back onto local builds where the files were looked for in the correct location. We now have several fixes either in flight or actually in place, to prevent this particular issue from arising again. We also are working on some changes in our processes to try to catch these types of issues sooner.
,
Jun 5 2018
Hi, the Merge-Approved-67 label was never removed after this blocking merge request. Assume the merge was made and we can remove it?
,
Jun 5 2018
Yes, it was done.
,
Jun 5 2018
,
Jun 5 2018
unintended
,
Jul 18
|
||||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Apr 9 2018Components: -Infra>Client>ChromeOS
Labels: -Pri-1 Pri-0
Owner: alemate@chromium.org
Status: Assigned (was: Untriaged)
This is not an infrastructure bug. Looking at this build: https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-release/builds/1637 You see that the build failed with this error message: " ... After update and reboot, Chrome failed to reach login screen within 180 seconds, ..." That message is caused by a Chrome bug, so the gardener gets the task. P0, because caroline is DOA, which will hold up releases.