Snappy snafu - no good repair build
Reported by
jrbarnette@chromium.org,
Feb 14 2017
|
||||||||
Issue descriptionExecutive summary: snappy is snafu. We need to get ToT green ASAP. The original history can be found in bug 691815 . Full details: The snappy builders are all red. In an attempt to help things along, the repair build for snappy was moved from R56-9000.76.0 to R57-9202.24.0. To match the change, firmware was advanced from Google_Snappy.8969.0.0 to Google_Snappy.9042.33.0. For snappy, build R57-9202.24.0 turned out not to be stable; it tends to cause failures when installed. The only plausible-looking repair build candidate for snappy seems to be R57-9170.0.0. However, that build delivers firmware Google_Snappy.9042.15.0, which is older than what many/most DUTs now have. Firmware downgrade doesn't work, so we can't use that firmware. So, here's where we are now: * The snappy repair version is set to R57-9202.24. * The snappy firmware version is now unassigned. This situation is unsustainable. Every week, the system will automatically re-assign firmware to snappy based on the current repair build. To get the system back on track, we need a green build with the latest firmware. We also need a green build because snappy has been red for a month. Once we get a green build with recent enough firmware, we need to make it the snappy repair build.
,
Feb 14 2017
We're seeing the same issues crop up on reef as well related to PaygenBuild. Who is an expert in this?
,
Feb 14 2017
,
Feb 14 2017
Gandof-release also once failed due to PaygenTestDev: https://uberchromegw.corp.google.com/i/chromeos/builders/gandof-release/builds/837
,
Feb 14 2017
+Aseda, who worked on paygen last month. Any ideas why we are having so much pain on this? See https://bugs.chromium.org/p/chromium/issues/detail?id=692129#c1 above. We are consistently seeing the PayGenCanary fail with filesystem size issues.
,
Feb 15 2017
+dgarrett The Gandof PaygenTest failure is not related to this, so let's keep that separate. Regarding the snappy PagyenBuildCanary failures, I'm not sure what's going on. I don't know why the new image would have a smaller filesystem size. PaygenTest failures wouldn't be releated to the PaygenBuild failures, as if PaygenBuild fails, the PaygenTest stage won't run at all for that channel.
,
Feb 15 2017
This isn't really about Paygen failures. The failures I've been seeing are about devices that crash and burn when you install new builds on them. Paygen testing does that a lot. This belongs in the oh-so-capable hands of the Sheriffs, who (if they know what's good for them) will find an expert in ApolloLake to take care of it.
,
Feb 15 2017
Although I note that the problem with file size cited in c#1 is at least one of the causes of redness. That needs to be explained and fixed.
,
Feb 15 2017
https://code.google.com/p/chrome-os-partner/issues/detail?id=62911 is tracking the cr50 update stuff which we believe to be killing the AU. I would imagine we fail CQ if PaygenBuild fails regardless.
,
Feb 15 2017
I think we're pretty close to identifying the issue causing the reboots at PayGenTest (and provisioning for hwtest, etc) for Reef family devices, as aaron mentioned, that's crosbug.com/p/62911. I can file a new crbug to track PaygenBuildCanary failurse and hang it off of this one.
,
Feb 15 2017
> I would imagine we fail CQ if PaygenBuild fails regardless. Alas, it doesn't work that way. * We don't run Paygen tests in the CQ. * We don't run HWTest on reef, pyro, or snappy in the CQ. Please note that this bug is about more than merely "make the snappy canary be green". This bug is about getting a build for snappy that works well enough for automated repair and firmware upgrade on snappy in the test lab.
,
Feb 15 2017
Yes, I'm fully aware of that. I'm trying to understand the other issues that have cropped in the pursuit of that goal. But the release builds are running those tests so we'd be in trouble there as well once we sort out the larger problem of the cr50 updater rebooting systems?
,
Feb 15 2017
I've breezed through the history on reef, pyro, and snappy. The failures over time have shifted. It's likely that some of the older failures have resolved themselves; I think many of those failures were widespread and got fixed. So, we should focus on symptoms visible in current builds. That said, the current symptoms are blocking essentially *all* testing. So, if there are other bugs, we can't see them until we fix the cr50 updater bug.
,
Feb 15 2017
,
Feb 15 2017
Filed issue 692625 for the PaygenBuildCanary failure, and made it blocking this issue.
,
Feb 15 2017
The CQ does test our generic ability to generate a payload. Most PaygenBuild failures are because of signer issues, or because of problems with historical release artifacts. Looking.
,
Feb 22 2017
It seems snappy has made it to the Beta channel. In consequence, the repair and firmware builds are susceptible to automated updates. Snappy updated yesterday during the regular 4:00 AM run. Here's the relevant content from the logs: Default R56-9000.82.0 -> R57-9202.18.0 Applying stable version changes: ... snappy (no change) -> R57-9202.27.0 ... Applying firmware updates: snappy (nothing) -> Google_Snappy.9042.43.0 This is where we want to be, so _this_ problem is fixed. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by adurbin@chromium.org
, Feb 14 2017