Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 2 users
Status: Fixed
Owner:
Closed: Feb 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug

Blocked on:
issue chrome-os-partner:62911
issue 692625



Sign in to add a comment
Snappy snafu - no good repair build
Project Member Reported by jrbarnette@chromium.org, Feb 14 2017 Back to list
Executive summary:  snappy is snafu.  We need to get ToT
green ASAP.

The original history can be found in  bug 691815 .

Full details:
The snappy builders are all red.  In an attempt to help things
along, the repair build for snappy was moved from R56-9000.76.0
to R57-9202.24.0.  To match the change, firmware was advanced
from Google_Snappy.8969.0.0 to Google_Snappy.9042.33.0.

For snappy, build R57-9202.24.0 turned out not to be stable; it
tends to cause failures when installed.  The only plausible-looking
repair build candidate for snappy seems to be R57-9170.0.0.  However,
that build delivers firmware Google_Snappy.9042.15.0, which is older
than what many/most DUTs now have.  Firmware downgrade doesn't work,
so we can't use that firmware.

So, here's where we are now:
  * The snappy repair version is set to R57-9202.24.
  * The snappy firmware version is now unassigned.

This situation is unsustainable.  Every week, the system will
automatically re-assign firmware to snappy based on the current
repair build.  To get the system back on track, we need a green
build with the latest firmware.  We also need a green build
because snappy has been red for a month.

Once we get a green build with recent enough firmware, we need
to make it the snappy repair build.

 

PaygenBuildCanary failure:

https://uberchromegw.corp.google.com/i/chromeos/builders/snappy-release/builds/374/steps/PaygenBuildCanary/logs/stdio


[0214/070714:WARNING:delta_diff_generator.cc(101)] Old and new filesystems have different size.
[0214/070714:FATAL:delta_diff_generator.cc(106)] Shirking the filesystem size is not supported at the moment.

PaygenTestDev failed as well w/ reboots/repairs not succeeding. Is this a cascade of the the above failing?

https://uberchromegw.corp.google.com/i/chromeos/builders/snappy-release/builds/374/steps/PaygenTestDev/logs/stdio

HWTest fails...

https://uberchromegw.corp.google.com/i/chromeos/builders/snappy-release/builds/374/steps/HWTest%20%5Bsanity%5D/logs/stdio


How do those resets get one? I think snappy has some reset issues in the hardware. We probably want to audit those machines to ensure they have any of the hw workarounds needed:

https://crbug.com/p/61326 
We're seeing the same issues crop up on reef as well related to PaygenBuild. Who is an expert in this?
Owner: jrbarnette@chromium.org
Comment 4 by mqg@chromium.org, Feb 14 2017
Gandof-release also once failed due to PaygenTestDev: https://uberchromegw.corp.google.com/i/chromeos/builders/gandof-release/builds/837
Comment 5 by bleung@chromium.org, Feb 14 2017
Cc: aaboagye@chromium.org
+Aseda, who worked on paygen last month. Any ideas why we are having so much pain on this? See https://bugs.chromium.org/p/chromium/issues/detail?id=692129#c1 above.

We are consistently seeing the PayGenCanary fail with filesystem size issues.
Cc: dgarr...@chromium.org
+dgarrett

The Gandof PaygenTest failure is not related to this, so let's keep that separate. Regarding the snappy PagyenBuildCanary failures, I'm not sure what's going on. I don't know why the new image would have a smaller filesystem size.

PaygenTest failures wouldn't be releated to the PaygenBuild failures, as if PaygenBuild fails, the PaygenTest stage won't run at all for that channel.
Owner: adurbin@chromium.org
Status: Assigned
This isn't really about Paygen failures.  The failures I've been
seeing are about devices that crash and burn when you install
new builds on them.  Paygen testing does that a lot.

This belongs in the oh-so-capable hands of the Sheriffs, who
(if they know what's good for them) will find an expert in
ApolloLake to take care of it.

Although I note that the problem with file size cited in c#1 is
at least one of the causes of redness.  That needs to be explained
and fixed.

https://code.google.com/p/chrome-os-partner/issues/detail?id=62911 is tracking the cr50 update stuff which we believe to be killing the AU.

I would imagine we fail CQ if PaygenBuild fails regardless. 
Blockedon: chrome-os-partner:62911
I think we're pretty close to identifying the issue causing the reboots at PayGenTest (and provisioning for hwtest, etc) for Reef family devices, as aaron mentioned, that's crosbug.com/p/62911.

I can file a new crbug to track PaygenBuildCanary failurse and hang it off of this one.
> I would imagine we fail CQ if PaygenBuild fails regardless. 

Alas, it doesn't work that way.
  * We don't run Paygen tests in the CQ.
  * We don't run HWTest on reef, pyro, or snappy in the CQ.

Please note that this bug is about more than merely "make the
snappy canary be green".  This bug is about getting a build for
snappy that works well enough for automated repair and firmware
upgrade on snappy in the test lab.

Yes, I'm fully aware of that. I'm trying to understand the other issues that have cropped in the pursuit of that goal. But the release builds are running those tests so we'd be in trouble there as well once we sort out the larger problem of the cr50 updater rebooting systems?
I've breezed through the history on reef, pyro, and snappy.
The failures over time have shifted.  It's likely that some
of the older failures have resolved themselves; I think many
of those failures were widespread and got fixed.  So, we should
focus on symptoms visible in current builds.

That said, the current symptoms are blocking essentially *all*
testing.  So, if there are other bugs, we can't see them until
we fix the cr50 updater bug.

Blockedon: 692625
Filed  issue 692625  for the PaygenBuildCanary failure, and made it blocking this issue. 
The CQ does test our generic ability to generate a payload.

Most PaygenBuild failures are because of signer issues, or because of problems with historical release artifacts. Looking.
Status: Fixed
It seems snappy has made it to the Beta channel.  In consequence,
the repair and firmware builds are susceptible to automated updates.
Snappy updated yesterday during the regular 4:00 AM run.  Here's
the relevant content from the logs:

Default R56-9000.82.0 -> R57-9202.18.0
Applying stable version changes:
...
   snappy                 (no change) -> R57-9202.27.0
...
Applying firmware updates:
   snappy                 (nothing) -> Google_Snappy.9042.43.0

This is where we want to be, so _this_ problem is fixed.
Sign in to add a comment