New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 617825 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature
OKR



Sign in to add a comment

Automated repair image update should check that image exists

Reported by jrbarnette@chromium.org, Jun 7 2016

Issue description

Recently, we had an outage when the veyron_minnie-cheets repair
image got cleaned up because it got to be older than 6 months.

In site_utils/assign_stable_images.py, we should check for each
board that its selected image file actually exists, and report
an error for any image not found.

If a build is missing, it means that the board in question *will*
become a supply issue within a day, if not sooner.  So, this error
is urgent once it's found.  Since the e-mail output from the cron
job is so frequently routine, we should probably make sure that
the error is reported in a way that can't reasonably be ignored.
 
Labels: okr Hotlist-Fixit
Possible alternate solution (from chatting with Don)
1. Fetch repair image from stable channel instead (will be hard)
2. Be sure the repair images are touched / updated every 5 months (less hard) 



Putting on both fixit & OKR so that we can investigate both options 
> Possible alternate solution (from chatting with Don)
> 1. Fetch repair image from stable channel instead (will be hard)

Actually, in the case in question, there was no stable channel
(after all, there was no Beta channel, either).  It would be
relatively easy to make a rule like "latest from Dev channel
if there's no Beta channel image".  I haven't yet decided whether
that's a good enough idea.  More below...

> 2. Be sure the repair images are touched / updated every 5 months (less hard) 

Trivial, I suppose, since the stable images cron job could just
touch whatever image is currently selected in place of checking
for its existence.  That's assuming gsutil makes it easy enough
to update timestamps on images...

I note that "the image expired and was deleted" isn't the only
way we might see "image doesn't exist."  The most obvious
other problem would be "a human being manually specified the
wrong image."

However, it's possible that the most important problem here
wasn't that the image was deleted, but rather that the builder
went 6 months without ever releasing on the Beta channel.
Catching that kind of event seems useful, and in that case,
allowing images to be deleted is likely the most cheap and
effective answer.

So...  If "board never releases on Beta channel" is deemed
"never interesting", then either of the two options (use
the Dev channel, or touch the image) should be good enough
_but_ we may still need to detect missing images for the sake
of human error.  Otherwise, the best answer is merely "flag when
images are missing".

Labels: -okr OKR
Owner: jrbarnette@chromium.org
Status: Assigned (was: Available)
Still an issue?
This is still a problem that can happen, although to date it's been
quite rare.

Looking at the current repair version assignments, it seems like
veyron_rialto is on track for another glitch within a month or so;
its current build is timestamped 2016/7/21.

Labels: -Pri-2 Pri-1
Seems eminently fixable, and will have real impact. Promoting to P1.
> Looking at the current repair version assignments, it seems like
> veyron_rialto is on track for another glitch within a month or so;
> its current build is timestamped 2016/7/21.

... and right on time, the veyron_rialto repair build is now gone.

ITOT that in the presence of the automated firmware version assignment,
this problem is now fatal to the assignment script.  The script failed
this morning with this error:
Applying firmware updates:
Traceback (most recent call last):
  File "site_utils/stable_images/assign_stable_images.py", line 596, in <module>
    main(sys.argv)
  File "site_utils/stable_images/assign_stable_images.py", line 591, in main
    firmware_upgrades = _get_firmware_upgrades(updater, upgrade_versions)
  File "site_utils/stable_images/assign_stable_images.py", line 443, in _get_firmware_upgrades
    for board, version in cros_versions.iteritems()
  File "site_utils/stable_images/assign_stable_images.py", line 444, in <dictcomp>
    if board not in _FIRMWARE_UPGRADE_BLACKLIST
  File "site_utils/stable_images/assign_stable_images.py", line 420, in _get_firmware_version
    return _get_by_key_path(_read_gs_json_data(uri), key_path)
  File "site_utils/stable_images/assign_stable_images.py", line 308, in _read_gs_json_data
    json_object = json.load(sp.stdout)
  File "/usr/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

The problem is, with the build deleted, there's no build metadata for
the firmware version lookup, and the current code doesn't catch the
exception.  :-(

Labels: -Pri-1 -Type-Bug Pri-3 Type-Feature
Status: WontFix (was: Assigned)
At this point, the only known source of this problem common enough
to warrant action has been veryon_rialto, because it wasn't releasing
on Beta channel.  We believe rialto will never again cause the problem,
and we have no expectation of new boards that will do that.

Human error is called out as a second possible source of this problem.
Probably, however, the proposed change here isn't the right answer for
human error:  Human error in assigning a repair version will cause
failures within hours, whereas the proposed change could only flag
errors once a week.

So, let's just give this up as a bad investment.

Sign in to add a comment