Automated repair image update should check that image exists
Reported by
jrbarnette@chromium.org,
Jun 7 2016
|
||||
Issue descriptionRecently, we had an outage when the veyron_minnie-cheets repair image got cleaned up because it got to be older than 6 months. In site_utils/assign_stable_images.py, we should check for each board that its selected image file actually exists, and report an error for any image not found. If a build is missing, it means that the board in question *will* become a supply issue within a day, if not sooner. So, this error is urgent once it's found. Since the e-mail output from the cron job is so frequently routine, we should probably make sure that the error is reported in a way that can't reasonably be ignored.
,
Jun 7 2016
Putting on both fixit & OKR so that we can investigate both options
,
Jun 8 2016
> Possible alternate solution (from chatting with Don) > 1. Fetch repair image from stable channel instead (will be hard) Actually, in the case in question, there was no stable channel (after all, there was no Beta channel, either). It would be relatively easy to make a rule like "latest from Dev channel if there's no Beta channel image". I haven't yet decided whether that's a good enough idea. More below... > 2. Be sure the repair images are touched / updated every 5 months (less hard) Trivial, I suppose, since the stable images cron job could just touch whatever image is currently selected in place of checking for its existence. That's assuming gsutil makes it easy enough to update timestamps on images... I note that "the image expired and was deleted" isn't the only way we might see "image doesn't exist." The most obvious other problem would be "a human being manually specified the wrong image." However, it's possible that the most important problem here wasn't that the image was deleted, but rather that the builder went 6 months without ever releasing on the Beta channel. Catching that kind of event seems useful, and in that case, allowing images to be deleted is likely the most cheap and effective answer. So... If "board never releases on Beta channel" is deemed "never interesting", then either of the two options (use the Dev channel, or touch the image) should be good enough _but_ we may still need to detect missing images for the sake of human error. Otherwise, the best answer is merely "flag when images are missing".
,
Dec 6 2016
Still an issue?
,
Dec 6 2016
This is still a problem that can happen, although to date it's been quite rare. Looking at the current repair version assignments, it seems like veyron_rialto is on track for another glitch within a month or so; its current build is timestamped 2016/7/21.
,
Jan 13 2017
Seems eminently fixable, and will have real impact. Promoting to P1.
,
Feb 9 2017
> Looking at the current repair version assignments, it seems like
> veyron_rialto is on track for another glitch within a month or so;
> its current build is timestamped 2016/7/21.
... and right on time, the veyron_rialto repair build is now gone.
ITOT that in the presence of the automated firmware version assignment,
this problem is now fatal to the assignment script. The script failed
this morning with this error:
Applying firmware updates:
Traceback (most recent call last):
File "site_utils/stable_images/assign_stable_images.py", line 596, in <module>
main(sys.argv)
File "site_utils/stable_images/assign_stable_images.py", line 591, in main
firmware_upgrades = _get_firmware_upgrades(updater, upgrade_versions)
File "site_utils/stable_images/assign_stable_images.py", line 443, in _get_firmware_upgrades
for board, version in cros_versions.iteritems()
File "site_utils/stable_images/assign_stable_images.py", line 444, in <dictcomp>
if board not in _FIRMWARE_UPGRADE_BLACKLIST
File "site_utils/stable_images/assign_stable_images.py", line 420, in _get_firmware_version
return _get_by_key_path(_read_gs_json_data(uri), key_path)
File "site_utils/stable_images/assign_stable_images.py", line 308, in _read_gs_json_data
json_object = json.load(sp.stdout)
File "/usr/lib/python2.7/json/__init__.py", line 290, in load
**kw)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
The problem is, with the build deleted, there's no build metadata for
the firmware version lookup, and the current code doesn't catch the
exception. :-(
,
Jul 10 2017
At this point, the only known source of this problem common enough to warrant action has been veryon_rialto, because it wasn't releasing on Beta channel. We believe rialto will never again cause the problem, and we have no expectation of new boards that will do that. Human error is called out as a second possible source of this problem. Probably, however, the proposed change here isn't the right answer for human error: Human error in assigning a repair version will cause failures within hours, whereas the proposed change could only flag errors once a week. So, let's just give this up as a bad investment. |
||||
►
Sign in to add a comment |
||||
Comment 1 by autumn@chromium.org
, Jun 7 2016