Paygen: Add retries for retrieving JSON files from GS |
|||||||||||||
Issue description
Leon paygen stage failed with "Failed: Build definition" error, which results in a json error:
13:20:11: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -h x-goog-if-generation-match:1475094010041000 rm gs://chromeos-releases/dev-channel/leon/8845.0.0/payloads/LOCK_flag
13:20:12: ERROR: Failed: Build definition (board='leon', version='8845.0.0', channel='dev-channel')
@@@STEP_FAILURE@@@
13:49:48: ERROR: <type 'exceptions.ValueError'>: Expecting object: line 11636 column 4 (char 408742)
Traceback (most recent call last):
File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 602, in TaskRunner
task(*x, **task_kwargs)
File "/b/cbuild/internal_master/chromite/cbuildbot/stages/release_stages.py", line 441, in _RunPaygenInProcess
skip_duts_check=skip_duts_check)
File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 1482, in CreatePayloads
skip_duts_check=skip_duts_check).CreatePayloads()
File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 1310, in CreatePayloads
payload_manager = self._DiscoverRequiredPayloads()
File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 850, in _DiscoverRequiredPayloads
previous_builds = [b for b in self._DiscoverNmoBuild()
File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 703, in _DiscoverNmoBuild
contents = self._GetOmahaJson()
File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 423, in _GetOmahaJson
self.cachedOmahaJson = json.loads(gslib.Cat(OMAHA_URI))
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 11636 column 4 (char 408742)
waterfall for the run with error:
https://uberchromegw.corp.google.com/i/chromeos/builders/leon-release/builds/1251
paygen stdio:
https://uberchromegw.corp.google.com/i/chromeos/builders/leon-release/builds/1251/steps/Paygen/logs/stdio
,
Sep 28 2016
That file is generated by GE, and uploaded to GS. Every single canary and release builder downloads and parses the exact same file during every build. So.... if this is the only builder affected this was probably an issue in GS, or with our download.
,
Sep 28 2016
Don, you have more expertise than me. :)
,
Sep 28 2016
Again, how many builders were affected?
,
Sep 29 2016
It looks like leon passed the last two canary build/tests. So it seems like a flake. However, I think that it is not uncommon that we run into GS issues or some other flakey issue. In this case, I suspect that if we had somehow determined that we had had a bad download and redownloaded the json file, we would've never seen this issue in the end. Is this something that you guys think would make sense (when we get a corrupt file, retry the download with the assumption that something may have happened in the network). Or do you think that it's way to unlikely that this would repeat itself?
,
Sep 29 2016
That makes sense to me, it looks like we don't currently have any retries in place for that at the moment. My only question would be do we try immediately, or wait some time before retrying?
,
Sep 29 2016
I think that it wouldn't hurt to wait before retrying to let whatever caused the failure to fix itself. I'm just going to throw out a backoff of 1 min? If the infra guys have a better solution than the totally random number that I made up, please speak up :D.
,
Sep 29 2016
Uploaded a CL here: https://chromium-review.googlesource.com/#/c/391052/
,
Sep 30 2016
The CL is probably in its final form. As it stands right now, it will dump the entire JSON received in the logs. It might be too much, but c#1 mentioned that it would be useful to see the file downloaded.
,
Oct 3 2016
,
Oct 3 2016
,
Oct 10 2016
,
Oct 11 2016
,
Jan 21 2017
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by dgarr...@chromium.org
, Sep 28 2016