New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 651232 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Oct 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Paygen: Add retries for retrieving JSON files from GS

Project Member Reported by shchen@chromium.org, Sep 28 2016

Issue description

Leon paygen stage failed with "Failed: Build definition" error, which results in a json error:

13:20:11: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' -h x-goog-if-generation-match:1475094010041000 rm gs://chromeos-releases/dev-channel/leon/8845.0.0/payloads/LOCK_flag
13:20:12: ERROR: Failed: Build definition (board='leon', version='8845.0.0', channel='dev-channel')

@@@STEP_FAILURE@@@
13:49:48: ERROR: <type 'exceptions.ValueError'>: Expecting object: line 11636 column 4 (char 408742)
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 602, in TaskRunner
    task(*x, **task_kwargs)
  File "/b/cbuild/internal_master/chromite/cbuildbot/stages/release_stages.py", line 441, in _RunPaygenInProcess
    skip_duts_check=skip_duts_check)
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 1482, in CreatePayloads
    skip_duts_check=skip_duts_check).CreatePayloads()
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 1310, in CreatePayloads
    payload_manager = self._DiscoverRequiredPayloads()
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 850, in _DiscoverRequiredPayloads
    previous_builds = [b for b in self._DiscoverNmoBuild()
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 703, in _DiscoverNmoBuild
    contents = self._GetOmahaJson()
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 423, in _GetOmahaJson
    self.cachedOmahaJson = json.loads(gslib.Cat(OMAHA_URI))
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 11636 column 4 (char 408742)

waterfall for the run with error:
https://uberchromegw.corp.google.com/i/chromeos/builders/leon-release/builds/1251

paygen stdio:
https://uberchromegw.corp.google.com/i/chromeos/builders/leon-release/builds/1251/steps/Paygen/logs/stdio
 
Cc: leecy@chromium.org
Interesting. One line further up in the logs is very relevant.

13:20:11: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil cat gs://chromeos-build-release-console/omaha_status.json


We fetched this Json file, and then it failed to parse.
  gs://chromeos-build-release-console/omaha_status.json

Either it was corrupt, or our download somehow yielded a corrupt copy. It would be useful to include the bad json in the logs when this happens, since the decode error is not useful without the raw contents.

That file is generated by GE, and uploaded to GS. Every single canary and release builder downloads and parses the exact same file during every build.

So.... if this is the only builder affected this was probably an issue in GS, or with our download.
Cc: -dgarr...@chromium.org
Owner: dgarr...@chromium.org
Don, you have more expertise than me. :)
Again, how many builders were affected?

Comment 5 by shchen@chromium.org, Sep 29 2016

It looks like leon passed the last two canary build/tests.  So it seems like a flake.  However, I think that it is not uncommon that we run into GS issues or some other flakey issue.  In this case, I suspect that if we had somehow determined that we had had a bad download and redownloaded the json file, we would've never seen this issue in the end.  Is this something that you guys think would make sense (when we get a corrupt file, retry the download with the assumption that something may have happened in the network).  Or do you think that it's way to unlikely that this would repeat itself?
That makes sense to me, it looks like we don't currently have any retries in place for that at the moment.

My only question would be do we try immediately, or wait some time before retrying?

Comment 7 by shchen@chromium.org, Sep 29 2016

I think that it wouldn't hurt to wait before retrying to let whatever caused the failure to fix itself.  I'm just going to throw out a backoff of 1 min?  If the infra guys have a better solution than the totally random number that I made up, please speak up :D.
Owner: aaboagye@chromium.org
Status: Started (was: Untriaged)
The CL is probably in its final form. As it stands right now, it will dump the entire JSON received in the logs. It might be too much, but c#1 mentioned that it would be useful to see the file downloaded.
Labels: BuildHealth iptaskforce Week-1640 Week-1639 OS-iOS
Summary: Paygen: Add retries for retrieving JSON files from GS (was: Failed: Build definition (board='leon', version='8845.0.0', channel='dev-channel'))
Labels: -OS-iOS OS-Chrome
Labels: -Week-1639 -Week-1640 Week-1641 Week-1642
Status: Fixed (was: Started)

Comment 14 by dchan@google.com, Jan 21 2017

Labels: VerifyIn-57

Comment 15 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 16 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 17 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 19 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment