Cloud Endpoints proxy server flakes with 404 when buildbot talks to buildbucket |
|||||||||
Issue descriptionA bunch of CLs got stuck in CQ around 13-15 CET on July 8. E.g.: https://codereview.chromium.org/2130293002/ CQ is stuck on this builder: https://build.chromium.org/p/tryserver.v8/builders/v8_win_rel_ng/builds/10201 The step triggering the triggered trybot went purple, so CQ seems to be blocked on the triggered trybot not running. Here, probably the purple trigger should have made the build fail. Or the trigger should have retried internally.
,
Jul 8 2016
Filed two specific buildbucket bugs: http://crbug.com/626652 and http://crbug.com/626652. This bug should be fixed asap to avoid impact on V8 devs, while the ones I filed would fix the root cause (i think).
,
Jul 8 2016
According to these logs all GET requests from google-api-python-client/1.3.1 (gzip) fail with 404: https://pantheon.corp.google.com/logs/viewer?project=cr-buildbucket&minLogLevel=0&expandAll=false&resource=appengine.googleapis.com%2Fmodule_id%2Fdefault%2Fversion_id%2F4743-1f0d017&logName=%2Fprojects%2Fcr-buildbucket%2Flogs%2Fappengine.googleapis.com%252Frequest_log&advancedFilter=metadata.serviceName%3D%22appengine.googleapis.com%22%0Ametadata.labels.%22appengine.googleapis.com%2Fmodule_id%22%3D%22default%22%0Ametadata.labels.%22appengine.googleapis.com%2Fversion_id%22%3D%224743-1f0d017%22%0Alog%3D%22appengine.googleapis.com%2Frequest_log%22%0AprotoPayload.userAgent%3D%22google-api-python-client%2F1.3.1%20(gzip)%22%0AprotoPayload.method%3D%22GET%22&lastVisibleTimestampNanos=1467984414773173000 However, I do not think that all of those requests are coming from CQ service since many of the buckets requested are not in any CQ config, e.g. master.internal.client.kitchensync.dashboard.
,
Jul 8 2016
Also here: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2023503002/1890001 The bot v8_linux_mips64el_compile_rel is marked not-started despite lots of idle bots on the tryserver.
,
Jul 8 2016
Reverted buildbucket back to 6c01270 in https://pantheon.corp.google.com/appengine/versions?project=cr-buildbucket&moduleId=default.
,
Jul 8 2016
The reason for the revert is that as tAndrii noticed that CQ started failing about the same time new version was deployed.
,
Jul 8 2016
Making this public. Nothing internal.
,
Jul 8 2016
FTR, the revert didn't really help, at least not visibly - there are still weird 404s in the log for certainly valid URLs that return 202 in my browser: /_ah/api/buildbucket/v1/search?max_builds=40&alt=json&tag=buildset%3Apatch%2Frietveld%2Fcodereview.chromium.org%2F2034083002%2F120001&bucket=master.tryserver.v8&bucket=master.tryserver.chromium.linux
,
Jul 8 2016
Returned back to newest version 1f0d017.
,
Jul 8 2016
looks like typical cloud endpoints outage. Pinged internal bug. Will try a workaround
,
Jul 8 2016
I also observed a wave of 404s last night around 11PM PST. The leases expired and BuildBucket recovered, and the 404s stopped. The 404s did show up in the GAE logs, but with no additional information. Is that possible if this was an endpoints outage? Either way, if this is not still ongoing, we should probably lower the priority.
,
Jul 8 2016
Do we have an internal bug to track this?
,
Jul 8 2016
the internal bug is b/25147957 I am landing https://codereview.chromium.org/2137583002/ which bypasses Cloud Endpoints API server proxy, which should fix the 404 with endpoints permanently. The context of that change is https://codereview.chromium.org/2117833003/
,
Jul 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/01a9273aee506a2463a290c407eb4cd3599d53d3 commit 01a9273aee506a2463a290c407eb4cd3599d53d3 Author: nodir <nodir@chromium.org> Date: Fri Jul 08 17:56:58 2016 buildbucket: bypass Cloud Endpoints API server Cloud Endpoints API server causes occasional 404s and increases of latency. We don't use the benefits that it provides, so bypass it. This CL also vendors discovery doc from buildbucket, primarily because apiclient API does not allow to manipulate API baseURL, but also it simplfies code. R=dnj@chromium.org BUG= 626650 Review-Url: https://codereview.chromium.org/2137583002 [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/__init__.py [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/client.py [add] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/discovery_doc.json [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/status.py [add] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/update_discovery_doc.sh
,
Jul 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/01a9273aee506a2463a290c407eb4cd3599d53d3 commit 01a9273aee506a2463a290c407eb4cd3599d53d3 Author: nodir <nodir@chromium.org> Date: Fri Jul 08 17:56:58 2016 buildbucket: bypass Cloud Endpoints API server Cloud Endpoints API server causes occasional 404s and increases of latency. We don't use the benefits that it provides, so bypass it. This CL also vendors discovery doc from buildbucket, primarily because apiclient API does not allow to manipulate API baseURL, but also it simplfies code. R=dnj@chromium.org BUG= 626650 Review-Url: https://codereview.chromium.org/2137583002 [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/__init__.py [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/client.py [add] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/discovery_doc.json [modify] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/status.py [add] https://crrev.com/01a9273aee506a2463a290c407eb4cd3599d53d3/scripts/master/buildbucket/update_discovery_doc.sh
,
Jul 8 2016
,
Jul 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/b26d9382bf13c6a2bb29ac7e57e1e02acf925b3a commit b26d9382bf13c6a2bb29ac7e57e1e02acf925b3a Author: nodir <nodir@google.com> Date: Fri Jul 08 18:01:35 2016
,
Jul 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager.git/+/b26d9382bf13c6a2bb29ac7e57e1e02acf925b3a commit b26d9382bf13c6a2bb29ac7e57e1e02acf925b3a Author: nodir <nodir@google.com> Date: Fri Jul 08 18:01:35 2016
,
Jul 8 2016
the problem is fixed for master.tryserver.infra https://codereview.chromium.org/2133263002/ removes the whitelist of buckets, so the fix will apply to all masters (maters still need to be restarted)
,
Jul 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/98030827a18cd840b8853d38ddd3c27c2f85aafb commit 98030827a18cd840b8853d38ddd3c27c2f85aafb Author: nodir <nodir@chromium.org> Date: Fri Jul 08 18:26:53 2016 buildbucket: always bypass cloud endpoints server The whitelisted master.tryserver.infra just worked, so apply the change to all masters to fix them R=dnj@chromium.org, tandrii@chromium.org BUG= 626650 Review-Url: https://codereview.chromium.org/2133263002 [modify] https://crrev.com/98030827a18cd840b8853d38ddd3c27c2f85aafb/scripts/master/buildbucket/client.py
,
Jul 8 2016
Workarounds landed, but picked up only by master.tryserver.infra The worst impact that this problem may cause is 1) longer cycle time because **some** completed builds are retried 2) longer pending queues however, we don't see unusually long pending queues in http://shortn/_WYRUuURIGt. Restarting chromium tryservers now would result in restarting of **all** builds, which would be probably worse than not restarting. Let's restart tryservers in the evening.
,
Jul 8 2016
,
Jul 18 2016
#20 helped
,
Jul 20 2016
Issue 626768 has been merged into this issue.
,
Sep 1 2016
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by serg...@chromium.org
, Jul 8 2016