Incomplete JSON response from AFE RPC causes parsing error |
|||||||||||||||
Issue descriptionSee https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/60859617-chromeos-test/hostless/debug/ 04/22 02:18:20.259 DEBUG| suite:1095| Adding status keyval for graphics_dEQP=60859975-chromeos-test 04/22 02:19:01.324 ERROR| suite:1055| Traceback (most recent call last): File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 989, in wait for result in results_generator: File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 387, in wait_for_child_results finished=True) File "/usr/local/autotest/server/frontend.py", line 292, in get_jobs jobs_data = self.run('get_jobs', **dargs) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 111, in run self, call, **dargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 71, in GenericRetry return functor(*args, **kwargs) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 81, in _run return super(RetryingAFE, self).run(call, **dargs) File "/usr/local/autotest/server/frontend.py", line 103, in run result = utils.strip_unicode(rpc_call(**dargs)) File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 121, in __call__ raise JSONRPCException('Error decoding JSON reponse:\n' + respdata) JSONRPCException: Error decoding JSON reponse: ... then a long JSON response which is cut off.
,
Apr 25 2016
+ current deputy and secondary
,
Apr 26 2016
OK. I just stumbled across the failure in this build:
https://uberchromegw.corp.google.com/i/chromeos_release/builders/pineview-release-group%20release-R50-7978.B/builds/56
The bvt-inline suite for x86-alex failed with "bvt-inline: FAIL: Exception
waiting for results".
Diving down to the autotest logs, you find the failure above. I tried
the simple exercise of taking the truncated JSON content, and running
it through 'wc -c'. Apparently, whatever is doing the cutting off, it
happens after 4096 bytes.
,
Apr 26 2016
,
May 2 2016
Happened again twice on the same CQ run: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/61992408-chromeos-test/hostless/debug/?pli=1 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/61992055-chromeos-test/hostless/debug/ Should we be worried that we are sending larger json responses now?
,
May 3 2016
,
May 4 2016
Trying to reproduce locally in a python interpreter by calling get_jobs using one of these failed parent_job_id doesn't seem to work.
However looking at the code:
if not default_timeout:
# If default timeout is None, socket will never time out.
respdata = urllib2.urlopen(request).read()
else:
timeout = max(min_rpc_timeout, default_timeout)
respdata = urllib2.urlopen(request, timeout=timeout).read()
try:
resp = decoder.JSONDecoder().decode(respdata)
except ValueError:
raise JSONRPCException('Error decoding JSON reponse:\n' + respdata)
I suspect the urllib2.urlopen().read is cutting off at the 4096 bytes.
My guess would be to wrap the reading into a while loop, append the responses and exit the loop once the response is of length 0.
Thoughts? It can't hurt IMO.
,
May 4 2016
Can we confirm the behavior of urllib2.urlopen cut off at 4k bytes? I did a quick test:
>>> import urllib2
>>> result=urllib2.urlopen('http://dshi1.mtv.corp.google.com/results/log.tar').read()
>>> len(result)
20480
It can download 20k just fine.
,
May 4 2016
Fair, maybe it might be better to add retry logic to the whole code segment instead. I'm also raising the priority of this as its causing a minimum of 1 CQ run failure a day.
,
May 4 2016
The Python specification for read() says it's _supposed_ to read all bytes until EOF. However, there seems to be some squishiness in the text, so having the code loop until read() returns an empty string is worth trying. However, it's possible that the underlying library has already recorde "EOF", and will just refuse to return more data. In any event, the change should tell us something about which side the problem is on.
,
May 4 2016
In the code sequence that calls read(), is the timeout set? If there's a timeout, that could explain the misbehavior, and it would also mean that retrying would be worth the trouble.
,
May 5 2016
Added some logging and on MobLab there is no timeout for this code path. Just FYI.
,
May 6 2016
Issue 609921 has been merged into this issue.
,
May 10 2016
Issue 610735 has been merged into this issue.
,
May 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9344c8078ad968e6f062d3ec4849be2f31739b92 commit 9344c8078ad968e6f062d3ec4849be2f31739b92 Author: Simran Basi <sbasi@google.com> Date: Thu May 05 21:14:01 2016 [autotest] Loop over urllib2 reads of the RPC responses. Recently some json RPC responses are being cut off at 4096 bytes. In an attempt to solve this problem, we will try to do multiple reads of the urlfile until no data is returned. BUG= chromium:606071 TEST=suite support still works in moblab. Change-Id: Iea04d48599deb21716b23cadff324fd3816118ad Reviewed-on: https://chromium-review.googlesource.com/342841 Reviewed-by: Simran Basi <sbasi@chromium.org> Tested-by: Simran Basi <sbasi@chromium.org> [modify] https://crrev.com/9344c8078ad968e6f062d3ec4849be2f31739b92/frontend/afe/json_rpc/proxy.py
,
May 11 2016
Issue 611034 has been merged into this issue.
,
May 12 2016
several builds today failed because of the same issue. Here is an one of them https://uberchromegw.corp.google.com/i/chromeos/builders/banon-release/builds/19
,
May 13 2016
Looking a the logs from the failure don't show any messages being logged about retry attempts. That's mostly expected: the code wouldn't log if retrying produced EOF. What this means is that the root of the problem is probably on the server side. The server transmits one file buffer's worth of data, then hangs up the phone. The behavior smacks of excessive load somewhere in the chain; we need to get a better handle on where that load is coming from.
,
May 16 2016
,
May 16 2016
Looking through the code, the server that's handling the load
is the shard global AFE RPC server. That's a single server handing
RPC calls from all the shards. As we increase the number of shards,
the load on that server will increase, so if we've hit a limit there,
it's not really surprising...
A quick fix might be to add a second global RPC server. However, the
RPC servers would have to be assigned to shards manually, and that's
not a scalable solution.
I also looked at the client side that generates the load. That's
in server.cros.dynamic_suite.job_status.wait_for_child_results().
That code repeatedly fetches all the recently completed jobs, with
this call:
new_finished_jobs = [job for job in
afe.get_jobs(parent_job_id=parent_job_id,
finished=True)
if job.id in remaining_child_jobs]
Each completed job transmits a large amount of data, including the
job's complete control file. Essentially all of that data is then
discarded. The result is that the server is transmitting what is
likely 500x-1000x more data than it needs to.
We should probably change the RPC interface to provide a version
of the "get_jobs" call that returns only a limited amount of data,
to reduce bandwidth consumption and load on the RPC server.
,
May 16 2016
Minor correction:
We do have 2 RPC server for shard. Following are the puppet code:
if ($server_type == 'autotest-shard' and
$fqdn =~ /^chromeos-server(\d+)\..*$/) {
$t_num = ($1 + 0) % 2
}
$global_afe_hostname = $t_num ? {
0 => 'chromeos-server28.mtv.corp.google.com',
1 => 'chromeos-server29.mtv.corp.google.com',
default => undef,
}
Adding another one should be quite straight forward.
,
May 16 2016
Taking a second look at this call:
new_finished_jobs = [job for job in
afe.get_jobs(parent_job_id=parent_job_id,
finished=True)
if job.id in remaining_child_jobs]
The call filters out jobs that have finished and been processed
on the client side. That means jobs that finish early are
re-transmitted with every query. That means that (loosely speaking)
the bandwidth consumed by this query for a given suite job grows
_quadratically_ over time. That's bound to be spectacularly bad...
So, other options for bandaids:
* The query is in a loop that polls every 5 seconds. Reducing
the polling frequency would cut back on bandwidth consumed,
although it wouldn't stop the quadratic growth.
* The query could be restricted to just the jobs that are still
of interested. That is, the "if job.id in remaining_child_jobs"
should be part of the query on the server side. That would
stop the quadratic growth, and ensure that each job is only
transmitted once.
,
May 16 2016
I think the following patch should stop the quadratic growth problem:
diff --git a/server/cros/dynamic_suite/job_status.py b/server/cros/dynamic_suite/job_status.py
index 6da5b58..5c2e4ef 100644
--- a/server/cros/dynamic_suite/job_status.py
+++ b/server/cros/dynamic_suite/job_status.py
@@ -382,10 +382,8 @@ def wait_for_child_results(afe, tko, parent_job_id):
remaining_child_jobs = set(job.id for job in
afe.get_jobs(parent_job_id=parent_job_id))
while remaining_child_jobs:
- new_finished_jobs = [job for job in
- afe.get_jobs(parent_job_id=parent_job_id,
- finished=True)
- if job.id in remaining_child_jobs]
+ new_finished_jobs = afe.get_jobs(id__in=remaining_child_jobs,
+ finished=True)
for job in new_finished_jobs:
,
May 16 2016
Sounds good for now, could you upload a CL?
,
May 17 2016
,
May 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4e9e23657fde19bef5c126b86df1674280e07056 commit 4e9e23657fde19bef5c126b86df1674280e07056 Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue May 17 15:26:08 2016 [autotest] Reduce polling load for dynamic suites. Dynamic suites check for completion by polling with a query for all finished child jobs. At each polling cycle the only jobs of interest would be jobs that had finished since the previous cycle, but the query fetched all finished jobs, meaning that data for jobs that finished early would be retransmitted many times. The excessive amount of data transmitted for the polling query is suspected of causing load on the RPC server which in turn causes mysterious server hang-ups. This changes the query to filter the unneeded data on the server side, so that each finished job is only transmitted once. BUG= chromium:606071 TEST=unit tests, plus push_to_prod suite on local instance Change-Id: I27c7be51f341e3d90a3fbd92ac8f6a93699a340a Reviewed-on: https://chromium-review.googlesource.com/344895 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/4e9e23657fde19bef5c126b86df1674280e07056/server/cros/dynamic_suite/job_status.py [modify] https://crrev.com/4e9e23657fde19bef5c126b86df1674280e07056/server/cros/dynamic_suite/job_status_unittest.py
,
May 19 2016
fyi, the polling in suite job should be hitting "cautotest" if I remember correctly, server 28 and 29 are for heartbeats only, but wouldn't hurt to add a third afe server for shard.
,
May 25 2016
Passing on to this week's deputy.
If this problem has stopped happening, we should declare
victory. I think that means revert this CL:
https://chromium-review.googlesource.com/342841
... and then close this bug as Fixed.
,
May 27 2016
https://chromium-review.googlesource.com/#/c/347517/ And then we can close the bug.
,
May 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d19a1df337f79ba28ddc6f9d573d0ffca7363c7b commit d19a1df337f79ba28ddc6f9d573d0ffca7363c7b Author: Richard Barnette <jrbarnette@google.com> Date: Fri May 27 17:04:21 2016 Revert "[autotest] Loop over urllib2 reads of the RPC responses." This reverts commit 9344c8078ad968e6f062d3ec4849be2f31739b92. This change is debug code that's no longer needed. BUG= chromium:606071 TEST=None Change-Id: I0af49ccf71af11b17cfcbe56be4c069ae582c482 Reviewed-on: https://chromium-review.googlesource.com/347517 Commit-Ready: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> Reviewed-by: Simran Basi <sbasi@chromium.org> [modify] https://crrev.com/d19a1df337f79ba28ddc6f9d573d0ffca7363c7b/frontend/afe/json_rpc/proxy.py
,
Jun 2 2016
Haven't seen this bug surface during my shift so closing out.
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
,
Jul 8 2016
This is occuring again: https://uberchromegw.corp.google.com/i/chromeos/builders/wolf-tot-paladin/builds/7174 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/68504279-chromeos-test/hostless/debug/
,
Jul 28 2016
Linking https://bugs.chromium.org/p/chromium/issues/detail?id=632433 that appears related.
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 18 2017
|
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dshi@chromium.org
, Apr 22 2016