New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 606071 link

Starred by 4 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 632433



Sign in to add a comment

Incomplete JSON response from AFE RPC causes parsing error

Project Member Reported by pho...@chromium.org, Apr 22 2016

Issue description

See 
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/60859617-chromeos-test/hostless/debug/

04/22 02:18:20.259 DEBUG|             suite:1095| Adding status keyval for graphics_dEQP=60859975-chromeos-test
04/22 02:19:01.324 ERROR|             suite:1055| Traceback (most recent call last):
  File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 989, in wait
    for result in results_generator:
  File "/usr/local/autotest/server/cros/dynamic_suite/job_status.py", line 387, in wait_for_child_results
    finished=True)
  File "/usr/local/autotest/server/frontend.py", line 292, in get_jobs
    jobs_data = self.run('get_jobs', **dargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 111, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 71, in GenericRetry
    return functor(*args, **kwargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 81, in _run
    return super(RetryingAFE, self).run(call, **dargs)
  File "/usr/local/autotest/server/frontend.py", line 103, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 121, in __call__
    raise JSONRPCException('Error decoding JSON reponse:\n' + respdata)
JSONRPCException: Error decoding JSON reponse:

... then a long JSON response which is cut off.
 

Comment 1 by dshi@chromium.org, Apr 22 2016

Labels: -Hardware-Lab Infra-ChromeOS
I'm wondering why JSONRPCException is not retried. If suite.py is really using frontend_wrappers.RetryingAFE, JSONRPCException should be retried.

Comment 2 by autumn@chromium.org, Apr 25 2016

Cc: jrbarnette@chromium.org sbasi@chromium.org
+ current deputy and secondary
OK.  I just stumbled across the failure in this build:
    https://uberchromegw.corp.google.com/i/chromeos_release/builders/pineview-release-group%20release-R50-7978.B/builds/56

The bvt-inline suite for x86-alex failed with "bvt-inline: FAIL: Exception
waiting for results".

Diving down to the autotest logs, you find the failure above.  I tried
the simple exercise of taking the truncated JSON content, and running
it through 'wc -c'.  Apparently, whatever is doing the cutting off, it
happens after 4096 bytes.

Comment 4 by benhenry@google.com, Apr 26 2016

Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Owner: sbasi@chromium.org

Comment 7 by sbasi@chromium.org, May 4 2016

Trying to reproduce locally in a python interpreter by calling get_jobs using one of these failed parent_job_id doesn't seem to work.

However looking at the code:

        if not default_timeout:
            # If default timeout is None, socket will never time out.
            respdata = urllib2.urlopen(request).read()
        else:
            timeout = max(min_rpc_timeout, default_timeout)
            respdata = urllib2.urlopen(request, timeout=timeout).read()
        try:
            resp = decoder.JSONDecoder().decode(respdata)
        except ValueError:
            raise JSONRPCException('Error decoding JSON reponse:\n' + respdata)


I suspect the urllib2.urlopen().read is cutting off at the 4096 bytes.

My guess would be to wrap the reading into a while loop, append the responses and exit the loop once the response is of length 0.

Thoughts? It can't hurt IMO.

Comment 8 by dshi@chromium.org, May 4 2016

Can we confirm the behavior of urllib2.urlopen cut off at 4k bytes? I did a quick test:
>>> import urllib2
>>> result=urllib2.urlopen('http://dshi1.mtv.corp.google.com/results/log.tar').read()
>>> len(result)
20480
It can download 20k just fine.

Comment 9 by sbasi@chromium.org, May 4 2016

Labels: -Pri-2 Pri-1
Fair, maybe it might be better to add retry logic to the whole code segment instead.

I'm also raising the priority of this as its causing a minimum of 1 CQ run failure a day.
The Python specification for read() says it's _supposed_
to read all bytes until EOF.  However, there seems to be
some squishiness in the text, so having the code loop until
read() returns an empty string is worth trying.  However, it's
possible that the underlying library has already recorde "EOF",
and will just refuse to return more data.

In any event, the change should tell us something about which
side the problem is on.

In the code sequence that calls read(), is the timeout set?
If there's a timeout, that could explain the misbehavior, and
it would also mean that retrying would be worth the trouble.

Added some logging and on MobLab there is no timeout for this code path. Just FYI.
Issue 609921 has been merged into this issue.
 Issue 610735  has been merged into this issue.
Project Member

Comment 15 by bugdroid1@chromium.org, May 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/9344c8078ad968e6f062d3ec4849be2f31739b92

commit 9344c8078ad968e6f062d3ec4849be2f31739b92
Author: Simran Basi <sbasi@google.com>
Date: Thu May 05 21:14:01 2016

[autotest] Loop over urllib2 reads of the RPC responses.

Recently some json RPC responses are being cut off at 4096 bytes.
In an attempt to solve this problem, we will try to do multiple
reads of the urlfile until no data is returned.

BUG= chromium:606071 
TEST=suite support still works in moblab.

Change-Id: Iea04d48599deb21716b23cadff324fd3816118ad
Reviewed-on: https://chromium-review.googlesource.com/342841
Reviewed-by: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>

[modify] https://crrev.com/9344c8078ad968e6f062d3ec4849be2f31739b92/frontend/afe/json_rpc/proxy.py

Issue 611034 has been merged into this issue.
several builds today failed because of the same issue. Here is an one of them https://uberchromegw.corp.google.com/i/chromeos/builders/banon-release/builds/19
Status: Assigned (was: Untriaged)
Looking a the logs from the failure don't show any messages
being logged about retry attempts.  That's mostly expected:
the code wouldn't log if retrying produced EOF.

What this means is that the root of the problem is probably
on the server side.  The server transmits one file buffer's
worth of data, then hangs up the phone.  The behavior smacks
of excessive load somewhere in the chain; we need to get a
better handle on where that load is coming from.

Cc: akes...@chromium.org
Looking through the code, the server that's handling the load
is the shard global AFE RPC server.  That's a single server handing
RPC calls from all the shards.  As we increase the number of shards,
the load on that server will increase, so if we've hit a limit there,
it's not really surprising...

A quick fix might be to add a second global RPC server.  However, the
RPC servers would have to be assigned to shards manually, and that's
not a scalable solution.

I also looked at the client side that generates the load.  That's
in server.cros.dynamic_suite.job_status.wait_for_child_results().
That code repeatedly fetches all the recently completed jobs, with
this call:
        new_finished_jobs = [job for job in
                             afe.get_jobs(parent_job_id=parent_job_id,
                                          finished=True)
                             if job.id in remaining_child_jobs]

Each completed job transmits a large amount of data, including the
job's complete control file.  Essentially all of that data is then
discarded.  The result is that the server is transmitting what is
likely 500x-1000x more data than it needs to.

We should probably change the RPC interface to provide a version
of the "get_jobs" call that returns only a limited amount of data,
to reduce bandwidth consumption and load on the RPC server.

Comment 22 by dshi@chromium.org, May 16 2016

Minor correction:
We do have 2 RPC server for shard. Following are the puppet code:
  if ($server_type == 'autotest-shard' and
      $fqdn =~ /^chromeos-server(\d+)\..*$/) {
    $t_num = ($1 + 0) % 2
  }
  $global_afe_hostname = $t_num ? {
    0 => 'chromeos-server28.mtv.corp.google.com',
    1 => 'chromeos-server29.mtv.corp.google.com',
    default => undef,
  }

Adding another one should be quite straight forward.
Taking a second look at this call:
        new_finished_jobs = [job for job in
                             afe.get_jobs(parent_job_id=parent_job_id,
                                          finished=True)
                             if job.id in remaining_child_jobs]

The call filters out jobs that have finished and been processed
on the client side.  That means jobs that finish early are
re-transmitted with every query.  That means that (loosely speaking)
the bandwidth consumed by this query for a given suite job grows
_quadratically_ over time.  That's bound to be spectacularly bad...

So, other options for bandaids:
  * The query is in a loop that polls every 5 seconds.  Reducing
    the polling frequency would cut back on bandwidth consumed,
    although it wouldn't stop the quadratic growth.
  * The query could be restricted to just the jobs that are still
    of interested.  That is, the "if job.id in remaining_child_jobs"
    should be part of the query on the server side.  That would
    stop the quadratic growth, and ensure that each job is only
    transmitted once.

I think the following patch should stop the quadratic growth problem:

diff --git a/server/cros/dynamic_suite/job_status.py b/server/cros/dynamic_suite/job_status.py
index 6da5b58..5c2e4ef 100644
--- a/server/cros/dynamic_suite/job_status.py
+++ b/server/cros/dynamic_suite/job_status.py
@@ -382,10 +382,8 @@ def wait_for_child_results(afe, tko, parent_job_id):
     remaining_child_jobs = set(job.id for job in
                                afe.get_jobs(parent_job_id=parent_job_id))
     while remaining_child_jobs:
-        new_finished_jobs = [job for job in
-                             afe.get_jobs(parent_job_id=parent_job_id,
-                                          finished=True)
-                             if job.id in remaining_child_jobs]
+        new_finished_jobs = afe.get_jobs(id__in=remaining_child_jobs,
+                                         finished=True)
 
         for job in new_finished_jobs:
 

Owner: jrbarnette@chromium.org
Sounds good for now, could you upload a CL?
Project Member

Comment 27 by bugdroid1@chromium.org, May 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4e9e23657fde19bef5c126b86df1674280e07056

commit 4e9e23657fde19bef5c126b86df1674280e07056
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue May 17 15:26:08 2016

[autotest] Reduce polling load for dynamic suites.

Dynamic suites check for completion by polling with a query for all
finished child jobs.  At each polling cycle the only jobs of
interest would be jobs that had finished since the previous cycle,
but the query fetched all finished jobs, meaning that data for jobs
that finished early would be retransmitted many times.

The excessive amount of data transmitted for the polling query is
suspected of causing load on the RPC server which in turn causes
mysterious server hang-ups.

This changes the query to filter the unneeded data on the server
side, so that each finished job is only transmitted once.

BUG= chromium:606071 
TEST=unit tests, plus push_to_prod suite on local instance

Change-Id: I27c7be51f341e3d90a3fbd92ac8f6a93699a340a
Reviewed-on: https://chromium-review.googlesource.com/344895
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/4e9e23657fde19bef5c126b86df1674280e07056/server/cros/dynamic_suite/job_status.py
[modify] https://crrev.com/4e9e23657fde19bef5c126b86df1674280e07056/server/cros/dynamic_suite/job_status_unittest.py

Comment 28 by fdeng@chromium.org, May 19 2016

fyi, the polling in suite job should be hitting "cautotest" if I remember correctly, server 28 and 29 are for heartbeats only, but wouldn't hurt to add a third afe server for shard.
Owner: kevcheng@chromium.org
Passing on to this week's deputy.

If this problem has stopped happening, we should declare
victory.  I think that means revert this CL:
    https://chromium-review.googlesource.com/342841
... and then close this bug as Fixed.
https://chromium-review.googlesource.com/#/c/347517/

And then we can close the bug.

Project Member

Comment 31 by bugdroid1@chromium.org, May 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d19a1df337f79ba28ddc6f9d573d0ffca7363c7b

commit d19a1df337f79ba28ddc6f9d573d0ffca7363c7b
Author: Richard Barnette <jrbarnette@google.com>
Date: Fri May 27 17:04:21 2016

Revert "[autotest] Loop over urllib2 reads of the RPC responses."

This reverts commit 9344c8078ad968e6f062d3ec4849be2f31739b92.

This change is debug code that's no longer needed.

BUG= chromium:606071 
TEST=None

Change-Id: I0af49ccf71af11b17cfcbe56be4c069ae582c482
Reviewed-on: https://chromium-review.googlesource.com/347517
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Simran Basi <sbasi@chromium.org>

[modify] https://crrev.com/d19a1df337f79ba28ddc6f9d573d0ffca7363c7b/frontend/afe/json_rpc/proxy.py

Status: Fixed (was: Assigned)
Haven't seen this bug surface during my shift so closing out.
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)
Blocking: 632433
Linking https://bugs.chromium.org/p/chromium/issues/detail?id=632433 that appears related. 

Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.

Comment 38 by sosa@chromium.org, Jul 18 2017

Status: Archived (was: Available)

Sign in to add a comment