New issue
Advanced search Search tips

Issue 712274 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 712283
Owner: ----
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Server-side test failed with timeout from download_extract

Reported by jrbarnette@chromium.org, Apr 17 2017

Issue description

This CQ run failed:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14280
Because of this slave paladin run:
    https://luci-milo.appspot.com/buildbot/chromeos/peppy-paladin/14856
Because of this test failure:
    http://cautotest/tko/retrieve_logs.cgi?job=/results/112455541-chromeos-test/

The error reporting is (shall we say) obtuse, but if you dig
you'll find what you want is at this path:
    chromeos4-row9-rack11-host3/ssp_logs/debug/autoserv.DEBUG

The most relevant part seems to be this:
04/13 00:06:04.552 INFO |        server_job:0184| FAIL	----	----	timestamp=1492067164	localtime=Apr 13 00:06:04	Failed to setup container for test: retry exception (function="download_extract()"), timeout = 300s. Check logs in ssp_logs folder for more details.
04/13 00:06:04.553 DEBUG|        base_utils:0185| Running 'sudo -n chown -R 99158 "/usr/local/autotest/results/112455541-chromeos-test/chromeos4-row9-rack11-host3"'
04/13 00:06:05.188 DEBUG|        base_utils:0185| Running 'sudo -n chgrp -R 5000 "/usr/local/autotest/results/112455541-chromeos-test/chromeos4-row9-rack11-host3"'
04/13 00:06:05.833 ERROR|         traceback:0013| Traceback (most recent call last):
04/13 00:06:05.834 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 530, in run_autoserv
04/13 00:06:05.835 ERROR|         traceback:0013|     machines)
04/13 00:06:05.835 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 163, in _run_with_ssp
04/13 00:06:05.836 ERROR|         traceback:0013|     dut_name=dut_name)
04/13 00:06:05.836 ERROR|         traceback:0013|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
04/13 00:06:05.837 ERROR|         traceback:0013|     return fn(*args, **kwargs)
04/13 00:06:05.837 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 235, in func_cleanup_if_fail
04/13 00:06:05.838 ERROR|         traceback:0013|     return func(*args, **kwargs)
04/13 00:06:05.839 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 946, in setup_test
04/13 00:06:05.839 ERROR|         traceback:0013|     download_extract(server_package_url, autotest_pkg_path, usr_local_path)
04/13 00:06:05.840 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/cros/retry.py", line 240, in func_retry
04/13 00:06:05.840 ERROR|         traceback:0013|     raise error.TimeoutException(exception_message)
04/13 00:06:05.841 ERROR|         traceback:0013| TimeoutException: retry exception (function="download_extract()"), timeout = 300s

That happened apparently because of a failure around this
event:
04/13 00:01:02.571 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/static/peppy-paladin/R59-9457.0.0-rc2/autotest_server_package.tar.bz2"''

Host 172.25.65.217 is chromeos-gt-devserver1.cbf; it was serving an image
to the shard, not the DUT.

I haven't been able to figure out what really happened.  I suspect
that the devserver was merely overloaded.  This could be related
to bug 604473.

 
Another failed CQ master:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14283
The corresponding slave:
    https://luci-milo.appspot.com/buildbot/chromeos/veyron_speedy-paladin/4991
These tests:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=112503832
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=112503830

This time, the devserver in the logs was 100.120.7.235 (chromeos11-infra-devserver2).

This problem happened on the same shard (chromeos-server14.mtv) as the previous
failure, so the problem could be shard load instead of devserver load.

Outgoing network traffic on the shard surged about a day before
the first failures.

Also, the failures in c#1 were accompanied by other less distinct
failures in the same suite.  So, this symptom is looking more like
it's shard load specific.


Mergedinto: 712283
Status: Duplicate (was: Available)
> This looks a lot like  issue 712283  ?

Yeah.  It seems the left hand doesn't know what the
right hand is doing.

Sign in to add a comment