New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 712283 link

Starred by 3 users

Issue metadata

Status: Archived
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

timeout for downloading autotest_server_package.tar.bz2 after 5 mins

Project Member Reported by nxia@chromium.org, Apr 17 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromeos/veyron_speedy-paladin/5027

http://cautotest/tko/retrieve_logs.cgi?job=/results/113133368-chromeos-test/

04/17 11:01:46.444 DEBUG|        base_utils:0185| Running 'sudo -n chgrp -R 5000 "/usr/local/autotest/results/113133368-chromeos-test/chromeos4-row4-rack12-host18"'
04/17 11:01:47.048 ERROR|         traceback:0013| Traceback (most recent call last):
04/17 11:01:47.049 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 530, in run_autoserv
04/17 11:01:47.050 ERROR|         traceback:0013|     machines)
04/17 11:01:47.050 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 163, in _run_with_ssp
04/17 11:01:47.051 ERROR|         traceback:0013|     dut_name=dut_name)
04/17 11:01:47.051 ERROR|         traceback:0013|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
04/17 11:01:47.052 ERROR|         traceback:0013|     return fn(*args, **kwargs)
04/17 11:01:47.052 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 235, in func_cleanup_if_fail
04/17 11:01:47.053 ERROR|         traceback:0013|     return func(*args, **kwargs)
04/17 11:01:47.054 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 946, in setup_test
04/17 11:01:47.054 ERROR|         traceback:0013|     download_extract(server_package_url, autotest_pkg_path, usr_local_path)
04/17 11:01:47.055 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/cros/retry.py", line 240, in func_retry
04/17 11:01:47.055 ERROR|         traceback:0013|     raise error.TimeoutException(exception_message)
04/17 11:01:47.056 ERROR|         traceback:0013| TimeoutException: retry exception (function="download_extract()"), timeout = 300s
04/17 11:01:47.070 INFO |            client:0570| Attempting refresh to obtain initial access_token
04/17 11:01:47.172 INFO |            client:0872| Refreshing access_token
04/17 11:01:48.185 ERROR|          autoserv:0765| 1
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 761, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 574, in run_autoserv
    sys.exit(exit_code)
SystemExit: 1
 

Comment 1 by nxia@chromium.org, Apr 17 2017

Components: Infra>Client>ChromeOS
Labels: -Pri-3 OS-Chrome Pri-2

Comment 2 by nxia@chromium.org, Apr 17 2017

Cc: jrbarnette@chromium.org xixuan@chromium.org

Comment 3 by nxia@chromium.org, Apr 17 2017

how do I follow the download timeout error?

Comment 4 by xixuan@chromium.org, Apr 17 2017

It fails at

https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/site_utils/lxc.py?q=download_extract&sq=package:%5Echromeos_(internal%7Cpublic)$&l=281

04/17 10:56:44.153 DEBUG|        base_utils:0185| Running 'ssh 100.120.7.235 'curl "http://100.120.7.235:8082/static/veyron_speedy-paladin/R60-9470.0.0-rc3/autotest_server_package.tar.bz2"''

--- 300s ---

04/17 11:01:44.158 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F name,state'     (following cleanup)
....

But I tested it locally on server14.mtv just now by calling "dev_server.ImageServerBase.download_file('http://100.120.7.235:8082/static/veyron_speedy-paladin/R60-9470.0.0-rc3/autotest_server_package.tar.bz2', '/tmp/autotest.tar.bz2', timeout=300)" and it successfully finished. 

The package is downloaded at about 10:40 (has 8 hour difference between that devserver and our shard)
chromeos-test@chromeos11-infra-devserver2:/var/log/devserver$ ls -ltr ~/images/veyron_speedy-paladin/R60-9470.0.0-rc3/
total 14840
-rw-r--r-- 1 chromeos-test chromeos-test 15192987 Apr 17 18:40 autotest_server_package.tar.bz2


Conclusion:
1. package is staged well.
2. the devserver seems work normally ( http://shortn/_bJ7UGdYmHI )
3. the devserver (chromeos11-infra-devserver2) seems to be an android devserver. In the log, there's several exception/errors, but not related to this one.
4. the download_extract works well now, but not at that time.

looks like a flakeness. 100.120.7.235 doesn't work at that time, and this command (ssh 100.120.7.235 'curl "http://100.120.7.235:8082/static/veyron_speedy-paladin/R60-9470.0.0-rc3/autotest_server_package.tar.bz2"') hang there, not actually being retried.




Comment 5 by nxia@chromium.org, Apr 17 2017

Can we extend the retry.retry timeout to maybe 10 or 15 mins? As dev_server.ImageServerBase.download_file(url, tmp_file, timeout=300) takes up the 300s timeout, so this method isn't really being retried?


http://shortn/_4w2JKKELJQ

Comment 6 by nxia@chromium.org, Apr 18 2017

Another example on veyron_minnie-paladin which timed out at downloading.

https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/2288

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=112903534

04/15 15:25:22.379 DEBUG|        base_utils:0185| Running 'ssh 100.120.7.235 'curl "http://100.120.7.235:8082/static/veyron_minnie-paladin/R60-9465.0.0-rc2/autotest_server_package.tar.bz2"''
04/15 15:30:22.407 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F name,state'




04/15 15:30:26.184 DEBUG|        base_utils:0185| Running 'sudo -n chgrp -R 5000 "/usr/local/autotest/results/112903534-chromeos-test/chromeos4-row9-rack10-host22"'
04/15 15:30:26.825 ERROR|         traceback:0013| Traceback (most recent call last):
04/15 15:30:26.826 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 530, in run_autoserv
04/15 15:30:26.828 ERROR|         traceback:0013|     machines)
04/15 15:30:26.828 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 163, in _run_with_ssp
04/15 15:30:26.829 ERROR|         traceback:0013|     dut_name=dut_name)
04/15 15:30:26.829 ERROR|         traceback:0013|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
04/15 15:30:26.830 ERROR|         traceback:0013|     return fn(*args, **kwargs)
04/15 15:30:26.831 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 235, in func_cleanup_if_fail
04/15 15:30:26.832 ERROR|         traceback:0013|     return func(*args, **kwargs)
04/15 15:30:26.833 ERROR|         traceback:0013|   File "/usr/local/autotest/site_utils/lxc.py", line 946, in setup_test
04/15 15:30:26.834 ERROR|         traceback:0013|     download_extract(server_package_url, autotest_pkg_path, usr_local_path)
04/15 15:30:26.834 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/cros/retry.py", line 240, in func_retry
04/15 15:30:26.836 ERROR|         traceback:0013|     raise error.TimeoutException(exception_message)
04/15 15:30:26.836 ERROR|         traceback:0013| TimeoutException: retry exception (function="download_extract()"), timeout = 300s
04/15 15:30:26.854 INFO |            client:0570| Attempting refresh to obtain initial access_token
04/15 15:30:26.952 INFO |            client:0872| Refreshing access_token
04/15 15:30:32.744 ERROR|          autoserv:0765| 1



Comment 7 by nxia@chromium.org, Apr 18 2017

Summary: dev_server.ImageServerBase.download_file timeout after 5 mins (was: veyron_speedy-paladin failed at generic_RebootTest_SERVER_JOB)

Comment 8 by nxia@chromium.org, Apr 18 2017

Another example:

https://luci-milo.appspot.com/buildbot/chromeos/veyron_speedy-paladin/5008

http://cautotest/tko/retrieve_logs.cgi?job=/results/113077836-chromeos-test/


04/17 00:01:26.629 DEBUG|        base_utils:0185| Running 'ssh 100.120.7.236 'curl "http://100.120.7.236:8082/static/veyron_mighty-paladin/R60-9469.0.0-rc3/autotest_server_package.tar.bz2"''
04/17 00:06:26.684 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F name,state'



04/17 00:06:30.054 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/cros/retry.py", line 240, in func_retry
04/17 00:06:30.055 ERROR|         traceback:0013|     raise error.TimeoutException(exception_message)
04/17 00:06:30.055 ERROR|         traceback:0013| TimeoutException: retry exception (function="download_extract()"), timeout = 30



Comment 9 by nxia@chromium.org, Apr 18 2017


https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/5027

http://cautotest/tko/retrieve_logs.cgi?job=/results/113077836-chromeos-test/


04/17 00:01:26.629 DEBUG|        base_utils:0185| Running 'ssh 100.120.7.236 'curl "http://100.120.7.236:8082/static/veyron_mighty-paladin/R60-9469.0.0-rc3/autotest_server_package.tar.bz2"''
04/17 00:06:26.684 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F name,state'


Comment 10 by nxia@chromium.org, Apr 20 2017

another example:

https://luci-milo.appspot.com/buildbot/chromeos/x86-alex-paladin/27534

FAIL	----	----	timestamp=1492649674	localtime=Apr 19 17:54:34	Failed to setup container for test: retry exception (function="download_extract()"), timeout = 300s. Check logs in ssp_logs folder for more details.
Cc: x...@chromium.org
This looks similar to  issue 712274 , we just saw "Failed to setup container for test: retry exception" on the last two cyan-pfq runs (links in  issue 712274 ).
Project Member

Comment 12 by bugdroid1@chromium.org, Apr 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d364662aa4eb1e82174de580ab49a8793eebe857

commit d364662aa4eb1e82174de580ab49a8793eebe857
Author: xixuan <xixuan@chromium.org>
Date: Fri Apr 21 20:37:26 2017

autotest: add timeout for downloading files for ssp.

Current the retry timeout limit is equal to the timeout for downloading
ssp files, which make retry not retryable. So increase the timeout for retry
to be 3 times of the timeout of download_file.

BUG= chromium:712283 
TEST=None

Change-Id: I639cfd24e400e6a4e5ab046dddb0d808205fb721
Reviewed-on: https://chromium-review.googlesource.com/480541
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/d364662aa4eb1e82174de580ab49a8793eebe857/site_utils/lxc.py

I chump the fix CL for this. But since push-to-prod is blocked for now, it won't take any effect.
 Issue 712274  has been merged into this issue.

Comment 15 Deleted

Comment 16 by nxia@chromium.org, Apr 21 2017

Summary: timeout for downloading autotest_server_package.tar.bz2 after 5 mins (was: dev_server.ImageServerBase.download_file timeout after 5 mins)

Comment 18 by aut...@google.com, Apr 25 2017

Owner: xixuan@chromium.org
Status: Fixed (was: Untriaged)

Comment 19 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 21 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment