New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 652770 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Provision jobs for chameleon tests in android1758 lab fail inconsistently since yesterday with Unhandled DevServerException: CrOS auto-update failed ... has reached AU retry limit.

Project Member Reported by ka...@chromium.org, Oct 4 2016

Issue description

Missing chameleon tests results promptes to look at cautotest -https://screenshot.googleplex.com/t9HW4zKQUqh

Nyan_big -https://screenshot.googleplex.com/x6UEXKgLUNj
Auron_yuna -https://screenshot.googleplex.com/V7xhUJSYQmK
Chell - https://screenshot.googleplex.com/dQTZgzNm6b8

etc.

Mostly on M55 but observed on M54 too.
i.e. veyron_jerry - https://screenshot.googleplex.com/ArAxj3YUJu4


Looks same as cassandra lab issue 651974, but here it is inconsistent.

 

Comment 1 by dshi@chromium.org, Oct 4 2016

What's the exact error? android1758- devservers are having heavy load since yesterday. We are adding two more to address the load issue.

Comment 2 by ka...@chromium.org, Oct 4 2016

Unhandled DevServerException: CrOS auto-update failed for host android1758-audiobox2-host1: has reached AU retry limit.
Screenshot https://screenshot.googleplex.com/8UeRBMSfv3E

Comment 3 by ka...@chromium.org, Oct 5 2016

Cc: cychiang@chromium.org
This is still going on for audiobox DUTs(pool:audio_box) in b.1758

Comment 4 by ka...@chromium.org, Oct 5 2016

Cc: waihong@chromium.org
taking a look

Comment 6 by dshi@chromium.org, Oct 5 2016

I think that's mostly devserver load issue. We are adding two more devservers there, but being blocked by some puppet bug:
https://bugs.chromium.org/p/chromium/issues/detail?id=653140
Cc: kevcheng@chromium.org
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
for android1758-audiobox1-host1/296269-provision I see in the autoserv.DEBUG logs an error reaching the devserver: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/android1758-audiobox1-host1/296269-provision/20160510081058/

10/05 08:11:09.788 DEBUG|        dev_server:0798| Error occurred with exit_code 255 when executing the ssh call: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts.
******************************
To get a cert for ssh, visit https://goto.google.com/corpssh
or run the "gcert" command in your terminal.
If you are still having trouble, please see
https://goto.google.com/corpssh-instructions
******************************

Permission denied (publickey,keyboard-interactive).
.
10/05 08:11:09.789 WARNI|             retry:0181| <class 'autotest_lib.client.common_lib.error.CmdError'>(Command <ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'> failed, rc=255, Command returned non-zero exit status
* Command: 
    ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'
Exit status: 255
Duration: 0.851129055023

stderr:
Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts.
******************************

So it tries some different devservers until one succeeds, then:
10/05 08:13:56.639 DEBUG|        base_utils:0185| Running 'ssh 100.107.127.227 'curl "http://100.107.127.227:8082/cros_au?build_name=nyan_big-release/R55-8865.0.0&async=True&force_update=True&host_name=android1758-audiobox1-host1&full_update=False"''
10/05 08:14:11.282 INFO |        dev_server:1642| Received response from devserver for cros_au call: '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>500 Internal Server Error</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>500 Internal Server Error</h2>\n        <p>The server encountered an unexpected condition which prevented it from fulfilling the request.</p>\n        <pre id="traceback">Traceback (most recent call last):\n  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond\n    response.body = self.handler()\n  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__\n    self.body = self.oldhandler(*args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__\n    return self.callable(*self.args, **self.kwargs)\n  File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 843, in cros_au\n    progress_tracker.WriteStatus(\'CrOS update is just started.\')\n  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 116, in WriteStatus\n    if not self.track_status_file:\n  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 108, in track_status_file\n    return GetTrackStatusFile(self.host_name, self.pid)\n  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 81, in GetTrackStatusFile\n    osutils.SafeMakedirs(os.path.dirname(TRACK_LOG_FILE_PATH))\nAttributeError: \'NoneType\' object has no attribute \'SafeMakedirs\'\n</pre>\n    <div id="powered_by">\n    <span>Powered by <a href="http://www.cherrypy.org">CherryPy 3.2.2</a></span>\n    </div>\n    </body>\n</html>\n'
10/05 08:14:11.283 DEBUG|        dev_server:1744| Failed to trigger auto-update process on devserver

Parsing that it looks like an exception on the devserver:
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 843, in cros_au
    progress_tracker.WriteStatus('CrOS update is just started.')
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 116, in WriteStatus
    if not self.track_status_file:
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 108, in track_status_file
    return GetTrackStatusFile(self.host_name, self.pid)
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 81, in GetTrackStatusFile
    osutils.SafeMakedirs(os.path.dirname(TRACK_LOG_FILE_PATH))
AttributeError: 'NoneType' object has no attribute 'SafeMakedirs'


Looks like osutils can't be imported, not sure what the fix for that is but assigning to Xixuan.

Comment 8 by dshi@chromium.org, Oct 5 2016

Cc: shuqianz@chromium.org
+shuqianz to comment on the error "Permission denied (publickey,keyboard-interactive)"
I manually run $curl "http://172.25.65.217:8082/check_health?" in several devservers, there is no issue. Can you specify an android-xxx devserver I can test? 

The test was running on chromeos-server26.mtv.corp.google.com and was hitting the error following devservers (172.25.65.235, 172.25.65.217) but it's working now.  Looks like a transient issue.

Comment 11 by dshi@chromium.org, Oct 5 2016

Re #9

manually ssh and curl to a Ganeti devserver works because it could be using your corp ssh credential (might prompt you to touch security key)

I just checked suite scheduler log, seems the error is happening to following 3 servers.
172.25.65.217
172.24.184.161
172.25.65.235
Re #11, it didn't prompt for password or touching the security key. I run this $curl "http://172.25.65.217:8082/check_health?" on both my desktop and chromeos2-devserver6.cros. It just returns the result immediately without asking for anything.
I checked the suite_scheduler log in cautotest, the permission issue after 9/30. And I also checked the logs in chromeos-server26.mtv, I didn't find any log about the permission issue. So, I think the permission issue is not the cause. If anyone finds new logs about the permission issue, please correct me here.
Cc: xixuan@chromium.org
Owner: kevcheng@chromium.org
the issue is the devserver: 100.107.127.227, I don't know when it's added, but it's not updated. So the devserver and chromite is not synced.

If it's a newly added devserver, there must be sth wrong in the process of adding new devserver.

If it's not, we may have a bug in updating all devservers.

Reassign to deputy to update it.
For the permission issue, I think the fix works. The devserver 172.25.65.217 is doing its job since 9/30. However, I check the logs, it will fail to ssh into at certain time, for example around 0am, 3am every day. And then suite_scheduler keeps trying for tens of times. I think it is load at that time is super high on devserver, then it fail to ssh to. But at other times, it works just fine. So, I think it is really a load issue. Example logs:

^^^^^^^^^^^172.25.65.217 was working fine before 3am at 10/01^^^^^^^^^^^^
......
10/01 03:04:07.717 INFO |        base_event:0238| Handling nightly for veyron_minnie-cheets
10/01 03:04:07.729 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"''
10/01 03:04:07.731 INFO |            driver:0173| Finished handling nightly event for board ninja
10/01 03:04:07.733 INFO |            driver:0173| Finished handling nightly event for board veyron_minnie-cheets
10/01 03:04:07.835 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.106 'curl "http://172.25.65.106:8082/check_health?"''
10/01 03:04:07.840 DEBUG| manifest_versions:0380| found build-name/ultima-release/pass/54/8743.50.0.xml

build-name/ultima-release/pass/55/8851.0.0.xml
10/01 03:04:07.827 ERROR|        dev_server:0389| Devserver check_health failed. Free disk space is low. Only 18GB is available.
10/01 03:04:07.842 DEBUG| manifest_versions:0254| parsing manifest path build-name/ultima-release/pass/54/8743.50.0.xml
10/01 03:04:07.849 DEBUG|        base_utils:0185| Running 'ssh 172.24.184.161 'curl "http://172.24.184.161:8082/check_health?"''
10/01 03:04:07.919 DEBUG| manifest_versions:0254| parsing manifest path build-name/ultima-release/pass/55/8851.0.0.xml
..............
10/01 03:04:12.148 DEBUG|        dev_server:0798| Error occurred with exit_code 255 when executing the ssh call: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts.
******************************
To get a cert for ssh, visit https://goto.google.com/corpssh
or run the "gcert" command in your terminal.
If you are still having trouble, please see
https://goto.google.com/corpssh-instructions
******************************

Permission denied (publickey,keyboard-interactive).
.
10/01 03:04:12.227 WARNI|             retry:0181| <class 'autotest_lib.client.common_lib.error.CmdError'>(Command <ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'> failed, rc=255, Command returned non-zero exit status
* Command: 
    ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'
Exit status: 255
Duration: 0.933449983597

stderr:
Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts.
******************************
To get a cert for ssh, visit https://goto.google.com/corpssh
or run the "gcert" command in your terminal.
If you are still having trouble, please see
https://goto.google.com/corpssh-instructions
............

^^^^^^^^^^Somehow, it failed to ssh around 3am, and suite_scheduler retried for hundreds of times in 5-10 mins^^^^^^^^^^^^^^^^

.....

^^^^^^^^^then suite_scheduler stopped retried and did something else. After an hour. 172.25.65.217 was recovered and working fine again.^^^^^^^

......10/01 04:03:11.978 INFO |       timed_event:0061| Checking deadline 2016-10-01 04:00:00 for event nightly
10/01 04:03:11.978 INFO |            driver:0196| Handling nightly event for 240 boards
10/01 04:03:12.011 DEBUG|        base_utils:0185| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/check_health?"''
10/01 04:03:12.126 DEBUG|        base_utils:0185| Running 'ssh 172.24.190.169 'curl "http://172.24.190.169:8082/check_health?"''
10/01 04:03:12.142 DEBUG|        base_utils:0185| Running 'ssh 172.24.190.169 'curl "http://172.24.190.169:8082/check_health?"''
10/01 04:03:12.143 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.235 'curl "http://172.25.65.235:8082/check_health?"''
10/01 04:03:12.312 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"''
10/01 04:03:12.518 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"''
10/01 04:03:12.523 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"''
10/01 04:03:12.525 DEBUG|        base_utils:0185| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/check_health?"''
10/01 04:03:12.529 DEBUG|        base_utils:0185| Running 'ssh 172.25.65.106 'curl "http://172.25.65.106:8082/check_health?"''
10/01 04:03:12.917 DEBUG|        base_utils:0185| Running '/usr/bin/git --git-dir=/tmp/_autotmp_WN9eH4_suite_scheduler/.git --work-tree=/tmp/_autotmp_WN9eH4_suite_scheduler log --since="2016-09-30 04:00:00" --pretty="format:%H" -- build-name/x86-alex-toolchain-gcc/pass/ build-name/x86-alex-release/pass/ build-name/x86-alex-release-group/pass/ build-name/x86-alex-toolchain-llvm/pass/'
10/01 04:03:12.998 DEBUG|        base_utils:0185| Running '/usr/bin/git --git-dir=/tmp/_autotmp_WN9eH4_suite_scheduler/.git --work-tree
......

Same pattern repeated the next day.I propose:
1. Stop the suite_scheduler crazy retries.
2. Balance the load at the night
This is still going on and failing tests for audio box equipped DUTs in b1758.
Status: Fixed (was: Assigned)
updated the devservers today and saw a successful provision  on android1758-audiobox2-host1.  I think this is resolved now.
Status: Verified (was: Fixed)
The devservers issue was gone. 
https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=6117

Sign in to add a comment