Provision jobs for chameleon tests in android1758 lab fail inconsistently since yesterday with Unhandled DevServerException: CrOS auto-update failed ... has reached AU retry limit. |
||||||||
Issue descriptionMissing chameleon tests results promptes to look at cautotest -https://screenshot.googleplex.com/t9HW4zKQUqh Nyan_big -https://screenshot.googleplex.com/x6UEXKgLUNj Auron_yuna -https://screenshot.googleplex.com/V7xhUJSYQmK Chell - https://screenshot.googleplex.com/dQTZgzNm6b8 etc. Mostly on M55 but observed on M54 too. i.e. veyron_jerry - https://screenshot.googleplex.com/ArAxj3YUJu4 Looks same as cassandra lab issue 651974, but here it is inconsistent.
,
Oct 4 2016
Unhandled DevServerException: CrOS auto-update failed for host android1758-audiobox2-host1: has reached AU retry limit. Screenshot https://screenshot.googleplex.com/8UeRBMSfv3E
,
Oct 5 2016
This is still going on for audiobox DUTs(pool:audio_box) in b.1758
,
Oct 5 2016
,
Oct 5 2016
taking a look
,
Oct 5 2016
I think that's mostly devserver load issue. We are adding two more devservers there, but being blocked by some puppet bug: https://bugs.chromium.org/p/chromium/issues/detail?id=653140
,
Oct 5 2016
for android1758-audiobox1-host1/296269-provision I see in the autoserv.DEBUG logs an error reaching the devserver: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/android1758-audiobox1-host1/296269-provision/20160510081058/ 10/05 08:11:09.788 DEBUG| dev_server:0798| Error occurred with exit_code 255 when executing the ssh call: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts. ****************************** To get a cert for ssh, visit https://goto.google.com/corpssh or run the "gcert" command in your terminal. If you are still having trouble, please see https://goto.google.com/corpssh-instructions ****************************** Permission denied (publickey,keyboard-interactive). . 10/05 08:11:09.789 WARNI| retry:0181| <class 'autotest_lib.client.common_lib.error.CmdError'>(Command <ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'> failed, rc=255, Command returned non-zero exit status * Command: ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"' Exit status: 255 Duration: 0.851129055023 stderr: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts. ****************************** So it tries some different devservers until one succeeds, then: 10/05 08:13:56.639 DEBUG| base_utils:0185| Running 'ssh 100.107.127.227 'curl "http://100.107.127.227:8082/cros_au?build_name=nyan_big-release/R55-8865.0.0&async=True&force_update=True&host_name=android1758-audiobox1-host1&full_update=False"'' 10/05 08:14:11.282 INFO | dev_server:1642| Received response from devserver for cros_au call: '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>500 Internal Server Error</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head>\n <body>\n <h2>500 Internal Server Error</h2>\n <p>The server encountered an unexpected condition which prevented it from fulfilling the request.</p>\n <pre id="traceback">Traceback (most recent call last):\n File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond\n response.body = self.handler()\n File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__\n self.body = self.oldhandler(*args, **kwargs)\n File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__\n return self.callable(*self.args, **self.kwargs)\n File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 843, in cros_au\n progress_tracker.WriteStatus(\'CrOS update is just started.\')\n File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 116, in WriteStatus\n if not self.track_status_file:\n File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 108, in track_status_file\n return GetTrackStatusFile(self.host_name, self.pid)\n File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 81, in GetTrackStatusFile\n osutils.SafeMakedirs(os.path.dirname(TRACK_LOG_FILE_PATH))\nAttributeError: \'NoneType\' object has no attribute \'SafeMakedirs\'\n</pre>\n <div id="powered_by">\n <span>Powered by <a href="http://www.cherrypy.org">CherryPy 3.2.2</a></span>\n </div>\n </body>\n</html>\n' 10/05 08:14:11.283 DEBUG| dev_server:1744| Failed to trigger auto-update process on devserver Parsing that it looks like an exception on the devserver: File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__ return self.callable(*self.args, **self.kwargs) File "/home/chromeos-test/chromiumos/src/platform/dev/devserver.py", line 843, in cros_au progress_tracker.WriteStatus('CrOS update is just started.') File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 116, in WriteStatus if not self.track_status_file: File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 108, in track_status_file return GetTrackStatusFile(self.host_name, self.pid) File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update_progress.py", line 81, in GetTrackStatusFile osutils.SafeMakedirs(os.path.dirname(TRACK_LOG_FILE_PATH)) AttributeError: 'NoneType' object has no attribute 'SafeMakedirs' Looks like osutils can't be imported, not sure what the fix for that is but assigning to Xixuan.
,
Oct 5 2016
+shuqianz to comment on the error "Permission denied (publickey,keyboard-interactive)"
,
Oct 5 2016
I manually run $curl "http://172.25.65.217:8082/check_health?" in several devservers, there is no issue. Can you specify an android-xxx devserver I can test?
,
Oct 5 2016
The test was running on chromeos-server26.mtv.corp.google.com and was hitting the error following devservers (172.25.65.235, 172.25.65.217) but it's working now. Looks like a transient issue.
,
Oct 5 2016
Re #9 manually ssh and curl to a Ganeti devserver works because it could be using your corp ssh credential (might prompt you to touch security key) I just checked suite scheduler log, seems the error is happening to following 3 servers. 172.25.65.217 172.24.184.161 172.25.65.235
,
Oct 5 2016
Re #11, it didn't prompt for password or touching the security key. I run this $curl "http://172.25.65.217:8082/check_health?" on both my desktop and chromeos2-devserver6.cros. It just returns the result immediately without asking for anything.
,
Oct 5 2016
I checked the suite_scheduler log in cautotest, the permission issue after 9/30. And I also checked the logs in chromeos-server26.mtv, I didn't find any log about the permission issue. So, I think the permission issue is not the cause. If anyone finds new logs about the permission issue, please correct me here.
,
Oct 5 2016
the issue is the devserver: 100.107.127.227, I don't know when it's added, but it's not updated. So the devserver and chromite is not synced. If it's a newly added devserver, there must be sth wrong in the process of adding new devserver. If it's not, we may have a bug in updating all devservers. Reassign to deputy to update it.
,
Oct 5 2016
For the permission issue, I think the fix works. The devserver 172.25.65.217 is doing its job since 9/30. However, I check the logs, it will fail to ssh into at certain time, for example around 0am, 3am every day. And then suite_scheduler keeps trying for tens of times. I think it is load at that time is super high on devserver, then it fail to ssh to. But at other times, it works just fine. So, I think it is really a load issue. Example logs: ^^^^^^^^^^^172.25.65.217 was working fine before 3am at 10/01^^^^^^^^^^^^ ...... 10/01 03:04:07.717 INFO | base_event:0238| Handling nightly for veyron_minnie-cheets 10/01 03:04:07.729 DEBUG| base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'' 10/01 03:04:07.731 INFO | driver:0173| Finished handling nightly event for board ninja 10/01 03:04:07.733 INFO | driver:0173| Finished handling nightly event for board veyron_minnie-cheets 10/01 03:04:07.835 DEBUG| base_utils:0185| Running 'ssh 172.25.65.106 'curl "http://172.25.65.106:8082/check_health?"'' 10/01 03:04:07.840 DEBUG| manifest_versions:0380| found build-name/ultima-release/pass/54/8743.50.0.xml build-name/ultima-release/pass/55/8851.0.0.xml 10/01 03:04:07.827 ERROR| dev_server:0389| Devserver check_health failed. Free disk space is low. Only 18GB is available. 10/01 03:04:07.842 DEBUG| manifest_versions:0254| parsing manifest path build-name/ultima-release/pass/54/8743.50.0.xml 10/01 03:04:07.849 DEBUG| base_utils:0185| Running 'ssh 172.24.184.161 'curl "http://172.24.184.161:8082/check_health?"'' 10/01 03:04:07.919 DEBUG| manifest_versions:0254| parsing manifest path build-name/ultima-release/pass/55/8851.0.0.xml .............. 10/01 03:04:12.148 DEBUG| dev_server:0798| Error occurred with exit_code 255 when executing the ssh call: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts. ****************************** To get a cert for ssh, visit https://goto.google.com/corpssh or run the "gcert" command in your terminal. If you are still having trouble, please see https://goto.google.com/corpssh-instructions ****************************** Permission denied (publickey,keyboard-interactive). . 10/01 03:04:12.227 WARNI| retry:0181| <class 'autotest_lib.client.common_lib.error.CmdError'>(Command <ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'> failed, rc=255, Command returned non-zero exit status * Command: ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"' Exit status: 255 Duration: 0.933449983597 stderr: Warning: Permanently added '172.25.65.217' (RSA) to the list of known hosts. ****************************** To get a cert for ssh, visit https://goto.google.com/corpssh or run the "gcert" command in your terminal. If you are still having trouble, please see https://goto.google.com/corpssh-instructions ............ ^^^^^^^^^^Somehow, it failed to ssh around 3am, and suite_scheduler retried for hundreds of times in 5-10 mins^^^^^^^^^^^^^^^^ ..... ^^^^^^^^^then suite_scheduler stopped retried and did something else. After an hour. 172.25.65.217 was recovered and working fine again.^^^^^^^ ......10/01 04:03:11.978 INFO | timed_event:0061| Checking deadline 2016-10-01 04:00:00 for event nightly 10/01 04:03:11.978 INFO | driver:0196| Handling nightly event for 240 boards 10/01 04:03:12.011 DEBUG| base_utils:0185| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/check_health?"'' 10/01 04:03:12.126 DEBUG| base_utils:0185| Running 'ssh 172.24.190.169 'curl "http://172.24.190.169:8082/check_health?"'' 10/01 04:03:12.142 DEBUG| base_utils:0185| Running 'ssh 172.24.190.169 'curl "http://172.24.190.169:8082/check_health?"'' 10/01 04:03:12.143 DEBUG| base_utils:0185| Running 'ssh 172.25.65.235 'curl "http://172.25.65.235:8082/check_health?"'' 10/01 04:03:12.312 DEBUG| base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'' 10/01 04:03:12.518 DEBUG| base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'' 10/01 04:03:12.523 DEBUG| base_utils:0185| Running 'ssh 172.25.65.217 'curl "http://172.25.65.217:8082/check_health?"'' 10/01 04:03:12.525 DEBUG| base_utils:0185| Running 'ssh 172.24.184.160 'curl "http://172.24.184.160:8082/check_health?"'' 10/01 04:03:12.529 DEBUG| base_utils:0185| Running 'ssh 172.25.65.106 'curl "http://172.25.65.106:8082/check_health?"'' 10/01 04:03:12.917 DEBUG| base_utils:0185| Running '/usr/bin/git --git-dir=/tmp/_autotmp_WN9eH4_suite_scheduler/.git --work-tree=/tmp/_autotmp_WN9eH4_suite_scheduler log --since="2016-09-30 04:00:00" --pretty="format:%H" -- build-name/x86-alex-toolchain-gcc/pass/ build-name/x86-alex-release/pass/ build-name/x86-alex-release-group/pass/ build-name/x86-alex-toolchain-llvm/pass/' 10/01 04:03:12.998 DEBUG| base_utils:0185| Running '/usr/bin/git --git-dir=/tmp/_autotmp_WN9eH4_suite_scheduler/.git --work-tree ...... Same pattern repeated the next day.I propose: 1. Stop the suite_scheduler crazy retries. 2. Balance the load at the night
,
Oct 6 2016
This is still going on. E.g. https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=4912
,
Oct 6 2016
Wrong example (from yesterday) E.g. https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=4915
,
Oct 7 2016
This is still going on and failing tests for audio box equipped DUTs in b1758.
,
Oct 10 2016
updated the devservers today and saw a successful provision on android1758-audiobox2-host1. I think this is resolved now.
,
Jan 4 2017
The devservers issue was gone. https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=6117 |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dshi@chromium.org
, Oct 4 2016