Verify/Repair fails for devservers in chromeos3 |
||||
Issue descriptionBackground: chromeos3 (Chaos) is under google_v2 ACL. There are currently four devservers in this lab. Please note Verify/Repair runs on this DUT: http://cautotest/afe/#tab_id=view_host&object_id=792 Verify fails on: 06/08 17:04:35.566 INFO | repair:0279| Verifying this condition: The most recent AU attempt on this DUT succeeded 06/08 17:04:35.566 DEBUG| ssh_host:0180| Running (ssh) 'test -f /var/tmp/provision_failed' 06/08 17:04:35.778 ERROR| repair:0284| Failed: The most recent AU attempt on this DUT succeeded Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 281, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 96, in verify 'Last AU on this DUT failed') AutoservVerifyError: Last AU on this DUT failed Repair logs show that: 06/08 17:09:13.875 DEBUG| base_utils:0185| Running 'ssh 172.22.39.163 'curl "http://172.22.39.163:8082/check_health?"'' 06/08 17:09:14.956 ERROR| repair:0397| Repair failed: Powerwash and then re-install the stable build via AU Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 395, in _repair_host self.repair(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 317, in repair super(PowerWashRepair, self).repair(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 297, in repair afe_utils.machine_install_and_update_labels(host, repair=True) File "/usr/local/autotest/server/afe_utils.py", line 190, in machine_install_and_update_labels image_name, host_attributes = host.machine_install(*args, **dargs) File "/usr/local/autotest/server/hosts/cros_host.py", line 753, in machine_install updater.run_update() File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator return function(*args, **kwargs) File "/usr/local/autotest/client/common_lib/cros/autoupdater.py", line 479, in run_update 'Update server at %s not available' % auserver_host) ChromiumOSError: Update server at 172.22.39.163:8082 not available dshi@ is able to make the http call from the DUT to the devserver. This may potentially be another ACL issue. Assigning to deputy to triage. Marking as Pri-1 as other labs are waiting to be cut over to new ACL.
,
Jun 9 2016
The devserver is in chromeos3, with acl you can only ssh to it. However, that dut is in the same subnet and i verified that curl http://172.22.39.163:8082/check_health works from the dut. This is very odd.
,
Jun 9 2016
The job link above links to a device, and the device did not run a test since May 14th: http://cautotest/afe/#tab_id=view_job&object_id=63343297. So I don't know what job comment1 is referring to. Tien please include the correct job links. That said, I agree with Dan's assessment. Here are all available dev servers: # Begin Dev Server host chromeos3-infra-devserver { hardware ethernet 00:0c:29:48:61:57; fixed-address 172.22.39.161; ddns-hostname "chromeos3-infra-devserver"; option host-name "chromeos3-infra-devserver"; } host chromeos3-infra-devserver1 { hardware ethernet 00:0c:29:e1:a8:cf; fixed-address 172.22.39.162; ddns-hostname "chromeos3-infra-devserver1"; option host-name "chromeos3-infra-devserver1"; } host chromeos3-infra-devserver2 { hardware ethernet 00:0c:29:fe:5e:25; fixed-address 172.22.39.163; ddns-hostname "chromeos3-infra-devserver2"; option host-name "chromeos3-infra-devserver2"; } host chromeos3-infra-devserver3 { hardware ethernet 00:0c:29:b1:47:12; fixed-address 172.22.39.164; ddns-hostname "chromeos3-infra-devserver3"; option host-name "chromeos3-infra-devserver3"; } # End Dev Server When I run the test from my workstation: $ ssh chromeos-test@172.22.39.161 'curl "http://172.22.39.161:8082/check_health?"' chromeos-test@172.22.39.161's password: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{"disk_total_bytes_per_second": 46241.42507089223, "network_total_bytes_per_second": 1729.8773573667793, "network_sent_bytes_per_second": 806.0439225662822, "apache_client_count": 4, "disk_write_bytes_per_second": 46241.42507089223, "cpu_percent": 100.0, "di100 413 100 413 0 0 2876 0 --:--:-- --:--:-- --:--:-- 28884971, "free_disk": 176, "staging_thread_count": 0, "telemetry_test_count": 1} $ ssh chromeos-test@172.22.39.162 'curl "http://172.22.39.162:8082/check_health?"' chromeos-test@172.22.39.162's password: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{"disk_total_bytes_per_second": 104335.28675417618, "network_total_bytes_per_second": 979.641694621444, "network_sent_bytes_per_second": 490.7698221347154, "apache_client_count": 4, "disk_write_bytes_per_second": 104335.28675417618, "cpu_percent": 0.0, "disk100 412 100 412 0 0 3389 0 --:--:-- --:--:-- --:--:-- 3404853, "free_disk": 181, "staging_thread_count": 0, "telemetry_test_count": 1} $ ssh chromeos-test@172.22.39.163 'curl "http://172.22.39.163:8082/check_health?"' chromeos-test@172.22.39.163's password: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{"disk_total_bytes_per_second": 51574.8589681047, "network_total_bytes_per_second": 1023.3106938116013, "network_sent_bytes_per_second": 434.3074487602753, "apache_client_count": 4, "disk_write_bytes_per_second": 51574.8589681047, "cpu_percent": 0.0, "disk_r100 408 100 408 0 0 3381 0 --:--:-- --:--:-- --:--:-- 3400 "free_disk": 177, "staging_thread_count": 0, "telemetry_test_count": 1} $ ssh chromeos-test@172.22.39.164 'curl "http://172.22.39.164:8082/check_health?"' chromeos-test@172.22.39.164's password: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{"disk_total_bytes_per_second": 64683.740275662196, "network_total_bytes_per_second": 1035.4707730540763, "network_sent_bytes_per_second": 440.17502746430034, "apache_client_count": 4, "disk_write_bytes_per_second": 64683.740275662196, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "gsutil_count": 1, "network_recv_bytes_per_second": 595.2957455897758, "free_disk": 179, "staging_thread_count": 0, "telemetry_test_cou100 433 100 433 0 0 2835 0 --:--:-- --:--:-- --:--:-- 2848 What this test is doing is sshing into a devserver and then curling from itself to make sure the web server on 8082 is running. Since it was the curl that failed and not the ssh into the server that is strange. Here is the code snippet: try: httplib.HTTPConnection(auserver_host).connect() except IOError: raise ChromiumOSError( 'Update server at %s not available' % auserver_host) I also re-ran from a DUT in the lab: $ ssh root@chromeos3-row1-rack2-host16.cros localhost ~ # curl "http://172.22.39.161:8082/check_health?" {"disk_total_bytes_per_second": 55645.39747660891, "network_total_bytes_per_second": 1411.9723929584582, "network_sent_bytes_per_second": 702.3401411312996, "apache_client_count": 4, "disk_write_bytes_per_second": 55645.39747660891, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 709.6322518271586, "free_disk": 176, "staging_thread_count": 0, "telemetry_test_count": 1} localhost ~ # curl "http://172.22.39.162:8082/check_health?" {"disk_total_bytes_per_second": 52789.90200619686, "network_total_bytes_per_second": 1344.1651177767922, "network_sent_bytes_per_second": 629.521659514759, "apache_client_count": 4, "disk_write_bytes_per_second": 52789.90200619686, "cpu_percent": 11.1, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 714.6434582620332, "free_disk": 181, "staging_thread_count": 0, "telemetry_test_count": 1} localhost ~ # curl "http://172.22.39.163:8082/check_health?" {"disk_total_bytes_per_second": 58552.31794603099, "network_total_bytes_per_second": 1686.3101156707496, "network_sent_bytes_per_second": 801.2197271386009, "apache_client_count": 4, "disk_write_bytes_per_second": 58552.31794603099, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 885.0903885321487, "free_disk": 177, "staging_thread_count": 0, "telemetry_test_count": 1} localhost ~ # curl "http://172.22.39.164:8082/check_health?" {"disk_total_bytes_per_second": 50340.51250573921, "network_total_bytes_per_second": 1306.4544449721723, "network_sent_bytes_per_second": 597.0222033429238, "apache_client_count": 4, "disk_write_bytes_per_second": 50340.51250573921, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "gsutil_count": 1, "network_recv_bytes_per_second": 709.4322416292484, "free_disk": 179, "staging_thread_count": 0, "telemetry_test_count": 1} Disk space on that server also looks good: chromeos-test@chromeos3-infra-devserver2:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 487M 4.0K 487M 1% /dev tmpfs 100M 4.5M 95M 5% /run /dev/sda1 196G 21G 165G 12% / none 4.0K 0 4.0K 0% /sys/fs/cgroup none 5.0M 0 5.0M 0% /run/lock none 497M 0 497M 0% /run/shm none 100M 0 100M 0% /run/user All devservers are available via ssh from corp and from http from inside the lab. I would recommend re-running the job.
,
Jun 9 2016
The jobs are the Verify/Repair jobs shown after "Show verifies, repairs, cleanups and resets" is checked off. Clicking on the latest Verify job brings me to these logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos3-row1-rack2-host16/56326113-verify?pli=1 Clicking on the second latest Repair job shows these logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos3-row1-rack2-host16/56326139-repair There's currently a Repair job running now that I did not invoke so I'll wait till that has completed. I'm not sure rerunning would help, however.
,
Jun 9 2016
If I ssh into the device here is what I get: $ ssh root@chromeos3-row1-rack2-host16.cros localhost ~ # python -bash: python: command not found This seems like a problem. Other hosts do this: $ ssh root@chromeos3-row1-rack2-host9.cros localhost ~ # python Python 2.7.3 (default, Dec 8 2015, 21:49:44) [GCC 4.9.x-google 20150123 (prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> This may need to be manually flashed since the chaos duts don't have servo.
,
Jun 9 2016
Oh yes that's the other odd thing: after flashing the DUT, confirming it has Python again, and hitting Reverify, that job failed due to "Last AU on this DUT failed". When I check the DUT, Python is gone again. What's interesting is this AU test in the Verify job fails but the Python test in the Verify job right afterward passes (but again, when I check the DUT itself after this Reverify run completes, Python is gone again). <status.log from Verify job> FAIL ---- verify.good_au timestamp=1465430675 localtime=Jun 08 17:04:35 Last AU on this DUT failed GOOD ---- verify.python timestamp=1465430676 localtime=Jun 08 17:04:36
,
Jun 9 2016
I can't get python too. kick off the repair, but it fails again. I can't explain it, so I plan to add some logging into it to see what's the real IOError.
,
Jun 9 2016
Update: manual flashing via USB seems to do the trick to bring a device's status from Repair Failed to Ready, as I was able to do so for nyan_big at chromeos3-row1-rack1-host13. http://cautotest/afe/#tab_id=view_host&object_id=1009 However, it looks like provisioning during jobs will cause the DUT to enter the "Update server not available, no Python" state. 1) I flashed with a USB the falco_li at chromeos3-row1-rack2-host16. 2) Hitting Reverify allowed the host to go from Repair Failed to Ready. 3) There was a queued job from last night (which has a 24 hr timeout) that began to provision the DUT. 4) Job failed due to provisioning. http://cautotest/afe/#tab_id=view_job&object_id=66137402 06/08 19:27:46.932 WARNI| cros_host:0755| Autoupdate did not complete. 06/08 19:27:46.933 ERROR|provision_AutoUpda:0138| Update server at 172.22.39.161:8082 not available 06/08 19:27:46.934 WARNI| test:0606| Autotest caught exception when running test: Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 139, in run_once raise error.TestFail(str(e)) TestFail: Update server at 172.22.39.161:8082 not available
,
Jun 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/18ed0ea21e71afb9a5da1b7511c1efbbae8523d6 commit 18ed0ea21e71afb9a5da1b7511c1efbbae8523d6 Author: xixuan <xixuan@google.com> Date: Thu Jun 09 01:43:26 2016 Autotest: add logging to record the IOError in connection to devserver. BUG= chromium:618510 TEST=None Change-Id: I76e2dbc49a36a727df62af93f4e39790604a09c0 Reviewed-on: https://chromium-review.googlesource.com/351091 Reviewed-by: Kris Rambish <krisr@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/18ed0ea21e71afb9a5da1b7511c1efbbae8523d6/client/common_lib/cros/autoupdater.py
,
Jun 9 2016
+jrbarnette
Xixuan, does this seems to be a bug to you related to ACL change?
In autoupdater.py:
try:
httplib.HTTPConnection(auserver_host).connect()
except IOError as e:
...
apparently if auserver_host is in subnet with ACL, this call will fail because autoupdater.py runs in drone, which doesn't have http connection to the devserver.
,
Jun 9 2016
I think you're right, the new ACL will block such connect.
,
Jun 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ccf2e72170237a6111d56c6956eb1732972a9ad8 commit ccf2e72170237a6111d56c6956eb1732972a9ad8 Author: xixuan <xixuan@google.com> Date: Fri Jun 10 23:42:38 2016 Autotest: delete http call to devserver for new ACL Since new ACL forbiddens all ports except for port 22, this CL deletes the direct http call to devserver in pre-check process of auto-update. It's replaced by devserver healthy check, which uses inherited call to connect to devserver. The inherited call in devserver use port 22 when the new ACL applies. BUG= chromium:618510 TEST=run local autotest instance and 'repair' job for dut to check the logic. Change-Id: Ib30af06d34df68c0fca608b3263397ca49796b9f Reviewed-on: https://chromium-review.googlesource.com/351681 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/ccf2e72170237a6111d56c6956eb1732972a9ad8/client/common_lib/cros/autoupdater.py
,
Jun 16 2016
The original problem of being unable to Verify/Repair seems resolved. (Currently one on DUT in Chaos is in Repair Failed and it's because it is unpingable.) I'll mark this bug as Fixed and let it sit for a while. I will file another bug for this Tuesday night's test run failure on provision_autoUpdate.
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
|
||||
►
Sign in to add a comment |
||||
Comment 1 by xixuan@chromium.org
, Jun 9 2016