New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618510 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Verify/Repair fails for devservers in chromeos3

Project Member Reported by tienchang@chromium.org, Jun 9 2016

Issue description

Background: chromeos3 (Chaos) is under google_v2 ACL. There are currently four devservers in this lab.

Please note Verify/Repair runs on this DUT:
http://cautotest/afe/#tab_id=view_host&object_id=792

Verify fails on:
06/08 17:04:35.566 INFO |            repair:0279| Verifying this condition: The most recent AU attempt on this DUT succeeded
06/08 17:04:35.566 DEBUG|          ssh_host:0180| Running (ssh) 'test -f /var/tmp/provision_failed'
06/08 17:04:35.778 ERROR|            repair:0284| Failed: The most recent AU attempt on this DUT succeeded
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 281, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 96, in verify
    'Last AU on this DUT failed')
AutoservVerifyError: Last AU on this DUT failed

Repair logs show that:
06/08 17:09:13.875 DEBUG|        base_utils:0185| Running 'ssh 172.22.39.163 'curl "http://172.22.39.163:8082/check_health?"''
06/08 17:09:14.956 ERROR|            repair:0397| Repair failed: Powerwash and then re-install the stable build via AU
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 395, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 317, in repair
    super(PowerWashRepair, self).repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 297, in repair
    afe_utils.machine_install_and_update_labels(host, repair=True)
  File "/usr/local/autotest/server/afe_utils.py", line 190, in machine_install_and_update_labels
    image_name, host_attributes = host.machine_install(*args, **dargs)
  File "/usr/local/autotest/server/hosts/cros_host.py", line 753, in machine_install
    updater.run_update()
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/client/common_lib/cros/autoupdater.py", line 479, in run_update
    'Update server at %s not available' % auserver_host)
ChromiumOSError: Update server at 172.22.39.163:8082 not available


dshi@ is able to make the http call from the DUT to the devserver. This may potentially be another ACL issue.

Assigning to deputy to triage. Marking as Pri-1 as other labs are waiting to be cut over to new ACL.
 
how to login into the DUT to check? I can't make http call from my workstation to the devserver due to some network restrictions I guess.

So I guess there exists some network restrictions here too.

kris, David, any idea?

Comment 2 by dshi@chromium.org, Jun 9 2016

The devserver is in chromeos3, with acl you can only ssh to it. However, that dut is in the same subnet and i verified that curl http://172.22.39.163:8082/check_health works from the dut. This is very odd.

Comment 3 by krisr@google.com, Jun 9 2016

The job link above links to a device, and the device did not run a test since May 14th:
  http://cautotest/afe/#tab_id=view_job&object_id=63343297.

So I don't know what job comment1 is referring to.  Tien please include the correct job links.  That said, I agree with Dan's assessment.  

Here are all available dev servers:
# Begin Dev Server

                host chromeos3-infra-devserver {
                        hardware ethernet 00:0c:29:48:61:57;
                        fixed-address 172.22.39.161;
                        ddns-hostname "chromeos3-infra-devserver";
                        option host-name "chromeos3-infra-devserver";
                }     

                host chromeos3-infra-devserver1 {
                        hardware ethernet 00:0c:29:e1:a8:cf;
                        fixed-address 172.22.39.162;
                        ddns-hostname "chromeos3-infra-devserver1";
                        option host-name "chromeos3-infra-devserver1";
                }     

                host chromeos3-infra-devserver2 {
                        hardware ethernet 00:0c:29:fe:5e:25;
                        fixed-address 172.22.39.163;
                        ddns-hostname "chromeos3-infra-devserver2";
                        option host-name "chromeos3-infra-devserver2";
                }     

                host chromeos3-infra-devserver3 {
                        hardware ethernet 00:0c:29:b1:47:12;
                        fixed-address 172.22.39.164;
                        ddns-hostname "chromeos3-infra-devserver3";
                        option host-name "chromeos3-infra-devserver3";
                }     

# End Dev Server

When I run the test from my workstation:
$ ssh chromeos-test@172.22.39.161 'curl "http://172.22.39.161:8082/check_health?"'
chromeos-test@172.22.39.161's password: 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"disk_total_bytes_per_second": 46241.42507089223, "network_total_bytes_per_second": 1729.8773573667793, "network_sent_bytes_per_second": 806.0439225662822, "apache_client_count": 4, "disk_write_bytes_per_second": 46241.42507089223, "cpu_percent": 100.0, "di100   413  100   413    0     0   2876      0 --:--:-- --:--:-- --:--:--  28884971, "free_disk": 176, "staging_thread_count": 0, "telemetry_test_count": 1}


$ ssh chromeos-test@172.22.39.162 'curl "http://172.22.39.162:8082/check_health?"'
chromeos-test@172.22.39.162's password: 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"disk_total_bytes_per_second": 104335.28675417618, "network_total_bytes_per_second": 979.641694621444, "network_sent_bytes_per_second": 490.7698221347154, "apache_client_count": 4, "disk_write_bytes_per_second": 104335.28675417618, "cpu_percent": 0.0, "disk100   412  100   412    0     0   3389      0 --:--:-- --:--:-- --:--:--  3404853, "free_disk": 181, "staging_thread_count": 0, "telemetry_test_count": 1}


$ ssh chromeos-test@172.22.39.163 'curl "http://172.22.39.163:8082/check_health?"'
chromeos-test@172.22.39.163's password: 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"disk_total_bytes_per_second": 51574.8589681047, "network_total_bytes_per_second": 1023.3106938116013, "network_sent_bytes_per_second": 434.3074487602753, "apache_client_count": 4, "disk_write_bytes_per_second": 51574.8589681047, "cpu_percent": 0.0, "disk_r100   408  100   408    0     0   3381      0 --:--:-- --:--:-- --:--:--  3400 "free_disk": 177, "staging_thread_count": 0, "telemetry_test_count": 1}


$ ssh chromeos-test@172.22.39.164 'curl "http://172.22.39.164:8082/check_health?"'
chromeos-test@172.22.39.164's password: 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"disk_total_bytes_per_second": 64683.740275662196, "network_total_bytes_per_second": 1035.4707730540763, "network_sent_bytes_per_second": 440.17502746430034, "apache_client_count": 4, "disk_write_bytes_per_second": 64683.740275662196, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "gsutil_count": 1, "network_recv_bytes_per_second": 595.2957455897758, "free_disk": 179, "staging_thread_count": 0, "telemetry_test_cou100   433  100   433    0     0   2835      0 --:--:-- --:--:-- --:--:--  2848

What this test is doing is sshing into a devserver and then curling from itself to make sure the web server on 8082 is running.  Since it was the curl that failed and not the ssh into the server that is strange.

Here is the code snippet:
try:
    httplib.HTTPConnection(auserver_host).connect()
except IOError:
    raise ChromiumOSError(
            'Update server at %s not available' % auserver_host)


I also re-ran from a DUT in the lab:
$ ssh root@chromeos3-row1-rack2-host16.cros
localhost ~ # curl "http://172.22.39.161:8082/check_health?"
{"disk_total_bytes_per_second": 55645.39747660891, "network_total_bytes_per_second": 1411.9723929584582, "network_sent_bytes_per_second": 702.3401411312996, "apache_client_count": 4, "disk_write_bytes_per_second": 55645.39747660891, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 709.6322518271586, "free_disk": 176, "staging_thread_count": 0, "telemetry_test_count": 1}

localhost ~ # curl "http://172.22.39.162:8082/check_health?"              
{"disk_total_bytes_per_second": 52789.90200619686, "network_total_bytes_per_second": 1344.1651177767922, "network_sent_bytes_per_second": 629.521659514759, "apache_client_count": 4, "disk_write_bytes_per_second": 52789.90200619686, "cpu_percent": 11.1, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 714.6434582620332, "free_disk": 181, "staging_thread_count": 0, "telemetry_test_count": 1}

localhost ~ # curl "http://172.22.39.163:8082/check_health?"              
{"disk_total_bytes_per_second": 58552.31794603099, "network_total_bytes_per_second": 1686.3101156707496, "network_sent_bytes_per_second": 801.2197271386009, "apache_client_count": 4, "disk_write_bytes_per_second": 58552.31794603099, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "network_recv_bytes_per_second": 885.0903885321487, "free_disk": 177, "staging_thread_count": 0, "telemetry_test_count": 1}

localhost ~ # curl "http://172.22.39.164:8082/check_health?"              
{"disk_total_bytes_per_second": 50340.51250573921, "network_total_bytes_per_second": 1306.4544449721723, "network_sent_bytes_per_second": 597.0222033429238, "apache_client_count": 4, "disk_write_bytes_per_second": 50340.51250573921, "cpu_percent": 0.0, "disk_read_bytes_per_second": 0.0, "gsutil_count": 1, "network_recv_bytes_per_second": 709.4322416292484, "free_disk": 179, "staging_thread_count": 0, "telemetry_test_count": 1}

Disk space on that server also looks good:
chromeos-test@chromeos3-infra-devserver2:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            487M  4.0K  487M   1% /dev
tmpfs           100M  4.5M   95M   5% /run
/dev/sda1       196G   21G  165G  12% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            497M     0  497M   0% /run/shm
none            100M     0  100M   0% /run/user

All devservers are available via ssh from corp and from http from inside the lab.  I would recommend re-running the job.
The jobs are the Verify/Repair jobs shown after "Show verifies, repairs, cleanups and resets" is checked off.

Clicking on the latest Verify job brings me to these logs:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos3-row1-rack2-host16/56326113-verify?pli=1

Clicking on the second latest Repair job shows these logs:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos3-row1-rack2-host16/56326139-repair

There's currently a Repair job running now that I did not invoke so I'll wait till that has completed. I'm not sure rerunning would help, however.

Comment 5 by krisr@google.com, Jun 9 2016

If I ssh into the device here is what I get:

$ ssh root@chromeos3-row1-rack2-host16.cros
localhost ~ # python
-bash: python: command not found

This seems like a problem.

Other hosts do this:

$ ssh root@chromeos3-row1-rack2-host9.cros
localhost ~ # python
Python 2.7.3 (default, Dec  8 2015, 21:49:44) 
[GCC 4.9.x-google 20150123 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 

This may need to be manually flashed since the chaos duts don't have servo.
Oh yes that's the other odd thing: after flashing the DUT, confirming it has Python again, and hitting Reverify, that job failed due to "Last AU on this DUT failed".
When I check the DUT, Python is gone again.

What's interesting is this AU test in the Verify job fails but the Python test in the Verify job right afterward passes (but again, when I check the DUT itself after this Reverify run completes, Python is gone again).

<status.log from Verify job>
	FAIL	----	verify.good_au	timestamp=1465430675	localtime=Jun 08 17:04:35	Last AU on this DUT failed
	GOOD	----	verify.python	timestamp=1465430676	localtime=Jun 08 17:04:36
I can't get python too. kick off the repair, but it fails again. I can't explain it, so I plan to add some logging into it to see what's the real IOError.
Update: manual flashing via USB seems to do the trick to bring a device's status from Repair Failed to Ready, as I was able to do so for nyan_big at chromeos3-row1-rack1-host13.
http://cautotest/afe/#tab_id=view_host&object_id=1009

However, it looks like provisioning during jobs will cause the DUT to enter the "Update server not available, no Python" state. 
1) I flashed with a USB the falco_li at chromeos3-row1-rack2-host16.
2) Hitting Reverify allowed the host to go from Repair Failed to Ready.
3) There was a queued job from last night (which has a 24 hr timeout) that began to provision the DUT.
4) Job failed due to provisioning.
http://cautotest/afe/#tab_id=view_job&object_id=66137402

06/08 19:27:46.932 WARNI|         cros_host:0755| Autoupdate did not complete.
06/08 19:27:46.933 ERROR|provision_AutoUpda:0138| Update server at 172.22.39.161:8082 not available
06/08 19:27:46.934 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 139, in run_once
    raise error.TestFail(str(e))
TestFail: Update server at 172.22.39.161:8082 not available
Project Member

Comment 9 by bugdroid1@chromium.org, Jun 9 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/18ed0ea21e71afb9a5da1b7511c1efbbae8523d6

commit 18ed0ea21e71afb9a5da1b7511c1efbbae8523d6
Author: xixuan <xixuan@google.com>
Date: Thu Jun 09 01:43:26 2016

Autotest: add logging to record the IOError in connection to devserver.

BUG= chromium:618510 
TEST=None

Change-Id: I76e2dbc49a36a727df62af93f4e39790604a09c0
Reviewed-on: https://chromium-review.googlesource.com/351091
Reviewed-by: Kris Rambish <krisr@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/18ed0ea21e71afb9a5da1b7511c1efbbae8523d6/client/common_lib/cros/autoupdater.py

Comment 10 by dshi@chromium.org, Jun 9 2016

Cc: jrbarnette@chromium.org
+jrbarnette

Xixuan, does this seems to be a bug to you related to ACL change?
In autoupdater.py:
try:
    httplib.HTTPConnection(auserver_host).connect()
except IOError as e:
    ...

apparently if auserver_host is in subnet with ACL, this call will fail because autoupdater.py runs in drone, which doesn't have http connection to the devserver.
I think you're right, the new ACL will block such connect.
Project Member

Comment 12 by bugdroid1@chromium.org, Jun 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ccf2e72170237a6111d56c6956eb1732972a9ad8

commit ccf2e72170237a6111d56c6956eb1732972a9ad8
Author: xixuan <xixuan@google.com>
Date: Fri Jun 10 23:42:38 2016

Autotest: delete http call to devserver for new ACL

Since new ACL forbiddens all ports except for port 22, this CL deletes the
direct http call to devserver in pre-check process of auto-update. It's
replaced by devserver healthy check, which uses inherited call to connect
to devserver. The inherited call in devserver use port 22 when the new ACL
applies.

BUG= chromium:618510 
TEST=run local autotest instance and 'repair' job for dut to check the logic.

Change-Id: Ib30af06d34df68c0fca608b3263397ca49796b9f
Reviewed-on: https://chromium-review.googlesource.com/351681
Commit-Ready: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/ccf2e72170237a6111d56c6956eb1732972a9ad8/client/common_lib/cros/autoupdater.py

Status: Fixed (was: Assigned)
The original problem of being unable to Verify/Repair seems resolved. (Currently one on DUT in Chaos is in Repair Failed and it's because it is unpingable.) I'll mark this bug as Fixed and let it sit for a while.

I will file another bug for this Tuesday night's test run failure on provision_autoUpdate.
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)

Sign in to add a comment