New issue
Advanced search Search tips

Issue 863217 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Repair task hits AFE for RPM information

Project Member Reported by pprabhu@chromium.org, Jul 12

Issue description

Example task: https://chrome-swarming.appspot.com/task?id=3ea8f55168c9b711

Logs weren't offloaded due to issue 863192
I've manually marked the results directory for offload, so the results _should_ become available at https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/swarming-3ea8f55168c9b711 in a bit.

From autoserv logs:

07/12 13:30:53.460 INFO |        server_job:0216|       START   ----    repair.rpm      timestamp=1531427453    localtime=Jul 12 13:30:53       
07/12 13:30:53.773 ERROR|        rpm_client:0044| <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:Can not retrieve rpm information from AFE for chromeos4-row7-rack6-host19, no host found.">
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/rpm_control_system/rpm_client.py", line 42, in set_power
    default_result=False)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 123, in timeout
    default_result = func(*args, **kwargs)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1306, in single_request
    return self.parse_response(response)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1482, in parse_response
    return u.close()
  File "/usr/lib/python2.7/xmlrpclib.py", line 794, in close
    raise Fault(**self._stack[0])
Fault: <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:Can not retrieve rpm information from AFE for chromeos4-row7-rack6-host19, no host found.">
07/12 13:30:53.774 ERROR|            repair:0507| Repair failed: Power cycle the host with RPM
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 505, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 92, in repair
    host.power_cycle()
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1731, in power_cycle
    rpm_client.set_power(self.hostname, 'CYCLE')
  File "/usr/local/autotest/site_utils/rpm_control_system/rpm_client.py", line 46, in set_power
    'Client call exception: ' + str(e))
RemotePowerException: Client call exception: <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:Can not retrieve rpm information from AFE for chromeos4-row7-rack6-host19, no host found.">
07/12 13:30:53.775 INFO |        server_job:0216|               FAIL    ----    repair.rpm      timestamp=1531427453    localtime=Jul 12 13:30:53       Client call exception: <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:Can not retrieve rpm information from AFE for chromeos4-row7-rack6-host19, no host found.">
07/12 13:30:53.775 INFO |        server_job:0216|       END FAIL        ----    repair.rpm      timestamp=1531427453    localtime=Jul 12 13:30:53       
07/12 13:30:53.775 INFO |            repair:0110| Attempting this repair action: Reset the DUT via keyboard sysrq-x

 
Status: Started (was: Assigned)
This is going to be a PITA to fix.

The RPM server itself hits the AFE to get powerunit information about the DUT.

The fix would involve changing the rpmserver's API to allow clients to provide this information. The clients can then provide this information from the HostInfo available to the test instead of going to the AFE.
Step 1: Add new RPCs that allow DUTs to supply the required RPM information instead of having the rpm_server hit AFE behind our back: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1136026
> The RPM server itself hits the AFE to get powerunit information about the DUT.

Why not change the RPM server to rely on the new source of truth?

Re #3: No servers in the lab should rely on Skylab services directly. We want to create a clear bifurcation between GCP services (which can call each other more freely) and baremetal / full stack deployments. The only flow of information from GCP services to the baremetal deployment will be via tasks running on skylab-drones.

These tasks obtain all the information necessary to execute, and may then pass it around to stuff deployed within the lab.
Seems like this is needed in the current phase (mark skylab-based paladin important)
Project Member

Comment 6 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c32b49367c0ec053855f7ed0d6b9646ff9e4f9d9

commit c32b49367c0ec053855f7ed0d6b9646ff9e4f9d9
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jul 18 04:53:26 2018

rpm: Split _get_powerunit_info()

BUG=chromium:863217
TEST=Locally run rpmserver and ensure behaviour matches
     prod for both DUT and servo.

Change-Id: I1d4db04a7a3b17d9e4a94634e11b8402eb41bf1d
Reviewed-on: https://chromium-review.googlesource.com/1136024
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Congbin Guo <guocb@chromium.org>

[modify] https://crrev.com/c32b49367c0ec053855f7ed0d6b9646ff9e4f9d9/site_utils/rpm_control_system/frontend_server.py

Project Member

Comment 7 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bc41d1c23137900ffef33ef0c08bc3a96736e652

commit bc41d1c23137900ffef33ef0c08bc3a96736e652
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jul 18 08:40:06 2018

rpm: Extract _queue_once()

in preparation for replacement RPCs for queue_requests()

BUG=chromium:863217
TEST=Locally run rpmserver and ensure behaviour matches
     prod for both DUT and servo.

Change-Id: Ib1d37df1ef99c9889be10f2f40cbd3a1f09ac0f7
Reviewed-on: https://chromium-review.googlesource.com/1136025
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/bc41d1c23137900ffef33ef0c08bc3a96736e652/site_utils/rpm_control_system/frontend_server.py

Project Member

Comment 8 by bugdroid1@chromium.org, Jul 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0a931caac805f598ade1fdf54a8f7aa189f82877

commit 0a931caac805f598ade1fdf54a8f7aa189f82877
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jul 18 11:08:11 2018

rpm: Add new RPCs to set power.

These RPCs are intended to replace the old queue_request() RPC.

BUG=chromium:863217
TEST=Locally run rpmserver and ensure behaviour matches
     prod for both DUT and servo.

Change-Id: I661f5769a5799772006ba0513d46f7573711503d
Reviewed-on: https://chromium-review.googlesource.com/1136026
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/0a931caac805f598ade1fdf54a8f7aa189f82877/site_utils/rpm_control_system/frontend_server.py

OK, now to push these changes to the rpmserver.

We do not update rpmserver as part of push-to-prod.
In fact, current git# of the autotest checkout is at:

chromeos-test@chromeos-server160:/usr/local/autotest$ git log -1
commit 1e3b52e60ea3e764af2281e43b8ab7e2b567103a (HEAD, m/master, cros/prod-next, cros/prod, cros/master)
Author: Sida Liu <sidal@chromium.org>
Date:   Fri Sep 22 10:34:18 2017 -0700

-------

Doing a manual update.
Manual server update steps:
[1]
chromeos-test@chromeos-server160:~/chromiumos$ repo sync

[2]
chromeos-test@chromeos-server160:/usr/local/autotest$ repo sync

[3] # Updates the chromite checkout in site-packages:
chromeos-test@chromeos-server160:/usr/local/autotest$ ./utils/build_externals.py

[4] # Doesn't matter what branch. Anyway this is not tested / automatically deployed.
chromeos-test@chromeos-server160:/usr/local/autotest$ git checkout cros/master

[5] # Restart relevant services
chromeos-test@chromeos-server160:/usr/local/autotest$ sudo service rpmserver_frontend_server stop
rpmserver_frontend_server stop/waiting
chromeos-test@chromeos-server160:/usr/local/autotest$ sudo service rpmserver_frontend_server start
rpmserver_frontend_server start/running, process 225267
chromeos-test@chromeos-server160:/usr/local/autotest$ sudo service rpmserver_dispatcher stop
rpmserver_dispatcher stop/waiting
chromeos-test@chromeos-server160:/usr/local/autotest$ sudo service rpmserver_dispatcher start
rpmserver_dispatcher start/running, process 230464

--------

chromeos-test@chromeos-server160:~$ tail /var/log/rpmserver/rpmserver_frontend_server.log
100.109.25.143 - - [18/Jul/2018 09:36:52] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.143 - - [18/Jul/2018 09:36:53] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.143 - - [18/Jul/2018 09:36:53] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.143 - - [18/Jul/2018 09:36:54] "POST /RPC2 HTTP/1.1" 200 -
100.109.178.145 - - [18/Jul/2018 09:36:55] "POST /RPC2 HTTP/1.1" 200 -
100.109.178.145 - - [18/Jul/2018 09:36:55] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.148 - - [18/Jul/2018 09:36:57] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.143 - - [18/Jul/2018 09:36:58] "POST /RPC2 HTTP/1.1" 200 -
100.109.25.143 - - [18/Jul/2018 09:36:59] "POST /RPC2 HTTP/1.1" 200 -
100.108.189.50 - - [18/Jul/2018 09:37:01] "POST /RPC2 HTTP/1.1" 200 -
chromeos-test@chromeos-server160:~$ tail /var/log/rpmserver/rpmserver_dispatcher.log
100.108.133.208 - - [18/Jul/2018 09:36:44] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:47] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:52] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:53] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:53] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:54] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:57] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:58] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:36:59] "POST /RPC2 HTTP/1.1" 200 -
100.108.133.208 - - [18/Jul/2018 09:37:01] "POST /RPC2 HTTP/1.1" 200 -
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>Test>Platform
Cc: ayatane@chromium.org pprabhu@chromium.org
 Issue 917447  has been merged into this issue.

Sign in to add a comment