New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 729819 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

[Canary] banon-release: Servo initialize timed out at during provision in HWTest phase

Project Member Reported by mcchou@chromium.org, Jun 5 2017

Issue description

Where the issue happened:
Canary banon-release

What the issue was:
Servo initialize timed out.

When the issue started:
This failure has been there since build #1181 (see https://chromegw.corp.google.com/i/chromeos/builders/banon-release?numbuilds=50.)

Error messages:
06-05-2017 [12:50:05] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=121613966
  @@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=121613966@@@
  06-05-2017 [13:11:44] Suite job is finished.
  06-05-2017 [13:11:44] Start collecting test results and dump them to json.
  Suite job   [ PASSED ]
  provision   [ FAILED ]
  provision     FAIL: Servo initialize timed out., command execution error, completed successfully
 
Found related error messages under https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121613975-chromeos-test/chromeos6-row1-rack17-host19/debug/.

Related error messages:
06/05 12:52:34.687 ERROR|            repair:0332| Failed: servod service is taking calls
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 211, in verify
    host.connect_servo()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 138, in connect_servo
    'Servo initialize timed out.')
AutoservVerifyError: Servo initialize timed out.
06/05 12:52:34.690 INFO |        server_job:0184| 	FAIL	----	verify.servod	timestamp=1496692354	localtime=Jun 05 12:52:34	Servo initialize timed out.
06/05 12:52:34.690 INFO |            repair:0105| Skipping this operation: pwr_button control is normal
06/05 12:52:34.691 DEBUG|            repair:0106| The following dependencies failed:
06/05 12:52:34.691 DEBUG|            repair:0108|     servod service is taking calls
06/05 12:52:34.692 INFO |            repair:0105| Skipping this operation: lid_open control is normal
06/05 12:52:34.692 DEBUG|            repair:0106| The following dependencies failed:
06/05 12:52:34.692 DEBUG|            repair:0108|     servod service is taking calls
06/05 12:52:34.692 INFO |            repair:0105| Skipping this operation: All host verification checks pass
06/05 12:52:34.693 DEBUG|            repair:0106| The following dependencies failed:
06/05 12:52:34.693 DEBUG|            repair:0108|     servod service is taking calls
06/05 12:52:34.693 ERROR|        servo_host:0868| Servo verification failed for chromeos6-row1-rack17-labstation
Traceback (most recent call last):
  File "/usr/local/autotest/server/hosts/servo_host.py", line 864, in create_servo_host
    newhost.verify()
  File "/usr/local/autotest/server/hosts/servo_host.py", line 618, in verify
    self._repair_strategy.verify(self, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 685, in verify
    self._verify_root._verify_host(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 326, in _verify_host
    self._verify_dependencies(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 199, in _verify_dependencies
    self._verify_list(host, self._dependency_list, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 188, in _verify_list
    raise AutoservVerifyDependencyError(self, failures)
AutoservVerifyDependencyError: Servo initialize timed out.
This error doesn't happen 100% of the time, but is certainly happening a lot.

https://chromegw.corp.google.com/i/chromeos/builders/banon-release

I've been digging for a while, but don't understand the failures. I cloned and re-ran the most recent failure (on the same DUT and everything), but it passed.

http://chromeos-server100.mtv.corp.google.com/afe/#tab_id=view_job&object_id=121653340

That DUT is attached to a labstation, which is fairly new, and we're attaching more DUTs to them than originally intended.

Speculating wildly: could we be overloading them in some conditions, which causes servod to appear to hang, and then it self-corrects as load drops?
Owner: kevcheng@chromium.org
Checking chromeos6-row1-rack17-labstation, there are 143 servod processes. That seems wildly wrong.


Status: Fixed (was: Untriaged)
I found various servod configs that shared the same servo serial (and subsequently multiple servod process on different ports for the same servo serial which can cause the servo init to time out).  I removed the extraneous ones and shut down those servod processes.  All should be well now.

Reopen if that is still a problem.

As for how this came about, these devices may have been deployed with an older version of the deployment script where if a deploy created the host but was unsuccessful in installing the image on the dut, a redpeploy would create a newduplicate servod config (And start a duplicate servod process) on a different port.  This shouldn't be a problem anymore.
Is there any way to sanity check the configs lab wide for this kind of problem?
I was going to add a new verifier for servod to check for this problem.  That should protect us moving forward.
Sweet!

I'm assuming that if one labstation was misconfigured there is more than one, I just want a way to make sure they all get fixed.

Comment 10 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment