[Canary] banon-release: Servo initialize timed out at during provision in HWTest phase |
||||
Issue descriptionWhere the issue happened: Canary banon-release What the issue was: Servo initialize timed out. When the issue started: This failure has been there since build #1181 (see https://chromegw.corp.google.com/i/chromeos/builders/banon-release?numbuilds=50.) Error messages: 06-05-2017 [12:50:05] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=121613966 @@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=121613966@@@ 06-05-2017 [13:11:44] Suite job is finished. 06-05-2017 [13:11:44] Start collecting test results and dump them to json. Suite job [ PASSED ] provision [ FAILED ] provision FAIL: Servo initialize timed out., command execution error, completed successfully
,
Jun 5 2017
This error doesn't happen 100% of the time, but is certainly happening a lot. https://chromegw.corp.google.com/i/chromeos/builders/banon-release
,
Jun 5 2017
I've been digging for a while, but don't understand the failures. I cloned and re-ran the most recent failure (on the same DUT and everything), but it passed. http://chromeos-server100.mtv.corp.google.com/afe/#tab_id=view_job&object_id=121653340
,
Jun 6 2017
That DUT is attached to a labstation, which is fairly new, and we're attaching more DUTs to them than originally intended. Speculating wildly: could we be overloading them in some conditions, which causes servod to appear to hang, and then it self-corrects as load drops?
,
Jun 6 2017
Checking chromeos6-row1-rack17-labstation, there are 143 servod processes. That seems wildly wrong.
,
Jun 6 2017
I found various servod configs that shared the same servo serial (and subsequently multiple servod process on different ports for the same servo serial which can cause the servo init to time out). I removed the extraneous ones and shut down those servod processes. All should be well now. Reopen if that is still a problem. As for how this came about, these devices may have been deployed with an older version of the deployment script where if a deploy created the host but was unsuccessful in installing the image on the dut, a redpeploy would create a newduplicate servod config (And start a duplicate servod process) on a different port. This shouldn't be a problem anymore.
,
Jun 6 2017
Is there any way to sanity check the configs lab wide for this kind of problem?
,
Jun 6 2017
I was going to add a new verifier for servod to check for this problem. That should protect us moving forward.
,
Jun 6 2017
Sweet! I'm assuming that if one labstation was misconfigured there is more than one, I just want a way to make sure they all get fixed.
,
Jan 22 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by mcchou@chromium.org
, Jun 5 2017