Servo initialization frequently times out - Failed to start XMLRPC server...FAIL retry exception (function="ready_test()"), timeout = 60s
Reported by
jrbarnette@chromium.org,
Aug 1
|
||||||||||
Issue description
Looking through recent logs, I've been seeing a lot of this
symptom during servo verification:
FAIL ---- verify.servod timestamp=1533143321 localtime=Aug 01 10:08:41 retry exception (function="ready_test()"), timeout = 60s
For a sample, here's a recent repair task:
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row1-rack9-host5/571154-repair/
Looking into the debug log, you see this:
08/01 10:07:41.095 DEBUG| abstract_ssh:0942| Started ssh tunnel, local = 34489 remote = 9999, pid = 33042
08/01 10:07:41.095 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
08/01 10:07:41.096 WARNI| retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:41.097 WARNI| retry:0183| Retrying in 0.838720 seconds...
08/01 10:07:41.941 WARNI| retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:41.941 WARNI| retry:0183| Retrying in 1.245816 seconds...
08/01 10:07:43.194 WARNI| retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:43.194 WARNI| retry:0183| Retrying in 1.318803 seconds...
08/01 10:08:41.520 ERROR|rpc_server_tracker:0201| Failed to start XMLRPC server.
08/01 10:08:41.520 DEBUG| abstract_ssh:0957| Terminated tunnel, pid 33042
08/01 10:08:41.520 ERROR| repair:0354| Failed: servod service is taking calls
The retries on "Connection refused" are normal; that's the start-up
lag for the ssh tunnel. The log shows the retries stopped after about
2 seconds. That _should_ be the indicator that the tunnel is working.
Instead, we see the timeout.
I checked connecting directly to the servo, and the servo is working fine;
the problem appears to be on the client side.
,
Sep 30
Servo tests for several hosts in chromeos15 are failing on OOB suites with this reason starting on 2018-09-28. usb_detect - https://screenshot.googleplex.com/1ozLoH6tCMA display_LidCloseOpen on chameleon suite - https://screenshot.googleplex.com/J6pBtE3XFQy I am increasing priority as probably other suites are affected. I see wifi suite(s) have it since earlier this month too.
,
Sep 30
regards to c#1, I assume you manually scan the log or is there a tool to do that? and the board is leon, could it be a bad servo? for c#2 for display_LidCloseOpen, I assume it passes before R69?
,
Oct 1
Sevaral suites ar affecte - last 7 days query shows Screenshots: https://screenshot.googleplex.com/t3AsHX2ejF3 https://screenshot.googleplex.com/orCTm3iKgnS Suites: - bluetooth_sanity - bluetooth_stress - chameleon_hdmi_nightly - chameleon_hdmi_perbuild - ent-nightly - faft_flashrom - platform_test_nightly - usb_detect - wifi_matfunc - wifi_perf - wifi_release - wifi_update_router Most(but 3) hosts in chromeos15: chromeos15-row1-rack1-host2 chromeos15-row1-rack1-host4 chromeos15-row1-rack2-host2 chromeos15-row1-rack5-host3 chromeos15-row1-rack5-host7 chromeos15-row1-rack6-host7 chromeos15-row1-rack8-host4 chromeos15-row1-rack10-host1 chromeos15-row1-rack10-host4 chromeos15-row1-rack11-host5 chromeos15-row1-rack11-host7 chromeos15-row2-rack1-host1 chromeos15-row3-rack2-host6 chromeos15-row4-rack12-host3 chromeos15-row13a-rack1-host5 chromeos15-row13a-rack1-host6 chromeos15-row13a-rack1-host7 chromeos15-row13a-rack1-host11 chromeos15-row13a-rack2-host9 chromeos15-row13a-rack2-host12 chromeos15-row13a-rack3-host7 chromeos15-row13a-rack4-host3 chromeos15-row13a-rack4-host4 chromeos15-row13b-rack2-host8 chromeos15-row13b-rack2-host9 chromeos15-row13b-rack4-host4 chromeos15-row13b-rack5-host1 chromeos15-row13b-rack5-host3 Only 3 hosts in chromeos2 chromeos2-row8-rack1-host22 chromeos2-row11-rack3-host4 chromeos2-row11-rack6-host1 + some more TE folks
,
Oct 1
+ This Week's Infra Deputy guocb@
,
Oct 2
,
Oct 2
Congbin, please check the network connection to these servos.
,
Oct 2
From my desk and testers desks we are able to ssh root@ to the servo hosts for these hostnames. Tom seems to get connection timeout. Joe and David, AFAIK there was a transition to the new net-block last week. Could these be having to do with this issue?
,
Oct 2
Please check if the lab stations see these servos. From my corp workstation, most of the servos are up (only 4 of them are down). $ servo_monitor $(cat duts.txt) chromeos15-row1-rack2-host2-servo down chromeos15-row1-rack8-host4-servo down chromeos15-row1-rack10-host1-servo down chromeos2-row11-rack6-host1-servo down So it seems not a servo issue.
,
Oct 3
,
Oct 4
Could this be caused by the bad dev_server in issue 891765 ?
,
Oct 5
Re c#11, I don't think so. Those devservers in issue 891765 was just deployed on this Tue, i.e. Oct 2.
,
Oct 5
guocb@, are you diagnosing this or should it go to someone else?
,
Oct 5
I am not sure if we should care about hardware failures in chromeos15. Anyway, I checked the servos of DUTs listed in c#4, some of them were not pingable. PING chromeos15-row1-rack10-host1-servo.cros.corp.google.com (100.115.125.1) 56(84) bytes of data. --- chromeos15-row1-rack10-host1-servo.cros.corp.google.com ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms PING chromeos15-row1-rack8-host4-servo.cros.corp.google.com (100.115.124.211) 56(84) bytes of data. --- chromeos15-row1-rack8-host4-servo.cros.corp.google.com ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms PING chromeos2-row11-rack6-host1-servo.cros.corp.google.com (100.115.244.131) 56(84) bytes of data. --- chromeos2-row11-rack6-host1-servo.cros.corp.google.com ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms
,
Oct 8
I am not sure what you mean by 'if we should care about hardware failures in chromeos15'? The problem is that we are have issue with consistent test execution regardless if it's hardware or software. +akeshet, looks like guocb@ is OOO till 10/21, can someone help to take a look at this issue? +kalin/harpreet is also pursuing a similar issue, not sure if they are related.
,
Oct 12
Other issues interfered and probably caused with this bug. Closing it as hosts servo's started WAI. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by demitrio@chromium.org
, Sep 20Same issue has been found @ chromeos2-row4-rack1-host{7,8}.