New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 870051 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
OOO
Closed: Oct 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Servo initialization frequently times out - Failed to start XMLRPC server...FAIL retry exception (function="ready_test()"), timeout = 60s

Reported by jrbarnette@chromium.org, Aug 1

Issue description

Looking through recent logs, I've been seeing a lot of this
symptom during servo verification:
	FAIL	----	verify.servod	timestamp=1533143321	localtime=Aug 01 10:08:41	retry exception (function="ready_test()"), timeout = 60s

For a sample, here's a recent repair task:
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row1-rack9-host5/571154-repair/

Looking into the debug log, you see this:
08/01 10:07:41.095 DEBUG|      abstract_ssh:0942| Started ssh tunnel, local = 34489 remote = 9999, pid = 33042
08/01 10:07:41.095 INFO |rpc_server_tracker:0196| Waiting 60 seconds for XMLRPC server to start.
08/01 10:07:41.096 WARNI|             retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:41.097 WARNI|             retry:0183| Retrying in 0.838720 seconds...
08/01 10:07:41.941 WARNI|             retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:41.941 WARNI|             retry:0183| Retrying in 1.245816 seconds...
08/01 10:07:43.194 WARNI|             retry:0228| <class 'socket.error'>([Errno 111] Connection refused)
08/01 10:07:43.194 WARNI|             retry:0183| Retrying in 1.318803 seconds...
08/01 10:08:41.520 ERROR|rpc_server_tracker:0201| Failed to start XMLRPC server.
08/01 10:08:41.520 DEBUG|      abstract_ssh:0957| Terminated tunnel, pid 33042
08/01 10:08:41.520 ERROR|            repair:0354| Failed: servod service is taking calls

The retries on "Connection refused" are normal; that's the start-up
lag for the ssh tunnel.  The log shows the retries stopped after about
2 seconds.  That _should_ be the indicator that the tunnel is working.
Instead, we see the timeout.

I checked connecting directly to the servo, and the servo is working fine;
the problem appears to be on the client side.

 
Same issue has been found @ chromeos2-row4-rack1-host{7,8}.
Cc: dchan@google.com zamorzaev@chromium.org
Components: Test
Labels: -Pri-2 Pri-1
Servo tests for several hosts in chromeos15 are failing on OOB suites with this reason starting on 2018-09-28.

usb_detect - https://screenshot.googleplex.com/1ozLoH6tCMA
display_LidCloseOpen on chameleon suite - https://screenshot.googleplex.com/J6pBtE3XFQy

I am increasing priority as probably other suites are affected.
I see wifi suite(s) have it since earlier this month too.

regards to c#1, I assume you manually scan the log or is there a tool to do that?
and the board is leon, could it be a bad servo?

for c#2 for display_LidCloseOpen, I assume it passes before R69?
Cc: harpreet@chromium.org dschimmels@chromium.org kmshelton@chromium.org jashur@chromium.org
Labels: servov3
Sevaral suites ar affecte - last 7 days query shows
Screenshots: 
https://screenshot.googleplex.com/t3AsHX2ejF3
https://screenshot.googleplex.com/orCTm3iKgnS

Suites: 
- bluetooth_sanity
- bluetooth_stress
- chameleon_hdmi_nightly
- chameleon_hdmi_perbuild
- ent-nightly
- faft_flashrom
- platform_test_nightly
- usb_detect
- wifi_matfunc
- wifi_perf
- wifi_release
- wifi_update_router

Most(but 3) hosts in chromeos15:
chromeos15-row1-rack1-host2
chromeos15-row1-rack1-host4
chromeos15-row1-rack2-host2
chromeos15-row1-rack5-host3
chromeos15-row1-rack5-host7
chromeos15-row1-rack6-host7
chromeos15-row1-rack8-host4
chromeos15-row1-rack10-host1
chromeos15-row1-rack10-host4
chromeos15-row1-rack11-host5
chromeos15-row1-rack11-host7
chromeos15-row2-rack1-host1
chromeos15-row3-rack2-host6
chromeos15-row4-rack12-host3
chromeos15-row13a-rack1-host5
chromeos15-row13a-rack1-host6
chromeos15-row13a-rack1-host7
chromeos15-row13a-rack1-host11
chromeos15-row13a-rack2-host9
chromeos15-row13a-rack2-host12
chromeos15-row13a-rack3-host7
chromeos15-row13a-rack4-host3
chromeos15-row13a-rack4-host4
chromeos15-row13b-rack2-host8
chromeos15-row13b-rack2-host9				
chromeos15-row13b-rack4-host4				
chromeos15-row13b-rack5-host1				
chromeos15-row13b-rack5-host3

Only 3 hosts in chromeos2
chromeos2-row8-rack1-host22					
chromeos2-row11-rack3-host4											
chromeos2-row11-rack6-host1

+ some more TE folks
Cc: gu...@chromium.org
+ This Week's Infra Deputy guocb@
Cc: waihong@chromium.org
Summary: Servo initialization frequently times out - Failed to start XMLRPC server...FAIL retry exception (function="ready_test()"), timeout = 60s (was: Servo initialization frequently times out)
Owner: gu...@chromium.org
Congbin, please check the network connection to these servos.
From my desk and testers desks we are able to ssh root@ to the servo hosts for these hostnames.

Tom seems to get connection timeout.

Joe and David, AFAIK there was a transition to the new net-block last week. Could these be having to do with this issue?
Please check if the lab stations see these servos.

From my corp workstation, most of the servos are up (only 4 of them are down).
$ servo_monitor $(cat duts.txt)
chromeos15-row1-rack2-host2-servo        down
chromeos15-row1-rack8-host4-servo        down
chromeos15-row1-rack10-host1-servo       down
chromeos2-row11-rack6-host1-servo        down

So it seems not a servo issue.
Cc: cros-conn-test-team@google.com
Could this be caused by the bad dev_server in  issue 891765  ?
Re c#11, I don't think so. Those devservers in  issue 891765  was just deployed on this Tue, i.e. Oct 2.
Status: Assigned (was: Available)
guocb@, are you diagnosing this or should it go to someone else?
I am not sure if we should care about hardware failures in chromeos15. Anyway, I checked the servos of DUTs listed in c#4, some of them were not pingable.

PING chromeos15-row1-rack10-host1-servo.cros.corp.google.com (100.115.125.1) 56(84) bytes of data.

--- chromeos15-row1-rack10-host1-servo.cros.corp.google.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

PING chromeos15-row1-rack8-host4-servo.cros.corp.google.com (100.115.124.211) 56(84) bytes of data.

--- chromeos15-row1-rack8-host4-servo.cros.corp.google.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

PING chromeos2-row11-rack6-host1-servo.cros.corp.google.com (100.115.244.131) 56(84) bytes of data.

--- chromeos2-row11-rack6-host1-servo.cros.corp.google.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Cc: -cros-conn-test-team@google.com akes...@chromium.org
I am not sure what you mean by 'if we should care about hardware failures in chromeos15'? 

The problem is that we are have issue with consistent test execution regardless if it's hardware or software.

+akeshet, looks like guocb@ is OOO till 10/21, can someone help to take a look at this issue?

+kalin/harpreet is also pursuing a similar issue, not sure if they are related. 


Status: WontFix (was: Assigned)
Other issues interfered and probably caused with this bug. Closing it as hosts servo's started WAI.

Sign in to add a comment