New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 884789 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

platform_PrinterPpds fails on daisy-skate(and daisy-spring)

Project Member Reported by ka...@chromium.org, Sep 17

Issue description

Dashboard view: https://stainless.corp.google.com/search?col=board&board=daisy&test=platform_PrinterPpds.100&exclude_not_run=true&view=matrix&row=build&days=7

Screenshot - https://screenshot.googleplex.com/TvtczRp7bnZ

daisy - all PASS
daisy-skate - mostly failing
daisy-spring - flaky

I'll get a daisy-skate board for pawliczek@ to take a closer look.
 
Cc: johndhong@chromium.org
This dashboard shows some of the units that fail - https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?testName=platform_PrinterPpds&suite=&daysBack=7&board=daisy_skate&architecture=&boardFamily=&buildConfig=&reason=&version=&milestone=&dut=

They are
chromeos4-row9-rack7-host5
chromeos4-row9-rack7-host1
chromeos4-row9-rack6-host1
chromeos4-row9-rack5-host5

Piotr already checked a device at desk from Test inventory and it passed the test.

We need to get one or two boards from the list above. 
+johndhong@ 


There are 
I locked hosts
chromeos4-row9-rack7-host5
chromeos4-row9-rack7-host1
chromeos4-row9-rack6-host1
chromeos4-row9-rack5-host5
https://screenshot.googleplex.com/48mWfG5nmks

Will pick them up today and bring to desk.
I brought 3 DUTs from hosts
chromeos4-row9-rack7-host1
chromeos4-row9-rack6-host1
chromeos4-row9-rack5-host5

Piotr will re-run tests and diagnose next week.
Cc: englab-sys-cros@google.com
Rebalanced the BVT pool since we left it dangerously under supported

johndhong@phobrz:~$ balance_pool -t 6 bvt daisy_skate
daisy_skate bvt pool: Target of 6 is above minimum.

Balancing ['model:daisy_skate'] bvt pool:
Total 6 DUTs, 2 working, 4 broken, 0 reserved.
Target is 6 working DUTs; grow pool by 4 DUTs.
['model:daisy_skate'] suites pool has 19 spares available for balancing pool bvt
['model:daisy_skate'] bvt pool will return 4 broken DUTs, leaving 0 still in the pool.
ERROR: ['model:daisy_skate'] bvt pool: Refusing to act on pool with 4 broken DUTs.
ERROR: Please investigate this model to for a bug 
ERROR: that is bricking devices. Once you have finished your 
ERROR: investigation, you can force a rebalance with 
ERROR: --force-rebalance
Transferring 0 DUTs from bvt to suites.
Transferring 0 DUTs from suites to bvt.

johndhong@phobrz:~$ balance_pool --force-rebalance -t 6 bvt daisy_skate
daisy_skate bvt pool: Target of 6 is above minimum.

Balancing ['model:daisy_skate'] bvt pool:
Total 6 DUTs, 2 working, 4 broken, 0 reserved.
Target is 6 working DUTs; grow pool by 4 DUTs.
['model:daisy_skate'] suites pool has 19 spares available for balancing pool bvt
['model:daisy_skate'] bvt pool will return 4 broken DUTs, leaving 0 still in the pool.
Transferring 4 DUTs from bvt to suites.
Updating host: chromeos4-row9-rack5-host5.
Removing labels ['pool:bvt'] from host chromeos4-row9-rack5-host5
Adding labels ['pool:suites'] to host chromeos4-row9-rack5-host5
Updating host: chromeos4-row9-rack6-host1.
Removing labels ['pool:bvt'] from host chromeos4-row9-rack6-host1
Adding labels ['pool:suites'] to host chromeos4-row9-rack6-host1
Updating host: chromeos4-row9-rack7-host1.
Removing labels ['pool:bvt'] from host chromeos4-row9-rack7-host1
Adding labels ['pool:suites'] to host chromeos4-row9-rack7-host1
Updating host: chromeos4-row9-rack7-host5.
Removing labels ['pool:bvt'] from host chromeos4-row9-rack7-host5
Adding labels ['pool:suites'] to host chromeos4-row9-rack7-host5
Transferring 4 DUTs from suites to bvt.
Updating host: chromeos4-row9-rack6-host3.
Removing labels ['pool:suites'] from host chromeos4-row9-rack6-host3
Adding labels ['pool:bvt'] to host chromeos4-row9-rack6-host3
Updating host: chromeos4-row9-rack6-host5.
Removing labels ['pool:suites'] from host chromeos4-row9-rack6-host5
Adding labels ['pool:bvt'] to host chromeos4-row9-rack6-host5
Updating host: chromeos4-row9-rack5-host9.
Removing labels ['pool:suites'] from host chromeos4-row9-rack5-host9
Adding labels ['pool:bvt'] to host chromeos4-row9-rack5-host9
Updating host: chromeos4-row9-rack5-host11.
Removing labels ['pool:suites'] from host chromeos4-row9-rack5-host11
Adding labels ['pool:bvt'] to host chromeos4-row9-rack5-host11

Components: Internals>Printing>CUPS
Just to be sure is there any plan to return them to the lab or do we suspect is is broken beyond repair?

If so please fill out go/atl-device-failure so that the lab team does not need to track it anymore
Devices are at my desk and I'll bring them in 2081 tomorrow.

It is fine if you still need them as I placed a replacement request.
For daisy_skate ONLY on host (chromeos4-row9-rack5-host11) is passing the test
https://screenshot.googleplex.com/STaPojrNAnP


For daisy spring ONLY one host (chromeos6-row2-rack16-host16) is failing the test
https://screenshot.googleplex.com/m3eKQANKP4F

I locked the host (at R71-11145.0.0) and re-ran the test(platform_PrinterPpds.ppd) from local chroot against this host. Test passed within 7 minutes.

Most failures take more than 40 minutes to fail. The loss of time in the logs is at:

10/04 09:48:24.846 DEBUG|          autotest:1281| Finished waiting on autotestd to start.
10/04 09:48:24.846 INFO |          autotest:1340| Finished waiting on autotestd to start.
10/04 09:48:25.424 DEBUG|          autotest:1281| AUTOTEST_STATUS::START	----	----	timestamp=1538671704	localtime=Oct 04 09:48:24	
10/04 09:48:25.425 INFO |        server_job:0217| START	----	----	timestamp=1538671704	localtime=Oct 04 09:48:24	
10/04 09:48:25.538 DEBUG|          autotest:1281| AUTOTEST_STATUS::	START	platform_PrinterPpds	platform_PrinterPpds	timestamp=1538671704	localtime=Oct 04 09:48:24	
10/04 09:48:25.539 INFO |        server_job:0217| 	START	platform_PrinterPpds	platform_PrinterPpds	timestamp=1538671704	localtime=Oct 04 09:48:24	

>>>> H E R E <<<<

10/04 10:19:19.044 DEBUG|          autotest:0956| Result exit status is 255.
10/04 10:19:19.049 DEBUG|             utils:0219| Running 'ping chromeos4-row9-rack5-host9 -w1 -c1'
10/04 10:19:20.336 DEBUG|             utils:0287| [stdout] PING chromeos4-row9-rack5-host9.cros.corp.google.com (100.115.200.57) 56(84) bytes of data.
10/04 10:19:20.336 DEBUG|             utils:0287| [stdout] 
10/04 10:19:20.336 DEBUG|             utils:0287| [stdout] --- chromeos4-row9-rack5-host9.cros.corp.google.com ping statistics ---
10/04 10:19:20.336 DEBUG|             utils:0287| [stdout] 1 packets transmitted, 0 received, 100% packet loss, time 0ms
10/04 10:19:20.337 DEBUG|             utils:0287| [stdout] 
10/04 10:19:20.341 INFO |        server_job:0217| 		FAIL	----	----	timestamp=1538673560	localtime=Oct 04 10:19:20	Autotest client terminated unexpectedly: DUT is no longer pingable, it may have rebooted or hung.

The client test log itself could not be collected. 

What would be making the test to stall so much time?
I guess over that time the devices are loaded to extent they hang and finally test being thrown out.
Cc: stagenut@chromium.org matth...@chromium.org
pawliczek@ are we still investigating this?
No, I have returned devices week ago.
The problem was caused by overheating.
I have send to Kalin the following (see attachments):
1. A simple script for testing the device (copy to /var/log/ and run for 3-5 minutes)
2. A plot with CPU temperature obtained by the script for 3 tested devices (2 of them always reboots after 2-3 minutes, 1 always survived; Y axis is temperature given as Celsius*1000)

According to my knowledge the problem was redirected to infrastructure team . It looks like some hardware problem with CPU cooling.
test_daisy.sh
346 bytes View Download
plot.png
36.0 KB View Download
Cc: -matth...@chromium.org
Labels: M-71
Status: Fixed (was: Untriaged)

Sign in to add a comment