New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 615474 link

Starred by 1 user

Issue metadata

Status: Verified
Owner: ----
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

bad device causing x86-alex-paladin bvt-cq to timeout.

Project Member Reported by waihong@chromium.org, May 27 2016

Issue description

The recent 4 builds (24593-24596) on x86-alex-paladin HwTest got timeout abort:
https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-paladin

Some previous good builds, like 24591, took ~24min to run HwTest. But the failed ones took >90min and got abort.

Looked at the new coming CLs and they are not related to the test timeout.

 
Cc: -djkurtz@chromium.org wnhuang@chromium.org
Cc: dshi@chromium.org
Summary: suite finishes and all tests pass but suite marked as abort (was: x86-alex-paladin HwTest timeout abort)
This one I don't understand, it looks like all the jobs passed but the suite got aborted anyways.

https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-paladin/builds/24596/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio

Suite timings:
Testing started at 2016-05-27 09:30:46
Testing ended at 2016-05-27 09:54:06

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64813624-chromeos-test/hostless/

From the .parse.log, it shows the abort happened at 2016-05-27 10:59:55:
STATUS: INFO	----	----	Job aborted by autotest_system on 2016-05-27 10:59:55

but the last test completed at May 27 09:53:56:
STATUS: END GOOD	64814136-chromeos-test/chromeos2-row1-rack6-host12/hardware_StorageWearoutDetect	hardware_StorageWearoutDetect	timestamp=1464368036	localtime=May 27 09:53:56


And the suite timings match up when the last test ended so... why the abort?
Components: Infra>Client>ChromeOS
Argh!  This sounds like bug 589673.

Summary: job gets scheduled but never runs causing suite to abort (was: suite finishes and all tests pass but suite marked as abort)
Oops! I glossed over an important bit from the HWTest stdio:

05-27-2016 [10:14:36] printing summary of incomplete jobs (1):

graphics_GLMark2.bvt-cq: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=64814075


In the debug log I see that it got scheduled at the very beginning and then no mention of the test after that. 

05/27 09:30:45.810 DEBUG|             suite:0825| Scheduling graphics_GLMark2.bvt-cq
05/27 09:30:46.165 DEBUG|             suite:1079| Adding job keyval for graphics_GLMark2.bvt-cq=64814075-chromeos-test


The job shows up in cautotest as abort with no logs.  Is it possible the scheduler ignored that job?
The problem was probably caused by the host chromeos2-row1-rack5-host7:
http://cautotest.corp.google.com/afe/#tab_id=view_host&object_id=480

All the incomplete jobs of the recent 4 builds all run on this same host.
Summary: bad device causing x86-alex-paladin bvt-cq to timeout. (was: job gets scheduled but never runs causing suite to abort)
Ah... and I see the provision keeps failing but repair works.  Provision keeps getting stuck wget'ing http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2 (I can download the file fine).

I tried running the command on the dut and it just stays stuck there.  There might be an issue with the device's ssd, in /var/log/messages:
2016-05-27T19:42:01.424363+00:00 ERR kernel: [53990.384899] sd 4:0:0:0: [sdb] Asking for cache data failed
2016-05-27T19:42:01.424421+00:00 ERR kernel: [53990.384922] sd 4:0:0:0: [sdb] Assuming drive cache: write through

I'm going to lock this device and ask for repair on it.
I logged-in to the host and verified that it did fail to fetch the file using wget:

# wget http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2
--2016-05-27 12:49:48--  http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2
Connecting to 172.17.40.28:8082... connected.
HTTP request sent, awaiting response... 


Need some lab guy to check the network setting.
Looks like all outbound connections from the host are blocked. I can't ssh from the host to the outside world.
Before you file a ticket for that DUT, there are still things we
can/should do from here.

In particular, why is there an sdb?

Why can't we reach the devserver (especially since we can
get into the DUT)?
The SSD error doesn't matter. The network issue is the cause.

$ strace wget http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("172.17.40.28")}, 16) = 0
write(2, "connected.\n", 11connected.
)            = 11
select(4, NULL, [3], NULL, {900, 0})    = 1 (out [3], left {899, 999989})
write(3, "GET /static/x86-alex-paladin/R53"..., 226) = 226
write(2, "HTTP request sent, awaiting resp"..., 40HTTP request sent, awaiting response... ) = 40
select(4, [3], NULL, NULL, {900, 0}

....Timeout!

Actually the host can send the first message out but the response message seems to be blocked.
I rebooted the machine. It works fine now.
> The SSD error doesn't matter. The network issue is the cause.

There was no SSD error.  'sdb' is some sort of removable storage.
It may be the SD card reader.

More info:  I ran `balance_pool cq x86-alex` earlier, so that the
DUT won't be affecting the CQ any more.
I've unlocked the DUT; it seems to be working, and after being
rebooted, it wiped logs from stateful.  So, there's not much
debug to be had from the DUT at this point.

Status: Fixed (was: Untriaged)
The new build became good.
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)

Sign in to add a comment