bad device causing x86-alex-paladin bvt-cq to timeout. |
|||||||
Issue descriptionThe recent 4 builds (24593-24596) on x86-alex-paladin HwTest got timeout abort: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-paladin Some previous good builds, like 24591, took ~24min to run HwTest. But the failed ones took >90min and got abort. Looked at the new coming CLs and they are not related to the test timeout.
,
May 27 2016
This one I don't understand, it looks like all the jobs passed but the suite got aborted anyways. https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-paladin/builds/24596/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio Suite timings: Testing started at 2016-05-27 09:30:46 Testing ended at 2016-05-27 09:54:06 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/64813624-chromeos-test/hostless/ From the .parse.log, it shows the abort happened at 2016-05-27 10:59:55: STATUS: INFO ---- ---- Job aborted by autotest_system on 2016-05-27 10:59:55 but the last test completed at May 27 09:53:56: STATUS: END GOOD 64814136-chromeos-test/chromeos2-row1-rack6-host12/hardware_StorageWearoutDetect hardware_StorageWearoutDetect timestamp=1464368036 localtime=May 27 09:53:56 And the suite timings match up when the last test ended so... why the abort?
,
May 27 2016
Argh! This sounds like bug 589673.
,
May 27 2016
Oops! I glossed over an important bit from the HWTest stdio: 05-27-2016 [10:14:36] printing summary of incomplete jobs (1): graphics_GLMark2.bvt-cq: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=64814075 In the debug log I see that it got scheduled at the very beginning and then no mention of the test after that. 05/27 09:30:45.810 DEBUG| suite:0825| Scheduling graphics_GLMark2.bvt-cq 05/27 09:30:46.165 DEBUG| suite:1079| Adding job keyval for graphics_GLMark2.bvt-cq=64814075-chromeos-test The job shows up in cautotest as abort with no logs. Is it possible the scheduler ignored that job?
,
May 27 2016
The problem was probably caused by the host chromeos2-row1-rack5-host7: http://cautotest.corp.google.com/afe/#tab_id=view_host&object_id=480 All the incomplete jobs of the recent 4 builds all run on this same host.
,
May 27 2016
Ah... and I see the provision keeps failing but repair works. Provision keeps getting stuck wget'ing http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2 (I can download the file fine). I tried running the command on the dut and it just stays stuck there. There might be an issue with the device's ssd, in /var/log/messages: 2016-05-27T19:42:01.424363+00:00 ERR kernel: [53990.384899] sd 4:0:0:0: [sdb] Asking for cache data failed 2016-05-27T19:42:01.424421+00:00 ERR kernel: [53990.384922] sd 4:0:0:0: [sdb] Assuming drive cache: write through I'm going to lock this device and ask for repair on it.
,
May 27 2016
I logged-in to the host and verified that it did fail to fetch the file using wget: # wget http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2 --2016-05-27 12:49:48-- http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2 Connecting to 172.17.40.28:8082... connected. HTTP request sent, awaiting response... Need some lab guy to check the network setting.
,
May 27 2016
Looks like all outbound connections from the host are blocked. I can't ssh from the host to the outside world.
,
May 27 2016
Before you file a ticket for that DUT, there are still things we can/should do from here.
,
May 27 2016
In particular, why is there an sdb? Why can't we reach the devserver (especially since we can get into the DUT)?
,
May 27 2016
The SSD error doesn't matter. The network issue is the cause.
$ strace wget http://172.17.40.28:8082/static/x86-alex-paladin/R53-8376.0.0-rc1/autotest/packages/client-autotest.tar.bz2
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("172.17.40.28")}, 16) = 0
write(2, "connected.\n", 11connected.
) = 11
select(4, NULL, [3], NULL, {900, 0}) = 1 (out [3], left {899, 999989})
write(3, "GET /static/x86-alex-paladin/R53"..., 226) = 226
write(2, "HTTP request sent, awaiting resp"..., 40HTTP request sent, awaiting response... ) = 40
select(4, [3], NULL, NULL, {900, 0}
....Timeout!
Actually the host can send the first message out but the response message seems to be blocked.
,
May 27 2016
I rebooted the machine. It works fine now.
,
May 27 2016
> The SSD error doesn't matter. The network issue is the cause. There was no SSD error. 'sdb' is some sort of removable storage. It may be the SD card reader. More info: I ran `balance_pool cq x86-alex` earlier, so that the DUT won't be affecting the CQ any more.
,
May 27 2016
I've unlocked the DUT; it seems to be working, and after being rebooted, it wiped logs from stateful. So, there's not much debug to be had from the DUT at this point.
,
May 27 2016
The new build became good.
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by waihong@chromium.org
, May 27 2016