Filesystem errors reported after install for provisioning
Reported by
jrbarnette@chromium.org,
Jun 21 2018
|
||
Issue description
Looking at history for the wolf CQ pool, I've observed three
distinct provisioning failures on two separate DUTs that
failed post-installation verify checks with this symptom:
FAIL ---- verify.ext4 timestamp=1529413090 localtime=Jun 19 05:58:10 Saw file system error: [ 1.386042] EXT4-fs error (device sda1): __ext4_get_inode_loc:3762: inode #142371: block 525106: comm find: unable to read itable block
These are the failed provision tasks:
2018-06-20 12:50:26 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack7-host1/1222296-provision/
row1-rack7-host1/1222296-provision/
2018-06-20 00:49:43 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack4-host9/1218597-provision/
2018-06-19 05:53:12 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack4-host9/1211689-provision/
In all cases, the DUT was fine after re-provisioning.
I believe that this behavior is new, and since it's occurring on two
different DUTs, I'm inclined to suspect software, not hardware.
,
Jun 21 2018
The cited I/O error on read from /dev/sda is present on either DUT
when the kernel verifier fails.
Put more simply, /dev/sda is suspect on both of these DUTs:
chromeos4-row1-rack4-host9
chromeos4-row1-rack7-host1
grundler@ gwendal@ - If one of you can confirm the diagnosis, we can
try and kick off replacing the units.
,
Jun 22 2018
Yes, when you can not read your SSD, it is more than time to replace it. Unfortunately, Kingston SSD does not tell when they are about to die.
,
Jun 22 2018
I've filed b/110569219 for decommission and replacement. I've locked the hosts and swapped them out of the CQ for working spare wolf units. $ atest host mod --lock -r b/110569219 ${HOSTS[@]} Locked hosts: chromeos4-row1-rack4-host9, chromeos4-row1-rack7-host1 $ balance_pool -f cq wolf wolf cq pool: Target of 9 is above minimum. Balancing ['model:wolf'] cq pool: Total 9 DUTs, 7 working, 2 broken, 0 reserved. Target is 9 working DUTs; grow pool by 2 DUTs. ['model:wolf'] suites pool has 14 spares available for balancing pool cq ['model:wolf'] cq pool will return 2 broken DUTs, leaving 0 still in the pool. Transferring 2 DUTs from cq to suites. Updating host: chromeos4-row1-rack7-host1. Removing labels ['pool:cq'] from host chromeos4-row1-rack7-host1 Adding labels ['pool:suites'] to host chromeos4-row1-rack7-host1 Updating host: chromeos4-row1-rack4-host9. Removing labels ['pool:cq'] from host chromeos4-row1-rack4-host9 Adding labels ['pool:suites'] to host chromeos4-row1-rack4-host9 Transferring 2 DUTs from suites to cq. Updating host: chromeos4-row1-rack3-host13. Removing labels ['pool:suites'] from host chromeos4-row1-rack3-host13 Adding labels ['pool:cq'] to host chromeos4-row1-rack3-host13 Updating host: chromeos4-row1-rack4-host7. Removing labels ['pool:suites'] from host chromeos4-row1-rack4-host7 Adding labels ['pool:cq'] to host chromeos4-row1-rack4-host7 |
||
►
Sign in to add a comment |
||
Comment 1 by jrbarnette@chromium.org
, Jun 21 2018I cross-checked the pool's history for the past week. There have been 7 provision failures in that time. It's always the same two DUTs, and 6 of the seven failures match the symptom above. Looking in /var/log/messages from the most recent provision failure, you can find this: 2018-06-20T20:02:23.819782+00:00 ERR kernel: [ 3.124297] ata1.00: exception Emask 0x0 SAct 0x70000003 SErr 0x0 action 0x0 2018-06-20T20:02:23.819783+00:00 ERR kernel: [ 3.124311] ata1.00: irq_stat 0x40000008 2018-06-20T20:02:23.819784+00:00 ERR kernel: [ 3.124323] ata1.00: failed command: READ FPDMA QUEUED 2018-06-20T20:02:23.819786+00:00 ERR kernel: [ 3.124336] ata1.00: cmd 60/08:e0:e8:85:84/00:00:00:00:00/40 tag 28 ncq 4096 in 2018-06-20T20:02:23.819787+00:00 ERR kernel: [ 3.124336] res 41/40:00:ec:85:84/00:00:00:00:00/00 Emask 0x409 (media error) <F> 2018-06-20T20:02:23.819788+00:00 ERR kernel: [ 3.124353] ata1.00: status: { DRDY ERR } 2018-06-20T20:02:23.819789+00:00 ERR kernel: [ 3.124361] ata1.00: error: { UNC } 2018-06-20T20:02:23.819794+00:00 INFO kernel: [ 3.125935] ata1.00: configured for UDMA/133 2018-06-20T20:02:23.819795+00:00 INFO kernel: [ 3.125967] sd 0:0:0:0: [sda] Unhandled sense code 2018-06-20T20:02:23.819796+00:00 INFO kernel: [ 3.125977] sd 0:0:0:0: [sda] 2018-06-20T20:02:23.819797+00:00 NOTICE kernel: [ 3.125983] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE 2018-06-20T20:02:23.819798+00:00 INFO kernel: [ 3.125992] sd 0:0:0:0: [sda] 2018-06-20T20:02:23.819800+00:00 NOTICE kernel: [ 3.125998] Sense Key : Medium Error [current] [descriptor] 2018-06-20T20:02:23.819806+00:00 NOTICE kernel: [ 3.126015] Descriptor sense data with sense descriptors (in hex): 2018-06-20T20:02:23.819807+00:00 INFO kernel: [ 3.126025] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 2018-06-20T20:02:23.819808+00:00 INFO kernel: [ 3.126076] 00 84 85 ec 2018-06-20T20:02:23.819810+00:00 INFO kernel: [ 3.126095] sd 0:0:0:0: [sda] 2018-06-20T20:02:23.819811+00:00 NOTICE kernel: [ 3.126102] Add. Sense: Unrecovered read error - auto reallocate failed 2018-06-20T20:02:23.819812+00:00 INFO kernel: [ 3.126114] sd 0:0:0:0: [sda] CDB: 2018-06-20T20:02:23.819813+00:00 NOTICE kernel: [ 3.126120] Read(10): 28 00 00 84 85 e8 00 00 08 00 2018-06-20T20:02:23.819818+00:00 ERR kernel: [ 3.126159] end_request: I/O error, dev sda, sector 8685036 2018-06-20T20:02:23.819819+00:00 INFO kernel: [ 3.126177] ata1: EH complete ... and that sure _does_ look like a hardware problem...