New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 855210 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Filesystem errors reported after install for provisioning

Reported by jrbarnette@chromium.org, Jun 21 2018

Issue description

Looking at history for the wolf CQ pool, I've observed three
distinct provisioning failures on two separate DUTs that
failed post-installation verify checks with this symptom:
	FAIL	----	verify.ext4	timestamp=1529413090	localtime=Jun 19 05:58:10	Saw file system error: [    1.386042] EXT4-fs error (device sda1): __ext4_get_inode_loc:3762: inode #142371: block 525106: comm find: unable to read itable block

These are the failed provision tasks:
    2018-06-20 12:50:26  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack7-host1/1222296-provision/
row1-rack7-host1/1222296-provision/
    2018-06-20 00:49:43  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack4-host9/1218597-provision/
    2018-06-19 05:53:12  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack4-host9/1211689-provision/

In all cases, the DUT was fine after re-provisioning.

I believe that this behavior is new, and since it's occurring on two
different DUTs, I'm inclined to suspect software, not hardware.

 
I cross-checked the pool's history for the past week.  There
have been 7 provision failures in that time.  It's always the
same two DUTs, and 6 of the seven failures match the symptom
above.

Looking in /var/log/messages from the most recent provision failure, you
can find this:
2018-06-20T20:02:23.819782+00:00 ERR kernel: [    3.124297] ata1.00: exception Emask 0x0 SAct 0x70000003 SErr 0x0 action 0x0
2018-06-20T20:02:23.819783+00:00 ERR kernel: [    3.124311] ata1.00: irq_stat 0x40000008
2018-06-20T20:02:23.819784+00:00 ERR kernel: [    3.124323] ata1.00: failed command: READ FPDMA QUEUED
2018-06-20T20:02:23.819786+00:00 ERR kernel: [    3.124336] ata1.00: cmd 60/08:e0:e8:85:84/00:00:00:00:00/40 tag 28 ncq 4096 in
2018-06-20T20:02:23.819787+00:00 ERR kernel: [    3.124336]          res 41/40:00:ec:85:84/00:00:00:00:00/00 Emask 0x409 (media error) <F>
2018-06-20T20:02:23.819788+00:00 ERR kernel: [    3.124353] ata1.00: status: { DRDY ERR }
2018-06-20T20:02:23.819789+00:00 ERR kernel: [    3.124361] ata1.00: error: { UNC }
2018-06-20T20:02:23.819794+00:00 INFO kernel: [    3.125935] ata1.00: configured for UDMA/133
2018-06-20T20:02:23.819795+00:00 INFO kernel: [    3.125967] sd 0:0:0:0: [sda] Unhandled sense code
2018-06-20T20:02:23.819796+00:00 INFO kernel: [    3.125977] sd 0:0:0:0: [sda]  
2018-06-20T20:02:23.819797+00:00 NOTICE kernel: [    3.125983] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2018-06-20T20:02:23.819798+00:00 INFO kernel: [    3.125992] sd 0:0:0:0: [sda]  
2018-06-20T20:02:23.819800+00:00 NOTICE kernel: [    3.125998] Sense Key : Medium Error [current] [descriptor]
2018-06-20T20:02:23.819806+00:00 NOTICE kernel: [    3.126015] Descriptor sense data with sense descriptors (in hex):
2018-06-20T20:02:23.819807+00:00 INFO kernel: [    3.126025]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
2018-06-20T20:02:23.819808+00:00 INFO kernel: [    3.126076]         00 84 85 ec 
2018-06-20T20:02:23.819810+00:00 INFO kernel: [    3.126095] sd 0:0:0:0: [sda]  
2018-06-20T20:02:23.819811+00:00 NOTICE kernel: [    3.126102] Add. Sense: Unrecovered read error - auto reallocate failed
2018-06-20T20:02:23.819812+00:00 INFO kernel: [    3.126114] sd 0:0:0:0: [sda] CDB: 
2018-06-20T20:02:23.819813+00:00 NOTICE kernel: [    3.126120] Read(10): 28 00 00 84 85 e8 00 00 08 00
2018-06-20T20:02:23.819818+00:00 ERR kernel: [    3.126159] end_request: I/O error, dev sda, sector 8685036
2018-06-20T20:02:23.819819+00:00 INFO kernel: [    3.126177] ata1: EH complete

... and that sure _does_ look like a hardware problem...

The cited I/O error on read from /dev/sda is present on either DUT
when the kernel verifier fails.

Put more simply, /dev/sda is suspect on both of these DUTs:
    chromeos4-row1-rack4-host9
    chromeos4-row1-rack7-host1

grundler@ gwendal@ - If one of you can confirm the diagnosis, we can
try and kick off replacing the units.

Yes, when you can not read your SSD, it is more than time to replace it.
Unfortunately, Kingston SSD does not tell when they are about to die.
Status: Fixed (was: Assigned)
I've filed b/110569219 for decommission and replacement.
I've locked the hosts and swapped them out of the CQ for
working spare wolf units.

    $ atest host mod --lock -r b/110569219 ${HOSTS[@]}
    Locked hosts:
            chromeos4-row1-rack4-host9, chromeos4-row1-rack7-host1
    $ balance_pool -f cq wolf
    wolf cq pool: Target of 9 is above minimum.

    Balancing ['model:wolf'] cq pool:
    Total 9 DUTs, 7 working, 2 broken, 0 reserved.
    Target is 9 working DUTs; grow pool by 2 DUTs.
    ['model:wolf'] suites pool has 14 spares available for balancing pool cq
    ['model:wolf'] cq pool will return 2 broken DUTs, leaving 0 still in the pool.
    Transferring 2 DUTs from cq to suites.
    Updating host: chromeos4-row1-rack7-host1.
    Removing labels ['pool:cq'] from host chromeos4-row1-rack7-host1
    Adding labels ['pool:suites'] to host chromeos4-row1-rack7-host1
    Updating host: chromeos4-row1-rack4-host9.
    Removing labels ['pool:cq'] from host chromeos4-row1-rack4-host9
    Adding labels ['pool:suites'] to host chromeos4-row1-rack4-host9
    Transferring 2 DUTs from suites to cq.
    Updating host: chromeos4-row1-rack3-host13.
    Removing labels ['pool:suites'] from host chromeos4-row1-rack3-host13
    Adding labels ['pool:cq'] to host chromeos4-row1-rack3-host13
    Updating host: chromeos4-row1-rack4-host7.
    Removing labels ['pool:suites'] from host chromeos4-row1-rack4-host7
    Adding labels ['pool:cq'] to host chromeos4-row1-rack4-host7

Sign in to add a comment