samus SSD becomes unresponsive after a few days |
||
Issue description
I have seen this on multiple units:
2017-09-15T14:40:08.152459-07:00 ERR kernel: [344533.892736] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x50000 action 0x6 frozen
2017-09-15T14:40:08.152494-07:00 ERR kernel: [344533.892760] ata1: SError: { PHYRdyChg CommWake }
2017-09-15T14:40:08.152498-07:00 ERR kernel: [344533.892776] ata1.00: failed command: WRITE FPDMA QUEUED
2017-09-15T14:40:08.152502-07:00 ERR kernel: [344533.892796] ata1.00: cmd 61/08:00:f8:81:d0/00:00:02:00:00/40 tag 0 ncq 4096 out
2017-09-15T14:40:08.152505-07:00 ERR kernel: [344533.892796] res 40/00:f4:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
2017-09-15T14:40:08.152508-07:00 ERR kernel: [344533.892825] ata1.00: status: { DRDY }
2017-09-15T14:40:08.152511-07:00 INFO kernel: [344533.892845] ata1: hard resetting link
2017-09-15T14:40:13.311464-07:00 WARNING kernel: [344539.056855] ata1: link is slow to respond, please be patient (ready=0)
2017-09-15T14:40:18.156465-07:00 ERR kernel: [344543.906665] ata1: COMRESET failed (errno=-16)
2017-09-15T14:40:18.156494-07:00 INFO kernel: [344543.906687] ata1: hard resetting link
2017-09-15T14:40:23.316463-07:00 WARNING kernel: [344549.071887] ata1: link is slow to respond, please be patient (ready=0)
2017-09-15T14:40:28.161450-07:00 ERR kernel: [344553.921747] ata1: COMRESET failed (errno=-16)
2017-09-15T14:40:28.161479-07:00 INFO kernel: [344553.921768] ata1: hard resetting link
2017-09-15T14:40:33.320463-07:00 WARNING kernel: [344559.085919] ata1: link is slow to respond, please be patient (ready=0)
2017-09-15T14:41:03.206430-07:00 ERR kernel: [344589.001883] ata1: COMRESET failed (errno=-16)
2017-09-15T14:41:03.206461-07:00 WARNING kernel: [344589.001905] ata1: limiting SATA link speed to 3.0 Gbps
2017-09-15T14:41:03.206465-07:00 INFO kernel: [344589.001919] ata1: hard resetting link
2017-09-15T14:41:08.213583-07:00 ERR kernel: [344594.012901] ata1: COMRESET failed (errno=-16)
2017-09-15T14:41:08.213619-07:00 ERR kernel: [344594.012923] ata1: reset failed, giving up
2017-09-15T14:41:08.213623-07:00 WARNING kernel: [344594.012936] ata1.00: disabled
2017-09-15T14:41:08.213626-07:00 WARNING kernel: [344594.012950] ata1.00: device reported invalid CHS sector 0
2017-09-15T14:41:08.213631-07:00 INFO kernel: [344594.012994] ata1: EH complete
2017-09-15T14:41:08.213880-07:00 INFO kernel: [344594.013030] sd 0:0:0:0: [sda] Unhandled error code
2017-09-15T14:41:08.213892-07:00 INFO kernel: [344594.013043] sd 0:0:0:0: [sda]
2017-09-15T14:41:08.213896-07:00 NOTICE kernel: [344594.013053] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
2017-09-15T14:41:08.213900-07:00 INFO kernel: [344594.013067] sd 0:0:0:0: [sda] CDB:
2017-09-15T14:41:08.213903-07:00 NOTICE kernel: [344594.013077] Write(10): 2a 00 02 d0 81 f8 00 00 08 00
2017-09-15T14:41:08.213907-07:00 ERR kernel: [344594.013115] end_request: I/O error, dev sda, sector 47219192
2017-09-15T14:41:08.213911-07:00 WARNING kernel: [344594.013134] EXT4-fs warning (device sda1): ext4_end_bio:336: I/O error -5 writing to inode 1047649 (offset 0 size 4096 starting block 5902400)
Right after a reboot I dumped the SMART info:
localhost ~ # smartctl -x /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.14.0] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: KINGSTON RBU-SUS151S332GD
Serial Number: 50026B7E49D723FE
Firmware Version: S9FM02.3
User Capacity: 32,017,047,552 bytes [32.0 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: < 1.8 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Sep 16 12:12:46 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM level is: 254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 30) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O-R-- 100 100 000 - 0
2 Throughput_Performance P-S--- 100 100 050 - 0
5 Reallocated_Sector_Ct PO--C- 100 100 050 - 0
9 Power_On_Hours -O--C- 100 100 000 - 425
12 Power_Cycle_Count -O--C- 100 100 000 - 37720
167 Unknown_Attribute -O---K 100 100 000 - 0
168 Unknown_Attribute -O--C- 100 100 000 - 0
169 Unknown_Attribute ------ 100 100 000 - 16
170 Unknown_Attribute PO--C- 100 100 010 - 10
172 Unknown_Attribute -O--CK 100 100 000 - 0
173 Unknown_Attribute ------ 100 100 000 - 22216807
175 Program_Fail_Count_Chip PO--C- 100 100 010 - 0
181 Program_Fail_Cnt_Total -O--C- 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--C- 100 100 000 - 1130
194 Temperature_Celsius PO---K 070 070 000 - 30
196 Reallocated_Event_Count ------ 100 100 000 - 0
197 Current_Pending_Sector -O--CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 0
218 Unknown_Attribute ------ 100 100 000 - 0
233 Media_Wearout_Indicator PO--C- 100 100 000 - 2305647
240 Unknown_SSD_Attribute PO--C- 100 100 050 - 0
241 Total_LBAs_Written -O--C- 100 100 000 - 1711154
242 Total_LBAs_Read -O--C- 100 100 000 - 2786909
243 Unknown_Attribute -O--C- 100 100 000 - 110382343808100
244 Unknown_Attribute ------ 100 100 000 - 103
245 Unknown_Attribute ------ 100 100 000 - 339
246 Unknown_Attribute ------ 100 100 000 - 440189
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 51 Comprehensive SMART error log
0x03 GPL R/O 64 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Commands not supported
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
0x07 0x008 1 3 --- Percentage Used Endurance Indicator
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 4 128 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 3 Device-to-host register FISes sent due to a COMRESET
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC
/dev/sda never recovers, so attempts to access anything that isn't in the kernel's buffer cache will result in an I/O error.
The problem goes away after a refresh+power reset, but eventually it happens again. This happens on the standard 3.14 Chrome OS kernel and on the Ubuntu 4.10 stock kernel. I don't know what triggers it or how to reproduce it on demand.
If I run `dd if=/dev/sda of=/dev/null bs=1048576` when the device is in a good state, it finishes in about a minute (for 32GB) and I do not see any disk errors in dmesg.
,
Sep 17 2017
FWIW the machines on which I've seen this problem generally are not suspending/resuming much (if at all).
,
Aug 1
|
||
►
Sign in to add a comment |
||
Comment 1 by gwendal@chromium.org
, Sep 17 2017