New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 814499 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

kefka-release HWTests are very flaky

Project Member Reported by norvez@chromium.org, Feb 21 2018

Issue description

Errors aren't always the same but seem to happen during provisioning:

https://luci-milo.appspot.com/buildbot/chromeos/kefka-release/1967
"
FAIL: Saw file system error: [ 1.767018] EXT4-fs error (device mmcblk0p1): ext4_lookup:1590: inode #147000: comm rm: deleted inode referenced: 260848, completed successfully
"

Same error on both
https://luci-milo.appspot.com/buildbot/chromeos/kefka-release/1966
https://luci-milo.appspot.com/buildbot/chromeos/kefka-release/1968
"
provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack8-host11: 0) SSHConnectionError: None, 1) SSHConnectionError: ssh: connect to host 100.115.227.30 port 22: Connection timed out
"
 
Cc: grundler@chromium.org nxia@chromium.org pprabhu@chromium.org puneetster@chromium.org matthewmwang@chromium.org ejcaruso@chromium.org snanda@chromium.org
Adding current deputies, sheriffs, and a few other folks.

There seem to be a lot of failing kefka provisions: http://shortn/_gDgVw9kRIT  (Note that things seem to have gotten better since quick provisioning was enabled on Feb 1 so it's likely this issue is unrelated).  It seems to happen to many DUTs: http://shortn/_on5TrVoaIg

Here's some DUTs in repair loops.  There might be more, but I did not exhaustively search for them.
chromeos2-row4-rack8-host11
chromeos2-row4-rack8-host18

From https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row4-rack8-host11/192382-repair/20180103083609/status.log?_ga=2.150178230.-1609996729.1510708542 it seems like the device is unresponsive after repair most of the time:
	FAIL	----	verify.ssh	timestamp=1519922768	localtime=Mar 01 08:46:08	No answer to ping from chromeos2-row4-rack8-host11
	START	----	repair.rpm	timestamp=1519922768	localtime=Mar 01 08:46:08	
		FAIL	----	repair.rpm	timestamp=1519923011	localtime=Mar 01 08:50:11	chromeos2-row4-rack8-host11 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1519923011	localtime=Mar 01 08:50:11	
	START	----	repair.sysrq	timestamp=1519923011	localtime=Mar 01 08:50:11	
		FAIL	----	repair.sysrq	timestamp=1519923255	localtime=Mar 01 08:54:15	Host chromeos2-row4-rack8-host11 is still offline after sysrq.
	END FAIL	----	repair.sysrq	timestamp=1519923255	localtime=Mar 01 08:54:15	
	START	----	repair.servoreset	timestamp=1519923255	localtime=Mar 01 08:54:15	
		FAIL	----	verify.ssh	timestamp=1519925737	localtime=Mar 01 09:35:37	No answer to ping from chromeos2-row4-rack8-host11
	END FAIL	----	repair.servoreset	timestamp=1519925737	localtime=Mar 01 09:35:37	
	START	----	repair.firmware	timestamp=1519925737	localtime=Mar 01 09:35:37	
		FAIL	----	repair.firmware	timestamp=1519925737	localtime=Mar 01 09:35:37	Firmware repair is not applicable to host chromeos2-row4-rack8-host11.
	END FAIL	----	repair.firmware	timestamp=1519925737	localtime=Mar 01 09:35:37	
	START	----	repair.usb	timestamp=1519925737	localtime=Mar 01 09:35:37	
		FAIL	----	repair.usb	timestamp=1519925866	localtime=Mar 01 09:37:46	Download image to usb failed.
	END FAIL	----	repair.usb	timestamp=1519925866	localtime=Mar 01 09:37:46	
END FAIL	----	repair	timestamp=1519925866	localtime=Mar 01 09:37:46	

But not always according to dut-status:
    2018-03-01 03:59:45  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/192281-verify/
    2018-03-01 02:19:13  NO http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/191810-repair/
    2018-03-01 01:54:49  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/191647-provision/
    2018-03-01 01:34:37  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/191469-repair/
    2018-03-01 01:29:31  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/191422-verify/

The provision after the repair failed due to no network connectivity again before even getting to the provision procedss:
https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row4-rack8-host11/191647-provision/20180103015449/status.log?_ga=2.179528100.-1609996729.1510708542
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1519898109	localtime=Mar 01 01:55:09	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1519899261	localtime=Mar 01 02:14:21	Unhandled AutoservSSHTimeout: ('ssh timed out', * Command: 
      /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_E60yKBssh-master/socket
      -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22
      chromeos2-row4-rack8-host11 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack:
      :get_chromeos_release_milestone|_get_lsb_release_content|run] ->
      ssh_run(cat \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
  Exit status: 255
  Duration: 63.5015308857

chromeos2-row4-rack8-host5 is repeatedly failing provision but then succeeds on repair, logs seem to show loss of network in multiple places.

This likely requires action from a few people:
- deputies: Get repair failed devices actually repaired offline.
- sheriffs: Triage some of these failures and potentially try and reproduce to see what's going on.
- snanda/puneetster: Depending on the results of the sheriffs, triage, might need someone to spend more time investigating/fixing this board.  kefka is one of the worst two boards in terms of provision failures.
I have essentially no kernel logs from these repair jobs for chromeos2-row4-rack8-host11:
2018-02-09-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-09 21:40:28  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/13343-repair/
2018-02-10-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-10 22:07:30  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/26275-repair/
2018-02-11-bvt.txt:chromeos2-row4-rack8-host11    OK  2018-02-11 21:06:58  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/35415-repair/
2018-02-12-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-12 22:02:05  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/46949-repair/
2018-02-13-bvt.txt:chromeos2-row4-rack8-host11    OK  2018-02-13 21:51:16  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/55640-repair/
2018-02-14-bvt.txt:chromeos2-row4-rack8-host11    OK  2018-02-14 16:51:04  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/62852-reset/
2018-02-15-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-15 22:38:57  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/72521-repair/
2018-02-16-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-16 22:21:09  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/81226-repair/
2018-02-17-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-17 21:52:16  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/89688-repair/
2018-02-18-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-18 22:46:45  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/99736-repair/
2018-02-19-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-19 22:01:47  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/110320-repair/
2018-02-20-bvt.txt:chromeos2-row4-rack8-host11    OK  2018-02-20 22:56:09  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/118914-repair/
2018-02-21-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-21 22:15:56  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/126014-repair/
2018-02-22-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-22 22:46:07  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/132472-repair/
2018-02-23-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-23 22:30:26  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/138229-repair/
2018-02-24-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-24 21:40:08  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/147376-repair/
2018-02-25-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-25 22:41:30  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/157841-repair/
2018-02-26-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-26 21:48:50  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/168989-repair/
2018-02-27-bvt.txt:chromeos2-row4-rack8-host11    OK  2018-02-27 22:02:05  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/180601-reset/
2018-02-28-bvt.txt:chromeos2-row4-rack8-host11    NO  2018-02-28 23:04:22  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack8-host11/190571-repair/

No logs means there is likely something really wrong with the HW given we are getting logs from other machines.

Same story for chromeos2-row4-rack8-host18

Given lack of data, I believe some TLC from a labtech should be the next step OR some one has to tell me where to find the system or kernel logs that correspond to those repair jobs.
I'm able to get these logs from chromeos2-row4-rack8-host18 /var/log/messages before I'm disconnected from ssh:

2018-03-02T10:48:51.681051-08:00 INFO kernel: [  448.157704] usb 1-2.2: USB disconnect, device number 6
2018-03-02T10:48:51.693735-08:00 ERR kernel: [  448.169093] Aborting journal on device sda1-8.
2018-03-02T10:48:51.693775-08:00 ERR kernel: [  448.169128] JBD2: Error -5 detected when updating journal superblock for sda1-8.
2018-03-02T10:48:55.361720-08:00 WARNING kernel: [  451.839326] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.361753-08:00 WARNING kernel: [  451.839371] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.361757-08:00 WARNING kernel: [  451.839402] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.361759-08:00 WARNING kernel: [  451.839430] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.361761-08:00 WARNING kernel: [  451.839461] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.362735-08:00 WARNING kernel: [  451.839595] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.362765-08:00 WARNING kernel: [  451.839623] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.362768-08:00 WARNING kernel: [  451.839771] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.364735-08:00 WARNING kernel: [  451.841844] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)
2018-03-02T10:48:55.364766-08:00 WARNING kernel: [  451.841978] EXT4-fs warning (device sda1): __ext4_read_dirblock:734: error -5 reading directory block (ino 25925, block 0)

Labels: provisionflake

Comment 5 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment