hardware_RamFio failing a lot. |
|||||||||||||||||||||
Issue descriptionWhile annotating builds, I noticed a lot of hardware_RamFio failures on veyron_speedy. I cannot tie this to a specific CL under test, and believe it is a test flake. https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/2724/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio hardware_RamFio ERROR: Command <mount -t ramfs -o context=u:object_r:tmpfs:s0 ramfs /tmp/ramdisk> failed, rc=32, Command returned non-zero exit status
,
Aug 6 2016
There have been no failures of this test on veyron in the longest time, which means there is a bad CL coming here. https://wmatrix.googleplex.com/platform/unfiltered?hide_missing=True&tests=hardware_RamFio&releases=tot&platforms=veyron*&days_back=100
,
Aug 6 2016
Mhh, strange, flaky only on paladin.
,
Aug 6 2016
08/02 02:38:40.279 DEBUG| base_utils:0185| Running 'status ui'
08/02 02:38:40.301 DEBUG| base_utils:0185| Running 'stop ui'
08/02 02:38:41.236 DEBUG| base_utils:0278| [stdout] ui stop/waiting
08/02 02:38:41.360 INFO | base_sysinfo:0380| ChromeOS BOARD = veyron_speedy_1.8GHz_4GB
08/02 02:38:41.362 DEBUG| base_utils:0185| Running 'logger "autotest starting iteration /usr/local/autotest/results/default/hardware_RamFio/sysinfo/iteration.1 on veyron_speedy_1.8GHz_4GB"'
08/02 02:38:41.375 DEBUG| base_utils:0185| Running 'grep MemFree /proc/meminfo'
08/02 02:38:41.399 INFO | hardware_RamFio:0042| Found 3427057664 bytes of usable memory.
08/02 02:38:41.403 DEBUG| base_utils:0185| Running 'mkdir -p /tmp/ramdisk'
08/02 02:38:41.415 DEBUG| base_utils:0185| Running 'mount -t ramfs -o context=u:object_r:tmpfs:s0 ramfs /tmp/ramdisk'
08/02 02:38:41.438 DEBUG| base_utils:0185| Running 'logger "autotest finished iteration /usr/local/autotest/results/default/hardware_RamFio/sysinfo/iteration.1"'
08/02 02:38:41.455 WARNI| test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
_call_test_function(self.execute, *p_args, **p_dargs)
File "/usr/local/autotest/common_lib/test.py", line 804, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/common_lib/test.py", line 461, in execute
dargs)
File "/usr/local/autotest/common_lib/test.py", line 347, in _call_run_once_with_retry
postprocess_profiled_run, args, dargs)
File "/usr/local/autotest/common_lib/test.py", line 376, in _call_run_once
self.run_once(*args, **dargs)
File "/usr/local/autotest/tests/hardware_RamFio/hardware_RamFio.py", line 58, in run_once
self._RAMDISK)
File "/usr/local/autotest/common_lib/base_utils.py", line 738, in run
"Command returned non-zero exit status")
CmdError: Command <mount -t ramfs -o context=u:object_r:tmpfs:s0 ramfs /tmp/ramdisk> failed, rc=32, Command returned non-zero exit status
* Command:
mount -t ramfs -o context=u:object_r:tmpfs:s0 ramfs /tmp/ramdisk
Exit status: 32
Duration: 0.00902414321899
stderr:
mount: wrong fs type, bad option, bad superblock on ramfs,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
,
Aug 6 2016
I see nothing enlightening in the logs.
,
Aug 8 2016
Dan, do you mind if I take a look at this?
,
Aug 8 2016
This could be related. It could be a race with the security server. 2016-08-02T09:38:41.374767+00:00 NOTICE root[22457]: autotest starting iteration /usr/local/autotest/results/default/hardware_RamFio/sysinfo/iteration.1 on veyron_speedy_1.8GHz_4GB 2016-08-02T09:38:41.438378+00:00 WARNING kernel: [ 753.182116] SELinux: Unable to set superblock options before the security server is initialized 2016-08-02T09:38:41.453560+00:00 NOTICE root[22461]: autotest finished iteration /usr/local/autotest/results/default/hardware_RamFio/sysinfo/iteration.1
,
Aug 8 2016
The message is a little misleading. There is no "security server" as a separate process. The message is sent when ss_initialized is not set to 1 yet. That variable is set during initialization of the selinux module, as a side effect of calling static int __init init_sel_fs(void) from selinuxfs.c. That module is statically linked, so it's not obvious what's the source of the race (if it's a race). I'll check the CLs.
,
Aug 8 2016
I have looked at the CLs (about 40 of them) in the "CommitQueueSync" step here: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/2724/ Unfortunately I haven't found any that could be related to selinux initialization.
,
Aug 9 2016
#8: Slight correction, ss_initialized is set by security_load_policy, called by sel_write_load (i.e. userspace writing to /sys/fs/selinux/load). rickyz: I'm not sure who setups the SELinux policies on cheets and non-cheets boards, and how that could possibly race. Any idea?
,
Aug 9 2016
,
Aug 9 2016
#10: Right, now I recall. SELinux policies are loaded by upstart (init) before anything else. (I should know because I wrote that CL.) So it would seem like it cannot be a race. However, the policies may have failed to load for some reason. Opal, Nicolas: sorry, I meant to ask you these questions: 1. why are we mounting a tmpfs inside /tmp, which is already a tmpfs? Maybe we don't need to. 2. if we need to, how about we try the mount with the selinux options first, and if that fails, we also try a plain mount?
,
Aug 9 2016
> 1. why are we mounting a tmpfs inside /tmp, which is already a tmpfs? Maybe we don't need to. We are mounting a ramfs inside /tmp The goal of this test is to measure read/write performance for ram > 2. if we need to, how about we try the mount with the selinux options first, and if that fails, we also try a plain mount? I think that is fine.
,
Aug 9 2016
#13: /tmp is ALREADY a ramfs. We should just write to a file in there. Any objections?
,
Aug 9 2016
#14 Sound good
,
Aug 9 2016
Sorry, I should have explained more clearly. Ramfs and tmpfs are both ram-based. It's likely that for the purpose of this test they are equivalent.
,
Aug 9 2016
Actually they are not... /tmp is mounted with a max size, so the test fails. Back to strategy #2.
,
Aug 9 2016
#14 Actualy for machine with 16GB ram. tmpfs size is only 8GB which is not enough
,
Aug 9 2016
Mhh, maybe we should reboot before/after this test to make sure there is no garbage around?
,
Aug 10 2016
Another failure: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/2805 Looking up builds 2724 and 2805 in CIDB: mysql> select id from buildTable where build_config = 'veyron_speedy-paladin' and (build_number = 2724 or build_number = 2805) order by id desc limit 2; +--------+ | id | +--------+ | 997290 | | 983450 | +--------+ 2 rows in set (0.05 sec) Looking up changes in common: mysql> select change_number, count(*) from clActionTable where build_id = 983450 or build_id = 997290 group by change_number having count(*) > 1; +---------------+----------+ | change_number | count(*) | +---------------+----------+ | 264830 | 2 | | 266797 | 2 | | 272285 | 2 | | 363972 | 2 | | 365141 | 2 | | 367231 | 2 | +---------------+----------+ 6 rows in set (0.06 sec)
,
Aug 10 2016
Looking at a couple more tests, 363972 appears in common in 2729 and 2728 and 2760 too.
,
Aug 10 2016
Hi Ricky, do you know under what circumstances this could happen? From the initial comment: hardware_RamFio ERROR: Command <mount -t ramfs -o context=u:object_r:tmpfs:s0 ramfs /tmp/ramdisk> failed, rc=32, Command returned non-zero exit status From comment #7: 2016-08-02T09:38:41.438378+00:00 WARNING kernel: [ 753.182116] SELinux: Unable to set superblock options before the security server is initialized What the message really means is that ss_initialized is zero. I thought we initialized selinux at the very beginning of upstart, and then we leave it alone. Maybe we aren't initializing selinux on some builds? (But the bug appears to be intermittent.) Thanks!
,
Aug 10 2016
As davidriley points out in comment 21, this is very likely to be caused by https://chromium-review.googlesource.com/#/c/363972/ We ran trybots with and without this change and it fails in the same way, so presumably something is not right with the kernel config change. The next steps I have here are: 1. Determine the new config is really doing the same thing as the Android config (need to get a dump of the final config from this and from the Android configured build to compare). 2. Figure out why this does not fail on the existing cheets systems. Sorry for the churn on this. Of course if you have ideas as to how the config might be broken I am glad to take them :).
,
Aug 10 2016
,
Aug 10 2016
Thank you Bernie. I will play with that patch.
,
Aug 11 2016
FWIW if I extract the .config (/build/$BOARD/var/cache/portage/sys-kernel/chromeos-kernel-3_14/.config) from a veyron_speedy build with the patch, and compare it to a veyron_minnie without the patch, they do seem to come out the same. So I think we need to look at if this test fails on veyron_minnie, or if it even gets run currently.
,
Aug 11 2016
It appears we still run _only_ arc-bvt-cq on the cheets boards, so veyron_minnie does not necessarily ever run this test. http://cautotest/afe/#tab_id=view_job&object_id=72879941 may shed some light on if this failure has been in place for some time, but just not noticed since the test does not run.
,
Aug 11 2016
If you don't mind I'll pass this on to you. I should focus on reliability issues, I thought this may be one.
,
Aug 11 2016
,
Aug 11 2016
Np, sorry for the trouble this caused. It looks like the veyron_minnie does pass the test, and it seems to have the same kernel configuration, so perhaps something else must be involved to fail this beyond the kernel. I wonder if we enable veyron_speedy with full cheets if it still fails, if it passes that implies something else being cheets does allows this to pass with this kernel config.
,
Aug 11 2016
Ok, so https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/1267 also includes https://chrome-internal-review.googlesource.com/#/c/274317/ which should make it look much more like the veyron_minnie build as it will have the full cheets stack inherited. https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/1268 is a veyron_minnie build with the kernel config change included, the kernel config change should be entirely redundant in this one so it is merely a sanity control build.
,
Aug 11 2016
I don't understand how/when RamFio is run and why this isn't a problem during the cheets transition. Just in case, this CL should make it work with and without selinux. https://chromium-review.googlesource.com/#/c/368095/
,
Aug 11 2016
Neat the veyron_speedy build with cheets inherited does pass, so something in cheets allows this test to pass. I am guessing if we have selinux enabled, but don't do some portion of the cheets initialization we end up in half baked state for this test? I am leaning toward landing Luigi's test fix and going ahead if there are no objections.
,
Aug 11 2016
The patch probably needs more testing. I am doing it now.
,
Aug 11 2016
Taking it back. Hope to be done by tonight or tomorrow. I need to get to the bottom of this :)
,
Aug 12 2016
Working on ToT cyan, it appears that both the plain "mount" command and the one with the selinux label work fine. (So why was the test changed?) But the kernel log shows that ss_initialized was 0, which could mean that the selinux policy failed to load for that device, or there was some other error at initialization. Looking at one of the test failures, veyron_speedy-paladin/builds/2805, I can find /var/log/messages here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/72728412-chromeos-test/chromeos4-row4-rack11-host4/sysinfo/var/log/?pli=1 In that log, it looks like selinux is run in permissive mode: 2016-08-09T23:39:40.081774+00:00 INFO kernel: [ 0.000566] Security Framework initialized 2016-08-09T23:39:40.081778+00:00 INFO kernel: [ 0.000575] SELinux: Initializing. 2016-08-09T23:39:40.081780+00:00 DEBUG kernel: [ 0.000609] SELinux: Starting in permissive mode which is a little strange, because I was told that we've been running in enforcing mode for months now. Maybe these boards have a different selinux setup? I will look into that next.
,
Aug 12 2016
The only reference to selinux in src/{private-,}overlays/*veyron* is this one:
overlay-variant-veyron-shark-private/scripts/build_kernel_image.sh:androidboot.selinux=permissive
The only reference to enforcing is here:
./project-cheets-private/chromeos-base/android-container/files/selinux_config:1:SELINUX=enforcing
I am puzzled. This is very strange. Are all ARC boards still inheriting from project-cheets-private? Bernie maybe you can help me here.
,
Aug 12 2016
I can reproduce the error by renaming /etc/selinux/arc to /etc/selinux/xxxarc so that the kernel fails to load the policy. I get the same mount error and the same kernel message. With my CL, the test passes. We can push my CL for now, but this shows that the policy files aren't set up correctly on some boards. We need a test for that so we don't have to puzzle ofer other similar failures. Strangely, there is no indication of the failed policy load in /var/log/messages.
,
Aug 12 2016
Thanks for tracking this down, semenzato@! For what it's worth, we do have an SELinux test that makes sure SELinux is enabled, and various things are labeled properly: https://chrome-internal.googlesource.com/chromeos/autotest-cheets/+/master/client/site_tests/cheets_SELinuxTest/cheets_SELinuxTest.py Are these tests run on the boards in question?
,
Aug 13 2016
#39 Hi Ricky, perhaps you are in a better position to know, since I imagine there is probably a strategy for migrating to selinux and arc++? Do you have any relevant docs/information? Thank you!
,
Aug 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7d94868bbb219deed7a558141a41a9689aeca865 commit 7d94868bbb219deed7a558141a41a9689aeca865 Author: Luigi Semenzato <semenzato@chromium.org> Date: Thu Aug 11 17:00:46 2016 hardware_RamFio: work both with and without selinux If selinux is enabled, mounting the ramfs requires a selinux context. Try both ways to ease transition. BUG= chromium:635206 TEST=tested on cyan Change-Id: I8b6076bec87e4e9c71806f2adf24f428d91200b1 Reviewed-on: https://chromium-review.googlesource.com/368095 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Puthikorn Voravootivat <puthik@chromium.org> [modify] https://crrev.com/7d94868bbb219deed7a558141a41a9689aeca865/client/site_tests/hardware_RamFio/hardware_RamFio.py
,
Aug 15 2016
chrome-on-chromeos gardener here. I just saw a hardware_RamFio on tricky-chrome-pfq, see issue 620567. Related to this one?
,
Aug 15 2016
I think that might be a separate flake, this one was failing with a particular set of kernel configs (which have not landed yet).
,
Aug 15 2016
Yes it looks like a separate issue. The client test log is empty, this is the server test log (autoserv.DEBUG). I'll copy this to the other bug too. 08/15 11:49:36.274 INFO | abstract_ssh:0743| Master ssh connection to chromeos4-row2-rack3-host1 is down. 08/15 11:49:36.275 DEBUG| abstract_ssh:0704| Nuking master_ssh_job. 08/15 11:49:36.275 DEBUG| abstract_ssh:0710| Cleaning master_ssh_tempdir. 08/15 11:49:36.277 INFO | abstract_ssh:0757| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp__suKnyssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host1' 08/15 11:49:36.277 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp__suKnyssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack3-host1' 08/15 11:49:37.707 DEBUG| base_utils:0278| [stdout] 16a7526c-c220-42d2-8edf-216e18725e05 08/15 11:49:37.711 INFO | server_job:0129| FAIL ---- ---- timestamp=1471286977 localtime=Aug 15 11:49:37 Autotest client terminated unexpectedly: DUT rebooted during the test run. 08/15 11:49:37.712 INFO | server_job:0129| END FAIL ---- ---- timestamp=1471286977 localtime=Aug 15 11:49:37 08/15 11:49:37.712 INFO | server_job:0129| END GOOD ---- ---- timestamp=1471286977 localtime=Aug 15 11:49:37
,
Aug 15 2016
I think this is now fixed based on success in https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/1277 I will now land the new kernel configs which will double confirm it.
,
Sep 8 2016
,
Sep 12 2016
Has been fixed for a while, sorry.
,
Oct 7 2016
,
Nov 19 2016
,
Jan 21 2017
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Aug 8 2017
test passed: https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=hardware_RamFio |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by dgarr...@chromium.org
, Aug 6 2016