New issue
Advanced search Search tips

Issue 909934 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Crostini Disk IO Test Failures

Project Member Reported by cylee@chromium.org, Nov 28

Issue description

vm.CrostiniDiskIOPerf has been up on crosbolt-nightly for a while.

However there're some failures, for example on eve there's still the "out of sick space" error, which occurs when I'm trying to write a 500M file to the home partition of the container when there're still ~2G of available space.

2018/11/25 22:11:09 [22:11:09.349] Error at crostini_disk_io_perf.go:191: stress_rw with block size 1m failed: failed to run fio: fio: io_u error on file /home/testuser/fio_test_data: No space left on device: write offset=411041792, buflen=1048576

Another problem is that on some board the returned fio result failed to parse.

2018/11/21 22:10:01 [22:10:01.154] Error at crostini_disk_io_perf.go:191: stress_rw with block size 16m failed: failed to parse fio result: invalid character 'i' in literal false (expecting 'a')

I'm trying to add more log to see what happens.
 
Project Member

Comment 2 by bugdroid1@chromium.org, Nov 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/c819d062d8f452db432d7a0c294d3ef7dd63cc50

commit c819d062d8f452db432d7a0c294d3ef7dd63cc50
Author: Cheng-Yu Lee <cylee@chromium.org>
Date: Thu Nov 29 20:11:18 2018

tast-tests: Write full multi-line fio error logs to a log file.

BUG=chromium:909934
TEST="tast --verbose run DUT vm.CrostiniDiskIOPerf"

Change-Id: Ic46770c5933221b662ff458cc4a80325e9bbe8a6
Reviewed-on: https://chromium-review.googlesource.com/1351971
Commit-Ready: Cheng-Yu Lee <cylee@chromium.org>
Tested-by: Cheng-Yu Lee <cylee@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/c819d062d8f452db432d7a0c294d3ef7dd63cc50/src/chromiumos/tast/local/bundles/cros/vm/crostini_disk_io_perf.go

Checked detailed error on various model. At least 4 types of failures:

1. stress_rw with block size 16m failed: failed to parse fio result: invalid character 'i' in literal false (expecting 'a')

It happens most frequently on many boards. The parse error comes from the error message of fio output like:
fio: pid=1307, got signal=9
fio: pid=1308, got signal=9
fio: pid=1305, got signal=9
fio: pid=1306, got signal=9
...

In stress tests, 8 parallel fio jobs(processes) are created to simulate heavy IO loads. When block size is large, each fio process can take up to ~500M memory. (The memory each fio process consumes looks proportional to the block size. For example, when bs=512K, it's ~18M. when bs=1M, it's ~33M. when bs=16M, it grows to 500+M). So kernel OOM happens to kill processes which consumes lots of memory. 
Since we don't know why fio implementation eats so much memory when bs is larger, the easiest solution is to restrict the block size in stress test.

2. Failed to run tests: context deadline exceeded, Test did not finish

Seen on some machines like setzer and bob.
While I set the running time of each time-based fio command to 10s. I found that it didn't finish in 10s+ frequently on some models. For example:

2018/12/04 22:29:45 [22:29:45.297] Running seq_write with bs 1m on host (1/3)
2018/12/04 22:29:55 [22:29:55.887] Reporting metric seq_write_bs_1m_write: 38820.0 47749.0 0.81
2018/12/04 22:29:55 [22:29:55.891] Running seq_write with block size 1m in container (2/3)
2018/12/04 22:30:16 [22:30:16.189] Running seq_write with bs 1m on host (2/3)
2018/12/04 22:30:26 [22:30:26.786] Reporting metric seq_write_bs_1m_write: 24120.0 47798.0 0.50
2018/12/04 22:30:26 [22:30:26.786] Running seq_write with block size 1m in container (3/3)
2018/12/04 22:30:41 [22:30:41.725] Running seq_write with bs 1m on host (3/3)
2018/12/04 22:30:52 [22:30:52.367] Reporting metric seq_write_bs_1m_write: 35384.0 47359.0 0.75

We can see that it can take up to 20s when running in container to finish a fio  command which should have done in ~10s+. I've locked some machines in lab and found the problem can also happen when fio is running on the host. 

2018/12/06 06:28:19 [14:28:20.199] Running seq_read with bs 4k on host (1/3)
2018/12/06 06:28:42 [14:28:42.898] Reporting metric seq_read_bs_4k_read: 10385.0 20601.0 0.50

I suspect it's because system is under heavier swap but needs more investigation. The easiest solution is to reduce the number of each time-based fio test (e.g., from 10s to 8s).

3. Failure dumping container log: exit status 1, rand_read with block size 16m failed: failed to run fio: : exit status 1, rand_read with block size 1m ......

Seen on for example hana. The fio commands simply exits abnormally without output anything to stdout and stderr. There's no kernel OOM either. I don't have clue yet. Need to lock a machine and look into details.

4. rand_write with block size 16m failed: failed to run fio: fio: io_u error on file /home/testuser/fio_test_data: No space left on device: write offset=268435456, buflen=16777216, ......

Only seen on eve. Disk running out of space when creating a file of size 500M while there should be at least 1-2G of disk space left. I've reported it when developing the test but could get much clue.

I will try to solve problem 1 and 2 first.
Project Member

Comment 4 by bugdroid1@chromium.org, Dec 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/cca3b1cad3f65e646f396d47a35803f2962fc4bb

commit cca3b1cad3f65e646f396d47a35803f2962fc4bb
Author: Cheng-Yu Lee <cylee@chromium.org>
Date: Sun Dec 09 08:48:21 2018

tast-tests: Alter blocksize of vm.CrostiniDiskIOPerf to 4k, 16k, and
64k.

Large blocksize is impractical and a fio job can consume too much memory
when blocksize grows. It often leads to kernel OOM when running the test
with big blocksizes.

BUG=chromium:909934
TEST=locked a banon in lab and passed the test.

Change-Id: Ic93159d6313395aa95983591879b60aeb9ae4e50
Reviewed-on: https://chromium-review.googlesource.com/1365915
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Cheng-Yu Lee <cylee@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/cca3b1cad3f65e646f396d47a35803f2962fc4bb/src/chromiumos/tast/local/bundles/cros/vm/crostini_disk_io_perf.go

Owner: cylee@chromium.org
Update: 
The fix was in, but a recent 9s problem causes all tests to fail (crbug/913540). That bug was fixed as well, but test lab didn't pickup that change yet. Expect it to back go green soon.
It was fixed on most models by reducing the memory usage and running time for each fio command : http://shortn/_udqASEWlha

However on some slow models the test still fails to run within the time limit. 
On setzer and reks it runs within the time limit sometimes and exceeds the time limit some other time. While on kefka it never finishes the work. celes is worst. It's easy to hang on one fio command and never returns. There could be other problems. on celes.

Sign in to add a comment