Crostini Disk IO Test Failures |
||
Issue descriptionvm.CrostiniDiskIOPerf has been up on crosbolt-nightly for a while. However there're some failures, for example on eve there's still the "out of sick space" error, which occurs when I'm trying to write a 500M file to the home partition of the container when there're still ~2G of available space. 2018/11/25 22:11:09 [22:11:09.349] Error at crostini_disk_io_perf.go:191: stress_rw with block size 1m failed: failed to run fio: fio: io_u error on file /home/testuser/fio_test_data: No space left on device: write offset=411041792, buflen=1048576 Another problem is that on some board the returned fio result failed to parse. 2018/11/21 22:10:01 [22:10:01.154] Error at crostini_disk_io_perf.go:191: stress_rw with block size 16m failed: failed to parse fio result: invalid character 'i' in literal false (expecting 'a') I'm trying to add more log to see what happens.
,
Nov 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/c819d062d8f452db432d7a0c294d3ef7dd63cc50 commit c819d062d8f452db432d7a0c294d3ef7dd63cc50 Author: Cheng-Yu Lee <cylee@chromium.org> Date: Thu Nov 29 20:11:18 2018 tast-tests: Write full multi-line fio error logs to a log file. BUG=chromium:909934 TEST="tast --verbose run DUT vm.CrostiniDiskIOPerf" Change-Id: Ic46770c5933221b662ff458cc4a80325e9bbe8a6 Reviewed-on: https://chromium-review.googlesource.com/1351971 Commit-Ready: Cheng-Yu Lee <cylee@chromium.org> Tested-by: Cheng-Yu Lee <cylee@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/c819d062d8f452db432d7a0c294d3ef7dd63cc50/src/chromiumos/tast/local/bundles/cros/vm/crostini_disk_io_perf.go
,
Dec 5
Checked detailed error on various model. At least 4 types of failures: 1. stress_rw with block size 16m failed: failed to parse fio result: invalid character 'i' in literal false (expecting 'a') It happens most frequently on many boards. The parse error comes from the error message of fio output like: fio: pid=1307, got signal=9 fio: pid=1308, got signal=9 fio: pid=1305, got signal=9 fio: pid=1306, got signal=9 ... In stress tests, 8 parallel fio jobs(processes) are created to simulate heavy IO loads. When block size is large, each fio process can take up to ~500M memory. (The memory each fio process consumes looks proportional to the block size. For example, when bs=512K, it's ~18M. when bs=1M, it's ~33M. when bs=16M, it grows to 500+M). So kernel OOM happens to kill processes which consumes lots of memory. Since we don't know why fio implementation eats so much memory when bs is larger, the easiest solution is to restrict the block size in stress test. 2. Failed to run tests: context deadline exceeded, Test did not finish Seen on some machines like setzer and bob. While I set the running time of each time-based fio command to 10s. I found that it didn't finish in 10s+ frequently on some models. For example: 2018/12/04 22:29:45 [22:29:45.297] Running seq_write with bs 1m on host (1/3) 2018/12/04 22:29:55 [22:29:55.887] Reporting metric seq_write_bs_1m_write: 38820.0 47749.0 0.81 2018/12/04 22:29:55 [22:29:55.891] Running seq_write with block size 1m in container (2/3) 2018/12/04 22:30:16 [22:30:16.189] Running seq_write with bs 1m on host (2/3) 2018/12/04 22:30:26 [22:30:26.786] Reporting metric seq_write_bs_1m_write: 24120.0 47798.0 0.50 2018/12/04 22:30:26 [22:30:26.786] Running seq_write with block size 1m in container (3/3) 2018/12/04 22:30:41 [22:30:41.725] Running seq_write with bs 1m on host (3/3) 2018/12/04 22:30:52 [22:30:52.367] Reporting metric seq_write_bs_1m_write: 35384.0 47359.0 0.75 We can see that it can take up to 20s when running in container to finish a fio command which should have done in ~10s+. I've locked some machines in lab and found the problem can also happen when fio is running on the host. 2018/12/06 06:28:19 [14:28:20.199] Running seq_read with bs 4k on host (1/3) 2018/12/06 06:28:42 [14:28:42.898] Reporting metric seq_read_bs_4k_read: 10385.0 20601.0 0.50 I suspect it's because system is under heavier swap but needs more investigation. The easiest solution is to reduce the number of each time-based fio test (e.g., from 10s to 8s). 3. Failure dumping container log: exit status 1, rand_read with block size 16m failed: failed to run fio: : exit status 1, rand_read with block size 1m ...... Seen on for example hana. The fio commands simply exits abnormally without output anything to stdout and stderr. There's no kernel OOM either. I don't have clue yet. Need to lock a machine and look into details. 4. rand_write with block size 16m failed: failed to run fio: fio: io_u error on file /home/testuser/fio_test_data: No space left on device: write offset=268435456, buflen=16777216, ...... Only seen on eve. Disk running out of space when creating a file of size 500M while there should be at least 1-2G of disk space left. I've reported it when developing the test but could get much clue. I will try to solve problem 1 and 2 first.
,
Dec 9
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/cca3b1cad3f65e646f396d47a35803f2962fc4bb commit cca3b1cad3f65e646f396d47a35803f2962fc4bb Author: Cheng-Yu Lee <cylee@chromium.org> Date: Sun Dec 09 08:48:21 2018 tast-tests: Alter blocksize of vm.CrostiniDiskIOPerf to 4k, 16k, and 64k. Large blocksize is impractical and a fio job can consume too much memory when blocksize grows. It often leads to kernel OOM when running the test with big blocksizes. BUG=chromium:909934 TEST=locked a banon in lab and passed the test. Change-Id: Ic93159d6313395aa95983591879b60aeb9ae4e50 Reviewed-on: https://chromium-review.googlesource.com/1365915 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Cheng-Yu Lee <cylee@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/cca3b1cad3f65e646f396d47a35803f2962fc4bb/src/chromiumos/tast/local/bundles/cros/vm/crostini_disk_io_perf.go
,
Dec 14
Update: The fix was in, but a recent 9s problem causes all tests to fail (crbug/913540). That bug was fixed as well, but test lab didn't pickup that change yet. Expect it to back go green soon.
,
Dec 21
It was fixed on most models by reducing the memory usage and running time for each fio command : http://shortn/_udqASEWlha However on some slow models the test still fails to run within the time limit. On setzer and reks it runs within the time limit sometimes and exceeds the time limit some other time. While on kefka it never finishes the work. celes is worst. It's easy to hang on one fio command and never returns. There could be other problems. on celes. |
||
►
Sign in to add a comment |
||
Comment 1 by cywang@google.com
, Nov 29