New issue
Advanced search Search tips

Issue 796318 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 1
Type: Bug



Sign in to add a comment

Fuchsia/x64 FYI bot reports "Success" but shows shard timed-out "exception" for net_unittests

Project Member Reported by w...@chromium.org, Dec 19 2017

Issue description

Recent builds, e.g:
https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia/12344

are mostly reporting "Success", but also show a net_unittests shard timed-out:

net_unittests net_unittests
Run on OS: 'Ubuntu-14.04'
Shard duration: 0:12:56.199630
stdout
some shards did not complete: 0
swarming.summary
step_metadata
missing shard #0
shard #0
shard #0 isolated out

Looking at the stdout from the shard, it seems that it did complete successfully.

Perhaps the shard took longer than allowed, but was still regarded otherwise as having succeeded?
 

Comment 1 by w...@chromium.org, Dec 19 2017

Cc: jbudorick@chromium.org
Owner: sergeyu@chromium.org
Looks like this started with build: https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia/12315

I see a Fuchsia roll by sergeyu@ in there: https://chromium.googlesource.com/chromium/src/+/1c610bb9ca28d976c23c0f7c4712ff0bdbb33503



This is the same as  bug 796026 .

Have a look at https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F12344%2F%2B%2Frecipes%2Fsteps%2Fnet_unittests%2F0%2Fstdout .

There's no `fsync` but we need something like that before shutdown.
That said, I'm not sure why the build is green in this case, that looks like it's a bug in the recipe, maybe, or in swarming.

Comment 4 by w...@chromium.org, Dec 19 2017

We're getting full stdio, but truncated JSON - does the recipe gather the
pass/fail data from the JSON or from stdio?
From the json. We ignore the stdio.
It's from the JSON (because otherwise we wouldn't have even bothered generating the json in the first place, since exfiltrating it was quite a hassle).
argh.
Oh, there's /system/bin/sync. I'll try that.
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ef62139da55ec66c148556896ca3bf0b23b886a0

commit ef62139da55ec66c148556896ca3bf0b23b886a0
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Wed Dec 20 21:50:36 2017

TestLauncher: fsync() summary JSON file

In some instances Fuchsia's test runner script fails to extract test
summary file. This change updates TestLauncher to call fsync(), which
may fix the problem for Fuchsia.

Bug:  796318 
Change-Id: Idb802c5e3047b2205a2606bcaea3d31008bc4935
Reviewed-on: https://chromium-review.googlesource.com/835052
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Reviewed-by: Lei Zhang <thestig@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525476}
[modify] https://crrev.com/ef62139da55ec66c148556896ca3bf0b23b886a0/base/test/launcher/test_results_tracker.cc
[modify] https://crrev.com/ef62139da55ec66c148556896ca3bf0b23b886a0/build/fuchsia/runner_common.py

The previous fix didn't solve the problem. I see the following error in the log:

[00699.252] 03759.03785> SUCCESS: all tests passed.
[00718.775] 03642.03669> VnodeMinfs::Sync Completion wait failure: -21
[00718.799] 03759.03785> [3:2004251015:1220/224602.520856:718799182:ERROR:test_launcher.cc(1159)] Failed to save test launcher output summary.

fsync() times out if it doesn't complete in 15 seconds. Given that summary file for net_unittests is big (15MB), it may timeout when the underlying filesystem is slow. Possible solutions
 1. Create minfs disk image under tmpfs on the host. Is /tmp mounted as tmpfs on swarming bots?
 2. Call fsync() more than once to force it to sync the FS, even if it takes more that 15 seconds.
Project Member

Comment 12 by bugdroid1@chromium.org, Dec 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d6a88b16c1c35715a0dfba78e5878b8a77c48dd7

commit d6a88b16c1c35715a0dfba78e5878b8a77c48dd7
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Thu Dec 21 20:18:26 2017

[Fuchsia] Update test runner to use /tmp for output minfs image.

/tmp is expected to perform better on GCE when used as a backing
storage for minfs disk image, which should make fsync() less
likely to timeout when writing output.json.

Bug:  796318 
Change-Id: I6a410d9b7284623ad5ad490225e4f489395a881d
Reviewed-on: https://chromium-review.googlesource.com/838511
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525784}
[modify] https://crrev.com/d6a88b16c1c35715a0dfba78e5878b8a77c48dd7/build/fuchsia/runner_common.py

Project Member

Comment 13 by bugdroid1@chromium.org, Dec 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/11400a4b68ae249b919ba918d9d302ff463c9c63

commit 11400a4b68ae249b919ba918d9d302ff463c9c63
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Thu Dec 21 21:44:13 2017

Revert "[Fuchsia] Update test runner to use /tmp for output minfs image."

This reverts commit d6a88b16c1c35715a0dfba78e5878b8a77c48dd7.

Reason for revert: net_unittests still fails.

Original change's description:
> [Fuchsia] Update test runner to use /tmp for output minfs image.
> 
> /tmp is expected to perform better on GCE when used as a backing
> storage for minfs disk image, which should make fsync() less
> likely to timeout when writing output.json.
> 
> Bug:  796318 
> Change-Id: I6a410d9b7284623ad5ad490225e4f489395a881d
> Reviewed-on: https://chromium-review.googlesource.com/838511
> Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
> Reviewed-by: Wez <wez@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#525784}

TBR=wez@chromium.org,sergeyu@chromium.org

Change-Id: Ieb709b65af577712862b8538d3e33a2f8aed81ee
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  796318 
Reviewed-on: https://chromium-review.googlesource.com/841242
Reviewed-by: Sergey Ulanov <sergeyu@chromium.org>
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525816}
[modify] https://crrev.com/11400a4b68ae249b919ba918d9d302ff463c9c63/build/fuchsia/runner_common.py

Comment 14 by w...@chromium.org, Dec 21 2017

Shall we try adding some limited number of Flush() calls, each with a warning logged to stdio? If we can confirm that that "fixes" things then the timings, and OS-level tracing, may help diagnose the issue.
Project Member

Comment 15 by bugdroid1@chromium.org, Dec 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/47148862c71e160dc651033322f5869c10039dd8

commit 47148862c71e160dc651033322f5869c10039dd8
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Fri Dec 22 02:27:17 2017

Retry fsync() when writing test summary file.

On Fuchsia fsync() times out after 15 seconds, which may not be enough.
Retry fsync() multiple times - this will allow to verify that the issue
is caused by disk IO being slow and not something else.

Bug:  796318 
Change-Id: I54f491e5f7b08bc765573a5eda0acee2f3ed31a7
Reviewed-on: https://chromium-review.googlesource.com/841626
Reviewed-by: Lei Zhang <thestig@chromium.org>
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525905}
[modify] https://crrev.com/47148862c71e160dc651033322f5869c10039dd8/base/test/launcher/test_results_tracker.cc

net_unittests is green now, with the hack I landed above: https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia/12441

Keeping this bug open to implement a better solution. We could switch to a virtio drive, which may provide better perf under nested VM. fsync() also shouldn't timeout (opened ZX-1513 to track that issue).

Comment 17 by w...@chromium.org, Dec 23 2017

Let's close this out and file a separate bug for improving the output
summary gathering implementation?
Status: Fixed (was: Assigned)
Opened bug 798642 with more details of what can be improved.

Sign in to add a comment