New issue
Advanced search Search tips

Issue 899991 link

Starred by 5 users

Issue metadata

Status: Duplicate
Owner: ----
Closed: Nov 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows , Mac
Pri: 2
Type: Bug



Sign in to add a comment

Swarming: 26 minute inputs overhead for layout test

Project Member Reported by martiniss@chromium.org, Oct 29

Issue description

https://chromium-swarm.appspot.com/task?id=40da2c1ac9df2510&refresh=10&show_raw=1 is a sample task. I've seen this more than a few times. 

This is hard to find out; the swarming task bot page at https://chromium-swarm.appspot.com/bot?id=build694-m4&sort_stats=total%3Adesc doesn't show this overhead in the duration at all. This also doesn't show up in the build page in the numbers under the test. I found this when I was looking at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/173154. That build says 


Max shard duration: 0:42:01.316480 (shard #10)
Min shard duration: 0:14:22.323346 (shard #3)

Which was confusing, because I didn't see that much a difference in the times listed below:


shard #3 isolated out
shard #3 (794.0 sec)
shard #4 isolated out
shard #4 (813.6 sec)
shard #5 isolated out
shard #5 (984.3 sec)
shard #6 isolated out
shard #6 (837.4 sec)
shard #7 isolated out
shard #7 (942.8 sec)
shard #8 isolated out
shard #8 (935.6 sec)
shard #9 isolated out
shard #9 (784.8 sec)
shard #10 isolated out
shard #10 (951.4 sec)

This might also be happening on windows bots. It could be causing tasks to take longer than necessary; I've been seeing elevated swarming task counts.
 
If it's mac, then it's bug 851355.
Yeah, that looks like the issue. Is this known to be affecting windows as well?
Labels: -Pri-1 OS-Windows Pri-2
Looks like the bots were able to execute all the swarming tasks, so no outage. I wonder if it's known that windows can have very large overheads as well.
Status: Available (was: Untriaged)
Cc: tikuta@chromium.org
Project Member

Comment 7 by bugdroid1@chromium.org, Nov 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1623ca39c23d59d987ce9064f1d90e1bcc85ea33

commit 1623ca39c23d59d987ce9064f1d90e1bcc85ea33
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Mon Nov 12 15:16:08 2018

Specify binary directory only for perl used in webkit layout test on win

This is to reduce the number of files sent to swarming under third_party/perl directory.
e.g.
https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=91c35f024b8949020a3a25f3421f27689f35dff5
from
https://chromium-swarm.appspot.com/task?id=40da578fadf59310&refresh=10&show_raw=1

Bug:  899991 
Change-Id: I8608283399e817c8d3d8558782d67587a735e96a
Reviewed-on: https://chromium-review.googlesource.com/c/1331097
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>
Cr-Commit-Position: refs/heads/master@{#607227}
[modify] https://crrev.com/1623ca39c23d59d987ce9064f1d90e1bcc85ea33/BUILD.gn

Project Member

Comment 8 by bugdroid1@chromium.org, Nov 20

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/545b162322e64a82931bbe9c0fd03fbea6d94103

commit 545b162322e64a82931bbe9c0fd03fbea6d94103
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Tue Nov 20 17:48:28 2018

[isolate] Increase archiveThreshold from 100kb to 1MB

I guess this threshold is too small when isolating webkit_layout_test.
There are too many files having size between 100kb~1MB for the test. And due to the large number of files, webkit_layout_tests may have large overhead in swarming to download isolated input files.

Let me see increased threshold mitigates that or not.

100kb archiveThreshold was introduced in https://codereview.chromium.org/2522803002

Bug:  899991 , 851355
Change-Id: I84218208655ff0a5e78e6f3a1bf3718378707a88
Reviewed-on: https://chromium-review.googlesource.com/c/1343920
Auto-Submit: Takuto Ikuta <tikuta@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/545b162322e64a82931bbe9c0fd03fbea6d94103/client/archiver/partitioning_walker_test.go
[modify] https://crrev.com/545b162322e64a82931bbe9c0fd03fbea6d94103/client/archiver/tarring_archiver.go

Components: Infra>Labs
Interestingly, the overhead here is on download side;
https://chromium-swarm.appspot.com/task?id=412799e52a0d3710

Which means we need to optimize the archiver to be more efficient. This is issue 854610.

That said, I suspect these machines are having a serious disk issue and I recommend redeploying the ones with significant overheads as HPFS+ is known to exhibit really bad performance over time, and these are fairly old VMs.
Summary: Swarming: 26 minute inputs overhead for layout test (was: 26 minute overhead for layout test swarming task)
Cc: jpwilson@google.com
 Issue 905012  has been merged into this issue.
This important to note that *all* of the slow runs (mac, Win7) I observed were in Golo.

I really suspect heavy disk fragmentation and VM recycling is the solution here. The issue is that one can only create and delete 100k files that often in the lifetime of a disk... Having this automatic is issue 855678.
Cc: c...@chromium.org jchin...@chromium.org martiniss@chromium.org st...@chromium.org
And just to be clear, I downloaded 91c35f024b8949020a3a25f3421f27689f35dff5 locally and it is 167697 files. You can download locally with:
./isolateserver.py download -I isolateserver.appspot.com -t foo -s 91c35f024b8949020a3a25f3421f27689f35dff5
find foo | wc -l

I mean, I can fine tune as much as I can, but I cannot do miracles. Writing 167697 files on a crappy old VM *is* going to take time.

Right now the inputs are not sharded: all layout tests are mapped on all shards. This is to say the least, "inefficient". This is worth a bug in itself, as this problem is highly layout tests specific.
Mergedinto: 851355
Status: Duplicate (was: Available)
Duping into issue 851355 since it's all about mapping in too many files and it's the oldest bug.

Sign in to add a comment