Swarming: 26 minute inputs overhead for layout test |
||||||||
Issue descriptionhttps://chromium-swarm.appspot.com/task?id=40da2c1ac9df2510&refresh=10&show_raw=1 is a sample task. I've seen this more than a few times. This is hard to find out; the swarming task bot page at https://chromium-swarm.appspot.com/bot?id=build694-m4&sort_stats=total%3Adesc doesn't show this overhead in the duration at all. This also doesn't show up in the build page in the numbers under the test. I found this when I was looking at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/173154. That build says Max shard duration: 0:42:01.316480 (shard #10) Min shard duration: 0:14:22.323346 (shard #3) Which was confusing, because I didn't see that much a difference in the times listed below: shard #3 isolated out shard #3 (794.0 sec) shard #4 isolated out shard #4 (813.6 sec) shard #5 isolated out shard #5 (984.3 sec) shard #6 isolated out shard #6 (837.4 sec) shard #7 isolated out shard #7 (942.8 sec) shard #8 isolated out shard #8 (935.6 sec) shard #9 isolated out shard #9 (784.8 sec) shard #10 isolated out shard #10 (951.4 sec) This might also be happening on windows bots. It could be causing tasks to take longer than necessary; I've been seeing elevated swarming task counts.
,
Oct 29
If it's mac, then it's bug 851355.
,
Oct 29
Yeah, that looks like the issue. Is this known to be affecting windows as well?
,
Oct 29
Looks like the bots were able to execute all the swarming tasks, so no outage. I wonder if it's known that windows can have very large overheads as well.
,
Nov 10
,
Nov 12
,
Nov 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1623ca39c23d59d987ce9064f1d90e1bcc85ea33 commit 1623ca39c23d59d987ce9064f1d90e1bcc85ea33 Author: Takuto Ikuta <tikuta@chromium.org> Date: Mon Nov 12 15:16:08 2018 Specify binary directory only for perl used in webkit layout test on win This is to reduce the number of files sent to swarming under third_party/perl directory. e.g. https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=91c35f024b8949020a3a25f3421f27689f35dff5 from https://chromium-swarm.appspot.com/task?id=40da578fadf59310&refresh=10&show_raw=1 Bug: 899991 Change-Id: I8608283399e817c8d3d8558782d67587a735e96a Reviewed-on: https://chromium-review.googlesource.com/c/1331097 Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Cr-Commit-Position: refs/heads/master@{#607227} [modify] https://crrev.com/1623ca39c23d59d987ce9064f1d90e1bcc85ea33/BUILD.gn
,
Nov 20
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/545b162322e64a82931bbe9c0fd03fbea6d94103 commit 545b162322e64a82931bbe9c0fd03fbea6d94103 Author: Takuto Ikuta <tikuta@chromium.org> Date: Tue Nov 20 17:48:28 2018 [isolate] Increase archiveThreshold from 100kb to 1MB I guess this threshold is too small when isolating webkit_layout_test. There are too many files having size between 100kb~1MB for the test. And due to the large number of files, webkit_layout_tests may have large overhead in swarming to download isolated input files. Let me see increased threshold mitigates that or not. 100kb archiveThreshold was introduced in https://codereview.chromium.org/2522803002 Bug: 899991 , 851355 Change-Id: I84218208655ff0a5e78e6f3a1bf3718378707a88 Reviewed-on: https://chromium-review.googlesource.com/c/1343920 Auto-Submit: Takuto Ikuta <tikuta@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/545b162322e64a82931bbe9c0fd03fbea6d94103/client/archiver/partitioning_walker_test.go [modify] https://crrev.com/545b162322e64a82931bbe9c0fd03fbea6d94103/client/archiver/tarring_archiver.go
,
Nov 23
Interestingly, the overhead here is on download side; https://chromium-swarm.appspot.com/task?id=412799e52a0d3710 Which means we need to optimize the archiver to be more efficient. This is issue 854610. That said, I suspect these machines are having a serious disk issue and I recommend redeploying the ones with significant overheads as HPFS+ is known to exhibit really bad performance over time, and these are fairly old VMs.
,
Nov 23
,
Nov 23
,
Nov 23
This important to note that *all* of the slow runs (mac, Win7) I observed were in Golo. I really suspect heavy disk fragmentation and VM recycling is the solution here. The issue is that one can only create and delete 100k files that often in the lifetime of a disk... Having this automatic is issue 855678.
,
Nov 23
And just to be clear, I downloaded 91c35f024b8949020a3a25f3421f27689f35dff5 locally and it is 167697 files. You can download locally with: ./isolateserver.py download -I isolateserver.appspot.com -t foo -s 91c35f024b8949020a3a25f3421f27689f35dff5 find foo | wc -l I mean, I can fine tune as much as I can, but I cannot do miracles. Writing 167697 files on a crappy old VM *is* going to take time. Right now the inputs are not sharded: all layout tests are mapped on all shards. This is to say the least, "inefficient". This is worth a bug in itself, as this problem is highly layout tests specific.
,
Nov 23
Duping into issue 851355 since it's all about mapping in too many files and it's the oldest bug. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by martiniss@chromium.org
, Oct 29