New issue
Advanced search Search tips

Issue 793993 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug



Sign in to add a comment

Redo linux_chromium_asan_rel_ng sharding

Project Member Reported by jam@chromium.org, Dec 11 2017

Issue description

This is usually the slowest bot, looks like browser_tests shards take 30 minutes. These should be around 10. I'm not sure why it got so much slower, i.e. did asan get slower?


 
Cc: martiniss@chromium.org
Components: Infra>Client>Chrome
Labels: OS-Android
Labels: -OS-Android OS-Linux
Attached: shard percentiles over the entire period for which we have data in chrome_infra.swarming_tasks

This only goes back six months, but it looks like a gradual growth in time as opposed to a single regression or a small group of regressions.
Screenshot from 2017-12-11 17-03-56.png
34.0 KB View Download
The asan bot seems to be much slower than other bots. https://viceroy.corp.google.com/chrome_infra/Buildbot/buildbot?refresh=-1&duration=1d&job=master.tryserver.chromium.linux&builder=linux_chromium_asan_rel_ng in the swarming section shows runtimes for several test suites, all of which take much longer on asan than on other configurations. I've confirmed this manually for the top 4 problems (components_unittests, webkit_unit_tests, net_unittests, and content_browsertests), not for browser_tests. browser_tests does take longer on this builder, but it's not as large of a difference.

browser_tests is not limiting the builder in most cases, as far as I can tell. I can double check this, but based on the graph in the viceroy console above, the other tests take longer, and slow down the builder.

I also made the same graph as john. Seems like it's regressed over time.

Comment 5 by jam@chromium.org, Dec 12 2017

Can someone take care of changing the sharding so this bot isn't so slow?
yes, I'll get to it.
Cc: kcc@chromium.org p...@chromium.org
Interesting that we were looking at this independently.

Running the following query in dremel to compare the swarming tasks time for a regular linux build and the asan build from the same time:

 SELECT tags_master as master, 
        tags_buildername as builder,
        tags_stepname as stepname, 
        sum(completed_ts - started_ts) as dur,        
        sum(cost_usd) as cost,
        (sum(cost_usd) / count (distinct tags_build_id)) as cost_per_build
 FROM 
   FLATTEN(FLATTEN(FLATTEN(FLATTEN(FLATTEN(chrome_infra.swarming_tasks.yesterday, tags_project), 
                                           tags_master), 
                                   tags_buildername), 
                           tags_stepname),
           tags_build_id)
WHERE state = 'COMPLETED'
  and tags_project = 'chromium'
  and ((tags_master = 'chromium.linux' and tags_buildername = 'Linux Tests' and tags_build_id = '65487') or 
       (tags_master = 'chromium.memory' and tags_buildername = 'Linux ASan LSan Tests (1)' and tags_build_id = '40806'))
  and completed_ts > (PARSE_UTC_USEC('2017-12-11') / 1000000)
  and completed_ts < (PARSE_UTC_USEC('2017-12-11') / 1000000) + 86400
GROUP BY master, builder, stepname
ORDER BY master asc, builder asc, stepname asc

https://ci.chromium.org/buildbot/chromium.linux/Linux%20Tests/65487
https://ci.chromium.org/buildbot/chromium.memory/Linux%20ASan%20LSan%20Tests%20%281%29/40807

there are steps that are 20x - 100x slower (or more), e.g. webkit_unit_tests goes from 20s to 40m!

I don't think simply resharding is the right answer (perhaps obviously :). We need to file bugs and possibly disable test steps until we figure out why things are so much slower.

linux_v_asan_test_times.csv
8.0 KB View Download
I don't think that *solely* resharding is the right answer, but I'd suspect that it will have minimal effect on total runtime while reducing user-visible runtime and as such may be worthwhile as a stopgap.

(Disabling suites, or the bot, would do that too, but both are a bit more drastic.)
I agree.

I'm going to file a separate bug for the overall "figure out what the heck is wrong w/ asan" problem, and we can leave this for the short-term fixes.
filed bug 794372 for the larger issue.
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5052d557b432ddf90d80f73a428418550dd7b24f

commit 5052d557b432ddf90d80f73a428418550dd7b24f
Author: John Budorick <jbudorick@chromium.org>
Date: Wed Dec 13 02:59:34 2017

Reshard egregiously long suites on linux_chromium_asan_rel_ng.

This reshards the suites w/ 90th percentile task times > 10 minutes
on linux_chromium_asan_rel_ng.

Bug:  793993 
Change-Id: Id269a3a2466af43956e11b00e25637e02ba5f410
Reviewed-on: https://chromium-review.googlesource.com/822183
Commit-Queue: John Budorick <jbudorick@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#523668}
[modify] https://crrev.com/5052d557b432ddf90d80f73a428418550dd7b24f/testing/buildbot/chromium.memory.json
[modify] https://crrev.com/5052d557b432ddf90d80f73a428418550dd7b24f/testing/buildbot/test_suite_exceptions.pyl

50th and 90th percentiles of browser_tests shard time by hour
Screenshot 2017-12-13 at 8.32.07 AM.png
121 KB View Download
50th and 90th percentiles of webkit_unit_tests shard time by hour
Screenshot 2017-12-13 at 8.33.36 AM.png
122 KB View Download
Status: Fixed (was: Assigned)
Per #10, investigation into the underlying cause of unexpected slowness on the ASAN bot will continue over in issue 794372. The immediate intervention here is done, though.

Sign in to add a comment