New issue
Advanced search Search tips

Issue 763379 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: ----

Blocking:
issue 769263



Sign in to add a comment

telemetry_perf_unittests timing out when re-enabling smoke tests

Project Member Reported by rogerm@google.com, Sep 8 2017

Issue description

telemetry_perf_unittests failing on chromium.linux/Linux Tests

Builders failed on: 
- Linux Tests: 
  https://build.chromium.org/p/chromium.linux/builders/Linux%20Tests


Following a pair of commits:

- https://chromium-review.googlesource.com/654868
- https://chromium-review.googlesource.com/655299
 
Project Member

Comment 1 by bugdroid1@chromium.org, Sep 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e996f894d9e0a2a9cf7dfe67a2efda60cc8543df

commit e996f894d9e0a2a9cf7dfe67a2efda60cc8543df
Author: Juan Antonio Navarro Pérez <perezju@chromium.org>
Date: Fri Sep 08 13:58:01 2017

Revert "[tools/perf] Reenable multitab:misc:typical24 smoke test"

This reverts commit 748a8867042a160ddbec5945c19c9afe14e1362c.

Reason for revert: made telemetry_perf_unittests tip over some time limit

Original change's description:
> [tools/perf] Reenable multitab:misc:typical24 smoke test
> 
> The story may no longer be failing.
> 
> TBR=nednguyen@google.com
> 
> Bug:  698499 
> Change-Id: I9383bfee2e5882d75459e63a882b4e8fc10b2d3e
> Reviewed-on: https://chromium-review.googlesource.com/654868
> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#500567}

TBR=perezju@chromium.org

Change-Id: I7105c8fb94af1be724b1a13b82c7da81f9c9aecf
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  698499 , 763379 
Reviewed-on: https://chromium-review.googlesource.com/657658
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#500581}
[modify] https://crrev.com/e996f894d9e0a2a9cf7dfe67a2efda60cc8543df/tools/perf/benchmarks/system_health_smoke_test.py

Cc: nedngu...@google.com
Status: Assigned (was: Available)
The error was "shard #6 timed out, took too much time to complete"

I assume we tipped over some time limit when re-enabling the stories.

For now I've disabled multitab:misc:typical24 again, let's see if that helps the bot to recover.

I guess we'll need to decide between increasing the time limit or keeping that story permanently disabled.
Labels: -Pri-0 Pri-2
Project Member

Comment 4 by bugdroid1@chromium.org, Sep 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ed15e687ba59854b2124a8f6c2bda9e4bc5616e1

commit ed15e687ba59854b2124a8f6c2bda9e4bc5616e1
Author: Roger McFarlane <rogerm@chromium.org>
Date: Fri Sep 08 15:06:59 2017

Revert "[tools/perf] Re-enable meadia system health smoke tests"

This reverts commit 359bafdab6c2f69932618e2879f1f5689e08af59.

Reason for revert:

Seeing telemetry-perf bot failures.

Original change's description:
> [tools/perf] Re-enable meadia system health smoke tests
> 
> These are running fine on bots now.
> 
> Bug:  726439 
> Change-Id: I97a5f9e85d873c685af14085f9534081fd3a5ee5
> Reviewed-on: https://chromium-review.googlesource.com/655299
> Reviewed-by: Ned Nguyen <nednguyen@google.com>
> Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#500564}

TBR=perezju@chromium.org,nednguyen@google.com

Change-Id: Ib3d38babf138c04ea32fa62cf257d20727593888
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  726439 ,  763379  
Reviewed-on: https://chromium-review.googlesource.com/657917
Reviewed-by: Roger McFarlane <rogerm@chromium.org>
Commit-Queue: Roger McFarlane <rogerm@chromium.org>
Cr-Commit-Position: refs/heads/master@{#500594}
[modify] https://crrev.com/ed15e687ba59854b2124a8f6c2bda9e4bc5616e1/tools/perf/benchmarks/system_health_smoke_test.py

Labels: -Sheriff-Chromium
Status Update:

The second revert has fixed the bots.

I'm removing Sheriff-Chromium label.

I've synced with perezju@ who will investigate/fix/reland after the weekend.
Components: Speed>Benchmarks>Waterfall
Ned, do you know how are tests assigned to shards for telemetry_perf_unittests?

From a recent run [1] I see 12 shards with times ranging from 112 to 205 seconds; if these smoke tests can be added to the shards with lower load it would all be fine?

Also what is the time limit for each shard to run?

[1]: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.linux%2FLinux_Tests%2F62708%2F%2B%2Frecipes%2Fsteps%2Ftelemetry_perf_unittests%2F0%2Flogs%2Fswarming.summary%2F0
Summary: telemetry_perf_unittests timing out when re-enabling smoke tests (was: telemetry_perf_unittests failing on chromium.linux/Linux Tests)
Cc: jbudorick@chromium.org dpranke@chromium.org
The number of shards & timeout for Linux  are configured in https://cs.chromium.org/chromium/src/testing/buildbot/chromium.linux.json?rcl=6569f043431d49051467dd9cc52df361034e3771&l=4180

+John/Dirk: can we just increase the number of shards here?
That 960 timeout looks reasonable enough to hold the extra tests. Wondering why it failed on the previous attempt?

I'm going to go ahead and re-enable the media smoke tests, which should fit without much trouble.

Waiting to re-enable multitab:misc:typical24 after we see the impact of the media tests.
Before doing that, I'm also interested in what Juan asked in #7 -- how does t_p_u do sharding?

The current task execution timeout for each shard on this suite is 16 minutes. Over the past month, the 90th percentile shard execution time for this suite has never cracked 5 minutes. There was only one day where the maximum shard execution time for this suite was above 10 minutes, and that was this event.

My suspicion is that we should be looking at how we're sharding before arbitrary increasing the shard count. We're not close to the currently allocated time.
(task data for #11 from an internal tool that I'm happy to share in a non-public setting)
... and my "before doing that" from #11 was in re increasing the number of shards as Ned proposed in #9. No objections to reenabling media smoke tests as Juan proposed in #10.
telemetry_perf_unittest are sharded in contiguous chunks (https://github.com/catapult-project/catapult/blob/master/third_party/typ/typ/runner.py#L367)

Would be great if we can use smarter sharding algorithm: either the greedy one (used by gpu test) or the cutting algorithm that Stephen proposed.
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/71921771dba55d265b7f953a4c0809c36ee180a0

commit 71921771dba55d265b7f953a4c0809c36ee180a0
Author: Juan Antonio Navarro Pérez <perezju@chromium.org>
Date: Tue Sep 26 13:30:46 2017

Reland "[tools/perf] Re-enable meadia system health smoke tests"

This reverts commit ed15e687ba59854b2124a8f6c2bda9e4bc5616e1.

Reason for revert: Tests should now fit within the allotted time.

Original change's description:
> Revert "[tools/perf] Re-enable meadia system health smoke tests"
> 
> This reverts commit 359bafdab6c2f69932618e2879f1f5689e08af59.
> 
> Reason for revert:
> 
> Seeing telemetry-perf bot failures.
> 
> Original change's description:
> > [tools/perf] Re-enable meadia system health smoke tests
> > 
> > These are running fine on bots now.
> > 
> > Bug:  726439 
> > Change-Id: I97a5f9e85d873c685af14085f9534081fd3a5ee5
> > Reviewed-on: https://chromium-review.googlesource.com/655299
> > Reviewed-by: Ned Nguyen <nednguyen@google.com>
> > Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#500564}
> 
> TBR=perezju@chromium.org,nednguyen@google.com
> 
> Change-Id: Ib3d38babf138c04ea32fa62cf257d20727593888
> No-Presubmit: true
> No-Tree-Checks: true
> No-Try: true
> Bug:  726439 ,  763379  
> Reviewed-on: https://chromium-review.googlesource.com/657917
> Reviewed-by: Roger McFarlane <rogerm@chromium.org>
> Commit-Queue: Roger McFarlane <rogerm@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#500594}

TBR=rogerm@chromium.org,perezju@chromium.org,nednguyen@google.com

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  726439 ,  763379 
Change-Id: I309dfdd0f7b32bd4dc275904fc24ef1e251f8c85
Reviewed-on: https://chromium-review.googlesource.com/684024
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#504352}
[modify] https://crrev.com/71921771dba55d265b7f953a4c0809c36ee180a0/tools/perf/benchmarks/system_health_smoke_test.py

Project Member

Comment 16 by bugdroid1@chromium.org, Sep 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2f37bb9a989ac2130617f13b47984a8e972b7e81

commit 2f37bb9a989ac2130617f13b47984a8e972b7e81
Author: Marc Treib <treib@chromium.org>
Date: Tue Sep 26 15:48:01 2017

Revert "Reland "[tools/perf] Re-enable meadia system health smoke tests""

This reverts commit 71921771dba55d265b7f953a4c0809c36ee180a0.

Reason for revert: telemetry_perf_unittests failing again:
https://uberchromegw.corp.google.com/i/chromium.linux/builders/Linux%20Tests

Original change's description:
> Reland "[tools/perf] Re-enable meadia system health smoke tests"
> 
> This reverts commit ed15e687ba59854b2124a8f6c2bda9e4bc5616e1.
> 
> Reason for revert: Tests should now fit within the allotted time.
> 
> Original change's description:
> > Revert "[tools/perf] Re-enable meadia system health smoke tests"
> > 
> > This reverts commit 359bafdab6c2f69932618e2879f1f5689e08af59.
> > 
> > Reason for revert:
> > 
> > Seeing telemetry-perf bot failures.
> > 
> > Original change's description:
> > > [tools/perf] Re-enable meadia system health smoke tests
> > > 
> > > These are running fine on bots now.
> > > 
> > > Bug:  726439 
> > > Change-Id: I97a5f9e85d873c685af14085f9534081fd3a5ee5
> > > Reviewed-on: https://chromium-review.googlesource.com/655299
> > > Reviewed-by: Ned Nguyen <nednguyen@google.com>
> > > Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> > > Cr-Commit-Position: refs/heads/master@{#500564}
> > 
> > TBR=perezju@chromium.org,nednguyen@google.com
> > 
> > Change-Id: Ib3d38babf138c04ea32fa62cf257d20727593888
> > No-Presubmit: true
> > No-Tree-Checks: true
> > No-Try: true
> > Bug:  726439 ,  763379  
> > Reviewed-on: https://chromium-review.googlesource.com/657917
> > Reviewed-by: Roger McFarlane <rogerm@chromium.org>
> > Commit-Queue: Roger McFarlane <rogerm@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#500594}
> 
> TBR=rogerm@chromium.org,perezju@chromium.org,nednguyen@google.com
> 
> # Not skipping CQ checks because original CL landed > 1 day ago.
> 
> Bug:  726439 ,  763379 
> Change-Id: I309dfdd0f7b32bd4dc275904fc24ef1e251f8c85
> Reviewed-on: https://chromium-review.googlesource.com/684024
> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#504352}

TBR=rogerm@chromium.org,perezju@chromium.org,nednguyen@google.com

Change-Id: Ib027b20abc609bab8a6269fe658ad5b135eb66b7
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  726439 ,  763379 
Reviewed-on: https://chromium-review.googlesource.com/685094
Reviewed-by: Marc Treib <treib@chromium.org>
Commit-Queue: Marc Treib <treib@chromium.org>
Cr-Commit-Position: refs/heads/master@{#504378}
[modify] https://crrev.com/2f37bb9a989ac2130617f13b47984a8e972b7e81/tools/perf/benchmarks/system_health_smoke_test.py

Ok, after some digging, I found that the problem lies in play:media:soundcloud specifically, which blows up the time of shard #6 from 3 to over the 15 minutes timeout.

So, there is clearly something wrong with that story.

I'll re-enable the others for now before investigating a bit more.
Blocking: 769263
Project Member

Comment 19 by bugdroid1@chromium.org, Sep 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d3994c22b44a40d3aef04fb383c3ca76bf884d11

commit d3994c22b44a40d3aef04fb383c3ca76bf884d11
Author: Juan A. Navarro Perez <perezju@chromium.org>
Date: Wed Sep 27 13:35:57 2017

Re-enable media system health smoke tests

Re-enable both:
- load:media:soundcloud
- play:media:google_play_music

The following remains disabled as it causes timeouts:
- play:media:soundcloud

Bug:  763379 
Change-Id: I8bf40d45ec747aa027540dc4f6cc637b47f35c32
Reviewed-on: https://chromium-review.googlesource.com/686874
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#504646}
[modify] https://crrev.com/d3994c22b44a40d3aef04fb383c3ca76bf884d11/tools/perf/benchmarks/system_health_smoke_test.py

The media stories started running fine on this bot:
https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=telemetry_perf_unittests&builder=chromium.linux%3ALinux%20Tests

I'll move over to re-enable multitab:misc:typical24, from what I can gather, that test didn't fail actually fail when we tried to re-enable it last time.
Project Member

Comment 21 by bugdroid1@chromium.org, Sep 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2b4933396a571e992ca6b474949624153edc79ae

commit 2b4933396a571e992ca6b474949624153edc79ae
Author: Juan Antonio Navarro Pérez <perezju@chromium.org>
Date: Thu Sep 28 13:42:05 2017

Revert "Revert "[tools/perf] Reenable multitab:misc:typical24 smoke test""

This reverts commit e996f894d9e0a2a9cf7dfe67a2efda60cc8543df.

Reason for revert: Story should be able to run now.

Original change's description:
> Revert "[tools/perf] Reenable multitab:misc:typical24 smoke test"
> 
> This reverts commit 748a8867042a160ddbec5945c19c9afe14e1362c.
> 
> Reason for revert: made telemetry_perf_unittests tip over some time limit
> 
> Original change's description:
> > [tools/perf] Reenable multitab:misc:typical24 smoke test
> > 
> > The story may no longer be failing.
> > 
> > TBR=nednguyen@google.com
> > 
> > Bug:  698499 
> > Change-Id: I9383bfee2e5882d75459e63a882b4e8fc10b2d3e
> > Reviewed-on: https://chromium-review.googlesource.com/654868
> > Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
> > Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> > Cr-Commit-Position: refs/heads/master@{#500567}
> 
> TBR=perezju@chromium.org
> 
> Change-Id: I7105c8fb94af1be724b1a13b82c7da81f9c9aecf
> No-Presubmit: true
> No-Tree-Checks: true
> No-Try: true
> Bug:  698499 , 763379 
> Reviewed-on: https://chromium-review.googlesource.com/657658
> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#500581}

TBR=perezju@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  698499 ,  763379 
Change-Id: I201f175ac822d6b51ad881a8e220b6e4f30e0397
Reviewed-on: https://chromium-review.googlesource.com/690194
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#505004}
[modify] https://crrev.com/2b4933396a571e992ca6b474949624153edc79ae/tools/perf/benchmarks/system_health_smoke_test.py

.. aaand the multitab:misc:typical24 is now running too with no issues.

Closing this, will follow up on issue 769263 about play:media:soundcloud.
Status: Fixed (was: Assigned)

Sign in to add a comment