New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618330 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

Android Tests bot occasionally fails telemetry_perf_unittests BattOrCases

Project Member Reported by timvolod...@chromium.org, Jun 8 2016

Issue description

https://uberchromegw.corp.google.com/i/chromium.linux/builders/Android%20Tests/builds/27146/steps/telemetry_perf_unittests%20on%20Android/logs/stdio

[27/177] benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.BattOr.BattOrCases failed unexpectedly 8.4077s
 
Summary: Android Tests bot occasionally fails telemetry_perf_unittests BattOrCases (was: Android Tests bot occasionally fails telemetry_per_unittests BattOrCases)
in the log: seeing both 'device not found' and exceptions on Timeout threads, however that happens for other tests as well, not only for the 'BattOrCases', the latter is listed as failing in the end though.

  AdbCommandFailedError: (device: 06074a7c006236b6) adb shell '( pm path org.chromium.chrome );echo %$?': failed with exit status 255 and output:
  - error: device '06074a7c006236b6' not found
  
[26/177] benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.page_cycler.typical_25 queued
[27/177] benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.BattOr.BattOrCases failed unexpectedly 8.4077s:
Cc: rnep...@chromium.org
looks like this has been added recently (a week or so ago):
https://codereview.chromium.org/1984473002/

going to disable on android for now, adding the author of the above patch +cc: rnephew@chromium.org
Project Member

Comment 5 by bugdroid1@chromium.org, Jun 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e7e948c11a4f4d927ce6c92a5205d6dbd4660345

commit e7e948c11a4f4d927ce6c92a5205d6dbd4660345
Author: timvolodine <timvolodine@chromium.org>
Date: Wed Jun 08 16:08:39 2016

[Android] Disable flaky BattOrCases telemetry_perf_unittest

Flakes on Android Tests bot.
https://uberchromegw.corp.google.com/i/chromium.linux/builders/Android%20Tests

BUG= 618330 
CQ_EXTRA_TRYBOTS=tryserver.chromium.perf:android_s5_perf_cq;tryserver.chromium.perf:winx64_10_perf_cq;tryserver.chromium.perf:mac_retina_perf_cq
NOTRY=true
TBR=nednguyen@google.com,boliu@chromium.org

Review-Url: https://codereview.chromium.org/2049113002
Cr-Commit-Position: refs/heads/master@{#398583}

[modify] https://crrev.com/e7e948c11a4f4d927ce6c92a5205d6dbd4660345/tools/perf/benchmarks/battor.py

Cc: jbudorick@chromium.org stip@chromium.org
Components: -Infra Infra>Client>Android
cc-ing stip since we recently changed telemetry_perf_unittests.
I'd guess this is probably not due to the swarming transition.
Im wondering if it has something to do with this:
WARNING:root:Dependency battor_agent_binary not configured for platform linux_x86_64. Skipping prefetch.
But I would expect it to fail much differently than it is manifesting here in this case; either way this needs to be fixed but I dont expect it to be the route cause. I wonder if the real cause of the failure is being swallowed somewhere though.
Cc: alexandermont@chromium.org
Labels: -Pri-3 BattOr Pri-2
Owner: charliea@chromium.org
Status: Assigned (was: Untriaged)
Sorry - just saw this. This is definitely due to some failure in the power-infra code, not the swarming transition.

We have some code that detects whether a power monitoring device ("BattOr") is present, and only enables our benchmark if it is. It looks like this code is itself crashing, causing problems.

Assigning this to rnephew@, who's agreed to look into what we could be doing better here
Owner: rnep...@chromium.org
Cc: nednguyen@chromium.org
Anyone have any input on why it would not be retrying this test?
(Deleting unimportant lines)

Retrying failed tests (attempt #1 of 3)...

...
[1/22] benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.BattOr.BattOrCases was skipped 2.8147s
...
All 21 other tests retry.

The cause of this issue is that telemetry is getting a device not responding error before it gets to the point where it determines if a battor is present and should continue with the run. Then no retries happen because in later attempts it detects that the test should be skipped. The problem is not with the battor benchmark itself, as the device offline error occurs during the FindBrowser step of a telemetry run.

The fix here is either to change how it is determined when this test runs (it is currently determined here for the battor test):
https://cs.chromium.org/chromium/src/tools/perf/benchmarks/battor.py?dr=CSs&q=ShouldDisable+battor&sq=package:chromium&l=22

or make changes to how failures are detected to detect this edge case.

Adding Ned on guidance for option one.

Ignore the first line of the last comment, I was writing it while I investigated and didn't clean up that line. I know why its not retrying, as the rest of the comment says.
Cc: dpranke@chromium.org
Thanks for your analysis, Randy. The unittest benchmark_smoke_unittest.BenchmarkSmokeTest.BattOr.BattOrCases though is generated dynamically through https://cs.chromium.org/chromium/src/tools/perf/benchmarks/benchmark_smoke_unittest.py?rcl=0&l=76 

So this seems like a problem with typ does not cache the test case that was dynamically generated?
/bump

Is this still happening? Any update on where we're at in fixing it?
It was disabled, and I found out WHY it was being flaky. I haven't done any work on fixing it because I am not sure that is the correct fix for it.
Owner: ----
Status: Available (was: Assigned)
We are bringing the android fyi bot online now, and it would be nice  if we could re-enable these tests on android. Ned said the problem was with typ not caching the test cases; could someone with more knowledge on typ take this over?

If no one is available that is familiar with typ I can take it back and work on it; it would just probably be faster if someone familiar took a look.
I can try to at least take a quick look at this tomorrow, and I should definitely have some time to look at this next week if need be.
Project Member

Comment 19 by bugdroid1@chromium.org, Jul 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3bfff97f65f489046098d9b4fd60ecb99c7faa11

commit 3bfff97f65f489046098d9b4fd60ecb99c7faa11
Author: rnephew <rnephew@chromium.org>
Date: Thu Jul 28 17:32:06 2016

[Android] Reenable BattOr tests for android.

BUG=624962, 618330 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq

Review-Url: https://codereview.chromium.org/2187183002
Cr-Commit-Position: refs/heads/master@{#408421}

[modify] https://crrev.com/3bfff97f65f489046098d9b4fd60ecb99c7faa11/tools/perf/benchmarks/battor.py
[modify] https://crrev.com/3bfff97f65f489046098d9b4fd60ecb99c7faa11/tools/perf/benchmarks/benchmark_smoke_unittest.py

Owner: nednguyen@chromium.org
Status: Assigned (was: Available)
Okay, I looked over this bug again, and I'm now realizing I don't actually know what is supposedly wrong with typ.

@nednguyen: why did you say that "this seems like a problem with typ does not cache the test case that was dynamically generated"?

From me looking at the code now, the "was skipped" result shows up when typ invokes a test, and it calls the TestCase.skip() method. So, it sounds like we retried the test, but the test itself decided it should be skipped.

What would you expect to have happened?
Hi Dirk,
The test benchmarks.benchmark_smoke_unittest.BenchmarkSmokeTest.BattOr.BattOrCase is supposed to be skipped when device does not has battor (See https://cs.chromium.org/chromium/src/tools/perf/benchmarks/battor.py?rcl=0&l=27 & https://cs.chromium.org/chromium/src/tools/perf/benchmarks/benchmark_smoke_unittest.py?rcl=0&l=75)

According to Randy's comment in #12, the test BattOr.BattOrCase is first run (which probable means the logic of determining whether the device has battor may be flaky), but failed. Then when typ retried the test, the test is skipped. So to summarize:

BattOr.BattOrCases run --> Failed
BattOr.BattOrCases run again in retries --> Skipped

All other failed test passed when retried.

Typ consider the whole suite as failed.

Based on this, the failure is caused by 2 bugs:
1) The check whether the device has battor device attached is flaky --> lead to flaky "skip vs not skip" state of the test case.
2) Even though the test case already skipped in retries, Typ still determines the test suite as failed.

I think the fix here include both 1) & 2)
Owner: dpranke@chromium.org
Ah, I see.

So, the test failed once, and never actually passed. Okay, let me think about how to handle this.
Owner: nednguyen@chromium.org
Okay, after having thought about this for a while, it's not actually obvious to me that typ is doing the wrong thing. After all, the test did fail once, and never actually passed.

However, I don't really feel that strongly about this, and there is a sense in which I can see how we should treat "failed then was skipped" like "failed then passed" and so in the sense of "the test didn't always fail" we might consider it to pass.

So, I've posted a CL that changes things to accomodate that:

https://codereview.chromium.org/2201663002/

It still feels a little weird to me, so I'm not sure I want to land it.

Ned, what do you think?
If I recall correctly, it is failing _before_ it gets to the part of code that detects a battor because of device flakiness. At least thats what I wrote in comment #12.
Owner: dpranke@chromium.org
Dirk: I think the interpretation "failed then was skipped" yields similar outcome to "failed then passed" is the elegant way to explain this decision. I think the weirdness come from the fact that the skip signal is issued dynamically during the test time instead of before test run time. 


#24: Sorry for missing this. But this shows that 1) in #21 is still a problem that need to be fixed, whereas 2) is probably not a problem. This makes it even a stronger case for making the fix in #23.

Assign to Dirk since he's initiated the fix. 
to be clear, my fix addresses 2), not 1).
Sorry for the mixed up, dpranke's fix addresses 2, and not 1) in #21.
Also 1) is an incorrect assessment. The test first failed because of the test android device failure, not because the battor discovering logic is flaky.
Status: Started (was: Assigned)
Labels: -Pri-2 Pri-1
Project Member

Comment 30 by bugdroid1@chromium.org, Aug 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/71c9ade4e91e16846779311f1ae800ed550c8b69

commit 71c9ade4e91e16846779311f1ae800ed550c8b69
Author: dpranke <dpranke@chromium.org>
Date: Tue Aug 30 19:25:32 2016

Roll typ to v0.9.7.

This picks up a change that will help address flaky telemetry failures.

This includes the following changes (v0.9.5..v0.9.7):

  79fe79db Change typ's interpretation of a test that first fails
           and is then skipped.
  101acd31 Bump version to 0.9.6, clean up Python 3 failures
  2f8787b8 rework run script to remove python3 log
  d5023636 rework sharding tests

TBR=rnephew@chromium.org, nednguyen@google.com
BUG= 618330 

Review-Url: https://codereview.chromium.org/2289303002
Cr-Commit-Position: refs/heads/master@{#415392}

[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/README.chromium
[add] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/codereview.settings
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/run
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/typ/json_results.py
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/typ/runner.py
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/typ/tests/main_test.py
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/typ/tests/pool_test.py
[modify] https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69/third_party/typ/typ/version.py

Owner: nednguyen@chromium.org
Status: Assigned (was: Started)
Okay, the typ fix has landed, so I'm punting this back to you telemetry folks :).
I'm not sure there is anything to do from the telemetry side now. We have the test blacklisted from running as part of a smoke test; and there has been active discussion on those taking too long for the CQ. 
Owner: dpranke@chromium.org
Status: Fixed (was: Assigned)
Dirk's fix should be enough if we re-enable the benchmark.
Project Member

Comment 34 by bugdroid1@chromium.org, Sep 6 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1023620c13385de9fcea55ec35f66dc22b454ff8

commit 1023620c13385de9fcea55ec35f66dc22b454ff8
Author: dpranke <dpranke@chromium.org>
Date: Tue Sep 06 01:51:33 2016

Revert of Roll typ to v0.9.7. (patchset #1 id:1 of https://codereview.chromium.org/2289303002/ )

Reason for revert:
Let's see if reverting this helps w/ crbug.com/643320.

Original issue's description:
> Roll typ to v0.9.7.
>
> This picks up a change that will help address flaky telemetry failures.
>
> This includes the following changes (v0.9.5..v0.9.7):
>
>   79fe79db Change typ's interpretation of a test that first fails
>            and is then skipped.
>   101acd31 Bump version to 0.9.6, clean up Python 3 failures
>   2f8787b8 rework run script to remove python3 log
>   d5023636 rework sharding tests
>
> TBR=rnephew@chromium.org, nednguyen@google.com
> BUG= 618330 
>
> Committed: https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69
> Cr-Commit-Position: refs/heads/master@{#415392}

TBR=nednguyen@google.com,rnephew@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 618330 , 643320

Review-Url: https://codereview.chromium.org/2317473002
Cr-Commit-Position: refs/heads/master@{#416573}

[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/README.chromium
[delete] https://crrev.com/8211426fca927c616b819505cc564a2de12c7715/third_party/typ/codereview.settings
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/run
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/typ/json_results.py
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/typ/runner.py
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/typ/tests/main_test.py
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/typ/tests/pool_test.py
[modify] https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8/third_party/typ/typ/version.py

Project Member

Comment 35 by bugdroid1@chromium.org, Sep 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/00550736d827b1637ee38757d40f46a382953c79

commit 00550736d827b1637ee38757d40f46a382953c79
Author: nednguyen <nednguyen@google.com>
Date: Thu Sep 08 22:09:01 2016

Reland of ll typ to v0.9.7. (patchset #1 id:1 of https://codereview.chromium.org/2317473002/ )

Reason for revert:
This is probably not the root cause of crbug.com/643320

Original issue's description:
> Revert of Roll typ to v0.9.7. (patchset #1 id:1 of https://codereview.chromium.org/2289303002/ )
>
> Reason for revert:
> Let's see if reverting this helps w/ crbug.com/643320.
>
> Original issue's description:
> > Roll typ to v0.9.7.
> >
> > This picks up a change that will help address flaky telemetry failures.
> >
> > This includes the following changes (v0.9.5..v0.9.7):
> >
> >   79fe79db Change typ's interpretation of a test that first fails
> >            and is then skipped.
> >   101acd31 Bump version to 0.9.6, clean up Python 3 failures
> >   2f8787b8 rework run script to remove python3 log
> >   d5023636 rework sharding tests
> >
> > TBR=rnephew@chromium.org, nednguyen@google.com
> > BUG= 618330 
> >
> > Committed: https://crrev.com/71c9ade4e91e16846779311f1ae800ed550c8b69
> > Cr-Commit-Position: refs/heads/master@{#415392}
>
> TBR=nednguyen@google.com,rnephew@chromium.org
> # Not skipping CQ checks because original CL landed more than 1 days ago.
> BUG= 618330 , 643320
>
> Committed: https://crrev.com/1023620c13385de9fcea55ec35f66dc22b454ff8
> Cr-Commit-Position: refs/heads/master@{#416573}

TBR=rnephew@chromium.org,nednguyen@chromium.org,dpranke@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 618330 , 643320

Review-Url: https://codereview.chromium.org/2328753002
Cr-Commit-Position: refs/heads/master@{#417415}

[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/README.chromium
[add] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/codereview.settings
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/run
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/typ/json_results.py
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/typ/runner.py
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/typ/tests/main_test.py
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/typ/tests/pool_test.py
[modify] https://crrev.com/00550736d827b1637ee38757d40f46a382953c79/third_party/typ/typ/version.py

Sign in to add a comment