New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 794707 link

Starred by 3 users

Issue metadata

Status: Duplicate
Merged: issue 797109
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

betty-pre-cq flaky.

Project Member Reported by dgarr...@chromium.org, Dec 13 2017

Issue description

VMTests in the betty PreCQ are overly flaky.

Can we restructure VMTest retries to be more reliable? Perhaps by only retesting failures, instead of the entire test suite?

Here is an example failure with no CLs:

https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/pre_cq/72859


The overall succes stats for betty-pre-cq over the last month:

betty-pre-cq:  74% successes   4% timeouts 2573 builds.


By way of comparison with another default PreCQ builder:

samus-no-vmtest-pre-cq:  87% successes   3% timeouts 2328 builds.

 
Cc: jkop@chromium.org
>  Perhaps by only retesting failures, instead of the entire test suite?

Tracked at  Issue 750918 
Mergedinto: 750918
Status: Duplicate (was: Untriaged)
Unless someone has a better way to improve betty stability, these should be duped together, and the data here should boost priority of the original bug.

betty does appear to be the least stable of our PreCQ builders, though I would have to enhance "cros stats" slightly to prove that.
Status: Assigned (was: Duplicate)
I don't see these as duplicate. There is likely a legitimate fixable product bug that contributes to VMTest flake. That can be resolved independently from the vmtest retry semantics.

Comment 4 by ihf@chromium.org, Dec 15 2017

Cc: norvez@chromium.org kinaba@chromium.org
Labels: M-65 OS-Chrome
There are two failures in the log above affecting two different CTS runs. Each of these runs has the same symptoms though:
1) Android comes up.
2) basic connection via adb is established.
3) Android doesn't respond to tradefed.

Overall this looks like a product issue (apparently betty only?)

I could try harder to restart/recover Android if this is common. But presumably we have issues that crept into betty.

---

Backing off though, a builder that not only builds but also tests will always show the worst success rate. After all, it needs to build and pass all tests.
Cc: ihf@chromium.org ayatane@chromium.org
 Issue 785613  has been merged into this issue.

Comment 6 by norvez@chromium.org, Dec 15 2017

betty-pre-cq seems pretty flaky recently, but betty-paladin doesn't look bad, even though afaict they're running the same tests (smoke suite). Could it be the difference between baremetal and GCE?
That's a thought.

I would expect performance differences between the two, which could affect timing sensitive tests.

There shouldn't be many other differences, but could be a few, for example, network connections and behavior.

Comment 8 by ihf@chromium.org, Dec 18 2017

Status: Started (was: Assigned)
I looked at a bunch of logs and vmtest can take a long time to start Android. I will relax timeouts for betty.
I've created  crbug.com/795976  to create a betty-tot-paladin builder.

Comment 10 by ihf@chromium.org, Dec 19 2017

Looks like the Chrome/Android start times on my change's pre-cq/GCE run were 3*63s and 1*80s. No wonder it is hard to pass when the first timeout was 60s (120s on second attempt, which often failed due to VM having problems to reboot?).

This change is increasing the login timeouts
https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/833502/

But notice that betty failure recovery (via reboot) is likely not functioning.
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6ad5aba532a163eb5b5f34c4204e5a716797ce67

commit 6ad5aba532a163eb5b5f34c4204e5a716797ce67
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Tue Dec 19 12:35:28 2017

tradefed_test: tune login timeouts.

We are interested in fairly tight login timeouts for the CQ to not
wait too long in case of problems. But it appears that we need to
be able to relax the login timeout for some boards like betty,
which can run on slow GCE instances.

This change
- increases the regular Chrome login timeout from 60s to 90s.
- increases betty timeout from 60s to 300s.

BUG= chromium:794707 
TEST=pre-cq will test.

Change-Id: Ifeea56cd609395a052ea7ee059de450a504b73b2
Reviewed-on: https://chromium-review.googlesource.com/833502
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Po-Hsien Wang <pwang@chromium.org>

[modify] https://crrev.com/6ad5aba532a163eb5b5f34c4204e5a716797ce67/server/cros/tradefed_test.py

Comment 12 by ihf@chromium.org, Dec 21 2017

Mergedinto: -750918 797109
Status: Duplicate (was: Started)
To make a long story short, the server hangs for 2-4 minutes at times.

Sign in to add a comment