New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 866768 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 869430



Sign in to add a comment

Master-release/LATEST-master=R69-10837.0.0, which blocks suite-scheduler kicking off ToT suites.

Project Member Reported by kinaba@chromium.org, Jul 24

Issue description

I'm watching this dashboard daily:
https://stainless.corp.google.com/search?status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_retried=true&exclude_cts=false&exclude_non_production=true&exclude_acts=true&exclude_non_release=true&exclude_au=true&test=cheets_CTS_N.7.1_r19.arm&exclude_not_run=false&row=build&col=test&view=matrix&first_date=2018-07-10&last_date=2018-07-24
and found all tests from the arc-cts suite stopped running after R69-10895.0.0 (while bvt-arc results are accumulating for M70).



To me the situation looks quite similar to  Bug 847540 , which we encountered at the M69 branch point.

In the bug, the solution was to checking chromeos-image-archive/master-release/LATEST-master instead of master-paladin,
but this time, as far as I checked, -release/LATEST-master is getting far behind.

gs://chromeos-image-archive/master-paladin/LATEST-master == R70-10903.0.0-rc2
gs://chromeos-image-archive/master-release/LATEST-master == R69-10837.0.0
 
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
xixuan to make initial diagnosis
Issue 867091 has been merged into this issue.
Cc: -jrbarnette@chromium.org -ihf@chromium.org sjg@chromium.org wuchengli@chromium.org shu...@chromium.org
Owner: ayatane@chromium.org
Summary: Master-release/LATEST-master=R69-10837.0.0, which blocks suite-scheduler kicking off ToT suites. (was: arc-cts suite (and maybe other nightly suites) not scheduled on M70 since the branch)
Master-release'version is decided as ToT as we discussed in  Issue 847540 . Given this assumption, I don't find anything wrong in suite-scheduler. Unless we switch the ToT back to master-paladin/LATEST-master.

arc-cts suite is not only kicked off as nightly suites. There's also weekly, new_build event to kick off arc-cts for different branches.

[ArcCtsStablePerBuild]
run_on: new_build
suite: arc-cts
branch_specs: ==tot-2

[ArcCtsBetaPerWeek]
run_on: weekly
suite: arc-cts
branch_specs: ==tot-1

[ArcCtsDevPerWeekSat]
run_on: weekly
suite: arc-cts
branch_specs: ==tot

[ArcCtsDevPerWeekWed]
run_on: weekly
suite: arc-cts
branch_specs: ==tot

[ArcCtsBetaNightly]
run_on: nightly
suite: arc-cts
branch_specs: ==tot-1

[ArcCtsDevPerWeekMon]
run_on: weekly
suite: arc-cts
branch_specs: ==tot

[ArcCtsDevPerWeekThu]
run_on: weekly
suite: arc-cts
branch_specs: ==tot

[ArcCtsDevPerWeekTue]
run_on: weekly
suite: arc-cts
branch_specs: ==tot

[ArcCtsPerBuild]
run_on: new_build
suite: arc-cts
branch_specs: ==tot

As ToT=R69, all events on tot branch cannot be kicked off. Issue 867091 has the same issue. If master-release goes to R70, this problem will be solved.

Assign to deputy to decide whether we want to solve this by pushing release-master to real ToT, (cc sheriffs) or switch ToT in suite-scheduler's settings.


Hokay, so what's the issue here?

OOB suites are tracking master-release since they test against release builds.

It sounds like master-release should be R70 (which is DEV?)

But why are the release builders trailing behind the CQ?  Shouldn't the CQ also track DEV?
Labels: Hotlist-Deputy
master-release/LATEST-master == R69-10837.0.0 is because that master-release stays red since 2018-07-02:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942091101558299760.

If we get a green master-release, the ToT will go newer.
Cc: pwang@chromium.org
Owner: jrbarnette@chromium.org
Cc: dgarr...@chromium.org
So it sounds like per comment 6 this is blocked on https://bugs.chromium.org/p/chromium/issues/detail?id=869430 which appears to be what is causing the master-release builder to fail since early July.



Seems this may be related to migrating to Swarming?
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Owner: ----
Status: Available (was: Assigned)
I'm not sure what's required here, but the LATEST-master files
are created by chromite; they're not in the purview of test
infrastructure.

So, passing to CI for evaluation.

Owner: athilenius@chromium.org
Status: Assigned (was: Available)
... Actually, this is probably modestly urgent, so let's pass to the CI Bobby
for prioritized evaluation.

Issue 868078 has been merged into this issue.
Is there any chance someone could dumb this down for me, this is all new territory.

So one of the builders makes an image (either master or release builder) and that image gets uploaded to Cloud Storage. Then, transitively, something called ?suite scheduler? kicks off several suite executions on that image, like CTS (which is the Android compatibility test, right?). What machines are running suite scheduler? I also don't follow what 'checking' the master-paladin means? Is build status polled?
Suite-scheduler is a GAE service, which schedules tests based on user needs, e.g. schedule suite dummy on branch ToT for all boards.

So suite-scheduler needs a source to decide what's current ToT, e.g. it's R69 or R70. Currently, it use the latest passed release-master as ToT:

xixuan@xixuan0:~/chromiumos/src/third_party/autotest/files$ gsutil cat gs://chromeos-image-archive/master-release/LATEST-master
R69-10837.0.0

So it's R69.

However, R69 is not expected. What users want is R70. But because master-release keeps failing since 7.02:

https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=master-release&buildBranch=

The LATEST-master is not updated since then.


To solve this problem, we want at least one green run of master-release builder. Or ToT will be R69 forever. Based on #9, the blocker for one green run of master-release may be  Issue 869430 .

Can we base it on something other than the LATEST files? Maybe CIDB?
Re #15, Technically yes. But most suites that are scheduled by suite-scheduler are based on a green release build. If master-release or some release builders always keep red, we will skip many suites.

I wonder if we have any official indication for ToT in GE or somewhere. But "master-release is red for nearly one month" is not expected anyway...
Thank you Xixuan, appreciate the explanation! So if I'm understanding correctly, options (for me) are find a way around using the LATEST file (maybe CIDB as Don suggested), wait for a green run on master-release or switch to the LATEST from master-paladin?
1) "switch to the LATEST from master-paladin": we don't want this finally since suite-scheduler is targeting release build. But as a temporary fix, we can do that. It's just a revert of https://chromium-review.googlesource.com/c/chromiumos/infra/suite_scheduler/+/1077329.

2) "Debug  Issue 869430  to make master-release green" is the right solution. Since red release builder also blocks suite-schedule to schedule suites on them, which is unexpected anyway.

3) "Find a way around instead of using LATEST file", that needs more work & discussion than option 1), From the perspective of adjusting ToT, we can simply adopt 1).

Cc: jrbarnette@chromium.org
Owner: jrbarnette@chromium.org
Yes, we need to look at  issue 869430  but also, per jrbarnette on IRC just now, the majority of the slaves are failing due to DUT shortages and have been for awhile. That needs to be fixed in parallel.

Alec will look at  issue 869430  while we use this bug (or another?) to track fixing DUT shortages.

Owner: jclinton@chromium.org
> Yes, we need to look at  issue 869430  but also, per jrbarnette on
> IRC just now, the majority of the slaves are failing due to DUT
> shortages and have been for awhile. That needs to be fixed in parallel.

AFAICT, this problem isn't about the DUT shortages.  Certainly, there's
nothing in the bug history explaining a connection.  If I've understood
the ask here properly, it requires chromite changes, not device repairs
or Autotest changes.

Blockedon: 869430
Owner: jrbarnette@chromium.org
I'll focus on  issue 869430  today.
Owner: athilenius@chromium.org
Richard, where are we tracking getting the DUT shortages fixed? Because that also blocks this bug.
> Richard, where are we tracking getting the DUT shortages fixed? Because that also blocks this bug.

There's no bug filed; there's a daily e-mail with a list of the problems;
I'm working my way through that to see what's what.

Do we have specific builders failing that we know are blocking this bug?
Although I can guess at some of the failures, the mail doesn't identify
problems by builder, but only by model.

The LATEST files are just files in GS. We could manually create/update one if we want as a temp workaround.

Sorry for taking so long to get up to speed, let me double checking my (new) understanding here: master-release has actual failing builders, but that might be okay because master-release is allowed to pass with failures in specific builders. The problem is its failing during the generation of links to builds thinking that they are still on Buildbot. Aka once Don's change lands, both of these problems might go away, as long as the builders that are failing are allowed to fail. In addition to this, some of them are failing because of provisioning issues, which might still block a successful run.
We had a successful run on master-release last night (thank you dgarrett for the links fix!) Does that resolve this issue?
francos66040@gmail.com

Master-release/LATEST-master=R69-10837.0.0, which blocks suite-scheduler
kicking off ToT suites.

builds.

also track DEV?
The fix needs merging back.
I've issued a merge request in  https://crbug.com/869430 .
Master-release/LATEST-master=R69-10837.0.0, which blocks suite-scheduler
kicking off ToT suites.
Still blocked on  crbug.com/869430  which Don has a fix for that needs to be merged in, it's pending review but should go in soon.
Status: Fixed (was: Assigned)
This should be fixed when the next R69 build starts. Calling it fixed, since this was proven out on R70.

Sign in to add a comment