New issue
Advanced search Search tips

Issue 852993 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 823541



Sign in to add a comment

migration app has incorrectly marked all chromium.fyi builders as migrated

Project Member Reported by no...@chromium.org, Jun 14 2018

Issue description

migration app has incorrectly marked all chromium.fyi builders as migrated and closed the bugs. 
 

Comment 1 by efoo@chromium.org, Jun 14 2018

Labels: LUCI-Chromium LUCI-Blocker-Chromium

Comment 2 by no...@chromium.org, Jun 14 2018

https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-migration&organizationId=433637338589&minLogLevel=0&expandAll=false&timestamp=2018-06-03T08%3A38%3A00.000000000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-migration%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T08%3A54%3A00.445700000Z&filters=text%3A%22Afl%20Upload%20Linux%20ASan%22&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-migration%22%0Aresource.labels.version_id%3D%2215768-7b8a601%22%0Aresource.labels.module_id%3D%22default%22%0Atimestamp%3D%222018-06-03T08%3A54%3A00.445700000Z%22%0AinsertId%3D%225b13acae000d7b0b11a2ed9e%22

is the request that determined that the builders no longer exist. There are no errors related to chromium.fyi there, which means milo response didn't include those builders.

https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-milo&organizationId=433637338589&minLogLevel=0&expandAll=false&timestamp=2018-06-03T08%3A54%3A00.466557000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T08%3A54%3A00.461477000Z&filters=text%3Aluci-migration&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.version_id%3D%223164-a268639%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-milo%22%0Atimestamp%3D%222018-06-03T08%3A54%3A00.461477000Z%22%0AinsertId%3D%225b13aca8000ea680605be0e6%22

is the milo request. I don't see any errors there either. It contains:
"Returning 83135 bytes"

https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-milo&organizationId=433637338589&minLogLevel=0&expandAll=false&timestamp=2018-06-03T08%3A54%3A00.466557000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T07%3A54%3A00.896325000Z&filters=text%3Aluci-migration&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-milo%22%0Aresource.labels.version_id%3D%223164-a268639%22%0Aresource.labels.module_id%3D%22default%22%0Atimestamp%3D%222018-06-03T07%3A54%3A00.896325000Z%22%0AinsertId%3D%225b139e990001f7e25447b02f%22

is the same request a minute earlier. It contains "Returning 115298 bytes" which supports the theory.

Comment 3 by no...@chromium.org, Jun 14 2018

Cc: hinoka@chromium.org
this happened because milo stopped receiving pubsub messages from chromium.fyi for 2 hours
https://screenshot.googleplex.com/DdS28fhthwf
and decided that its builders no longer exist

this is because of heuristics builtin into milo

--

it is not obvious how to work around this behavior on migration app side. If milo says builder does not exist, migration app closes the bug. Humans are notified. Migration app does not expect a builder to reappear. It would mean that the bug needs to be reopened and humans are notified again.

Comment 4 by hinoka@chromium.org, Jun 14 2018

Milo's heuristics are that if it hasn't received a message for 1h+, the master is presumed dead.  Lacking any other signals, I think that's the best heuristic we can ask Milo to perform.

I've once had to delete a master (chromium.chrome) and then undo the deletion because problems were later discovered.  Giving luci-migration the ability to detect "back from the dead" masters might not be a bad idea.

Comment 5 by no...@chromium.org, Jun 14 2018

migration app can detect "back from the dead", but it means it will spam bugs, which would be considered incorrect by a migration app user. Here, the master wasn't actually deleted. It might be deadish for an hour, but it wasn't deleted. Another problem is that we've already allowed task force to mark bugs as fixed before builder is closed. The suggestion implies migration app will reopen bugs after humans close them.

WDYT about an *explicit* list of masters in milo, in a config? it would be changed as often, as we deleted masters (not that often).

Comment 6 by hinoka@chromium.org, Jun 14 2018

As someone who has been doing a lot of master deletions lately, I'm not a huge fan of an explicit list to delete in Milo, because that would inevitably become another source of inconsistency caused by human error.

Imagine if a client team wants to turn down a master (cit restart -s offline, and delete the master folder in build.git), but wonder why their data is inconsistent, and we tell them it is before they forgot to update a config line that is exclusive to Milo  (It's unlikely we can make a config change like this block the build.git CQ).

I prefer having machines in the loop here, whether that be to improve the heuristics of Milo, or add another data source.  But IMO an explicit list of masters in a config that Milo can read would inevitably become stale and require someone (most likely me) to manually maintain it perpetually.

Comment 7 by no...@chromium.org, Jun 14 2018

ok, can we increase the 1h to like 4h then? master deletions do not happen often. It seems that there is more harm if [milo returns incorrect data because a master or its network was unavailable for an hour] than [milo confirming that master does not exist with a 4h delay].

Comment 8 by hinoka@chromium.org, Jun 14 2018

4h sgtm.  

Comment 9 by no...@chromium.org, Jun 15 2018

efoo, please reopen those bugs (I don't have the list)

Comment 10 by no...@chromium.org, Jun 15 2018

Blockedon: 823541
apart from that, the remaining work is to unmark the builders as "migrated" on the migration app. However, we've already concluded that "migrated" shouldn't be a thing in the first place. The right thing to do is to fix bug 823541 which is a long overdue
Project Member

Comment 11 by bugdroid1@chromium.org, Jun 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/facf65b8cd0d715f8a0823cb00748697d6b21724

commit facf65b8cd0d715f8a0823cb00748697d6b21724
Author: Nodir Turakulov <nodir@google.com>
Date: Fri Jun 15 17:22:29 2018

[milo] increase masterExpiry to 4h

A master or its network may be offline for an hour. This leads milo to think
that master was deleted, it responds to buildbot API calls as "there are no
builders on this master" which leads other systems (e.g. luci-migration) to make
wrong decisions.

Increase expiration to 4h.

Bug:  852993 
Change-Id: I3f7ea76b652e2de667aff54fb828a96643657714
Reviewed-on: https://chromium-review.googlesource.com/1102191
Reviewed-by: Ryan Tseng <hinoka@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/facf65b8cd0d715f8a0823cb00748697d6b21724/milo/buildsource/buildbot/buildstore/master.go
[modify] https://crrev.com/facf65b8cd0d715f8a0823cb00748697d6b21724/milo/buildsource/buildbot/master_test.go

Comment 12 by no...@chromium.org, Jun 26 2018

Status: Fixed (was: Started)
The work in this bug is done. The remaining bit is bug 823541

Sign in to add a comment