migration app has incorrectly marked all chromium.fyi builders as migrated |
|||
Issue descriptionmigration app has incorrectly marked all chromium.fyi builders as migrated and closed the bugs.
,
Jun 14 2018
https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-migration&organizationId=433637338589&minLogLevel=0&expandAll=false×tamp=2018-06-03T08%3A38%3A00.000000000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-migration%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T08%3A54%3A00.445700000Z&filters=text%3A%22Afl%20Upload%20Linux%20ASan%22&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-migration%22%0Aresource.labels.version_id%3D%2215768-7b8a601%22%0Aresource.labels.module_id%3D%22default%22%0Atimestamp%3D%222018-06-03T08%3A54%3A00.445700000Z%22%0AinsertId%3D%225b13acae000d7b0b11a2ed9e%22 is the request that determined that the builders no longer exist. There are no errors related to chromium.fyi there, which means milo response didn't include those builders. https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-milo&organizationId=433637338589&minLogLevel=0&expandAll=false×tamp=2018-06-03T08%3A54%3A00.466557000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T08%3A54%3A00.461477000Z&filters=text%3Aluci-migration&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.version_id%3D%223164-a268639%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-milo%22%0Atimestamp%3D%222018-06-03T08%3A54%3A00.461477000Z%22%0AinsertId%3D%225b13aca8000ea680605be0e6%22 is the milo request. I don't see any errors there either. It contains: "Returning 83135 bytes" https://pantheon-hourly-sso.corp.google.com/logs/viewer?project=luci-milo&organizationId=433637338589&minLogLevel=0&expandAll=false×tamp=2018-06-03T08%3A54%3A00.466557000Z&customFacets&limitCustomFacetWidth=true&dateRangeStart=2018-06-03T07%3A38%3A00.000Z&dateRangeEnd=2018-06-03T09%3A38%3A00.000Z&interval=JUMP_TO_TIME&resource=gae_app&logName=projects%2Fluci-milo%2Flogs%2Fappengine.googleapis.com%252Frequest_log&scrollTimestamp=2018-06-03T07%3A54%3A00.896325000Z&filters=text%3Aluci-migration&advancedFilter=resource.type%3D%22gae_app%22%0Aresource.labels.zone%3D%22us10%22%0Aresource.labels.project_id%3D%22luci-milo%22%0Aresource.labels.version_id%3D%223164-a268639%22%0Aresource.labels.module_id%3D%22default%22%0Atimestamp%3D%222018-06-03T07%3A54%3A00.896325000Z%22%0AinsertId%3D%225b139e990001f7e25447b02f%22 is the same request a minute earlier. It contains "Returning 115298 bytes" which supports the theory.
,
Jun 14 2018
this happened because milo stopped receiving pubsub messages from chromium.fyi for 2 hours https://screenshot.googleplex.com/DdS28fhthwf and decided that its builders no longer exist this is because of heuristics builtin into milo -- it is not obvious how to work around this behavior on migration app side. If milo says builder does not exist, migration app closes the bug. Humans are notified. Migration app does not expect a builder to reappear. It would mean that the bug needs to be reopened and humans are notified again.
,
Jun 14 2018
Milo's heuristics are that if it hasn't received a message for 1h+, the master is presumed dead. Lacking any other signals, I think that's the best heuristic we can ask Milo to perform. I've once had to delete a master (chromium.chrome) and then undo the deletion because problems were later discovered. Giving luci-migration the ability to detect "back from the dead" masters might not be a bad idea.
,
Jun 14 2018
migration app can detect "back from the dead", but it means it will spam bugs, which would be considered incorrect by a migration app user. Here, the master wasn't actually deleted. It might be deadish for an hour, but it wasn't deleted. Another problem is that we've already allowed task force to mark bugs as fixed before builder is closed. The suggestion implies migration app will reopen bugs after humans close them. WDYT about an *explicit* list of masters in milo, in a config? it would be changed as often, as we deleted masters (not that often).
,
Jun 14 2018
As someone who has been doing a lot of master deletions lately, I'm not a huge fan of an explicit list to delete in Milo, because that would inevitably become another source of inconsistency caused by human error. Imagine if a client team wants to turn down a master (cit restart -s offline, and delete the master folder in build.git), but wonder why their data is inconsistent, and we tell them it is before they forgot to update a config line that is exclusive to Milo (It's unlikely we can make a config change like this block the build.git CQ). I prefer having machines in the loop here, whether that be to improve the heuristics of Milo, or add another data source. But IMO an explicit list of masters in a config that Milo can read would inevitably become stale and require someone (most likely me) to manually maintain it perpetually.
,
Jun 14 2018
ok, can we increase the 1h to like 4h then? master deletions do not happen often. It seems that there is more harm if [milo returns incorrect data because a master or its network was unavailable for an hour] than [milo confirming that master does not exist with a 4h delay].
,
Jun 14 2018
4h sgtm.
,
Jun 15 2018
efoo, please reopen those bugs (I don't have the list)
,
Jun 15 2018
apart from that, the remaining work is to unmark the builders as "migrated" on the migration app. However, we've already concluded that "migrated" shouldn't be a thing in the first place. The right thing to do is to fix bug 823541 which is a long overdue
,
Jun 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/facf65b8cd0d715f8a0823cb00748697d6b21724 commit facf65b8cd0d715f8a0823cb00748697d6b21724 Author: Nodir Turakulov <nodir@google.com> Date: Fri Jun 15 17:22:29 2018 [milo] increase masterExpiry to 4h A master or its network may be offline for an hour. This leads milo to think that master was deleted, it responds to buildbot API calls as "there are no builders on this master" which leads other systems (e.g. luci-migration) to make wrong decisions. Increase expiration to 4h. Bug: 852993 Change-Id: I3f7ea76b652e2de667aff54fb828a96643657714 Reviewed-on: https://chromium-review.googlesource.com/1102191 Reviewed-by: Ryan Tseng <hinoka@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/facf65b8cd0d715f8a0823cb00748697d6b21724/milo/buildsource/buildbot/buildstore/master.go [modify] https://crrev.com/facf65b8cd0d715f8a0823cb00748697d6b21724/milo/buildsource/buildbot/master_test.go
,
Jun 26 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by efoo@chromium.org
, Jun 14 2018