New issue
Advanced search Search tips

Issue 882940 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Some build ids are missing from the cq_attempts BQ table but present in cq_events

Project Member Reported by liaoyuke@chromium.org, Sep 11

Issue description

In the first patchset of this CL: https://chromium-review.googlesource.com/c/chromium/src/+/1214081/1. Builder: android-kitkat-arm-rel has a failed build: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-kitkat-arm-rel/77142, and its build id is: 8935882956476713904.

However, when running the following query:
SELECT
  *
FROM
  `chrome-infra-events.aggregated.cq_attempts` AS ca
WHERE
  ca.issue = '1214081'
  AND ca.patchset = '1'

8935882956476713904 is not present in the contributing_buildbucket_ids.
 
Sorry, I'm not sure about the correct component to use.
Components: Infra>Platform>CQ
Owner: tandrii@chromium.org
Status: Assigned (was: Untriaged)
interesting. If I query raw CQ events, I can see it being present in a bunch of CQ sent events:

SELECT
  issue, patchset, timestamp_millis, contributing_buildbucket_ids
FROM
  [chrome-infra-events:raw_events.cq]
WHERE
  issue = '1214081'
  AND patchset = '1'
  AND contributing_buildbucket_ids = 8935882956476713904
ORDER BY timestamp_millis ASC
LIMIT
  1000



However, I also don't see it cq attempts table:

SELECT
  *
FROM
  [chrome-infra-events:aggregated.cq_attempts] AS ca
WHERE
  ca.issue = "1214081"
  AND ca.patchset = "1"
  AND contributing_bbucket_ids = 8935882956476713904;


This suggests a bug in dataflow job itself. I wonder if the fact that on Sunday 9th there were failures in creating new dataflow jobs have something to do with this: https://screenshot.googleplex.com/V3DyzXpP4Qo
If not, then the bug in source code here:
https://cs.chromium.org/chromium/infra/packages/dataflow/cq_attempts.py?q=cq+dataflow+cq_attempt&sq=package:chromium&dr=C&l=10

Do you see any bug there?
Labels: -Pri-1 Pri-2
Owner: ----
Status: Available (was: Assigned)
Summary: Some build ids are missing from the cq_attempts BQ table but present in cq_events (was: Some build ids are missing from the cq_attempts BQ table)
Lowering priority and de-assigning. Among CQ bugs, this is Pri2.
And I must fix a few Pri1s. 

Unless someone volunteers to look into this, I intend to revisit this bug together with rework of CQ BQ schema.
Hi Andrii,

I just find out that for all the cq attempts that have retries, only the build id of the first build is recorded in the cq_attempts table. For example, https://chromium-review.googlesource.com/c/chromium/src/+/1227495/1 has 3 builds on linux-chromeos-rel, however, only one of them is present in the cq_attempts table.

Is this by design?
And this seems to apply to all the try jobs, if there are multiple builds of 
a builder are associated with an issues, then only the first build is recorded in the cq_attempts table.
liaoyuke@ i don't recall if it was by design or not.

In any case, your team are the primary consumers of this data. I'm personally in favor of changing it to suit your needs. So, I encourage you to modify data flow job to produce better data in cq attempts (and btw, history of that code may provide a clue, maybe...)
Sure, I'll look into it.

It is interesting that the build ids of the retries appear in the cq_attempts table now, but just with a few hours delays.
tandrii@: what's the expected latency here for a build id to show up in cq_attempts?
The dataflow job is triggered every 3 hours. Looking at https://pantheon.corp.google.com/dataflow?project=chrome-infra-events, it appears job runs for 40 minutes. So, approximately 4 hour latency is expected.

TBH, i don't know why the job is so slow. It seems like it loads the whole existing cq_events dataset and then computes cq_attempts. This appears quite wasteful to me.
3-hours is too long for us.

Maybe it's better to derive the needed info from cq_events directly via a sql query? Is it possible?
Project Member

Comment 12 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/4d7ff593e51a42149410366410129be75f925acf

commit 4d7ff593e51a42149410366410129be75f925acf
Author: Yuke Liao <liaoyuke@chromium.org>
Date: Tue Sep 18 22:51:40 2018

[Findit] Change sql query to use cq raw events table

This CL changes sql query to use cq raw events table instead of the
aggregated one because the aggregated one has about 4 days delay.

TBR=stgao@chromium.org

Bug: 882940
Change-Id: I96838bd91ddb33daa8d57defb24e55482840884f
Reviewed-on: https://chromium-review.googlesource.com/1231876
Commit-Queue: Yuke Liao <liaoyuke@chromium.org>
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17680}
[modify] https://crrev.com/4d7ff593e51a42149410366410129be75f925acf/appengine/findit/services/flake_detection/flaky_tests.cq_false_rejection.sql

Sign in to add a comment