New issue
Advanced search Search tips

Issue 838333 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 838005



Sign in to add a comment

ci_results_archiver chokes on tko_jobs with missing queued_time column

Project Member Reported by dgarr...@chromium.org, Apr 30 2018

Issue description

We are currently getting alerts for ArchiverExportRateLow and ArchiverSuccessRateLow.

The associated graphs show that the number of TKO jobs is zero. This seems likely to be related to the DB migration.
 
For page-level alerts, please post updates for this outage in the
corresponding Outalator entry:
https://o.corp.google.com/#Tickets:chrome-infra::::

Alert Details
------------------
Description:
Export rate of CI results archiver is very low.

name: ArchiverExportRateLow
current value: (0, 41.6666666667)
threshold: VAL[0] < VAL[1] for 3h
alert fields: {, metric:table=tko_jobs}

sent at: 2018-04-30 12:37:40
active since: 2018-04-26 20:07:22 (3 days 16 hours)
…


Useful Links
------------------
playbook:  
https://goto.google.com/chrome-os-infra-playbook#ArchiverExportRateLow
console: https://viceroy.corp.google.com/chrome_infra
silence:  
http://alertmanager.corp.google.com/#view=createSilence&query=alertname%3D%22ArchiverExportRateLow%22,monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,monarch_target_fields%3D%22%5E%24%22,monitorname%3D%22monarch%22,service%3D%22chromeos-infra-alert-owners%22
alert manager:  
http://alertmanager.corp.google.com/#view=conditionSummary&query=alertname%3D%22ArchiverExportRateLow%22,+monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,+monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,+monarch_target_fields%3D%22%5E%24%22,+monitorname%3D%22monarch%22,+service%3D%22chromeos-infra-alert-owners%22

Query
------------------
graph:  
https://pcon.corp.google.com/p#chromeos-infra-alert-owners/queryplayground?query=mash&duration=1d&mash=%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/afe_jobs/export_count%27%29%2C%0A%20%20%20%20%20%20%20%7B%7D%29%0A%20%7C%20Window%28Delta%28%271h%27%29%29%0A%20%7C%20GroupBy%28%5B%5D%29%0A%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27afe_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/cidb_builds/export_count%27%29%2C%0A%20%20%20%20%20%20%20%7B%7D%29%0A%20%7C%20Window%28Delta%28%271h%27%29%29%0A%20%7C%20GroupBy%28%5B%5D%29%0A%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27cidb_builds%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/tko_jobs/export_count%27%29%2C%0A%20%20%20%20%20%20%20%7B%7D%29%0A%20%7C%20Window%28Delta%28%271h%27%29%29%0A%20%7C%20GroupBy%28%5B%5D%29%0A%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27tko_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%7C%20Union%28reducer%3DMin%28%29%29%0A%7C%20JoinWithLiteralTable%28fields%3D%28%27metric%3Atable%27%2C%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%27afe_jobs%27%2C%2041.666666666666664%29%2C%20%28%27cidb_builds%27%2C%204.166666666666667%29%2C%20%28%27tko_jobs%27%2C%2041.666666666666664%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29&endtime=1524798442

For page-level alerts, please post updates for this outage in the
corresponding Outalator entry:
https://o.corp.google.com/#Tickets:chrome-infra::::

Alert Details
------------------
Description:
Success rate of CI results archiver is very low.

name: ArchiverSuccessRateLow
current value: 0
threshold: Lt(0.5) for 3h
alert fields: {, metric:table=tko_jobs}

sent at: 2018-04-30 12:37:40
active since: 2018-04-27 13:10:18 (2 days 23 hours)
…


Useful Links
------------------
playbook:  
https://goto.google.com/chrome-os-infra-playbook#ArchiverSuccessRateLow
console: https://viceroy.corp.google.com/chrome_infra
silence:  
http://alertmanager.corp.google.com/#view=createSilence&query=alertname%3D%22ArchiverSuccessRateLow%22,monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,monarch_target_fields%3D%22%5E%24%22,monitorname%3D%22monarch%22,service%3D%22chromeos-infra-alert-owners%22
alert manager:  
http://alertmanager.corp.google.com/#view=conditionSummary&query=alertname%3D%22ArchiverSuccessRateLow%22,+monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,+monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,+monarch_target_fields%3D%22%5E%24%22,+monitorname%3D%22monarch%22,+service%3D%22chromeos-infra-alert-owners%22

Query
------------------
graph:  
https://pcon.corp.google.com/p#chromeos-infra-alert-owners/queryplayground?query=mash&duration=1d&mash=%28%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/afe_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/afe_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27afe_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/cidb_builds/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/cidb_builds/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27cidb_builds%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/tko_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/tko_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27tko_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%7C%20Union%28reducer%3DMin%28%29%29&endtime=1524859818
Labels: -Pri-3 Pri-1
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
stainless / ci_results_archiver stale data problems are P1.


Traceback (most recent call last):
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 215, in <module>
    main()
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 209, in main
    _BuilderMain(table_type=options.table_type, configs=configs)
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 114, in _BuilderMain
    builder.Run()
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/archive_builder.py", line 154, in Run
    exported_table_suffixes = self._exporter.ExportNewEntries(entries)
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/bigquery_exporter.py", line 47, in ExportNewEntries
    entries, self._table_spec.partition_timestamp_column)
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/bigquery_exporter.py", line 91, in _PartitionByDate
    date = entry[timestamp_column].date()
AttributeError: 'NoneType' object has no attribute 'date'
ERROR 2018-04-30 14:03:03 scheduler [base.py:131] Job "tko_jobs (trigger: interval[0:20:00], next run at: 2018-04-30 21:20:00 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/local/lib/python2.7/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 100, in _RunBuilder
    subprocess32.check_call(builder_argv, timeout=timeout.total_seconds())
  File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/local/lib/python2.7/site-packages/subprocess32.py", line 602, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/bin/python', '/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_arc
hiver/ci_results_archiver/main.py', '--config=/etc/ci_results_archiver/config.yaml', '--one-shot', '--table-type=tko_jobs']' returned non-zero exit status 1

What server did that come from?
ci_results_archiver is choking on this entry:

MySQL [chromeos_autotest_db]> select * from tko_jobs where job_idx = 176132539 \G
*************************** 1. row ***************************
          job_idx: 176132539
              tag: 195273431-chromeos-test/chromeos6-row2-rack18-host10
            label:
         username: chromeos-test
      machine_idx: 8082
      queued_time: NULL
     started_time: 2018-04-26 16:11:49
    finished_time: NULL
       afe_job_id: 195273431
afe_parent_job_id: 0
            build: NULL
    build_version: NULL
            suite: NULL
            board: NULL
1 row in set (0.04 sec)


because queued_time is NULL.
The logs for this job are at

gs://chromeos-autotest-results/195273431-chromeos-test/chromeos6-row2-rack18-host10/

tko/parse would have read the queued_time from keyval file here, which is surprisingly empty:
drone=cros-full-0010.mtv.corp.google.com
hostname=chromeos6-row2-rack18-host10
job_started=1524784309
label=
status_version=1
user=chromeos-test


Looks like that job died some horrific death immediately after starting. ci_results_archiver should skip such rows instead of choking on them.
Should we delete the row to get things working again, or harden the service, or both?
There are only two entries in this bad state:

MySQL [chromeos_autotest_db]> select * from tko_jobs where queued_time is NULL \G
*************************** 1. row ***************************
          job_idx: 176132539
              tag: 195273431-chromeos-test/chromeos6-row2-rack18-host10
            label:
         username: chromeos-test
      machine_idx: 8082
      queued_time: NULL
     started_time: 2018-04-26 16:11:49
    finished_time: NULL
       afe_job_id: 195273431
afe_parent_job_id: 0
            build: NULL
    build_version: NULL
            suite: NULL
            board: NULL
*************************** 2. row ***************************
          job_idx: 176356711
              tag: 195555511-chromeos-test/chromeos6-row4-rack24-host7
            label:
         username: chromeos-test
      machine_idx: 9461
      queued_time: NULL
     started_time: 2018-04-27 16:05:24
    finished_time: NULL
       afe_job_id: 195555511
afe_parent_job_id: 0
            build: NULL
    build_version: NULL
            suite: NULL
            board: NULL
2 rows in set (0.04 sec)


I killed those.

MySQL [chromeos_autotest_db]> delete from tko_tests where job_idx in (176132539, 176356711);
Query OK, 3 rows affected (0.10 sec)
MySQL [chromeos_autotest_db]> delete from tko_jobs where job_idx in (176132539, 176356711);
Query OK, 2 rows affected (0.06 sec)
ci_results_archiver is correctly exporting tko_db tables after my surgery in #9: http://shortn/_8i9fxWg7Dq
Summary: ci_results_archiver chokes on tko_jobs with missing queued_time column (was: ArchiverExportRateLow && ArchiverSuccessRateLow for TKO jobs.)

Comment 12 by nya@chromium.org, May 1 2018

Blocking: 838005
Project Member

Comment 13 by bugdroid1@chromium.org, May 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/infra/ci_results_archiver/+/20e1a250b7dfc624642f694eb9996c13ee9e75b2

commit 20e1a250b7dfc624642f694eb9996c13ee9e75b2
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue May 01 15:32:03 2018

Skip unpartitionable entries.

Exported bigquery tables are exported on some column from each of the
data sources. In case some rows do not have a valid entry in this
column, skip that row (reporting a metric) instead of choking on it.

BUG= chromium:838333 
TEST=unittests

Change-Id: I3f6df3746cc6216a80d59e4c9fef55445084ec0e
Reviewed-on: https://chromium-review.googlesource.com/1036523
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/bigquery_exporter_unittest.py
[modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/bigquery_exporter.py
[modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/archive_builder.py

Status: Fixed (was: Started)
Pending push-to-prod.

Sign in to add a comment