ci_results_archiver chokes on tko_jobs with missing queued_time column |
|||||
Issue descriptionWe are currently getting alerts for ArchiverExportRateLow and ArchiverSuccessRateLow. The associated graphs show that the number of TKO jobs is zero. This seems likely to be related to the DB migration.
,
Apr 30 2018
For page-level alerts, please post updates for this outage in the corresponding Outalator entry: https://o.corp.google.com/#Tickets:chrome-infra:::: Alert Details ------------------ Description: Success rate of CI results archiver is very low. name: ArchiverSuccessRateLow current value: 0 threshold: Lt(0.5) for 3h alert fields: {, metric:table=tko_jobs} sent at: 2018-04-30 12:37:40 active since: 2018-04-27 13:10:18 (2 days 23 hours) … Useful Links ------------------ playbook: https://goto.google.com/chrome-os-infra-playbook#ArchiverSuccessRateLow console: https://viceroy.corp.google.com/chrome_infra silence: http://alertmanager.corp.google.com/#view=createSilence&query=alertname%3D%22ArchiverSuccessRateLow%22,monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,monarch_target_fields%3D%22%5E%24%22,monitorname%3D%22monarch%22,service%3D%22chromeos-infra-alert-owners%22 alert manager: http://alertmanager.corp.google.com/#view=conditionSummary&query=alertname%3D%22ArchiverSuccessRateLow%22,+monarch_metric_fields%3D%22metric:table%3Dtko_jobs%22,+monarch_module_name%3D%22chromeos-infra-ci-results-archiver-alerts%22,+monarch_target_fields%3D%22%5E%24%22,+monitorname%3D%22monarch%22,+service%3D%22chromeos-infra-alert-owners%22 Query ------------------ graph: https://pcon.corp.google.com/p#chromeos-infra-alert-owners/queryplayground?query=mash&duration=1d&mash=%28%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/afe_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/afe_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27afe_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/cidb_builds/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/cidb_builds/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27cidb_builds%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%28Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/tko_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%27metric%3Asuccess%27%3A%20True%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D0%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%2C%0A%20%20Fetch%28Raw%28%27monarch.acquisitions.Task%27%2C%20%27/chrome/infra/chromeos/ci_results_archiver/builder/tko_jobs/tick%27%29%2C%0A%20%20%20%20%20%20%20%20%7B%7D%29%0A%20%20%7C%20Window%28Delta%28%273h%27%29%29%0A%20%20%7C%20GroupBy%28%5B%5D%29%0A%20%20%7C%20JoinWithLiteralTable%28fields%3D%28%29%2C%20input_default%3D1%2C%20streams%3D%5B%28%29%5D%2C%20target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%20%7C%20Join%28left_default%3DNone%2C%20left_name%3D%27success%27%2C%20right_default%3DNone%2C%20right_name%3D%27total%27%29%0A%20%7C%20Point%28VAL%5B0%5D%20/%20VAL%5B1%5D%29%0A%20%7C%20MapStreamId%28out_field_map%3D%7B%27metric%3Atable%27%3A%20Point%28%27tko_jobs%27%29%7D%2C%20out_target_schema_name%3D%27monarch.acquisitions.Task%27%29%29%0A%7C%20Union%28reducer%3DMin%28%29%29&endtime=1524859818
,
Apr 30 2018
stainless / ci_results_archiver stale data problems are P1.
Traceback (most recent call last):
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 215, in <module>
main()
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 209, in main
_BuilderMain(table_type=options.table_type, configs=configs)
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 114, in _BuilderMain
builder.Run()
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/archive_builder.py", line 154, in Run
exported_table_suffixes = self._exporter.ExportNewEntries(entries)
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/bigquery_exporter.py", line 47, in ExportNewEntries
entries, self._table_spec.partition_timestamp_column)
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/venv/ci_results_archiver/bigquery_exporter.py", line 91, in _PartitionByDate
date = entry[timestamp_column].date()
AttributeError: 'NoneType' object has no attribute 'date'
ERROR 2018-04-30 14:03:03 scheduler [base.py:131] Job "tko_jobs (trigger: interval[0:20:00], next run at: 2018-04-30 21:20:00 UTC)" raised an exception
Traceback (most recent call last):
File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/local/lib/python2.7/site-packages/apscheduler/executors/base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_archiver/ci_results_archiver/main.py", line 100, in _RunBuilder
subprocess32.check_call(builder_argv, timeout=timeout.total_seconds())
File "/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/local/lib/python2.7/site-packages/subprocess32.py", line 602, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/usr/local/google/home/chromeos-test/.cache/cros_venv/venv-2.7.6-9b73359f13f2cf7b6ed8f86d2c7d6ff7/bin/python', '/usr/local/google/home/chromeos-test/chromiumos/infra/ci_results_arc
hiver/ci_results_archiver/main.py', '--config=/etc/ci_results_archiver/config.yaml', '--one-shot', '--table-type=tko_jobs']' returned non-zero exit status 1
,
Apr 30 2018
What server did that come from?
,
Apr 30 2018
ci_results_archiver is choking on this entry:
MySQL [chromeos_autotest_db]> select * from tko_jobs where job_idx = 176132539 \G
*************************** 1. row ***************************
job_idx: 176132539
tag: 195273431-chromeos-test/chromeos6-row2-rack18-host10
label:
username: chromeos-test
machine_idx: 8082
queued_time: NULL
started_time: 2018-04-26 16:11:49
finished_time: NULL
afe_job_id: 195273431
afe_parent_job_id: 0
build: NULL
build_version: NULL
suite: NULL
board: NULL
1 row in set (0.04 sec)
because queued_time is NULL.
,
Apr 30 2018
The logs for this job are at gs://chromeos-autotest-results/195273431-chromeos-test/chromeos6-row2-rack18-host10/ tko/parse would have read the queued_time from keyval file here, which is surprisingly empty: drone=cros-full-0010.mtv.corp.google.com hostname=chromeos6-row2-rack18-host10 job_started=1524784309 label= status_version=1 user=chromeos-test Looks like that job died some horrific death immediately after starting. ci_results_archiver should skip such rows instead of choking on them.
,
Apr 30 2018
Should we delete the row to get things working again, or harden the service, or both?
,
Apr 30 2018
,
Apr 30 2018
There are only two entries in this bad state:
MySQL [chromeos_autotest_db]> select * from tko_jobs where queued_time is NULL \G
*************************** 1. row ***************************
job_idx: 176132539
tag: 195273431-chromeos-test/chromeos6-row2-rack18-host10
label:
username: chromeos-test
machine_idx: 8082
queued_time: NULL
started_time: 2018-04-26 16:11:49
finished_time: NULL
afe_job_id: 195273431
afe_parent_job_id: 0
build: NULL
build_version: NULL
suite: NULL
board: NULL
*************************** 2. row ***************************
job_idx: 176356711
tag: 195555511-chromeos-test/chromeos6-row4-rack24-host7
label:
username: chromeos-test
machine_idx: 9461
queued_time: NULL
started_time: 2018-04-27 16:05:24
finished_time: NULL
afe_job_id: 195555511
afe_parent_job_id: 0
build: NULL
build_version: NULL
suite: NULL
board: NULL
2 rows in set (0.04 sec)
I killed those.
MySQL [chromeos_autotest_db]> delete from tko_tests where job_idx in (176132539, 176356711);
Query OK, 3 rows affected (0.10 sec)
MySQL [chromeos_autotest_db]> delete from tko_jobs where job_idx in (176132539, 176356711);
Query OK, 2 rows affected (0.06 sec)
,
Apr 30 2018
ci_results_archiver is correctly exporting tko_db tables after my surgery in #9: http://shortn/_8i9fxWg7Dq
,
Apr 30 2018
,
May 1 2018
,
May 1 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/infra/ci_results_archiver/+/20e1a250b7dfc624642f694eb9996c13ee9e75b2 commit 20e1a250b7dfc624642f694eb9996c13ee9e75b2 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue May 01 15:32:03 2018 Skip unpartitionable entries. Exported bigquery tables are exported on some column from each of the data sources. In case some rows do not have a valid entry in this column, skip that row (reporting a metric) instead of choking on it. BUG= chromium:838333 TEST=unittests Change-Id: I3f6df3746cc6216a80d59e4c9fef55445084ec0e Reviewed-on: https://chromium-review.googlesource.com/1036523 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> [modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/bigquery_exporter_unittest.py [modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/bigquery_exporter.py [modify] https://crrev.com/20e1a250b7dfc624642f694eb9996c13ee9e75b2/ci_results_archiver/archive_builder.py
,
May 1 2018
Pending push-to-prod. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dgarr...@chromium.org
, Apr 30 2018