suite report tko data is incorrect |
||||||||||
Issue descriptionExample dashboard: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1609289 That claims that each test was running for 27+ days :) I think we're somehow spewing the wrong hwtest name / suite_attrs for HWTest [jetstream_cq] stage. --> davidriley for FIR.
,
Jun 21 2017
Is this consistent, or was this a one-off occurrence?
,
Jun 21 2017
+ secondary
,
Jun 21 2017
Not sure. I think I saw it on a bunch of jetstream related builds though. Reassigning to find a proper owner.
,
Jun 30 2017
,
Jul 11 2017
This problem is getting worse. At some point I expect the suite report to start taking longer and affecting build times. https://viceroy.corp.google.com/chromeos/suite_details?build_id=1657426 https://storage.cloud.google.com/chromeos-autotest-results/127989581-chromeos-test/hostless/.parse.log?_ga=1.109879685.1008157659.1499732868 DEBUG:root:Suite job 127989581: DEBUG:root:Fetching child jobs... DEBUG:root:... fetched 11 child jobs. DEBUG:root:Fetching HQEs... DEBUG:root:... fetched 11 HQEs. DEBUG:root:6 distinct hosts participated in the suite. INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:root:... generated 3027 entries DEBUG:root:Dumping 3027 entries >>> tko = frontend.TKO() >>> suite_job_id=127989581 >>> statuses = tko.get_job_test_statuses_from_db(suite_job_id) >>> print statuses[0] TEST STATUS: ['', '', 'SERVER_JOB'] status: GOOD hostname: job_tag: 127989581-chromeos-test/hostless test_finished_time: 2017-07-11 05:43:39 reason: test_started_time: 2017-07-11 05:19:34 subdir: ---- test_name: SERVER_JOB job_owner: chromeos-test afe_job_id: 127989581 id: ['', '', 'SERVER_JOB'] >>> print statuses[1] TEST STATUS: ['completed successfully', '', 'jetstream_NetworkInterfaces'] status: GOOD hostname: job_tag: 127989581-chromeos-test/hostless test_finished_time: 2017-06-27 07:18:52 reason: completed successfully test_started_time: 2017-06-27 07:17:56 subdir: 127989743-chromeos-test/chromeos4-row9-jetstream-host1/jetstream_NetworkInterfaces test_name: jetstream_NetworkInterfaces job_owner: chromeos-test afe_job_id: 127989581 id: ['completed successfully', '', 'jetstream_NetworkInterfaces'] >>> print statuses[2] TEST STATUS: ['completed successfully', '', 'jetstream_PrioritizedDevice'] status: GOOD hostname: job_tag: 127989581-chromeos-test/hostless test_finished_time: 2017-06-27 07:19:12 reason: completed successfully test_started_time: 2017-06-27 07:18:16 subdir: 127989746-chromeos-test/chromeos4-row9-jetstream-host5/jetstream_PrioritizedDevice test_name: jetstream_PrioritizedDevice job_owner: chromeos-test afe_job_id: 127989581 id: ['completed successfully', '', 'jetstream_PrioritizedDevice'] Is this something that is happening due to bad data or tests for the DUT, or something weird and whacky with the shard?
,
Jul 11 2017
We need a root cause. Is this because wrong entries are getting uploaded to TKO? If so, it may be investigated together with the other TKO related issues we are having.
,
Jul 11 2017
Re c#7: Agreed, I'm just commenting that the problem is still happening and it's getting worse. I haven't done any investigations on why TKO is reporting statuses spanning for weeks for the same suite job id.
,
Jul 14 2017
,
Jul 20 2017
Ping?
,
Jul 25 2017
Re #6, you say "I expect the suite report to start taking longer and affecting build times". Are you suggesting there is a relationship between this data-correctness issue and the run time of the suite report generator?
,
Jul 25 2017
degradation of a class 2 service (monitoring/dashboards). demoting to P2.
,
Jul 25 2017
btw regarding #6, the suite job in question is logged here: https://storage.cloud.google.com/chromeos-autotest-results/127989581-chromeos-test/hostless/debug/autoserv.DEBUG?_ga=1.11423415.2000744542.1478018485 The correct timeframe is 7/11 (statuses[0]). It is the second two statuses that are incorrect.
,
Jul 25 2017
What I don't understand about #6 are why those statuses claim to be statuses for afe id 127989581 which is the id of the suite job, when they are clearly statuses for a child job on a DUT.
,
Jul 26 2017
Here's a real child job of suite 127989581: http://cautotest/afe/#tab_id=view_job&object_id=127989743 The child's logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/127989743-chromeos-test/chromeos4-row9-jetstream-host1/ In particular, it's .parse.log: https://storage.cloud.google.com/chromeos-autotest-results/127989743-chromeos-test/chromeos4-row9-jetstream-host1/.parse.log?_ga=1.245902215.2000744542.1478018485 Which doesn't indicate anything obviously fishy. +dshi any reason you can think that tko parse would be reporting the wrong date in its job results? Maybe that shard's system clock was woefully wrong at the time tko/parse ran, and we rely on the it's timestamp?
,
Jul 26 2017
Assigning to current deputy
,
Jul 26 2017
c#111: It looks like the suite reports are getting longer and longer due to having greater re-use of the same job ID or something like that. Longer time frame to query, long suite report.
,
Jul 26 2017
@metrics.SecondsTimerDecorator(
'chromeos/autotest/tko/get_job_status_duration')
def get_job_test_statuses_from_db(self, job_id):
"""Get job test statuses from the database.
Retrieve a set of fields from a job that reflect the status of each test
run within a job.
fields retrieved: status, test_name, reason, test_started_time,
test_finished_time, afe_job_id, job_owner, hostname.
@param job_id: The afe job id to look up.
@returns a TestStatus object of the resulting information.
"""
if self._db is None:
self._db = db.db()
fields = ['status', 'test_name', 'subdir', 'reason',
'test_started_time', 'test_finished_time', 'afe_job_id',
'job_owner', 'hostname', 'job_tag']
table = 'tko_test_view_2'
where = 'job_tag like "%s-%%"' % job_id
test_status = []
# Run commit before we query to ensure that we are pulling the latest
# results.
self._db.commit()
for entry in self._db.select(','.join(fields), table, (where, None)):
status_dict = {}
for key,value in zip(fields, entry):
# All callers expect values to be a str object.
status_dict[key] = str(value)
# id is used by TestStatus to uniquely identify each Test Status
# obj.
status_dict['id'] = [status_dict['reason'], status_dict['hostname'],
status_dict['test_name']]
test_status.append(status_dict)
return [TestStatus(self, e) for e in test_status]
The fact that this happens to return child jobs seems like an accident to me. Either that, or when we do tko/parse do we use the suite job's afe id instead of the child job on purpose?
It looks like this (horrible) query is also doing a "LIKE" match on job tag, which sounds both error prone and horribly slow (will require table scan on each query). As the size of tko grows (between our 6 month cleanups) it will get slower.
,
Jul 26 2017
I hope that the LIKE clause is the culprit here - does anyone know whether that is required or can we just turn it into an equality clause?
,
Jul 26 2017
The query has a wildcard in it, so it indeed needs to be a LIKE query, at least as currently formulated.
,
Jul 26 2017
I think this is just an one off issue. That particular test has some system time problem on the dut. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124312103-chromeos-test/chromeos4-row9-jetstream-host5/ The test was started on 6/21, but the status log has timestamp in 5/25. That seems the dut has a totally off clock. Maybe we should add a check in verify, to make sure the system time is aligned with the drone.
,
Sep 3 2017
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by davidri...@chromium.org
, Jun 21 2017Adding some people here because I'm not really sure what is going on and who is the appropriate owner. The suite report gets all the child jobs and queries TKO for all the test statuses: statuses = tko.get_job_test_statuses_from_db(job.id) It then defines the start/finish time for the entire hwtest based on the min/max start/finish time. Looking at the example and the child job 124312103 I see the following three entries from TKO: DEBUG:root:status TEST STATUS: ['', 'chromeos4-row9-jetstream-host5', 'SERVER_JOB'] status: GOOD hostname: chromeos4-row9-jetstream-host5 job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5 test_finished_time: 2017-06-21 02:31:45 reason: test_started_time: 2017-06-21 02:25:22 subdir: ---- test_name: SERVER_JOB job_owner: chromeos-test afe_job_id: 124312103 id: ['', 'chromeos4-row9-jetstream-host5', 'SERVER_JOB'] DEBUG:root:status TEST STATUS: ['', 'chromeos4-row9-jetstream-host5', 'CLIENT_JOB.0'] status: GOOD hostname: chromeos4-row9-jetstream-host5 job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5 test_finished_time: 2017-05-25 13:28:33 reason: test_started_time: 2017-05-25 13:26:54 subdir: None test_name: CLIENT_JOB.0 job_owner: chromeos-test afe_job_id: 124312103 id: ['', 'chromeos4-row9-jetstream-host5', 'CLIENT_JOB.0'] DEBUG:root:status TEST STATUS: ['completed successfully', 'chromeos4-row9-jetstream-host5', 'jetstream_GcdCommands'] status: GOOD hostname: chromeos4-row9-jetstream-host5 job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5 test_finished_time: 2017-05-25 13:28:31 reason: completed successfully test_started_time: 2017-05-25 13:26:55 subdir: jetstream_GcdCommands test_name: jetstream_GcdCommands job_owner: chromeos-test afe_job_id: 124312103 id: ['completed successfully', 'chromeos4-row9-jetstream-host5', 'jetstream_GcdCommands'] I argue that the suite report is doing the correct thing and that there is some bogus stuff in TKO.