New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735511 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

suite report tko data is incorrect

Project Member Reported by pprabhu@chromium.org, Jun 21 2017

Issue description

Example dashboard: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1609289

That claims that each test was running for 27+ days :)

I think we're somehow spewing the wrong hwtest name / suite_attrs for HWTest [jetstream_cq] stage.

--> davidriley for FIR.
 
Cc: akes...@chromium.org dshi@chromium.org jrbarnette@chromium.org
Adding some people here because I'm not really sure what is going on and who is the appropriate owner.  The suite report gets all the child jobs and queries TKO for all the test statuses:
    statuses = tko.get_job_test_statuses_from_db(job.id)

It then defines the start/finish time for the entire hwtest based on the min/max start/finish time.  Looking at the example and the child job 124312103 I see the following three entries from TKO:
DEBUG:root:status TEST STATUS: ['', 'chromeos4-row9-jetstream-host5', 'SERVER_JOB']
              status: GOOD
            hostname: chromeos4-row9-jetstream-host5
             job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5
  test_finished_time: 2017-06-21 02:31:45
              reason:
   test_started_time: 2017-06-21 02:25:22
              subdir: ----
           test_name: SERVER_JOB
           job_owner: chromeos-test
          afe_job_id: 124312103
                  id: ['', 'chromeos4-row9-jetstream-host5', 'SERVER_JOB']

DEBUG:root:status TEST STATUS: ['', 'chromeos4-row9-jetstream-host5', 'CLIENT_JOB.0']
              status: GOOD
            hostname: chromeos4-row9-jetstream-host5
             job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5
  test_finished_time: 2017-05-25 13:28:33
              reason:
   test_started_time: 2017-05-25 13:26:54
              subdir: None
           test_name: CLIENT_JOB.0
           job_owner: chromeos-test
          afe_job_id: 124312103
                  id: ['', 'chromeos4-row9-jetstream-host5', 'CLIENT_JOB.0']

DEBUG:root:status TEST STATUS: ['completed successfully', 'chromeos4-row9-jetstream-host5', 'jetstream_GcdCommands']
              status: GOOD
            hostname: chromeos4-row9-jetstream-host5
             job_tag: 124312103-chromeos-test/chromeos4-row9-jetstream-host5
  test_finished_time: 2017-05-25 13:28:31
              reason: completed successfully
   test_started_time: 2017-05-25 13:26:55
              subdir: jetstream_GcdCommands
           test_name: jetstream_GcdCommands
           job_owner: chromeos-test
          afe_job_id: 124312103
                  id: ['completed successfully', 'chromeos4-row9-jetstream-host5', 'jetstream_GcdCommands']

I argue that the suite report is doing the correct thing and that there is some bogus stuff in TKO.
Is this consistent, or was this a one-off occurrence?
Cc: ayatane@chromium.org
+ secondary
Cc: davidri...@chromium.org
Owner: ayatane@chromium.org
Not sure.  I think I saw it on a bunch of jetstream related builds though.

Reassigning to find a proper owner.
Status: Assigned (was: Untriaged)
This problem is getting worse.  At some point I expect the suite report to start taking longer and affecting build times.

https://viceroy.corp.google.com/chromeos/suite_details?build_id=1657426
https://storage.cloud.google.com/chromeos-autotest-results/127989581-chromeos-test/hostless/.parse.log?_ga=1.109879685.1008157659.1499732868

DEBUG:root:Suite job 127989581:
DEBUG:root:Fetching child jobs...
DEBUG:root:... fetched 11 child jobs.
DEBUG:root:Fetching HQEs...
DEBUG:root:... fetched 11 HQEs.
DEBUG:root:6 distinct hosts participated in the suite.
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:root:... generated 3027 entries
DEBUG:root:Dumping 3027 entries

>>> tko = frontend.TKO()
>>> suite_job_id=127989581
>>> statuses = tko.get_job_test_statuses_from_db(suite_job_id)

>>> print statuses[0]
TEST STATUS: ['', '', 'SERVER_JOB']
              status: GOOD
            hostname:
             job_tag: 127989581-chromeos-test/hostless
  test_finished_time: 2017-07-11 05:43:39
              reason:
   test_started_time: 2017-07-11 05:19:34
              subdir: ----
           test_name: SERVER_JOB
           job_owner: chromeos-test
          afe_job_id: 127989581
                  id: ['', '', 'SERVER_JOB']

>>> print statuses[1]
TEST STATUS: ['completed successfully', '', 'jetstream_NetworkInterfaces']
              status: GOOD
            hostname:
             job_tag: 127989581-chromeos-test/hostless
  test_finished_time: 2017-06-27 07:18:52
              reason: completed successfully
   test_started_time: 2017-06-27 07:17:56
              subdir: 127989743-chromeos-test/chromeos4-row9-jetstream-host1/jetstream_NetworkInterfaces
           test_name: jetstream_NetworkInterfaces
           job_owner: chromeos-test
          afe_job_id: 127989581
                  id: ['completed successfully', '', 'jetstream_NetworkInterfaces']

>>> print statuses[2]
TEST STATUS: ['completed successfully', '', 'jetstream_PrioritizedDevice']
              status: GOOD
            hostname:
             job_tag: 127989581-chromeos-test/hostless
  test_finished_time: 2017-06-27 07:19:12
              reason: completed successfully
   test_started_time: 2017-06-27 07:18:16
              subdir: 127989746-chromeos-test/chromeos4-row9-jetstream-host5/jetstream_PrioritizedDevice
           test_name: jetstream_PrioritizedDevice
           job_owner: chromeos-test
          afe_job_id: 127989581
                  id: ['completed successfully', '', 'jetstream_PrioritizedDevice']

Is this something that is happening due to bad data or tests for the DUT, or something weird and whacky with the shard?
We need a root cause.  Is this because wrong entries are getting uploaded to TKO?  If so, it may be investigated together with the other TKO related issues we are having.
Re c#7: Agreed, I'm just commenting that the problem is still happening and it's getting worse.  I haven't done any investigations on why TKO is reporting statuses spanning for weeks for the same suite job id.
Labels: -Pri-3 Pri-1
Ping?
Re #6, you say "I expect the suite report to start taking longer and affecting build times". Are you suggesting there is a relationship between this data-correctness issue and the run time of the suite report generator?
Labels: -Pri-1 Pri-2
degradation of a class 2 service (monitoring/dashboards). demoting to P2.
Summary: suite report (was: Jetstream suite timeline collates across runs)
btw regarding #6, the suite job in question is logged here: https://storage.cloud.google.com/chromeos-autotest-results/127989581-chromeos-test/hostless/debug/autoserv.DEBUG?_ga=1.11423415.2000744542.1478018485

The correct timeframe is 7/11 (statuses[0]). It is the second two statuses that are incorrect.
What I don't understand about #6 are why those statuses claim to be statuses for afe id 127989581 which is the id of the suite job, when they are clearly statuses for a child job on a DUT.
Here's a real child job of suite 127989581: http://cautotest/afe/#tab_id=view_job&object_id=127989743

The child's logs:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/127989743-chromeos-test/chromeos4-row9-jetstream-host1/

In particular, it's .parse.log:
https://storage.cloud.google.com/chromeos-autotest-results/127989743-chromeos-test/chromeos4-row9-jetstream-host1/.parse.log?_ga=1.245902215.2000744542.1478018485

Which doesn't indicate anything obviously fishy.

+dshi any reason you can think that tko parse would be reporting the wrong date in its job results?

Maybe that shard's system clock was woefully wrong at the time tko/parse ran, and we rely on the it's timestamp?

Owner: pho...@chromium.org
Assigning to current deputy
c#111: It looks like the suite reports are getting longer and longer due to having greater re-use of the same job ID or something like that.  Longer time frame to query, long suite report.
Cc: pho...@chromium.org
Owner: dshi@chromium.org
Summary: suite report tko data is incorrect (was: suite report )
    @metrics.SecondsTimerDecorator(
            'chromeos/autotest/tko/get_job_status_duration')
    def get_job_test_statuses_from_db(self, job_id):
        """Get job test statuses from the database.

        Retrieve a set of fields from a job that reflect the status of each test
        run within a job.
        fields retrieved: status, test_name, reason, test_started_time,
                          test_finished_time, afe_job_id, job_owner, hostname.

        @param job_id: The afe job id to look up.
        @returns a TestStatus object of the resulting information.
        """
        if self._db is None:
            self._db = db.db()
        fields = ['status', 'test_name', 'subdir', 'reason',
                  'test_started_time', 'test_finished_time', 'afe_job_id',
                  'job_owner', 'hostname', 'job_tag']
        table = 'tko_test_view_2'
        where = 'job_tag like "%s-%%"' % job_id
        test_status = []
        # Run commit before we query to ensure that we are pulling the latest
        # results.
        self._db.commit()
        for entry in self._db.select(','.join(fields), table, (where, None)):
            status_dict = {}
            for key,value in zip(fields, entry):
                # All callers expect values to be a str object.
                status_dict[key] = str(value)
            # id is used by TestStatus to uniquely identify each Test Status
            # obj.
            status_dict['id'] = [status_dict['reason'], status_dict['hostname'],
                                 status_dict['test_name']]
            test_status.append(status_dict)

        return [TestStatus(self, e) for e in test_status]


The fact that this happens to return child jobs seems like an accident to me. Either that, or when we do tko/parse do we use the suite job's afe id instead of the child job on purpose?

It looks like this (horrible) query is also doing a "LIKE" match on job tag, which sounds both error prone and horribly slow (will require table scan on each query). As the size of tko grows (between our 6 month cleanups) it will get slower.
I hope that the LIKE clause is the culprit here - does anyone know whether that is required or can we just turn it into an equality clause?
The query has a wildcard in it, so it indeed needs to be a LIKE query, at least as currently formulated.

Comment 21 by dshi@chromium.org, Jul 26 2017

I think this is just an one off issue. That particular test has some system time problem on the dut.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/124312103-chromeos-test/chromeos4-row9-jetstream-host5/

The test was started on 6/21, but the status log has timestamp in 5/25. That seems the dut has a totally off clock.

Maybe we should add a check in verify, to make sure the system time is aligned with the drone.

Comment 22 by dshi@chromium.org, Sep 3 2017

Status: WontFix (was: Assigned)

Sign in to add a comment