New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 611064 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug
m53



Sign in to add a comment

Adding more duts to frontend job cause host-scheduler to crash.

Project Member Reported by shuqianz@chromium.org, May 11 2016

Issue description

host-scheduler stops because of database issue:

05/11 09:04:44.580 INFO |    host_scheduler:0430| Setting signal handler
05/11 09:04:44.583 INFO | metadata_reporter:0155| Metadata reporting thread is started.
05/11 09:04:44.782 INFO |    host_scheduler:0386| Calling new tick.
05/11 09:04:44.782 INFO |    host_scheduler:0387| Leasing hosts for frontend tasks.
05/11 09:04:44.788 INFO |    host_scheduler:0389| Finding hosts for new jobs.
05/11 09:04:44.812 WARNI|  scheduler_models:0183| /usr/local/autotest/scheduler/scheduler_models.py:183: DeprecationWarning: object() takes no parameters
  return super(DBObject, cls).__new__(cls, id=id, **kwargs)

05/11 09:04:45.223 DEBUG|    host_scheduler:0179| Minimum duts to get for suites (suite_id: min_duts): {}
05/11 09:04:45.303 DEBUG|               rdb:0395| Processing 648 host acquisition requests
05/11 09:04:45.560 DEBUG| rdb_cache_manager:0235| Cache stats: hit ratio: 39.39%, avg staleness per line: 0.00%.
05/11 09:04:45.560 DEBUG|               rdb:0416| Host acquisition stats: distinct requests: 33, leased hosts: 0, unsatisfied requests: 648
05/11 09:04:45.567 INFO |    host_scheduler:0391| Releasing unused hosts.
05/11 09:04:45.792 INFO |    query_managers:0359| Setting leased = False for the hosts that match {'hostname__in': [u'chromeos2-row3-rack6-host2', u'chromeos2-row3-rack6-host7', u'chromeos1-grover-host1', 
u'chromeos4-row9-rack7-host5', u'chromeos1-row2-rack7-host1', u'chromeos4-row9-rack7-host7', u'chromeos4-row9-rack7-host11', u'chromeos4-row9-rack7-host15', u'chromeos4-row9-rack7-host17', u'chromeos4-row9
-rack7-host21', u'chromeos4-row9-rack8-host7', u'chromeos4-row9-rack8-host11', u'chromeos2-row6-rack1-host2', u'chromeos2-row6-rack2-host4', u'chromeos1-row5-rack5-host2', u'chromeos2-row3-rack5-host13', u
'chromeos2-row3-rack5-host16', u'chromeos2-row3-rack5-host18']}
05/11 09:04:45.798 INFO |    host_scheduler:0393| Updating suite assignment with released hosts
05/11 09:04:45.799 INFO |    host_scheduler:0395| Calling email_manager.
05/11 09:04:45.799 INFO |    host_scheduler:0386| Calling new tick.
05/11 09:04:45.799 INFO |    host_scheduler:0387| Leasing hosts for frontend tasks.
05/11 09:04:45.804 INFO |    host_scheduler:0389| Finding hosts for new jobs.
05/11 09:04:46.203 DEBUG|    host_scheduler:0179| Minimum duts to get for suites (suite_id: min_duts): {}
05/11 09:04:46.263 DEBUG|               rdb:0395| Processing 648 host acquisition requests
05/11 09:04:46.847 DEBUG| rdb_cache_manager:0235| Cache stats: hit ratio: 39.39%, avg staleness per line: 0.00%.
05/11 09:04:46.848 DEBUG|               rdb:0416| Host acquisition stats: distinct requests: 33, leased hosts: 18, unsatisfied requests: 630
05/11 09:04:46.853 INFO |  scheduler_models:0530| Assigning host chromeos2-row6-rack2-host4 to entry HQE: 63135235, for job: 62804994 and host: no host has status:Queued
05/11 09:04:46.864 ERROR|     email_manager:0082| Uncaught exception; terminating host_scheduler.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 487, in main
    host_scheduler.tick()
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 390, in tick
    self._schedule_jobs()
  File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
    return function(*args, **kwargs)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 335, in _schedule_jobs
    self.schedule_host_job(acquisition.host, acquisition.job)
  File "/usr/local/autotest/scheduler/host_scheduler.py", line 233, in schedule_host_job
    queue_entry.set_host(host)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 531, in set_host
    self.update_field('host_id', host.id)
  File "/usr/local/autotest/scheduler/scheduler_models.py", line 308, in update_field
    _db.execute(query, (value, self.id))
  File "/usr/local/autotest/database/database_connection.py", line 312, in execute
    results = self._backend.execute(query, parameters)
  File "/usr/local/autotest/database/database_connection.py", line 132, in execute
    parameters=parameters)
  File "/usr/local/autotest/database/database_connection.py", line 54, in execute
    self._cursor.execute(query, parameters)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 122, in execute
    six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
IntegrityError: (1062, "Duplicate entry '62804994-3246' for key 'host_queue_entries_job_id_and_host_id'")

 
I checked the database, it seems the host-scheduler is scheduling a completed job to the host again.

E.g
05/11 09:04:46.853 INFO |  scheduler_models:0530| Assigning host chromeos2-row6-rack2-host4 to entry HQE: 63135235, for job: 62804994 and host: no host has status:Queued

$select * from afe_host_queue_entries where host_id=3246 and job_id>62804934 and job_id<62805000;
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
| id       | job_id   | host_id | status    | meta_host | active | complete | deleted | execution_subdir           | atomic_group_id | aborted | started_on          | finished_on         |
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
| 63135174 | 62804935 |    3246 | Completed |      NULL |      0 |        1 |       0 | chromeos2-row6-rack2-host4 |            NULL |       0 | 2016-05-10 12:07:54 | 2016-05-10 12:10:53 |
| 63135234 | 62804994 |    3246 | Completed |      NULL |      0 |        1 |       0 | chromeos2-row6-rack2-host4 |            NULL |       0 | 2016-05-10 12:11:20 | 2016-05-10 12:14:27 |
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
2 rows in set (0.00 sec)

The job 62804994 has finished yesterday.

Comment 2 by fdeng@chromium.org, May 11 2016

It is always failing for the same host-queue-entry 63135234. 

mysql> select * from afe_host_queue_entries where job_id=62804994;
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
| id       | job_id   | host_id | status    | meta_host | active | complete | deleted | execution_subdir           | atomic_group_id | aborted | started_on          | finished_on         |
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
| 63135234 | 62804994 |    3246 | Completed |      NULL |      0 |        1 |       0 | chromeos2-row6-rack2-host4 |            NULL |       0 | 2016-05-10 12:11:20 | 2016-05-10 12:14:27 |
| 63135235 | 62804994 |    3285 | Completed |      3217 |      0 |        1 |       0 | chromeos2-row6-rack3-host4 |            NULL |       0 | 2016-05-11 09:24:40 | 2016-05-11 09:28:17 |
+----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+
2 rows in set (0.00 sec)

For some reason the job gets two host-queue-entries and one of them has meta_host=NULL. I believe this is incorrect. 
meta_host is used for determine jobs scheduled from frontend, so maybe scheduler got confused by it.
I manually updated it to 3217. Let's see if the problem will happen again.

Comment 3 by fdeng@chromium.org, May 11 2016

Cc: tienchang@chromium.org
+  tienchang@chromium.org
looks like this job is scheduled by Tien manually from frontend?
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=62804994

Tien, could you provide some details on how you scheduled the job?
Yes! I created the host via the frontend. I believe this is what I did ...

This was the job:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=62804343

Wanting to rule out (or confirm) a failure, I cloned the job. I wanted to chose a specific hostname, but I already clicked "Reuse any specific hosts (default)". No matters, because I could just select a host while creating the job. In the Hostname field, I entered "chromeos2-row6-rack2-host4" while still on the "By browsing hosts" tab. I then selected (confirmed by the yellow highlight) chromeos2-row6-rack2-host4.

Maybe that's why there were 2 queued jobs? After reviewing my steps, I seem horrible at following directions!

Comment 5 by fdeng@chromium.org, May 11 2016

tien@ thanks for you information. That's very useful.
I think I know what's happening.

In #4, a job with two hosts are created, which resulting in two hqes
hqe 63135234 -- this one was created with a specific host chromeos2-row6-rack2-host4, and meta_host=NULL

hqe 63135235 -- this one was created with meta_host=3217 (because user selected Reuse any similar hosts), but not specific host.

Our host-scheduler can't deal with such situation.
It tried to assign chromeos2-row6-rack2-host4 to 63135235, because 
 - 63135235 needs a host
 - chromeos2-row6-rack2-host4 is ready/unleased
However,  chromeos2-row6-rack2-host4 was already assigned to 63135234 (as user requested), so the assignment triggers db error.

The problem went away after another dut become available (chromeos2-row6-rack3-host4), and 63135235 finally get this one.



Status: Fixed (was: Untriaged)
Interesting problem. Fang, thanks for figuring this out!

Comment 7 by fdeng@chromium.org, May 11 2016

Potential fix:
 - Ban such use case, that is, if user selects "Reusing any similar hosts", do now allow user to add an additional dut later on in the next tab. 

 - Or if we allow this use case, when user added an additional dut, it should not lead to an hqe with meta_host=NULL being created.

Comment 8 by fdeng@chromium.org, May 11 2016

Status: Available (was: Fixed)
The problem is still there :p I asked Tian to not do it again, but we can't guarantee someone else won't. P3 as not critical, good candidate for Fixit

Comment 9 by fdeng@chromium.org, May 11 2016

Components: Infra>Client>ChromeOS
Labels: Hotlist-Fixit OS-Chrome

Comment 10 by fdeng@chromium.org, May 11 2016

Summary: Adding more duts to frontend job cause host-scheduler to crash. (was: Duplicate entry for key 'host_queue_entries_job_id_and_host_id' on shard chromeos-server13.cbf)
Supported for all options reopen 
Cc: shuqianz@chromium.org ntang@chromium.org krk@chromium.org dchan@chromium.org jrbarnette@chromium.org jean@chromium.org stephenlin@chromium.org
 Issue 634064  has been merged into this issue.
Labels: -Pri-3 Pri-1
Hey guys we need to raise the priority of this. If Danny can make this mistake at Google, our partners can make this mistake with their MobLabs.

Either:
* Update the host scheduler to properly handle this situation (which is likely the correct action).
* Disable multiple host selection in the AFE (Sounds like we're hiding a problem to me).
will take a look

Comment 15 by krk@chromium.org, Aug 4 2016

Labels: Proj-Moblab m53
Project Member

Comment 16 by bugdroid1@chromium.org, Aug 11 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/61f5d31b6f6173069e1ad5ce28c739666398e237

commit 61f5d31b6f6173069e1ad5ce28c739666398e237
Author: Shuqian Zhao <shuqianz@chromium.org>
Date: Sat Aug 06 00:15:23 2016

[autotest]only accept hosts when both hosts and meta_hosts given for a clone job

When cloning a job via autotest frontend, if user chooses the default
clone action, which is 'Reuse any similar hosts' and then later still
select specific hosts in the next tab, the host scheduler will get
confused about the case and would try to schedule two same jobs to one
host. In this case, the host scheduler will crash because of duplicate entries
for one job in the database. The fix is to only accept the hosts
variables under this situation.

BUG= chromium:611064 
TEST=unittest

Change-Id: Id945a9f8f2b45271c903f1ec038366da47f5d436
Reviewed-on: https://chromium-review.googlesource.com/366795
Commit-Ready: Dan Shi <dshi@chromium.org>
Tested-by: Shuqian Zhao <shuqianz@chromium.org>
Reviewed-by: Fang Deng <fdeng@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/61f5d31b6f6173069e1ad5ce28c739666398e237/frontend/afe/rpc_interface.py

Status: Fixed (was: Available)
Labels: VerifyIn-54
Labels: VerifyIn-55

Comment 20 by dchan@chromium.org, Oct 10 2016

Labels: -VerifyIn-55

Comment 21 by dchan@google.com, Nov 19 2016

Labels: VerifyIn-56

Comment 22 by dchan@google.com, Jan 21 2017

Labels: VerifyIn-57

Comment 23 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 24 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 25 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 27 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment