Adding more duts to frontend job cause host-scheduler to crash. |
||||||||||||||||||||
Issue description
host-scheduler stops because of database issue:
05/11 09:04:44.580 INFO | host_scheduler:0430| Setting signal handler
05/11 09:04:44.583 INFO | metadata_reporter:0155| Metadata reporting thread is started.
05/11 09:04:44.782 INFO | host_scheduler:0386| Calling new tick.
05/11 09:04:44.782 INFO | host_scheduler:0387| Leasing hosts for frontend tasks.
05/11 09:04:44.788 INFO | host_scheduler:0389| Finding hosts for new jobs.
05/11 09:04:44.812 WARNI| scheduler_models:0183| /usr/local/autotest/scheduler/scheduler_models.py:183: DeprecationWarning: object() takes no parameters
return super(DBObject, cls).__new__(cls, id=id, **kwargs)
05/11 09:04:45.223 DEBUG| host_scheduler:0179| Minimum duts to get for suites (suite_id: min_duts): {}
05/11 09:04:45.303 DEBUG| rdb:0395| Processing 648 host acquisition requests
05/11 09:04:45.560 DEBUG| rdb_cache_manager:0235| Cache stats: hit ratio: 39.39%, avg staleness per line: 0.00%.
05/11 09:04:45.560 DEBUG| rdb:0416| Host acquisition stats: distinct requests: 33, leased hosts: 0, unsatisfied requests: 648
05/11 09:04:45.567 INFO | host_scheduler:0391| Releasing unused hosts.
05/11 09:04:45.792 INFO | query_managers:0359| Setting leased = False for the hosts that match {'hostname__in': [u'chromeos2-row3-rack6-host2', u'chromeos2-row3-rack6-host7', u'chromeos1-grover-host1',
u'chromeos4-row9-rack7-host5', u'chromeos1-row2-rack7-host1', u'chromeos4-row9-rack7-host7', u'chromeos4-row9-rack7-host11', u'chromeos4-row9-rack7-host15', u'chromeos4-row9-rack7-host17', u'chromeos4-row9
-rack7-host21', u'chromeos4-row9-rack8-host7', u'chromeos4-row9-rack8-host11', u'chromeos2-row6-rack1-host2', u'chromeos2-row6-rack2-host4', u'chromeos1-row5-rack5-host2', u'chromeos2-row3-rack5-host13', u
'chromeos2-row3-rack5-host16', u'chromeos2-row3-rack5-host18']}
05/11 09:04:45.798 INFO | host_scheduler:0393| Updating suite assignment with released hosts
05/11 09:04:45.799 INFO | host_scheduler:0395| Calling email_manager.
05/11 09:04:45.799 INFO | host_scheduler:0386| Calling new tick.
05/11 09:04:45.799 INFO | host_scheduler:0387| Leasing hosts for frontend tasks.
05/11 09:04:45.804 INFO | host_scheduler:0389| Finding hosts for new jobs.
05/11 09:04:46.203 DEBUG| host_scheduler:0179| Minimum duts to get for suites (suite_id: min_duts): {}
05/11 09:04:46.263 DEBUG| rdb:0395| Processing 648 host acquisition requests
05/11 09:04:46.847 DEBUG| rdb_cache_manager:0235| Cache stats: hit ratio: 39.39%, avg staleness per line: 0.00%.
05/11 09:04:46.848 DEBUG| rdb:0416| Host acquisition stats: distinct requests: 33, leased hosts: 18, unsatisfied requests: 630
05/11 09:04:46.853 INFO | scheduler_models:0530| Assigning host chromeos2-row6-rack2-host4 to entry HQE: 63135235, for job: 62804994 and host: no host has status:Queued
05/11 09:04:46.864 ERROR| email_manager:0082| Uncaught exception; terminating host_scheduler.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/host_scheduler.py", line 487, in main
host_scheduler.tick()
File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
return function(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 390, in tick
self._schedule_jobs()
File "/usr/local/autotest/site-packages/statsd/timer.py", line 95, in _decorator
return function(*args, **kwargs)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 335, in _schedule_jobs
self.schedule_host_job(acquisition.host, acquisition.job)
File "/usr/local/autotest/scheduler/host_scheduler.py", line 233, in schedule_host_job
queue_entry.set_host(host)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 531, in set_host
self.update_field('host_id', host.id)
File "/usr/local/autotest/scheduler/scheduler_models.py", line 308, in update_field
_db.execute(query, (value, self.id))
File "/usr/local/autotest/database/database_connection.py", line 312, in execute
results = self._backend.execute(query, parameters)
File "/usr/local/autotest/database/database_connection.py", line 132, in execute
parameters=parameters)
File "/usr/local/autotest/database/database_connection.py", line 54, in execute
self._cursor.execute(query, parameters)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 122, in execute
six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
IntegrityError: (1062, "Duplicate entry '62804994-3246' for key 'host_queue_entries_job_id_and_host_id'")
,
May 11 2016
It is always failing for the same host-queue-entry 63135234. mysql> select * from afe_host_queue_entries where job_id=62804994; +----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+ | id | job_id | host_id | status | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on | +----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+ | 63135234 | 62804994 | 3246 | Completed | NULL | 0 | 1 | 0 | chromeos2-row6-rack2-host4 | NULL | 0 | 2016-05-10 12:11:20 | 2016-05-10 12:14:27 | | 63135235 | 62804994 | 3285 | Completed | 3217 | 0 | 1 | 0 | chromeos2-row6-rack3-host4 | NULL | 0 | 2016-05-11 09:24:40 | 2016-05-11 09:28:17 | +----------+----------+---------+-----------+-----------+--------+----------+---------+----------------------------+-----------------+---------+---------------------+---------------------+ 2 rows in set (0.00 sec) For some reason the job gets two host-queue-entries and one of them has meta_host=NULL. I believe this is incorrect. meta_host is used for determine jobs scheduled from frontend, so maybe scheduler got confused by it. I manually updated it to 3217. Let's see if the problem will happen again.
,
May 11 2016
+ tienchang@chromium.org looks like this job is scheduled by Tien manually from frontend? http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=62804994 Tien, could you provide some details on how you scheduled the job?
,
May 11 2016
Yes! I created the host via the frontend. I believe this is what I did ... This was the job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=62804343 Wanting to rule out (or confirm) a failure, I cloned the job. I wanted to chose a specific hostname, but I already clicked "Reuse any specific hosts (default)". No matters, because I could just select a host while creating the job. In the Hostname field, I entered "chromeos2-row6-rack2-host4" while still on the "By browsing hosts" tab. I then selected (confirmed by the yellow highlight) chromeos2-row6-rack2-host4. Maybe that's why there were 2 queued jobs? After reviewing my steps, I seem horrible at following directions!
,
May 11 2016
tien@ thanks for you information. That's very useful. I think I know what's happening. In #4, a job with two hosts are created, which resulting in two hqes hqe 63135234 -- this one was created with a specific host chromeos2-row6-rack2-host4, and meta_host=NULL hqe 63135235 -- this one was created with meta_host=3217 (because user selected Reuse any similar hosts), but not specific host. Our host-scheduler can't deal with such situation. It tried to assign chromeos2-row6-rack2-host4 to 63135235, because - 63135235 needs a host - chromeos2-row6-rack2-host4 is ready/unleased However, chromeos2-row6-rack2-host4 was already assigned to 63135234 (as user requested), so the assignment triggers db error. The problem went away after another dut become available (chromeos2-row6-rack3-host4), and 63135235 finally get this one.
,
May 11 2016
Interesting problem. Fang, thanks for figuring this out!
,
May 11 2016
Potential fix: - Ban such use case, that is, if user selects "Reusing any similar hosts", do now allow user to add an additional dut later on in the next tab. - Or if we allow this use case, when user added an additional dut, it should not lead to an hqe with meta_host=NULL being created.
,
May 11 2016
The problem is still there :p I asked Tian to not do it again, but we can't guarantee someone else won't. P3 as not critical, good candidate for Fixit
,
May 11 2016
,
May 11 2016
,
May 20 2016
Supported for all options reopen
,
Aug 4 2016
Issue 634064 has been merged into this issue.
,
Aug 4 2016
Hey guys we need to raise the priority of this. If Danny can make this mistake at Google, our partners can make this mistake with their MobLabs. Either: * Update the host scheduler to properly handle this situation (which is likely the correct action). * Disable multiple host selection in the AFE (Sounds like we're hiding a problem to me).
,
Aug 4 2016
will take a look
,
Aug 4 2016
,
Aug 11 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/61f5d31b6f6173069e1ad5ce28c739666398e237 commit 61f5d31b6f6173069e1ad5ce28c739666398e237 Author: Shuqian Zhao <shuqianz@chromium.org> Date: Sat Aug 06 00:15:23 2016 [autotest]only accept hosts when both hosts and meta_hosts given for a clone job When cloning a job via autotest frontend, if user chooses the default clone action, which is 'Reuse any similar hosts' and then later still select specific hosts in the next tab, the host scheduler will get confused about the case and would try to schedule two same jobs to one host. In this case, the host scheduler will crash because of duplicate entries for one job in the database. The fix is to only accept the hosts variables under this situation. BUG= chromium:611064 TEST=unittest Change-Id: Id945a9f8f2b45271c903f1ec038366da47f5d436 Reviewed-on: https://chromium-review.googlesource.com/366795 Commit-Ready: Dan Shi <dshi@chromium.org> Tested-by: Shuqian Zhao <shuqianz@chromium.org> Reviewed-by: Fang Deng <fdeng@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/61f5d31b6f6173069e1ad5ce28c739666398e237/frontend/afe/rpc_interface.py
,
Aug 12 2016
,
Aug 29 2016
,
Oct 7 2016
,
Oct 10 2016
,
Nov 19 2016
,
Jan 21 2017
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||
Comment 1 by shuqianz@chromium.org
, May 11 2016