make the sub-project 'modify AFE RPC more responsible' higher priority |
||||||||
Issue descriptionThis device is listed as "Ready" but no tests will schedule on it. http://cautotest/afe/#tab_id=view_host&object_id=2747 Unclear as to why tests can not be scheduled.
,
Jun 9 2016
Issue 618837 has been merged into this issue.
,
Jun 9 2016
no answers now, possible solutions: 1. find a way to activate them 2. change their states in db (to activate them) still checking.
,
Jun 10 2016
with Fang's help, their statuses in db is changed. Now they're back to work. There're some issues in synchronization between master and shard, which makes the duts' status is different in master db and shard db. This will lead to many unknown problems. This bug is one of it. Next time, before update a DUT's status/acl group/attributes, or lock a DUT, FIRST check with the deputy to make sure database is consistent after any moves.
,
Jun 10 2016
The swapping of devices in and out of the performance pool is completely automated, the acl/label changes happen in a script run by a cron job every hour. When a bad device is detected in the pool it is swapped out and a functioning replacement from the suites pool is moved in. There are 5 - 15 device swaps a day - it is not really feasible to notify the deputy of every device moves. Also since we only have a single device to run performance tests on we need to be able to swap devices in/out of the pool faster than it would be reasonable to ask the deputy to reply. Just to be clear my script is making AFE RPC calls, I am not writing to the DB directly.
,
Jun 10 2016
oh, make sense. I agree. Seems I need to make the sub-project 'modify AFE RPC more responsible' higher priority since it causes more and more inconsistencies between databases.
,
Jun 10 2016
Thanks for the help
,
Jun 13 2016
,
Jun 13 2016
+ richard for input
,
Jun 13 2016
> Just to be clear my script is making AFE RPC calls, I am not > writing to the DB directly. Where can we find the source to this script?
,
Jun 13 2016
I am confused about why anything done to the device should affect a the host "Status" which is not modifiable via the AFE FE.
Before I filed the bug I did lock the device in the UI and request a repair to so see if it would self correct, however since the problem with the status was already present it could not be the root cause of this issue.
The script calls that actually modify the state of a device in the lab are:
afe = frontend.AFE()
hosts = afe.get_hosts(["chromeos4-row2-rack9-host19"])
host = host[0]
host.add_labels(["pool:performance"])
host.remove_labels(["pool:performance"])
host.add_acl("performance")
host.remove_acl("pool:performance")
I do other read only operations like afe.get_hostnames host.get_labels
All tests are scheduled with run_suite
Fang/Xixuan, can you explain more about why you feel my code/actions contributed to this device status getting out of sync ?
,
Jun 14 2016
The code in Comment 11 won't work today for any hosts that are on shard and I believe that's the problem causing this bug. host.add_acl end up calling afe/rpc_interface.py:acl_group_add_users which currently only modifies master db. It needs to be modified to fan out to shards. the same issue with remove_acl. (Your use case is totally reasonable, just that we need to fix our end to support it)
,
Jun 14 2016
the easiest/urgent fix would be fixing the calls and stop the bug from making dut stuck, I suggest do that before we are able to fix the bigger db inconsistency issue.
,
Jun 14 2016
I suspect that the ACL call might be the most problematic, I have another bug open about that call being very slow ( 200+ seconds ) https://bugs.chromium.org/p/chromium/issues/detail?id=618827
,
Jul 31 2016
,
Aug 24 2016
,
Jun 20 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by xixuan@chromium.org
, Jun 9 2016