New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618828 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

make the sub-project 'modify AFE RPC more responsible' higher priority

Project Member Reported by haddowk@chromium.org, Jun 9 2016

Issue description


This device is listed as "Ready" but no tests will schedule on it.

http://cautotest/afe/#tab_id=view_host&object_id=2747

Unclear as to why tests can not be scheduled.



 
checking
 Issue 618837  has been merged into this issue.
no answers now, possible solutions:

1. find a way to activate them
2. change their states in db (to activate them)

still checking.

Comment 4 by xixuan@chromium.org, Jun 10 2016

Cc: fdeng@chromium.org
with Fang's help, their statuses in db is changed. Now they're back to work.

There're some issues in synchronization between master and shard, which makes the duts' status is different in master db and shard db. This will lead to many unknown problems. This bug is one of it.

Next time, before update a DUT's status/acl group/attributes, or lock a DUT, FIRST check with the deputy to make sure database is consistent after any moves.
Cc: mshe...@chromium.org

The swapping of devices in and out of the performance pool is completely automated, the acl/label changes happen in a script run by a cron job every hour.  When a bad device is detected in the pool it is swapped out and a functioning replacement from the suites pool is moved in.

There are 5 - 15 device swaps a day - it is not really feasible to notify the deputy of every device moves. Also since we only have a single device to run performance tests on we need to be able to swap devices in/out of the pool faster than it would be reasonable to ask the deputy to reply.

Just to be clear my script is making AFE RPC calls,  I am not writing to the DB directly.

Comment 6 by xixuan@chromium.org, Jun 10 2016

oh, make sense. I agree.

Seems I need to make the sub-project 'modify AFE RPC more responsible' higher priority since it causes more and more inconsistencies between databases.
Thanks for the help

Comment 8 by autumn@chromium.org, Jun 13 2016

Labels: -current-issue
Owner: xixuan@chromium.org
Status: Assigned (was: Untriaged)
Summary: make the sub-project 'modify AFE RPC more responsible' higher priority (was: chromeos4-row1-rack3-host7 is in "Ready" state but no tests will schedule on it)

Comment 9 by autumn@chromium.org, Jun 13 2016

Cc: jrbarnette@chromium.org
+ richard for input 
> Just to be clear my script is making AFE RPC calls,  I am not
> writing to the DB directly.

Where can we find the source to this script?

I am confused about why anything done to the device should affect a the host "Status" which is not modifiable via the AFE FE.

Before I filed the bug I did lock the device in the UI and request a repair to so see if it would self correct, however since the problem with the status was already present it could not be the root cause of this issue.

The script calls that actually modify the state of a device in the lab are:


afe = frontend.AFE()

hosts = afe.get_hosts(["chromeos4-row2-rack9-host19"])
host = host[0]

host.add_labels(["pool:performance"])
host.remove_labels(["pool:performance"])

host.add_acl("performance")
host.remove_acl("pool:performance")

I do other read only operations like afe.get_hostnames host.get_labels

All tests are scheduled with run_suite

Fang/Xixuan, can you explain more about why you feel my code/actions contributed to this device status getting out of sync ?


Comment 12 by fdeng@chromium.org, Jun 14 2016

The code in Comment 11 won't work today for any hosts that are on shard and I believe that's the problem causing this bug.

host.add_acl end up calling afe/rpc_interface.py:acl_group_add_users which currently only modifies master db. It needs to be modified to fan out to shards.

the same issue with remove_acl. 

(Your use case is totally reasonable, just that we need to fix our end to support it)

Comment 13 by fdeng@chromium.org, Jun 14 2016

the easiest/urgent fix would be fixing the calls and stop the bug from making dut stuck, I suggest do that before we are able to fix the bigger db inconsistency issue.
I suspect that the ACL call might be the most problematic, I have another bug open about that call being very slow ( 200+ seconds )

https://bugs.chromium.org/p/chromium/issues/detail?id=618827



Cc: -mshe...@chromium.org
Cc: hctsai@chromium.org cywang@chromium.org waihong@chromium.org bccheng@chromium.org
Status: WontFix (was: Assigned)

Sign in to add a comment