New issue
Advanced search Search tips

Issue 782797 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Nov 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 780891



Sign in to add a comment

DUT (chromeos4-row11-rack11-host11) stuck in provision-repair loop with due to AFE DB error

Project Member Reported by pprabhu@chromium.org, Nov 8 2017

Issue description

All provisions are failing so:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host11/1951844-provision/20170811103121/
And repairs are passing so:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host11/1951886-repair

status.log says:

		FAIL	provision_FirmwareUpdate	provision_FirmwareUpdate	timestamp=1510166057	localtime=Nov 08 10:34:17	JSONRPCException: DoesNotExist: Label matching query does not exist. Lookup parameters were {'pk': 652844}
  Traceback (most recent call last):
    File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
      results['result'] = self.invokeServiceEndpoint(meth, args)
    File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
      return meth(*args)
    File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
      return f(*args, **keyword_args)
    File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1148, in replacement
      return func(**kwargs)
    File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 264, in label_remove_hosts
      remove_label_from_hosts(id, hosts)
    File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 251, in remove_label_from_hosts
      models.Label.smart_get(id).host_set.remove(*host_objs)
    File "/usr/local/autotest/frontend/afe/model_logic.py", line 833, in smart_get
      return manager.get(pk=id_or_name)
    File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
      return self.get_query_set().get(*args, **kwargs)
    File "/usr/local/autotest/site-packages/django/db/models/query.py", line 389, in get
      (self.model._meta.object_name, kwargs))
  DoesNotExist: Label matching query does not exist. Lookup parameters were {'pk': 652844}


The error is from provision_FirmwareUpdate:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 606, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_FirmwareUpdate/provision_FirmwareUpdate.py", line 78, in run_once
    raise error.TestFail(str(e))

-------------------


I don't know how many of these are there.
jrbanette@'s new service to tell us about such provision-repair loops would have told us if this is widespread or not.

-->deputy.

I'm continuing to investigate a little more.
 
Blocking: 780891
Most likely point of failure is: http://shortn/_iHyvMpP4bb
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
Two observations
- id's for labels do not necessarily match between master and shard
- afe_hosts_labels uses the correct (mismatching) label_id to relate a host to the label
- we're somehow using the id for the label obtained on the shard, to try to update the afe_hosts_labels table on the master.
KABOOM

On the shard:
(4029 is the host_id of the host in question on both master and shard)

mysql> select * from afe_labels where name = 'fwrw-version:cyan-firmware/R46-7287.57.25';
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
| id     | name                                      | kernel_config | platform | invalid | only_if_needed | atomic_group_id |
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
| 652844 | fwrw-version:cyan-firmware/R46-7287.57.25 |               |        0 |       0 |              0 |            NULL |
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
1 row in set (0.01 sec)

mysql> select * from afe_hosts_labels where host_id = 4029 and label_id = 652844;                                                                                                              
+-------+---------+----------+
| id    | host_id | label_id |
+-------+---------+----------+
| 55797 |    4029 |   652844 |
+-------+---------+----------+
1 row in set (0.00 sec)


--------------------
On the master:
mysql> select * from afe_labels where name = 'fwrw-version:cyan-firmware/R46-7287.57.25';                                                                                                      
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
| id     | name                                      | kernel_config | platform | invalid | only_if_needed | atomic_group_id |
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
| 653538 | fwrw-version:cyan-firmware/R46-7287.57.25 |               |        0 |       0 |              0 |            NULL |
+--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+
1 row in set (0.00 sec)

(Notice the id is different)

mysql>  select * from afe_hosts_labels where host_id = 4029 and label_id = 653538;
+----------+---------+----------+
| id       | host_id | label_id |
+----------+---------+----------+
| 16408164 |    4029 |   653538 |
+----------+---------+----------+
1 row in set (0.00 sec)

The correct host-label mapping exists via the different label_id.

But, we try to update using the shard's label_id:

mysql>  select * from afe_hosts_labels where host_id = 4029 and label_id = 652844;
Empty set (0.00 sec)

---------------
This looks like a problem with RPC forwarding from shard to master.
Yep, the client was using label ids directly: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/759103
Project Member

Comment 6 by bugdroid1@chromium.org, Nov 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ef7e5b0735a4664e6f40c8c756f87fabd180dff3

commit ef7e5b0735a4664e6f40c8c756f87fabd180dff3
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Sat Nov 11 11:24:22 2017

autotest: Use label names from AFE client instead of IDs.

master-shard databases do not necessarily have the same id for labels.
So use the label name instead of the id when updating them.

BUG= chromium:782797 
TEST=None

Change-Id: I2a8c1ec2d671151caff31121ca6f9585372dfed3
Reviewed-on: https://chromium-review.googlesource.com/759103
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/ef7e5b0735a4664e6f40c8c756f87fabd180dff3/server/frontend.py

Status: Verified (was: Started)
All planned work here is done.
That DUT has recovered.

Sign in to add a comment