DUT (chromeos4-row11-rack11-host11) stuck in provision-repair loop with due to AFE DB error |
|||
Issue descriptionAll provisions are failing so: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host11/1951844-provision/20170811103121/ And repairs are passing so: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host11/1951886-repair status.log says: FAIL provision_FirmwareUpdate provision_FirmwareUpdate timestamp=1510166057 localtime=Nov 08 10:34:17 JSONRPCException: DoesNotExist: Label matching query does not exist. Lookup parameters were {'pk': 652844} Traceback (most recent call last): File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest results['result'] = self.invokeServiceEndpoint(meth, args) File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint return meth(*args) File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn return f(*args, **keyword_args) File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1148, in replacement return func(**kwargs) File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 264, in label_remove_hosts remove_label_from_hosts(id, hosts) File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 251, in remove_label_from_hosts models.Label.smart_get(id).host_set.remove(*host_objs) File "/usr/local/autotest/frontend/afe/model_logic.py", line 833, in smart_get return manager.get(pk=id_or_name) File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get return self.get_query_set().get(*args, **kwargs) File "/usr/local/autotest/site-packages/django/db/models/query.py", line 389, in get (self.model._meta.object_name, kwargs)) DoesNotExist: Label matching query does not exist. Lookup parameters were {'pk': 652844} The error is from provision_FirmwareUpdate: Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 606, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_FirmwareUpdate/provision_FirmwareUpdate.py", line 78, in run_once raise error.TestFail(str(e)) ------------------- I don't know how many of these are there. jrbanette@'s new service to tell us about such provision-repair loops would have told us if this is widespread or not. -->deputy. I'm continuing to investigate a little more.
,
Nov 8 2017
Ugh. When will we learn that hiding tracebacks is a terrible idea: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/site_tests/provision_FirmwareUpdate/provision_FirmwareUpdate.py?q=provision_FirmwareUpdate&sq=package:chromeos+file:src/third_party/autotest/files&dr=CSs&l=78 https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/759099
,
Nov 8 2017
Most likely point of failure is: http://shortn/_iHyvMpP4bb
,
Nov 8 2017
Two observations - id's for labels do not necessarily match between master and shard - afe_hosts_labels uses the correct (mismatching) label_id to relate a host to the label - we're somehow using the id for the label obtained on the shard, to try to update the afe_hosts_labels table on the master. KABOOM On the shard: (4029 is the host_id of the host in question on both master and shard) mysql> select * from afe_labels where name = 'fwrw-version:cyan-firmware/R46-7287.57.25'; +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ | id | name | kernel_config | platform | invalid | only_if_needed | atomic_group_id | +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ | 652844 | fwrw-version:cyan-firmware/R46-7287.57.25 | | 0 | 0 | 0 | NULL | +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ 1 row in set (0.01 sec) mysql> select * from afe_hosts_labels where host_id = 4029 and label_id = 652844; +-------+---------+----------+ | id | host_id | label_id | +-------+---------+----------+ | 55797 | 4029 | 652844 | +-------+---------+----------+ 1 row in set (0.00 sec) -------------------- On the master: mysql> select * from afe_labels where name = 'fwrw-version:cyan-firmware/R46-7287.57.25'; +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ | id | name | kernel_config | platform | invalid | only_if_needed | atomic_group_id | +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ | 653538 | fwrw-version:cyan-firmware/R46-7287.57.25 | | 0 | 0 | 0 | NULL | +--------+-------------------------------------------+---------------+----------+---------+----------------+-----------------+ 1 row in set (0.00 sec) (Notice the id is different) mysql> select * from afe_hosts_labels where host_id = 4029 and label_id = 653538; +----------+---------+----------+ | id | host_id | label_id | +----------+---------+----------+ | 16408164 | 4029 | 653538 | +----------+---------+----------+ 1 row in set (0.00 sec) The correct host-label mapping exists via the different label_id. But, we try to update using the shard's label_id: mysql> select * from afe_hosts_labels where host_id = 4029 and label_id = 652844; Empty set (0.00 sec) --------------- This looks like a problem with RPC forwarding from shard to master.
,
Nov 8 2017
Yep, the client was using label ids directly: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/759103
,
Nov 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ef7e5b0735a4664e6f40c8c756f87fabd180dff3 commit ef7e5b0735a4664e6f40c8c756f87fabd180dff3 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Sat Nov 11 11:24:22 2017 autotest: Use label names from AFE client instead of IDs. master-shard databases do not necessarily have the same id for labels. So use the label name instead of the id when updating them. BUG= chromium:782797 TEST=None Change-Id: I2a8c1ec2d671151caff31121ca6f9585372dfed3 Reviewed-on: https://chromium-review.googlesource.com/759103 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/ef7e5b0735a4664e6f40c8c756f87fabd180dff3/server/frontend.py
,
Nov 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f5b2cb558ae426a0875fe53be44af8a08c6a20bf commit f5b2cb558ae426a0875fe53be44af8a08c6a20bf Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Nov 16 04:21:13 2017 autotest: reraise provision errors with the original traceback. Without this traceback, we have no idea why the provision actually failed. BUG= chromium:782797 TEST=None Change-Id: I2b5bccc23d86a866d0cca9c80a77023ba4ebff66 Reviewed-on: https://chromium-review.googlesource.com/759099 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/f5b2cb558ae426a0875fe53be44af8a08c6a20bf/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py [modify] https://crrev.com/f5b2cb558ae426a0875fe53be44af8a08c6a20bf/server/site_tests/provision_FirmwareUpdate/provision_FirmwareUpdate.py [modify] https://crrev.com/f5b2cb558ae426a0875fe53be44af8a08c6a20bf/server/site_tests/provision_AndroidUpdate/provision_AndroidUpdate.py [modify] https://crrev.com/f5b2cb558ae426a0875fe53be44af8a08c6a20bf/server/site_tests/provision_TestbedUpdate/provision_TestbedUpdate.py
,
Nov 16 2017
All planned work here is done. That DUT has recovered. |
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Nov 8 2017