New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 757500 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

AFE: Large number of master-shard host_label desync since Aug 18th

Project Member Reported by pprabhu@chromium.org, Aug 21 2017

Issue description

https://viceroy.corp.google.com/chromeos/sentinel?duration=8d#_VG_PFGzLahW

sentinel is recovering a large number of afe_label inconsistencies.

No visible impact (yet).
 
chromeos-test@chromeos-server18:/var/log/autotest_sentinel$ grep -i 'delete label' sentinel.log.3 | awk '{print $6}'| sort | uniq -c | sort -n
      1 10974
      2 9952
      3 187672
      4 400055
      5 152520
     14 1500
     24 164763
chromeos-test@chromeos-server18:/var/log/autotest_sentinel$ grep -i 'add label' sentinel.log.3 | awk '{print $6}'| so
rt | uniq -c | sort -n                                                                                                     1 9952
      2 152520
      2 187672
      6 400055
     19 164763
     26 1500

Cc: shuqianz@chromium.org
deleted labels:


mysql> select * from afe_labels where id in ('1500', '164763', '152520', '400055', '187672', '9952', '10974');
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+
| id     | name                  | kernel_config | platform | invalid | only_if_needed | atomic_group_id |
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+
|   1500 | pool:suites           |               |        0 |       0 |              0 |            NULL |
|   9952 | bluetooth             |               |        0 |       0 |              0 |            NULL |
|  10974 | webcam                |               |        0 |       0 |              0 |            NULL |
| 152520 | audio_loopback_dongle |               |        0 |       0 |              0 |            NULL |
| 164763 | pool:performance      |               |        0 |       0 |              0 |            NULL |
| 187672 | hw_video_acc_vp9      |               |        0 |       0 |              0 |            NULL |
| 400055 | pool:crosperf         |               |        0 |       0 |              0 |            NULL |
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+
7 rows in set (0.00 sec)


added labels:
mysql> select * from afe_labels where id in ('1500', '164763', '400055', '187672', '152520');
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+
| id     | name                  | kernel_config | platform | invalid | only_if_needed | atomic_group_id |
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+
|   1500 | pool:suites           |               |        0 |       0 |              0 |            NULL |
| 152520 | audio_loopback_dongle |               |        0 |       0 |              0 |            NULL |
| 164763 | pool:performance      |               |        0 |       0 |              0 |            NULL |
| 187672 | hw_video_acc_vp9      |               |        0 |       0 |              0 |            NULL |
| 400055 | pool:crosperf         |               |        0 |       0 |              0 |            NULL |
+--------+-----------------------+---------------+----------+---------+----------------+-----------------+



---------------
And the label-desync has come back to its baseline level. Perhaps a fallout of the shard migration on Friday?
Cc: jrbarnette@chromium.org
This looks very fishy. Why are so many hosts with the incorrect pool labels?

For example, one of the DUTs from which pool:suites was removed no longer has _any_ pool:
mysql> select * from afe_hosts where id = 6220;
+------+-----------------------------+--------+----------+--------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| id   | hostname                    | locked | synch_id | status | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
+------+-----------------------------+--------+----------+--------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| 6220 | chromeos6-row2-rack5-host16 |      0 |     NULL | Ready  |       0 |          0 |         NULL | NULL      |     1 |      0 |       85 |             |
+------+-----------------------------+--------+----------+--------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
1 row in set (0.00 sec)

mysql> ^CCtrl-C -- exit!
Aborted
pprabhu@pprabhu:~$ atest host list chromeos6-row2-rack5-host16
Host                         Status  Shard                                  Locked  Lock Reason  Locked by  Platform  Labels
chromeos6-row2-rack5-host16  Ready   chromeos-server50.hot.corp.google.com  False                None       stout     bluetooth, storage:ssd, os:cros, hw_jpeg_acc_dec, power:battery, board:stout, hw_video_acc_h264, cts_abi_x86, cts_abi_arm, webcam, internal_display, audio_loopback_dongle, variant:stout, sku:stout_intel_celeron_1007U_4Gb, touchpad, cros-version:stout-release/R62-9856.0.0
pprabhu@pprabhu:~$ atest host list chromeos6-row2-rack5-host16 | grep pool


Status: Archived (was: Assigned)
The # of desync'ed hosts has stabilized to its basline (which is way too high, imo).

Sign in to add a comment