New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 609645 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

chameleon labels are removed and not added back to hosts. Unable to re-run display_HotPlugAtSuspend test job

Project Member Reported by ka...@chromium.org, May 5 2016

Issue description

Test job failed at https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=62370897

I am unable to re-run. Yellow message on top says "ValidationError: {'hosts': u'Host(s) failed to meet job dependencies (board:daisy, chameleon, chameleon:hdmi, pool:chameleon_hdmi_stable, cros-version:daisy-release/R52-8283.0.0): chromeos1-row5-rack5-host2'}"


I don't see why - it is the same host, that ran this test earlier today.
 

Comment 1 by dshi@chromium.org, May 5 2016

The host doesn't have label chameleon:hdmi, so the error is legit. Could it be that the label was removed after that test?

Comment 2 by ka...@chromium.org, May 5 2016

Cc: waihong@chromium.org
Status: WontFix (was: Untriaged)
Thanks Dan, this is odd. I have no idea why the label should go away. I'll get the label in again. I'll ask Kevin, if he knows about this(just like the sevro label).


Comment 3 by ka...@chromium.org, May 5 2016

Hmm, the label got in somehow. And now I am able to re-run.

Comment 4 by dshi@chromium.org, May 6 2016

Cc: kevcheng@chromium.org
+Kevin

Could this be auto-added/removed by label detection code?
yeah, the chameleon label is affected like the servo label.  Looks like I'll need to look into making the chameleon label detection mechanism more robust as well.

Comment 6 by ka...@chromium.org, May 6 2016

Cc: conradlo@chromium.org cychiang@chromium.org
Owner: kevcheng@chromium.org
Status: Assigned (was: WontFix)
Summary: chameleon labels are removed and not added back to hosts. Unable to re-run display_HotPlugAtSuspend test job (was: Unable to re-run display_HotPlugAtSuspend test job)
Can you please stop the mechanism of removing/adding labels, until it is fixed?
At least for chromeos1-* and android1758-* hosts
Sorry for the inconvenience Kalin, this label auto-updating is a bit difficult to debug without it being in production :\.  And I can't turn it off for specific (or globbed) hostnames.

I did a checkthrough for all duts in pool:chameleon and 5 out of 48 don't have the chameleon label.  I tried to ping their hostname-chameleon, and none of them resolved so either there's something funky with the dns server or the chameleons got removed from these hosts.  Can you confirm if these 5 hosts are expected to have chameleons?

chromeos4-row10-rack6-host3
chromeos4-row10-rack6-host5
chromeos4-row10-rack6-host7
chromeos4-row9-rack4-host7
chromeos4-row9-rack4-host9

Comment 8 by ka...@chromium.org, May 6 2016

I do not know about these boards - 3 auron yuna and 2 nyan_big - and do not expect them to have chameleon. That is first time I hear about these boards. 
Is it possible to have 'pool:chameleon' label added wrongfully by auto-labeling?
pool labels must be manually specified (you can see the whole list of labels that can be auto-added/removed here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/cros_label.py?rcl=7b271d0305b5f6ba1c746f3b6098695e9377f038&l=428)

if it's all right with you, can I remove the pool:chameleon labels from those hosts?  Is there someone else I should check with as well?

And as for disappearing chameleon labels, I'll keep my eye open but if you see it again, please let me know so I can debug.  I added logging when labels are added/removed now.
Yes, you can remove the label from these hosts. Not sure about other devices. 

Logging is nice to have. Thanks.
pool:chameleon has been removed from:

chromeos4-row10-rack6-host3
chromeos4-row10-rack6-host5
chromeos4-row10-rack6-host7
chromeos4-row9-rack4-host7
chromeos4-row9-rack4-host9
The hosts below miss their 'servo' label. servo boards are ping-able.

kalin@kalin:~$ atest host list | grep -E "pool:usb_peripherals|audio_box" | grep -v servo | cut -d ' ' -f 1
chromeos1-row2-rack4-host4
chromeos1-row1-rack3-host3
chromeos1-row2-rack3-host5
chromeos1-row5-rack6-host2
Labels: -Pri-3 Pri-1
Those 4 hosts didn't run any tests until after the fix was pushed (5/5 2pm) so I ran a reverify for them to pick up the label again then I reran their last test to make sure the label stuck (which they did).

If you see any more hosts that are missing their servo label after running a test/verify, please let me know!

Nice, now all is good.

Coming back to the original reason to file this bug - unabole to re-run a test b/c a label is missing.

It seems auto-labeling will be removing labels from host and adding them back as the algorithm is set. Is this the case? If yes - this means at some point when I want to re-run a test b/c it failed the previous day, I would not be able to schedule the test.

Can you share more on the auto-labeling design?
For missing chameleon:hdmi label I find one host that does not have 'chameleon:hdmi' label:
$ atest host list | grep -E "pool:chameleon_hdmi|pool:chameleon_hdmi_stable" | grep -v chameleon:hdmi | cut -d ' ' -f 1
chromeos1-row2-rack4-host2


Instead, this host has 'chameleon:' label, which means ChameleonConnectionLabel() did not see the HDMI connection.

Therefore some chameleon suite(s) will not execute.
Certainly! 

The auto-labeling boils down to a simple detect and apply differences approach.
1. It runs through all the label detect methods to see what labels are applicable right now.
2. It checks to see which labels are considered non-applicable anymore (basically existing labels that weren't detected in step #1).
3. It removes the labels considered non-applicable (from step #2)
4. It checks to see which labels that aren't already applied to the host from step #1.
5. It adds the labels from step #4.

And done!  This is done during verify (which is invoked during a reset which happens before any test runs).


What can happen in the case of this chameleon label is, the host already has a chameleon label.  Then in step #1, the chameleon is not detected (for whatever reason: chameleon host is down, dns is broken, etc) and so it's marked in #2 to remove and step #3 removes the chameleon label.

So the above can happen in the reset phase before the job starts and basically the test gets scheduled and the host has the chameleon label, the chameleon label gets removed during the reset job, and when the test starts it'll complain that the label is gone. 

I believe the fix is to make the chameleon detection mechanism more robust.  Currently it just detects if the chameleon has been initialized on the host, but perhaps I should change it to just pinging the chameleon host (just like what I did for servo). 
Is it possible for chromeos1-row2-rack4-host2's chameleon that it's hdmi cable got disconnected? 


I went to the DUT and I am seeing the HDMI cable connected.

chameleon board is pingable
$ ping chromeos1-row2-rack4-host2-chameleon
PING chromeos1-row2-rack4-host2-chameleon.cros.corp.google.com (172.27.214.41) 56(84) bytes of data.
64 bytes from 172.27.214.41: icmp_seq=1 ttl=62 time=1.14 ms
64 bytes from 172.27.214.41: icmp_seq=2 ttl=62 time=1.07 ms

display_ServerChameleonConnection fails with "FAIL: No port detected on the Chameleon board"

So the auto-labling worked correctly. It looks like something is wrong with this particular chameleon. Even after chameleon reboot the issue persists. I'll look into this.


Actually after several reboots and unplug/plug, now connection is back.

So, now a cameleon test will not be scheduled until another reset job is scheduled. If you are saying the reset job is going to happen before a test is run, that means test should already be scheduled. But that happens only after the respective to the suite label is present. Which means none of these two things will happen and label will not be added, unless I run Verify job manually. 


The cable detection is checked by the +5V power supply from the DUT. If DUT is down (the +5V line is floating), Chameleon treats the cable not being physically connected.

A possible error case might be the DUT was down.
If DUT is down then I would expected none of auto-labeling methods should run.

and if this is what happened, once the label is removed, how reset job will be triggered to return the labels needed for tests to proceed?

I think labels that trigger suites should not be part of auto-labling, b/c once removed, there is no way to return automatically.
Hmm... interesting situation.  Which labels are involved that trigger suites?  

And the suites will fail if the connection isn't there anyways right? Might this be a good thing since suites won't run on hw that's expected to fail?
On second thought, that's probably is what the display_ServerChameleonConnection test is for.  I can update the label classes to skip updating so that the label will still be applied when creating the host but not during the verify phase (aka auto-updating).
If you stop auto-updating labels(did you mean 'aka auto-labeling'), what will be different than before?

What I am saying is that if a label is removed by auto-labeling, it should not need a manual intervention in order to auto-label again. Or... should it? 

Is this the main purpose of your 'auto-labeling'change - if a label is removed, no tests related to this label should be run until someone manually re-verify the host(after checking the DUT)?

I can agree this will decrease the number of failing(by infrastructure reason) tests, and thinking about it, this might be a good thing.


Kalin and I met and agreed to wait a bit to see how things play out (how often do expected labels get removed) to see what next steps we should take.

1 idea Kalin had (which sounds like a good idea) is during the reset phase when labels get added/removed, if one of the labels removed is a dependency for the test, to abort the test.  Not sure how complicated that will be though.

Either way, we'll keep this bug open until we're happy with auto-labeling.
... and it will be nice if we have a missing tests indicator on wmatrix, with pointer to which one exactly.


Kevin stopped by, and we agreed to leave the change as is at this point, and monitor how things go for the next one or two weeks.

To be aware of:

- whenever a label is removed by auto-label, it needs another job to be scheduled so reset stage verifies and auto-label the host again. 

- manual re-run job for a test can NOT be scheduled, until the label is back(auto or manually)

- chameleon connection labels(chameleon:hdmi, chameleon:dp, ...) for now serve the purpose of 'suite_dependency' labels. That means the suite will not be scheduled when removed. We can exclude them from the auto-labeling, and keep them only at host creation level.

- it is not entirely clear when a test is scheduled, and dependency(for suite or for test) label is removed after that, if the test job will proceed.

- this mechanism can disable tests to run on failing infrastructure(servo or chameleon) which will free lab from executing anyway projected to fail tests. Some more looking into this is needed.


Hey Kalin,

Just wanted to check how this week of testing went and if you noticed any differences in # of tests failing and if there random disappearances of labels.

Comment 29 by ka...@chromium.org, May 13 2016

No issues on the test suites and pools of DUTs I observe. Things looks good.


Status: Fixed (was: Assigned)
closing out as it seems like the label detection has stabilized.

Comment 31 by ka...@chromium.org, May 25 2016

Status: Verified (was: Fixed)
Thanks Kevin. Yes, label detection is good.

Sign in to add a comment