New issue
Advanced search Search tips

Issue 762589 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Oct 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

afe_lock_machine does not work properly.

Project Member Reported by yunlian@chromium.org, Sep 6 2017

Issue description

src/third_party/toolchain-utils/afe_lock_machine.py --status --chromeos_root /ssd/clean/ --remote chromeos2-row9-rack9-host9.cros


does not show any result.

It seems
the src/third_party/autotest/files/server/frontend.py

RpcClient::run does not return correctly.
It throws exceptions in this try block and it keeps retring. So
it got stuck here.

       try:
            result = utils.strip_unicode(rpc_call(**dargs))
            if self.reply_debug:
                print result
            return result
        except Exception:



 
Owner: yunlian@chromium.org
Status: Assigned (was: Untriaged)
I looked at the failing code: https://cs.corp.google.com/chromeos_public/src/third_party/toolchain-utils/afe_lock_machine.py?rcl=5a51638398691dee889202e043a42a30a6aef585&l=384

Running AFE RPCs directly through script is not supported (you can do it, but we won't promise it won't break / won't spend time debugging with you).

I've seen this code before and will re-iterate what I said then: Please use 

$ atest host mod -l -r 'reason' hostname

The current failure looks like RPC failed. Dunno why. The only change we've recently made is that if you make this RPC on a shard, it will get forwarded to the master. This should always have been the case, but wasn't working correctly (was a bug).

Are you calling this against a shard?

Just FYI...it looks like our script was working fine up through Aug. 31, and started failing on Sept. 1.
Do you have a stack trace for the RPC (i.e., retry eventually fails, but do you know what the RPC is failing with?)
It seems this error only happens when I use the hostname (instead of ip address) of the machines in the lab.
The ip of chromeos2-row9-rack9-host17.cros is 100.115.232.97

./afe_lock_machine.py --add --chromeos_root /usr/local/google/crostc/chromeos --remote chromeos2-row9-rack9-host17.cros (failed)
./afe_lock_machine.py --add --chromeos_root /usr/local/google/crostc/chromeos --remote 100.115.232.97 (successful)
./afe_lock_machine.py --add --chromeos_root /usr/local/google/crostc/chromeos --remote yunlian.svl (successful)
That is because when you use the name, it tries to go through the HW lab AFE server; when you use the IP address, is uses our local AFE server, which (by-the-way) is probably using older code, since I'm not sure about the last time the chroot on chrotomation2 was 'repo sync'd  It might be that if we do a repo sync on chrotomation2, the local AFE server will stop working as well...
Project Member

Comment 6 by bugdroid1@chromium.org, Sep 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/toolchain-utils/+/4260aae2fe0abb937543c6cafc44f68e99de7410

commit 4260aae2fe0abb937543c6cafc44f68e99de7410
Author: Yunlian Jiang <yunlian@google.com>
Date: Wed Sep 06 23:18:18 2017

USE local lock for buildbot_test_toolchain.

Currently the AFE lock machanism for our nightly test is broken,
we use local lock as a workaround for now.

BUG= chromium:762589 
TEST=Generate the crosperf command in another simple python file
     with the same code. Run the generated command line on
     crotomation2 and it goes though the locking machine stage.

Change-Id: Icd3132bc383b63aab6d6f5237a04348f11d8726d
Reviewed-on: https://chromium-review.googlesource.com/653213
Commit-Ready: Yunlian Jiang <yunlian@chromium.org>
Tested-by: Yunlian Jiang <yunlian@chromium.org>
Reviewed-by: Caroline Tice <cmtice@chromium.org>

[modify] https://crrev.com/4260aae2fe0abb937543c6cafc44f68e99de7410/buildbot_test_toolchains.py

Note that there is weak correlation between the timing of this breaking, and a pesky incorrect DUT locking problem disappearing from the lab:  issue 732999 .

If this script is ever resurrected, we should keep an eye out for that bug to reappear (That bug is very significant, enough that if it were caused by this script, we'd request suspension of the script)
Owner: cmt...@chromium.org
Further investigation reveals:

'atest' only works on machines which are not 'on' BeyondCorp.  I can get it to work on my workstation (disabling BeyondCorp), but I have not been able to get it to work on chrotomation2.svl, neither from my own account nor from the role account (mobiletc-prebuild).  So 'atest' is not a viable solution for us.

I could try to fix the RPC issues, but the RPC interface is subject to change without notice and we have been told we will get no help from the chromeos-infra team if we try to go that route.

Eventually, something called "SkyLab" will come along and "replace all these corp RPCs with oauth-based appengine ones (and will provide replacement tools for atest and similar command line tools)."

So at this point, my recommendation is to keep using the not-perfect-but-it-works file locks mechanism for now, and wait for SkyLab.
Status: WontFix (was: Assigned)
The beyondcorp conclusions are accurate: b/32303896

I think it's best to punt on resurrecting this for skylab. Note that, at that time, it's best to work with chromeos-infra to support your use case fully (i.e., skylab-or-not, we are not likely to support / or maybe even allow automated administrative jobs that modify DUT inventory (locking a DUT counts as such a modification)).
chromeos-infra should be able to support your use case easily via pools / some other mechanism.
Status: Verified (was: WontFix)
Now that chrotomation2 is moved to MPC, afe_lock_machine is working.
Project Member

Comment 12 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/toolchain-utils/+/b1afe3f2c2d4219ce490ffa111f530983b171141

commit b1afe3f2c2d4219ce490ffa111f530983b171141
Author: Ting-Yuan Huang <laszio@chromium.org>
Date: Tue Aug 21 23:48:17 2018

Revert "USE local lock for buildbot_test_toolchain."

This reverts commit 4260aae2fe0abb937543c6cafc44f68e99de7410.

Reason for revert: afe_lock_machine is fixed.

Original change's description:
> USE local lock for buildbot_test_toolchain.
> 
> Currently the AFE lock machanism for our nightly test is broken,
> we use local lock as a workaround for now.
> 
> BUG= chromium:762589 
> TEST=Generate the crosperf command in another simple python file
>      with the same code. Run the generated command line on
>      crotomation2 and it goes though the locking machine stage.
> 
> Change-Id: Icd3132bc383b63aab6d6f5237a04348f11d8726d
> Reviewed-on: https://chromium-review.googlesource.com/653213
> Commit-Ready: Yunlian Jiang <yunlian@chromium.org>
> Tested-by: Yunlian Jiang <yunlian@chromium.org>
> Reviewed-by: Caroline Tice <cmtice@chromium.org>

Bug:  chromium:762589 
Change-Id: Ib9f787ff48953d384dd36c72811ebd0f20dd25db
Reviewed-on: https://chromium-review.googlesource.com/1178897
Tested-by: Ting-Yuan Huang <laszio@chromium.org>
Reviewed-by: Luis Lozano <llozano@chromium.org>
Reviewed-by: Caroline Tice <cmtice@chromium.org>
Commit-Queue: Ting-Yuan Huang <laszio@chromium.org>

[modify] https://crrev.com/b1afe3f2c2d4219ce490ffa111f530983b171141/buildbot_test_toolchains.py

Sign in to add a comment