New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 671709 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Repair can't diagnose an offline Moblab instance

Reported by jrbarnette@chromium.org, Dec 6 2016

Issue description

If a moblab instance fails by going offline, Repair tasks
don't provide proper diagnosis in status.log.

Here's what status.log looks like:
START	----	repair	timestamp=1480789105	localtime=Dec 03 10:18:25	
END FAIL	----	repair	timestamp=1480789234	localtime=Dec 03 10:20:34	

If you dive into the debug logs, you can find this traceback:
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/repair", line 20, in repair
    try_servo_repair=True)
  File "/usr/local/autotest/server/hosts/factory.py", line 252, in create_target_machine
    return create_host(machine, **kwargs)
  File "/usr/local/autotest/server/hosts/factory.py", line 188, in create_host
    host_instance = host_class(hostname, **args)
  File "/usr/local/autotest/server/hosts/base_classes.py", line 56, in __init__
    super(Host, self).__init__(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 69, in __init__
    self._initialize(*args, **dargs)
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 85, in _initialize
    self.run('rm -rf %s/*' % MOBLAB_IMAGE_STORAGE)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 187, in run
    options, stdin, args, ignore_timeout)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 148, in _run
    raise error.AutoservSSHTimeout("ssh timed out", result)

The failure happens in the MoblabHost initialization code:
    def _initialize(self, *args, **dargs):
        # [ ... ]
        # Clear the Moblab Image Storage so that staging an image is properly
        # tested.
        if dargs.get('retain_image_storage') is not True:
            self.run('rm -rf %s/*' % MOBLAB_IMAGE_STORAGE)
        # [ ... ]

When the moblab is down, the call to self.run() above fails, meaning
creating the MoblabHost fails, meaning repair can't run (because the
repair operation requires a Host object).

As a rule, Host _initialize() methods aren't allowed to fail, and so
aren't allowed to call self.run().

 
I don't know the proper fix to this, aside from "the call to
self.run() needs to move somewhere else".  In particular, I don't
know where "somewhere else" should be.  I also don't know how to
prove whether the call is still needed, or if "somewhere else"
could be the bit bucket.

As things currently stand, this is a low priority bug, because we
don't have a reliable repair procedure for a Moblab that's offline,
so poor diagnosis (not failure to repair) seems to be the impact, and
also because historically, Moblab instances don't fail by going offline
very often.

For reference, this problem was discovered as part of  bug 671279 .  In
that case, it was (reasonably) clear the the Moblab instances were offline,
and no amount of repair would have fixed the problem, because the root cause
was a failed network switch.

Cc: akes...@chromium.org
Labels: Hotlist-Fixit

Comment 4 by sbasi@chromium.org, Feb 3 2017

Owner: sbasi@chromium.org
Status: Assigned (was: Available)
Project Member

Comment 5 by bugdroid1@chromium.org, Feb 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b79379f47294920f8d44227e93a704e12724f05b

commit b79379f47294920f8d44227e93a704e12724f05b
Author: Simran Basi <sbasi@google.com>
Date: Wed Feb 15 21:33:10 2017

[autotest] Move MobLab image storage cleanup and ensure moblab_RunSuite runs a test.

Fixing two problems with this CL.

1) Migrates the cleanup of the moblab image storage folder
   from moblab_host to the moblab_RunSuite test.

2) Updates the moblab_RunSuite test and run_suite to ensure
   atleast a single test is ran to prevent breakages that
   create 0 tests.

BUG= chromium:671709 ,chromium:642905
TEST=None

Change-Id: Ief1567c0c1a692d4ca10735b30f49cc43776e9ed
Reviewed-on: https://chromium-review.googlesource.com/437750
Commit-Ready: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/b79379f47294920f8d44227e93a704e12724f05b/site_utils/run_suite.py
[modify] https://crrev.com/b79379f47294920f8d44227e93a704e12724f05b/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
[modify] https://crrev.com/b79379f47294920f8d44227e93a704e12724f05b/server/hosts/moblab_host.py

Project Member

Comment 6 by bugdroid1@chromium.org, Feb 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba90ec86aa31ac93f2eeba1f5407f0426ced508d

commit ba90ec86aa31ac93f2eeba1f5407f0426ced508d
Author: Simran Basi <sbasi@chromium.org>
Date: Fri Feb 17 06:59:08 2017

Revert "[autotest] Move MobLab image storage cleanup and ensure moblab_RunSuite runs a test."

This reverts commit b79379f47294920f8d44227e93a704e12724f05b.

Reason for revert: <INSERT REASONING HERE>

Original change's description:
> [autotest] Move MobLab image storage cleanup and ensure moblab_RunSuite runs a test.
> 
> Fixing two problems with this CL.
> 
> 1) Migrates the cleanup of the moblab image storage folder
>    from moblab_host to the moblab_RunSuite test.
> 
> 2) Updates the moblab_RunSuite test and run_suite to ensure
>    atleast a single test is ran to prevent breakages that
>    create 0 tests.
> 
> BUG= chromium:671709 ,chromium:642905
> TEST=None
> 
> Change-Id: Ief1567c0c1a692d4ca10735b30f49cc43776e9ed
> Reviewed-on: https://chromium-review.googlesource.com/437750
> Commit-Ready: Simran Basi <sbasi@chromium.org>
> Tested-by: Simran Basi <sbasi@chromium.org>
> Reviewed-by: Dan Shi <dshi@google.com>
> 

TBR=sbasi@chromium.org,sbasi@google.com,dshi@google.com,dshi@chromium.org,akeshet@chromium.org
# Not skipping CQ checks because original CL landed > 1 day ago.
BUG= chromium:671709 ,chromium:642905

Change-Id: Ie2f7b78fe5fd39b004436b367ab62050ae76dc1b
Reviewed-on: https://chromium-review.googlesource.com/444310
Reviewed-by: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba90ec86aa31ac93f2eeba1f5407f0426ced508d/site_utils/run_suite.py
[modify] https://crrev.com/ba90ec86aa31ac93f2eeba1f5407f0426ced508d/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
[modify] https://crrev.com/ba90ec86aa31ac93f2eeba1f5407f0426ced508d/server/hosts/moblab_host.py

Comment 7 by sbasi@chromium.org, Jul 18 2017

Status: Archived (was: Assigned)

Sign in to add a comment