New issue
Advanced search Search tips

Issue 756671 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

create a monarch metric (and eventually, an alert) for "all devservers in subnet X are down"

Project Member Reported by akes...@chromium.org, Aug 17 2017

Issue description

When all devservers in a subnet are down, we drop jobs. Let's add a monarch counter about that event, and add alerts around it.
 
relevant autotest code in dev_server.py

    @classmethod
    def resolve(cls, build, hostname=None, ban_list=None):
        """"Resolves a build to a devserver instance.

        @param build: The build (e.g. x86-mario-release/R18-1586.0.0-a1-b1514).
        @param hostname: The hostname of dut that requests a devserver. It's
                         used to make sure a devserver in the same subnet is
                         preferred.
        @param ban_list: The blacklist of devservers shouldn't be chosen.

        @raise DevServerException: If no devserver is available.
        """
        tried_devservers = set()
        devservers, can_retry = cls.get_available_devservers(hostname)
        if devservers:
            tried_devservers |= set(devservers)

        devserver = cls.get_healthy_devserver(build, devservers,
                                              ban_list=ban_list)

        if not devserver and can_retry:
            # Find available devservers without dut location constrain.
            devservers, _ = cls.get_available_devservers()
            devserver = cls.get_healthy_devserver(build, devservers,
                                                  ban_list=ban_list)
            if devservers:
                tried_devservers |= set(devservers)
        if devserver:
            return devserver
        else:
            error_msg = ('All devservers are currently down: %s. '
                         'dut hostname: %s' %
                         (tried_devservers, hostname))
            logging.error(error_msg)
            raise DevServerException(error_msg)

Owner: xixuan@chromium.org
Labels: -Chase-Pending Chase
Project Member

Comment 4 by bugdroid1@chromium.org, Aug 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0f714f84a761288d5559d541bf84b87bab943a83

commit 0f714f84a761288d5559d541bf84b87bab943a83
Author: Xixuan Wu <xixuan@chromium.org>
Date: Thu Aug 24 06:13:38 2017

autotest: add metric to record 'all devservers in subnet X are down'.

BUG= chromium:756671 
TEST=Ran unittest. Ran resolve locally to check subnet.

Change-Id: I5c6d480aaba8c2543bade5c61eb3e34b21019143
Reviewed-on: https://chromium-review.googlesource.com/630096
Commit-Ready: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/0f714f84a761288d5559d541bf84b87bab943a83/client/common_lib/cros/dev_server.py

Sign in to add a comment