create a monarch metric (and eventually, an alert) for "all devservers in subnet X are down" |
||||
Issue descriptionWhen all devservers in a subnet are down, we drop jobs. Let's add a monarch counter about that event, and add alerts around it.
,
Aug 21 2017
,
Aug 21 2017
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0f714f84a761288d5559d541bf84b87bab943a83 commit 0f714f84a761288d5559d541bf84b87bab943a83 Author: Xixuan Wu <xixuan@chromium.org> Date: Thu Aug 24 06:13:38 2017 autotest: add metric to record 'all devservers in subnet X are down'. BUG= chromium:756671 TEST=Ran unittest. Ran resolve locally to check subnet. Change-Id: I5c6d480aaba8c2543bade5c61eb3e34b21019143 Reviewed-on: https://chromium-review.googlesource.com/630096 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/0f714f84a761288d5559d541bf84b87bab943a83/client/common_lib/cros/dev_server.py
,
Aug 28 2017
dashboard & alert: https://critique.corp.google.com/#review/166547879 https://critique.corp.google.com/#review/166540613 |
||||
►
Sign in to add a comment |
||||
Comment 1 by akes...@chromium.org
, Aug 18 2017relevant autotest code in dev_server.py @classmethod def resolve(cls, build, hostname=None, ban_list=None): """"Resolves a build to a devserver instance. @param build: The build (e.g. x86-mario-release/R18-1586.0.0-a1-b1514). @param hostname: The hostname of dut that requests a devserver. It's used to make sure a devserver in the same subnet is preferred. @param ban_list: The blacklist of devservers shouldn't be chosen. @raise DevServerException: If no devserver is available. """ tried_devservers = set() devservers, can_retry = cls.get_available_devservers(hostname) if devservers: tried_devservers |= set(devservers) devserver = cls.get_healthy_devserver(build, devservers, ban_list=ban_list) if not devserver and can_retry: # Find available devservers without dut location constrain. devservers, _ = cls.get_available_devservers() devserver = cls.get_healthy_devserver(build, devservers, ban_list=ban_list) if devservers: tried_devservers |= set(devservers) if devserver: return devserver else: error_msg = ('All devservers are currently down: %s. ' 'dut hostname: %s' % (tried_devservers, hostname)) logging.error(error_msg) raise DevServerException(error_msg)