Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: Verified
Owner:
Closed: Jun 20
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment
devserver load may contribute to some provision failures
Project Member Reported by xixuan@chromium.org, Feb 27 2017 Back to list
In recent failed build: https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/27560

There're 4 same errors from one devserver 100.115.219.133 (chromeos4-devserver5) to 4 different hosts:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777573-chromeos-test/chromeos4-row2-rack9-host7/debug
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777579-chromeos-test/chromeos4-row2-rack9-host5/debug
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777585-chromeos-test/chromeos4-row2-rack9-host4/debug
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777588-chromeos-test/chromeos4-row2-rack8-host16/debug

The ssh errors all happened around 02/27 07:16~07:17, when this devserver was experiencing an huge process increase: http://shortn/_MQa1pIkXhU.

After this time point, the 4 DUTs went through a successful provision with the same devserver, when the process count has dropped to normal.

I tend to believe it's a "devserver-load-caused" provision failure, and plan to work on a CL to resolve another devserver in this case.

Also in  Issue 695529 , I see there's a case that "devserver cannot ping a DUT, but we still try that devserver twice and claim a failure finally", which could be solved by resolving another devserver. so I will also work on that together.


 
Comment 1 by xixuan@chromium.org, Feb 27 2017
Summary: devserver load may contribute to some provision failures (was: devserver load may contributes to some provision failures)
Components: -Infra Infra>Platform
Components: -Infra>Platform Infra>Client>ChromeOS
> another pingable case:
>
> https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/1982
> https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103896160-chromeos-test/chromeos2-row7-rack7-host1/debug

The DUT was actually offline.  The sequence of events was this:
  * provision_Autoupdate.double ran and failed because the DUT
    went offline during testing.
  * The subsequent provision task failed because the DUT was
    still offline.
  * The subsequent repair passed after using servo to reset
    the DUT.

Here's the history from dut-status:
chromeos2-row7-rack7-host1
    2017-02-27 23:59:39  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/648034-repair/
    2017-02-27 23:54:47  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/648025-provision/
    2017-02-27 22:13:32  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103864948-chromeos-test/
    2017-02-27 21:59:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/647835-provision/

Project Member Comment 6 by bugdroid1@chromium.org, Mar 1 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/447ad9d610433e6d34e4db04057357ef9f73a0f2

commit 447ad9d610433e6d34e4db04057357ef9f73a0f2
Author: xixuan <xixuan@chromium.org>
Date: Wed Mar 01 04:21:07 2017

devserver: add ongoing au process number check in check_health.

This CL mainly does:
1. Add checking the number of current background au processes in
check_healthy devserver call.
2. Force kill_au_proc to kill the au process if the process's pid is
passed.

BUG= chromium:696606 
TEST=Run local devserver and call check_health & kill_au_proc.

Change-Id: I4fe44407b85659bd3aab309ac8efe11b0b457f68
Reviewed-on: https://chromium-review.googlesource.com/447821
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/447ad9d610433e6d34e4db04057357ef9f73a0f2/cros_update_progress.py
[modify] https://crrev.com/447ad9d610433e6d34e4db04057357ef9f73a0f2/devserver.py

Project Member Comment 7 by bugdroid1@chromium.org, Mar 1 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/016d95b92b13c252d6fb75f9075d8763ba913cd7

commit 016d95b92b13c252d6fb75f9075d8763ba913cd7
Author: xixuan <xixuan@chromium.org>
Date: Wed Mar 01 17:21:55 2017

autotest: resolve devserver if privision fails due to network/load issue.

This CL does:
1. Resolve a new devserver if this devserver fails to ping the DUT,
but double check the DUT's connectivity before resolving.
2. Resolve a new devserver if this devserver is overloaded in the
middle.
3. Change kill_au_proc logic to make sure the background au process is
killed.

BUG= chromium:696606 
TEST=locally run afe & use lab's devserver to test resolving.

Change-Id: I358fd1f49fb8f1ec0cf5865641717da6f1f6b07f
Reviewed-on: https://chromium-review.googlesource.com/447878
Reviewed-by: Dan Shi <dshi@google.com>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/server/hosts/cros_host.py
[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server.py
[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server_unittest.py

Project Member Comment 8 by bugdroid1@chromium.org, Mar 1 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/016d95b92b13c252d6fb75f9075d8763ba913cd7

commit 016d95b92b13c252d6fb75f9075d8763ba913cd7
Author: xixuan <xixuan@chromium.org>
Date: Wed Mar 01 17:21:55 2017

autotest: resolve devserver if privision fails due to network/load issue.

This CL does:
1. Resolve a new devserver if this devserver fails to ping the DUT,
but double check the DUT's connectivity before resolving.
2. Resolve a new devserver if this devserver is overloaded in the
middle.
3. Change kill_au_proc logic to make sure the background au process is
killed.

BUG= chromium:696606 
TEST=locally run afe & use lab's devserver to test resolving.

Change-Id: I358fd1f49fb8f1ec0cf5865641717da6f1f6b07f
Reviewed-on: https://chromium-review.googlesource.com/447878
Reviewed-by: Dan Shi <dshi@google.com>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/server/hosts/cros_host.py
[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server.py
[modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server_unittest.py

Project Member Comment 9 by bugdroid1@chromium.org, Mar 18 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6f0cb6532b6b2a21732193924db031c992e045fc

commit 6f0cb6532b6b2a21732193924db031c992e045fc
Author: xixuan <xixuan@chromium.org>
Date: Sat Mar 18 21:24:51 2017

autotest: monitor the AU process number on devserver.

This CL adds monitoring AU process count on devservers. We add it since
recently it's found that more and more provision failure (may) come from
devserver's load issue.

BUG= chromium:696606 
TEST=Run unittest.

Change-Id: I5c0c1f571403b283ff0c07e283436850a7be9ff7
Reviewed-on: https://chromium-review.googlesource.com/450972
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/6f0cb6532b6b2a21732193924db031c992e045fc/client/common_lib/cros/dev_server.py

Comment 10 Deleted
Cc: dgarr...@chromium.org
Status: Fixed
+don, the AU process number is already monitored, but we don't use it for resolve.

I mark this as fixed for now since you may want to file a more systematically bug for your project. Feel free to re-open it.
Status: Verified
Closing. Please reopen it if its not fixed. Thanks!
Sign in to add a comment