devserver load may contribute to some provision failures |
|||||
Issue descriptionIn recent failed build: https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/27560 There're 4 same errors from one devserver 100.115.219.133 (chromeos4-devserver5) to 4 different hosts: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777573-chromeos-test/chromeos4-row2-rack9-host7/debug https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777579-chromeos-test/chromeos4-row2-rack9-host5/debug https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777585-chromeos-test/chromeos4-row2-rack9-host4/debug https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103777588-chromeos-test/chromeos4-row2-rack8-host16/debug The ssh errors all happened around 02/27 07:16~07:17, when this devserver was experiencing an huge process increase: http://shortn/_MQa1pIkXhU. After this time point, the 4 DUTs went through a successful provision with the same devserver, when the process count has dropped to normal. I tend to believe it's a "devserver-load-caused" provision failure, and plan to work on a CL to resolve another devserver in this case. Also in Issue 695529 , I see there's a case that "devserver cannot ping a DUT, but we still try that devserver twice and claim a failure finally", which could be solved by resolving another devserver. so I will also work on that together.
,
Feb 27 2017
,
Feb 27 2017
,
Feb 28 2017
> another pingable case: > > https://luci-milo.appspot.com/buildbot/chromeos/elm-paladin/1982 > https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103896160-chromeos-test/chromeos2-row7-rack7-host1/debug The DUT was actually offline. The sequence of events was this: * provision_Autoupdate.double ran and failed because the DUT went offline during testing. * The subsequent provision task failed because the DUT was still offline. * The subsequent repair passed after using servo to reset the DUT. Here's the history from dut-status: chromeos2-row7-rack7-host1 2017-02-27 23:59:39 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/648034-repair/ 2017-02-27 23:54:47 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/648025-provision/ 2017-02-27 22:13:32 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103864948-chromeos-test/ 2017-02-27 21:59:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host1/647835-provision/
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/447ad9d610433e6d34e4db04057357ef9f73a0f2 commit 447ad9d610433e6d34e4db04057357ef9f73a0f2 Author: xixuan <xixuan@chromium.org> Date: Wed Mar 01 04:21:07 2017 devserver: add ongoing au process number check in check_health. This CL mainly does: 1. Add checking the number of current background au processes in check_healthy devserver call. 2. Force kill_au_proc to kill the au process if the process's pid is passed. BUG= chromium:696606 TEST=Run local devserver and call check_health & kill_au_proc. Change-Id: I4fe44407b85659bd3aab309ac8efe11b0b457f68 Reviewed-on: https://chromium-review.googlesource.com/447821 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/447ad9d610433e6d34e4db04057357ef9f73a0f2/cros_update_progress.py [modify] https://crrev.com/447ad9d610433e6d34e4db04057357ef9f73a0f2/devserver.py
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/016d95b92b13c252d6fb75f9075d8763ba913cd7 commit 016d95b92b13c252d6fb75f9075d8763ba913cd7 Author: xixuan <xixuan@chromium.org> Date: Wed Mar 01 17:21:55 2017 autotest: resolve devserver if privision fails due to network/load issue. This CL does: 1. Resolve a new devserver if this devserver fails to ping the DUT, but double check the DUT's connectivity before resolving. 2. Resolve a new devserver if this devserver is overloaded in the middle. 3. Change kill_au_proc logic to make sure the background au process is killed. BUG= chromium:696606 TEST=locally run afe & use lab's devserver to test resolving. Change-Id: I358fd1f49fb8f1ec0cf5865641717da6f1f6b07f Reviewed-on: https://chromium-review.googlesource.com/447878 Reviewed-by: Dan Shi <dshi@google.com> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/server/hosts/cros_host.py [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server.py [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server_unittest.py
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/016d95b92b13c252d6fb75f9075d8763ba913cd7 commit 016d95b92b13c252d6fb75f9075d8763ba913cd7 Author: xixuan <xixuan@chromium.org> Date: Wed Mar 01 17:21:55 2017 autotest: resolve devserver if privision fails due to network/load issue. This CL does: 1. Resolve a new devserver if this devserver fails to ping the DUT, but double check the DUT's connectivity before resolving. 2. Resolve a new devserver if this devserver is overloaded in the middle. 3. Change kill_au_proc logic to make sure the background au process is killed. BUG= chromium:696606 TEST=locally run afe & use lab's devserver to test resolving. Change-Id: I358fd1f49fb8f1ec0cf5865641717da6f1f6b07f Reviewed-on: https://chromium-review.googlesource.com/447878 Reviewed-by: Dan Shi <dshi@google.com> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/server/hosts/cros_host.py [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server.py [modify] https://crrev.com/016d95b92b13c252d6fb75f9075d8763ba913cd7/client/common_lib/cros/dev_server_unittest.py
,
Mar 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6f0cb6532b6b2a21732193924db031c992e045fc commit 6f0cb6532b6b2a21732193924db031c992e045fc Author: xixuan <xixuan@chromium.org> Date: Sat Mar 18 21:24:51 2017 autotest: monitor the AU process number on devserver. This CL adds monitoring AU process count on devservers. We add it since recently it's found that more and more provision failure (may) come from devserver's load issue. BUG= chromium:696606 TEST=Run unittest. Change-Id: I5c0c1f571403b283ff0c07e283436850a7be9ff7 Reviewed-on: https://chromium-review.googlesource.com/450972 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/6f0cb6532b6b2a21732193924db031c992e045fc/client/common_lib/cros/dev_server.py
,
Jun 20 2017
+don, the AU process number is already monitored, but we don't use it for resolve. I mark this as fixed for now since you may want to file a more systematically bug for your project. Feel free to re-open it.
,
Aug 3 2017
Closing. Please reopen it if its not fixed. Thanks! |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by xixuan@chromium.org
, Feb 27 2017