network.ShillStability failed on betty-arc64-paladin with multiple processes |
|||
Issue descriptionhttps://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8930293407931948512 2018/11/09 18:42:20 Started test network.ShillStability 2018/11/09 18:42:20 [18:42:19.687] Error at shill_stability.go:47: Found 2 shill processes ([5716 5776]); want 1 2018/11/09 18:42:20 [18:42:19.687] Stack trace: Found 2 shill processes ([5716 5776]); want 1 at chromiumos/tast/local/bundles/cros/network.ShillStability.func1 (shill_stability.go:47) at chromiumos/tast/local/bundles/cros/network.ShillStability (shill_stability.go:52) at chromiumos/tast/testing.(*Test).Run.func4 (test.go:189) at chromiumos/tast/testing.runStages.func1.1 (stage.go:39) at chromiumos/tast/testing.runAndRecover.func1 (stage.go:69) at runtime.goexit (asm_amd64.s:2361) 2018/11/09 18:42:20 Completed test network.ShillStability in 272ms with 1 error(s) Here's pstree from faillog (collected just after the test failed): root 5712 0.0 0.0 8716 1920 ? Ss 20:42 0:00 /sbin/minijail0 -u shill -g shill -G -n -B 20 -c 800003de0 --ambient -- /usr/bin/shill --log-level=0 --log-scopes= --vmodule=object_proxy=0,dbus_object=0,bus=0 --device-black-list=eth_test,faketap0,wlan1,wlan2,managed0,managed1,pseudomodem0p --accept-hostname-from=pseudoethernet0 --arc-device=arcbr0 --portal-list= --jail-vpn-clients root 5715 0.5 0.0 0 0 ? Z 20:42 0:00 \_ [metrics_client] <defunct> shill 5716 4.5 0.1 31472 13476 ? S 20:42 0:00 \_ /usr/bin/shill --log-level=0 --log-scopes= --vmodule=object_proxy=0,dbus_object=0,bus=0 --device-black-list=eth_test,faketap0,wlan1,wlan2,managed0,managed1,pseudomodem0p --accept-hostname-from=pseudoethernet0 --arc-device=arcbr0 --portal-list= --jail-vpn-clients syslog 5718 0.0 0.0 6544 996 ? S 20:42 0:00 \_ /usr/bin/logger --priority daemon err --tag /usr/bin/shill dhcp 5776 0.0 0.0 11292 1240 ? S 20:42 0:00 \_ /sbin/dhcpcd -B -q -4 eth0=ethernet_525400123456 root 5786 0.0 0.0 4380 904 ? Ss 20:42 0:00 /bin/sh /usr/share/cros/init/shill-event.sh 5776 is dhcpcd, so my best guess is the test was very unlucky and ran while shill was doing a fork-exec to start dhcpcd. The getPID function in shill_stability.go should be updated to handle this (likely by returning the main shill process in this case).
,
Nov 17
> Coincidentally, it also doesn't show me any results at all for > board=betty (where this failed). Anyone know why that is? As I understand it, the results that Stainless displays come from the full Autotest lab setup (i.e. Autotest test -> status.log -> Autotest TKO MySQL database -> Stainless BigQuery tables). We fudge things a bit to get individual Tast test results to show up in Stainless even though they're all wrapped into a single Autotest job when running in the lab. betty (and amd64-generic) builders don't use DUTs in the lab; they run VMs locally. As a result, test results don't go into the MySQL database, so they don't make it to Stainless. nya@, if you can think of some way that we can get Tast test results from these VM builders into Stainless, please let me know! Tast's results.json files end up in GCS, but I don't know if there's an easy way to ingest them into Stainless. This would be nice to have, since there's currently no easy way to see whether a test is passing reliably on VMs when we're considering removing its "informational" flag -- one needs to manually click through to different betty-release runs to see whether it's passing or not. This has resulted in us adding flaky tests to the CQ in the past. :-(
,
Nov 17
Your explanation pretty much gives the answer for this too, but just to note: > nya@, if you can think of some way that we can get Tast test results from these VM builders into Stainless, please let me know! It's not just Tast results. I see no betty results for Autotest either. I can't select it in the drop-down, and it doesn't give me any matches in the Custom View.
,
Dec 11
https://chromium-review.googlesource.com/c/chromiumos/platform/tast-tests/+/1370628
,
Dec 12
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/25299a78cc37981f72b9d0e02134c94983af96c4 commit 25299a78cc37981f72b9d0e02134c94983af96c4 Author: Brian Norris <briannorris@chromium.org> Date: Wed Dec 12 21:55:47 2018 tast-tests: ShillStability: Ignore shill children If network.ShillStability runs very soon after shill starts (or restarts), shill could still be forking new processes. Those processes would briefly (fork/exec) also be considered instances of /usr/bin/shill. Filter these false matches by only looking for shill as a child of init or minijail0 (minijail0 is optional today). BUG= chromium:904117 TEST=network.ShillStability Change-Id: I67339aa2833d081def7c50081bbf46a11f9a5d68 Reviewed-on: https://chromium-review.googlesource.com/1370628 Commit-Ready: Brian Norris <briannorris@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> [modify] https://crrev.com/25299a78cc37981f72b9d0e02134c94983af96c4/src/chromiumos/tast/local/bundles/cros/network/shill_stability.go
,
Dec 12
I still have no way to see how reliably this runs on betty at any scale. I suppose I can click through a few builder results, but that's not going to go very far... |
|||
►
Sign in to add a comment |
|||
Comment 1 by briannorris@chromium.org
, Nov 17Thanks for the report! I'll try to tackle this soon. Technically, the autotest also had this problem, but we only see this because... 1. the shill.DefaultProfile tast stops/starts shill // Stop shill temporarily and remove the default profile. if err := shill.SafeStop(ctx); err != nil { s.Fatal(err) } os.Remove(filePath) if err := shill.SafeStart(ctx); err != nil { s.Fatal(err) } 2. client-side tast can apparently keep running (and starting the next tast) even with the interruption in connectivity 3. tast is blazing fast! In 3 seconds it finished DefaultProfile and moved on to ShillStability: 2018/11/09 18:42:17 Started test network.DefaultProfile 2018/11/09 18:42:20 [18:42:17.245] Waiting for org.chromium.flimflam D-Bus service 2018/11/09 18:42:20 Completed test network.DefaultProfile in 2.998s with 0 error(s) 2018/11/09 18:42:20 Started test network.ShillStability 2018/11/09 18:42:20 [18:42:19.687] Error at shill_stability.go:47: Found 2 shill processes ([5716 5776]); want 1 That's...not enough time for shill to get dhcpcd totally up, apparently -- from net.log: 2018-11-10T02:42:17.334675+00:00 INFO shill[5716]: [INFO:manager.cc(297)] Manager started. ... 2018-11-10T02:42:17.509120+00:00 INFO shill[5716]: [INFO:device.cc(188)] Device created: eth0 index 2 2018-11-10T02:42:17.516069+00:00 INFO shill[5716]: [INFO:service.cc(271)] ethernet service 1 constructed. 2018-11-10T02:42:17.517834+00:00 INFO shill[5716]: [INFO:manager.cc(1237)] Device eth0 registered. 2018-11-10T02:42:17.517996+00:00 INFO shill[5716]: [INFO:manager.cc(1512)] Device eth0 updated: enabled 2018-11-10T02:42:19.614933+00:00 INFO shill[5716]: [INFO:ethernet.cc(109)] Registering eth0 with manager. ... 2018-11-10T02:42:19.696741+00:00 NOTICE dhcpcd[5776]: dhcpcd is running with reduced privileges 2018-11-10T02:42:19.704107+00:00 INFO dhcpcd[5776]: status changed to Init 2018-11-10T02:42:19.705123+00:00 INFO shill[5716]: [INFO:dhcp_config.cc(133)] Init DHCP Proxy: eth0 at :1.53 ... See how 18:42:19.687 (the Tast error timestamp) and dhcpcd's first log messages (2018-11-10T02:42:19.696741+00:00 NOTICE dhcpcd[5776]: dhcpcd is running with reduced privileges) are within a few milliseconds of each other. It's a great problem to have when your test framework is so fast (and network-quality tolerant) that you hit these sorts of race conditions. On a related note: Stainless still doesn't show me any tast.network.ShillStability failures in the last month: https://stainless.corp.google.com/search?view=matrix&row=board_model&col=build&first_date=2018-10-21&last_date=2018-11-17&test=%5Etast%5C.network%5C.ShillStability%24&status=WARN&status=FAIL&status=ERROR&exclude_cts=true&exclude_not_run=false&exclude_non_release=false&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false Coincidentally, it also doesn't show me any results at all for board=betty (where this failed). Anyone know why that is? That makes it harder to evaluate stability for something like this...