VM Shutdowns failing in tests sometimes as of 11302.0.0 |
||||
Issue descriptionFirst time I've seen this, it's in 11302.0.0+ (maybe earlier, it's only had 3 failures with this). 2018-11-27T15:56:05.287500-08:00 WARNING vm_concierge[535]: Shutdown RPC failed for VM 3 with error code 4: Deadline Exceeded 2018-11-27T15:56:05.306857-08:00 ERR localhos[554]: crosvm[4]: [src/linux.rs:1063] failed to send VmResponse: Send(Error(107)) 2018-11-27T15:56:05.306935-08:00 INFO localhos[554]: crosvm[4]: [src/linux.rs:1066] control socket requested exit From hana on 11302.0.0: https://stainless.corp.google.com/browse/chromeos-autotest-results/261718585-chromeos-test/ And from elm on 11302.0.0: https://stainless.corp.google.com/browse/chromeos-autotest-results/261704614-chromeos-test/ And hana again on 11307.0.0: https://stainless.corp.google.com/browse/chromeos-autotest-results/261928727-chromeos-test/
,
Nov 29
So maitred is getting the shutdown request and garcon shuts down soon after: 2018-11-29T10:03:22.169511+00:00 INFO VM(3)[31999]: [ 57.142343] maitred: Received shutdown request#015 2018-11-29T10:03:22.178842+00:00 INFO VM(3)[31976]: tremplin[182]: 2018/11/29 10:03:21 Received GetContainerInfo RPC: penguin#012 2018-11-29T10:03:22.178854+00:00 INFO VM(3)[31976]: tremplin[182]: 2018/11/29 10:03:21 Received GetContainerUsername RPC: penguin#012 2018-11-29T10:03:22.219187+00:00 INFO vm_cicerone[31971]: process 10 did not exit from signal 9 in 0 seconds 2018-11-29T10:03:22.219238+00:00 INFO vm_cicerone[31971]: Shutdown of container penguin for VM termina But then nothing happens for 30 seconds, which is weird because maitred only waits 10 seconds before sending SIGKILL to every process and then continuing with the shutdown(https://chromium.googlesource.com/chromiumos/platform2/+/master/vm_tools/maitred/init.cc#1150). So we should be seeing logs that say "Detaching loopback" and "Unmounting filesystems" about 10 seconds after we receive the shutdown request. The final sync() before shutdown can potentially take a long time if there is a lot of buffered data but we should see other messages in the logs before it even attempts the sync(). I have no idea what's happening there.
,
Nov 29
This happened again on 11313.0.0...and again it was on hana. So 4 out of 5 times this has been on hana...which is very odd IMO. Ahhh...so hana and elm are the only 2 boards with the MTK8173 chipset. Although they're not failing every time...sometimes it works.
,
Nov 30
And again in 11314.0.0 on elm. If you can get yourself an elm or hana device I'd suspect you could reproduce this within a few tries of running the vm.CrostiniFiles test (and it'll likely also start causing failures in the vm.CrostiniStartEverything since that was just updated to stop the VM again).
,
Dec 5
To add onto this...I'm seeing some other new failures that are specific to elm/hana devices. When we install an app, we have an inode watcher that seem the changes in the filesystem for the .desktop files...then it waits 5 seconds in case there's more changes...and then it parses then and sends them. In a normal container log file, there's generally about 3 seconds between when the app install finishes and we start parsing these files, like this: Dec 05 07:07:52 penguin garcon[221]: [221]: Finished installing Linux package result: 1 Dec 05 07:07:55 penguin garcon[221]: [221]: Searching for local Linux package that owns /usr/share/applications/x11_demo.desktop but on elm/hana recently, now this happens: Dec 05 07:58:32 penguin garcon[212]: [212]: Finished installing Linux package result: 1 Dec 05 07:59:12 penguin garcon[212]: [212]: Searching for local Linux package that owns /usr/share/applications/x11_demo.desktop Dec 05 07:59:12 penguin rtkit-daemon[29]: The canary thread is apparently starving. Taking action. Dec 05 07:59:12 penguin rtkit-daemon[29]: Demoting known real-time threads. Dec 05 07:59:12 penguin rtkit-daemon[29]: Demoted 0 threads. So it's a 40 second delay. This looks suspicious because it's inducing a delay that we don't expect....similar to what is happening when shutdown isn't completing in a timely fashion (although I don't have container logs for that test). I'll try to get a elm/hana device and see if I can reproduce this.
,
Dec 5
And I went looking back further through the test results...and not every time this fails with this timing is that rtkit-daemon message in the container log.
,
Dec 5
I have an elm device now and I can reproduce both of these problems. The VM takes 30 seconds to shut down (it's just timing out and forcibly killing it) and I can reproduce the slowdown that's causing the testing failure after package install as well (I don't know if the 2 are related yet...but both of these problems are fairly new and specific to this chipset).
,
Dec 6
And I looked through the test results again...and both of these issues started happening at the same time in 11302.0.0. I'll look through those CLs to see if I can find anything of interest to blame.
,
Dec 6
Well this CL is highly suspicious. It landed in 11302.0.0 and it's from somebody at mediatek. :) https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/1350564/ It's a change to external interrupt handling for the GPIO driver. I'm not entirely sure if it'll affect elm/hana though...I'll build an image for it...reproduce the problem, revert that change and see if it fixes it.
,
Dec 6
It doesn't affect elm/hana. The CL is for MT8183. oak (elm/hana) are MT8173C. Also oak is on branch kernel-3.18.
,
Dec 6
OK thanks. I'll at least first prove that the failure started in 11302.0.0 with my own testing rather than relying on the automated testing (because sometimes it does still pass even with this problem).
,
Dec 6
I failed to reproduce this in 11302.0.0 on elm even though the tast test logs indicate there was a failure with the same problem in 11302.0.0 on hana. I tried probably half a dozen times...this is very odd. Then I loaded 11303.0.0 and also couldn't reproduce it (although it did take longer one time and had the rtkit-daemon log message, but the test still passed because it didn't take that much longer). On 11348.0.0 it was failing very consistently, I think it was every time I ran it. Time for a little better bisection...tried 11316.0.0 and it didn't reproduce there either. This isn't proceeding in any effective way...I'm going back to 11348.0.0 to see if I can still reproduce there and then just going to analyze that more to see if I can track this down.
,
Dec 6
And for clarification....I'm only reproducing the vm.CrostiniStartEverything failures (there were changes to other things with testing that caused my HEAD versions to not work on older Chrome OS versions for CrostiniFiles). But I did now reproduce the failure in CrostiniStartEverything on 11348.0.0 on my first try. Maybe the two aren't actually linked.
,
Dec 7
Reproduced it once on 11348.0.0 and then not again. I'm going to stop looking into this one for now because I just feel like I'm wasting my time...
,
Jan 8
We did have a VM push on the first day that this error showed up. release_72 auto_minenv720.1_11283.0.0 11/27/2018 14:45:00 The last public push before this was 11220.0.0. That was the first push with the 4.19 kernel. So we're looking at a rather large changelog: https://crosland.corp.google.com/log/11220.0.0..11283.0.0
,
Jan 9
I tried reproducing this again on elm with R73-11528.0.0 and after 50+ tries of just running vm.CrostiniStartEverything it never failed. Then I ran the full set of tast tests (which I've done before) and it failed on the second try. Interestingly enough, this time in journalctl there is no mention of rtkit-daemon....but we still see the same long delay with searching for the installed desktop files after install is completed...and also the failure with stopping the VM in the CrostiniFiles test.
,
Jan 9
And after it failed I did run 'ps aux' in the container, and all processes except for packagekitd had zero cumulative time...packagekitd was only 9 seconds, which correctly corresponds to the amount of time it would have spent doing the install/uninstall during a normal test run.
,
Jan 10
Jeff, any luck checking ps/top in the VM or on the host? I'm hoping that there's a kernel thread (guest or host) or vcpu or something running amok so that it'll be easier to pin down the cause. I didn't see any obvious red flags in the 4.19 changelog for the component. Maybe it's time I picked up an oak device so I can repro locally too.
,
Jan 10
See Comment #17. I did run ps in the VM after that test failed and there was nothing out of line. I left my test script running over night...and for the next 39 iterations it didn't fail (it failed after that, but it was some other odd failure, not this one, but it tricked my script so it stopped testing). I'll keep doing this and likely add more data extraction to it to hopefully get something useful out of it.
,
Jan 10
In #17 you said you ran ps in the container, I meant to ask if there was anything outside of the container in the VM that might be spinning :)
,
Jan 10
Ohhh...in the VM. Good point...I'll definitely run that next time it happens to check the VM as well. |
||||
►
Sign in to add a comment |
||||
Comment 1 by jkardatzke@chromium.org
, Nov 29