use helper process in a sandbox to clean up in imageloader |
|||||
Issue description
This is working (call helper process in a sandbox) when manually executing imageloader as root user. but in autotest, minijail crashes after imageloader returns successfully.
The autotest issues:
return subprocess.check_output([
'/usr/sbin/imageloader', '--dry_run', '--unmount_all',
]).splitlines()
I could clearly see imageloader message sent back and force between imageloader and helper process and imageloader eventually exits successfully. However, minijail crashes after imageloader exits which leaves autotest in a state where it is not able to recover.
suspected causes:
autotest already does sandboxing and sandboxing again in imageloader causes conflicts.
,
Nov 15 2017
,
Nov 15 2017
This issue is also discussed in our code review: https://chromium-review.googlesource.com/c/chromiumos/platform/imageloader/+/767024
,
Nov 15 2017
So imageloader prints all the paths out and then crashes? You can paste an example or something?
,
Nov 15 2017
Manually run imageloader (call helper process to CleanupAll):
localhost ~ # imageloader --unmount_all --dry_run
/run/imageloader/cros-termina/10081.0.2
(works properly)
Run autotest on DuT:
Autotest stops here since subprocess never returns the result:
return subprocess.check_output([
'/usr/sbin/imageloader', '--dry_run', '--unmount_all',
]).splitlines()
/var/log/message:
2017-11-15T15:41:15.738867-08:00 NOTICE dbus[599]: [system] Activating service name='org.chromium.ImageLoader' (using servicehelper)
2017-11-15T15:41:15.785694-08:00 NOTICE dbus[599]: [system] Successfully activated service 'org.chromium.ImageLoader'
2017-11-15T15:41:37.277888-08:00 INFO kernel: [82789.022062] imageloader[19574]: segfault at 76f21ebb1ac0 ip 000076f21ebb1ac0 sp 00007ffd21414418 error 15
2017-11-15T15:41:37.307499-08:00 INFO crash_reporter[19619]: libminijail[19619]: mount /dev/log -> /dev/log type ''
2017-11-15T15:41:37.307530-08:00 INFO crash_reporter[19619]: libminijail[19619]: mount /dev/pstore -> /dev/pstore type ''
2017-11-15T15:41:37.610540-08:00 DEBUG kernel: [82789.355193] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-11-15T15:41:37.615545-08:00 DEBUG kernel: [82789.360069] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2017-11-15T15:41:37.629911-08:00 WARNING crash_reporter[19619]: Could not load the device policy file.
2017-11-15T15:41:37.630303-08:00 WARNING crash_reporter[19619]: [user] Received crash notification for imageloader[19574] sig 11, user 220 (developer build - not testing - always dumping)
2017-11-15T15:41:37.632704-08:00 INFO crash_reporter[19619]: State of crashed process [19574]: S (sleeping)
2017-11-15T15:41:37.667771-08:00 INFO crash_reporter[19619]: Stored minidump to /var/spool/crash/imageloader.20171115.154137.19574.dmp
2017-11-15T15:41:37.668204-08:00 INFO crash_reporter[19619]: Leaving core file at /var/spool/crash/imageloader.20171115.154137.19574.core due to developer image
2017-11-15T15:41:37.668959-08:00 WARNING crash_reporter[19619]: [ARC] Received crash notification for imageloader[19574] sig 11, user 220 (ignoring - crash origin is not ARC)
crash trace dump:
#0 0x000076f21ebb1ac0 in ?? () from /usr/lib64/libstdc++.so.6
No symbol table info available.
#1 0x00005d7ff8023ebb in std::default_delete<brillo::dbus_utils::ExportedObjectManager>::operator() (__ptr=0x76f21ebafd98 <std::wclog+216>, this=<optimized out>)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:76
No locals.
#2 std::unique_ptr<brillo::dbus_utils::ExportedObjectManager, std::default_delete<brillo::dbus_utils::ExportedObjectManager> >::~unique_ptr (this=0x7ffd21414968)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:236
No locals.
#3 brillo::DBusServiceDaemon::~DBusServiceDaemon (this=0x7ffd214146b0) at ../../../../../../../../usr/include/brillo/daemons/dbus_daemon.h:52
No locals.
#4 0x00005d7ff802009e in main (argc=1, argv=0x7ffd21414b58) at ../../../imageloader-9999/platform/imageloader/imageloader_main.cc:221
Noting that imageloader_main.cc has total 221 loc in my change. It looks like minijail crashes due to imageloader(or helper process) crash. In this case the root case isn't in minijail.
,
Nov 15 2017
Ok, we should root cause the crash before submitting your CL in case it's related to your CL specifically.
,
Nov 16 2017
Ok, so autotest is using subprocess.check_output to run imageloader which is wrong. The correct way to run command in autotest is: utils.system_output(...); And I have no idea why subprocess.check_output in autotest fails to cleanup the service process which causes crash in DBusServiceDaemon destructor. I will send a CL to fix autotest.
,
Nov 16 2017
Thanks for root causing that.
,
Nov 17 2017
I observed another interesting case. It looks like there is always an imageloader crash after each dbus call. The reason is following:
1. Each dbus call brings up imageloader service (imageloader_wrapper) and imageloader executes following code:
imageloader::ImageLoader daemon(std::move(config), std::move(helper_process));
daemon.Run();
2. imageloader blocks on daemon.Run() for several seconds. Then imageloader exits but daemon crashes. It looks like daemon is not properly cleaning up itself (or not initializing itself properly).
crash trace dump:
#0 0x0000746d38efcac0 in ?? () from /usr/lib64/libstdc++.so.6
No symbol table info available.
#1 0x00005fafa07ffebb in std::default_delete<brillo::dbus_utils::ExportedObjectManager>::operator() (__ptr=0x746d38efad98 <std::wclog+216>, this=<optimized out>)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:76
No locals.
#2 std::unique_ptr<brillo::dbus_utils::ExportedObjectManager, std::default_delete<brillo::dbus_utils::ExportedObjectManager> >::~unique_ptr (this=0x7fff98b29c08)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/4.9.x/include/g++-v4/bits/unique_ptr.h:236
No locals.
#3 brillo::DBusServiceDaemon::~DBusServiceDaemon (this=0x7fff98b29950) at ../../../../../../../../usr/include/brillo/daemons/dbus_daemon.h:52
No locals.
#4 0x00005fafa07fc09e in main (argc=1, argv=0x7fff98b29df8) at ../../../imageloader-9999/platform/imageloader/imageloader_main.cc:194
,
Nov 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/74554cc517ff14531ef5aa71a13101b4eeb4e0ca commit 74554cc517ff14531ef5aa71a13101b4eeb4e0ca Author: Xiaochu Liu <xiaochu@chromium.org> Date: Sat Nov 18 04:18:42 2017 platform_ImageLoader: use utils.system_output() Uses utils.system_output(...) instead of subprocess.check_output(...). subprocess.check_output fails to cleanup DBusServbiceDaemon and causes crash. BUG= chromium:785509 TEST=autotest Change-Id: Iba59c90ffca323f59fa54d6fb358f49e7c54fdc3 Reviewed-on: https://chromium-review.googlesource.com/775032 Commit-Ready: Xiaochu Liu <xiaochu@chromium.org> Tested-by: Greg Kerr <kerrnel@chromium.org> Tested-by: Xiaochu Liu <xiaochu@chromium.org> Reviewed-by: Greg Kerr <kerrnel@chromium.org> [modify] https://crrev.com/74554cc517ff14531ef5aa71a13101b4eeb4e0ca/client/site_tests/platform_ImageLoader/platform_ImageLoader.py
,
Nov 20 2017
Thanks for the analysis. Is the crash in comment #9 the crash fixed by the CL in comment #10, is there a separate crash that I should look into?
,
Nov 20 2017
Both are caused by the same reason. I observed flakeness of autotest after my imageloader change: https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=platform_ImageLoaderServer The crash happens on other 'subprocess.call()' in autotest which I ought to fix in my last CL. I'm working on a patch to fix this.
,
Nov 20 2017
Ok, to be clear, does this crash affect actual users or is it just a crash in the autotest?
,
Nov 20 2017
Yes, it crashes for actual users.
,
Nov 20 2017
So there's a crash that's totally separate from the subprocess.check_output() issue?
,
Nov 20 2017
subprocess.check_output() is the effect of that crash. In my observation, each execution of 'daemon.Run();' in imageloader_main causes crash after it exits. I suspect this crash was there before my imageloader change but only observed in autotest after my change (my change invokes helper process which invokes imageloader_wrapper and imageloader_main).
,
Nov 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/fb2e7420ab1882ac1bab4f499996072aafa98bf8 commit fb2e7420ab1882ac1bab4f499996072aafa98bf8 Author: Xiaochu Liu <xiaochu@chromium.org> Date: Tue Nov 21 07:25:46 2017 platform_ImageLoader: use utils.system() Uses utils.system(...) instead of subprocess.call(...). This autotest is flaky (crashes at subprocess.call()) on certain boards. BUG= chromium:785509 TEST=autotest, print return value of utils.system to verify its return value is retcode (0 on success). Change-Id: I8e74bd5e352dbbdee3e765b0626e2f2948ea0d45 Reviewed-on: https://chromium-review.googlesource.com/779634 Commit-Ready: Xiaochu Liu <xiaochu@chromium.org> Tested-by: Xiaochu Liu <xiaochu@chromium.org> Reviewed-by: Greg Kerr <kerrnel@chromium.org> [modify] https://crrev.com/fb2e7420ab1882ac1bab4f499996072aafa98bf8/client/site_tests/platform_ImageLoader/platform_ImageLoader.py
,
Nov 21 2017
Hmm. I ran imageloader, without sending it any commands, and it exits clean: localhost ~ # imageloader_wrapper localhost ~ # echo $? 0 Nothing in the logs reports a crash. Let me try sending it some commands and then I'll look for a crash.
,
Nov 21 2017
This might be relevant. I checked ps aux after running the autotest, and it looks all the helper process's are sticking around. Is it possible those aren't be reaped correctly? localhost ~ # ps aux | grep imageloader root 3114 0.0 0.7 30676 7872 ? Ss 11:10 0:00 /usr/sbin/imageloader --mount --mount_component=TestFlashComponent --mount_point=/run/imageloader/TestFlashComponent_testing --mount_helper_fd=2 root 3131 0.0 0.7 30604 7416 ? Ss 11:10 0:00 /usr/sbin/imageloader --mount --mount_component=TestFlashComponent --mount_point=/run/imageloader/TestFlashComponent_testing --mount_helper_fd=2 root 3209 0.0 0.7 30604 7460 ? Ss 11:10 0:00 /usr/sbin/imageloader --unmount --mount_point=/run/imageloader/CorruptTestFlashComponent1/23.0.0.207 --mount_helper_fd=2 root 3220 0.0 0.7 30604 7432 ? Ss 11:10 0:00 /usr/sbin/imageloader --unmount --mount_point=/run/imageloader/FlashLoadedAtPath/23.0.0.207 --mount_helper_fd=2 root 3230 0.0 0.7 30604 7472 ? Ss 11:10 0:00 /usr/sbin/imageloader --unmount --mount_point=/run/imageloader/TestFlashComponent/24.0.0.186 --mount_helper_fd=2 root 3240 0.0 0.7 30604 7264 ? Ss 11:10 0:00 /usr/sbin/imageloader --unmount --mount_point=/run/imageloader/TestFlashComponent_testing --mount_helper_fd=2
,
Nov 21 2017
Ok that's a separate issue I caused when I first wrote imageloader, I'll file a separate bug report. I can't reproduce the crash though, can you still repro it?
,
Nov 21 2017
yes, I guess main function of all these commands exited correctly but for whatever reason the service process it invoked failed. Another way to reproduce this is to call a dbus API (e.g. LoadComponent) and then there should be a crash after imageloader_wrapper exits.
,
Nov 21 2017
Can you copy and paste exact reproduction instructions and the crash? Where do you see the crash (logs, command line, etc)? Because I can't repro this and I have your latest change. What board does the crash happen on?
,
Nov 27 2017
,
Jan 22 2018
,
Jan 23 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by kerrnel@chromium.org
, Nov 15 2017