Crash at MetricsDaemon::ReportZram-ed8aa3dc |
|||||||||
Issue descriptionChrome Version: 9202.1.0 OS: ChromeOS Magic SignatureMetricsDaemon::ReportZram Sample crash report ID: f6d96f6680000000 Minidump: attached. Thread 0 CRASHED [SIGFPE @ 0x00000000 ] MAGIC SIGNATURE THREAD Stack Quality88%Show frame trust levels 0x00005ee2c8a76a7d (metrics_daemon -metrics_daemon.cc:868 ) MetricsDaemon::ReportZram(base::FilePath const&) 0x00005ee2c8a76142 (metrics_daemon -metrics_daemon.cc:820 ) MetricsDaemon::MeminfoCallback(base::TimeDelta) 0x000074cdeb566ab8 (libbase-core-395517.so -callback.h:397 ) base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&) 0x000074cdeb582a68 (libbase-core-395517.so -message_loop.cc:478 ) base::MessageLoop::RunTask(base::PendingTask const&) 0x000074cdeb582d8a (libbase-core-395517.so -message_loop.cc:487 ) base::MessageLoop::DeferOrRunPendingTask(base::PendingTask const&) 0x000074cdeb583340 (libbase-core-395517.so -message_loop.cc:642 ) base::MessageLoop::DoDelayedWork(base::TimeTicks*) 0x000074cdeb585c8c (libbase-core-395517.so -message_pump_libevent.cc:229 ) base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) 0x000074cdeb5a7694 (libbase-core-395517.so -run_loop.cc:35 ) base::RunLoop::Run() 0x000074cdeb7ac9e5 (libbrillo-core-395517.so -base_message_loop.cc:212 ) brillo::BaseMessageLoop::Run() 0x000074cdeb778455 (libbrillo-core-395517.so -daemon.cc:29 ) brillo::Daemon::Run() 0x00005ee2c8a71b4e (metrics_daemon -metrics_daemon.cc:191 ) MetricsDaemon::Run() 0x00005ee2c8a70e5e (metrics_daemon -metrics_daemon_main.cc:101 ) main 0x000074cdeac23795 (libc-2.23.so -libc-start.c:289 ) __libc_start_main 0x00005ee2c8a70338 (metrics_daemon + 0x0000a338 ) _start 0x00007ffde86c6027 0x00005ee2c8a7030f (metrics_daemon + 0x0000a30f )
,
Jan 31 2017
I'll take a look, thanks.
,
Jan 31 2017
The exception happens at the second-to-last line where we divide by compr_data_size. It looks like something is overflowing (SIGFPE). That is weird though.
orig_data_size is provided by two calls to ReadFileToUint64 from sysfs entries. If the reads or the conversions to int fail, this code is not executed. Thus, orig_data_size would have to be very big to overflow when multiplied by 100. More precisely, it would have to be over 2^57. It is conceivable that a kernel bug could place unreasonable values in the sysfs entries.
The division by compr_data_size cannot hurt, because compr_data_size_mb > 1, thus compr_data_size > 2^20.
Ketaki, is this a one-off, or are we seeing multiple instances?
// |orig_data_size| does not include zero-filled pages.
orig_data_size += zero_pages * page_size;
const int compr_data_size_mb = compr_data_size >> 20;
const int savings_mb = (orig_data_size - compr_data_size) >> 20;
const int zero_ratio_percent = zero_pages * page_size * 100 / orig_data_size;
// Report compressed size in megabytes. 100 MB or less has little impact.
SendSample("Platform.ZramCompressedSize", compr_data_size_mb, 100, 4000, 50);
SendSample("Platform.ZramSavings", savings_mb, 100, 4000, 50);
// The compression ratio is multiplied by 100 for better resolution. The
// ratios of interest are between 1 and 6 (100% and 600% as reported). We
// don't want samples when very little memory is being compressed.
if (compr_data_size_mb >= 1) {
SendSample("Platform.ZramCompressionRatioPercent",
orig_data_size * 100 / compr_data_size, 100, 600, 50);
}
,
Jan 31 2017
Looking at kernel. CHROMEOS_RELEASE_BUILDER_PATH=gandof-release/R57-9202.1.0 using kernel 3.14 values into sysfs are generated by these: return sprintf(buf, "%u\n", atomic_read(&zram->stats.pages_zero)); return sprintf(buf, "%llu\n", (u64)(atomic_read(&zram->stats.pages_stored)) << PAGE_SHIFT); pages_zero is atomically incremented/decremented every time a page is allocated/freed. pages_stored is treated similarly. VERY unlikely that there are bugs here... except... mmm... what if one of them goes negative... Let's see if Minchan has ideas/suggestions.
,
Feb 1 2017
Hi,
It's v3.14 which is really old so it might have bug. However, I never heard such problem until now. Maybe, there is no one to test zram stat like chromium.
Thanks for the notice, Luigi.
At a first glance for v3.14, it seems to have race pages_{stored|zeroed} updating. For example, zram_bvec_write doesn't increase pages_stored yet but others on other CPU can free the page via zram_free_page by swap_slot_free_notify. In that case, pages_stored can be negative temporally and sysfs read can hit it.
It would be really rare but is possible theoretically.
stats.pages_stored increased side should be protected by tb_lock, too.
pages_zero is same.
,
Feb 1 2017
Thank you very much for your prompt reply! I might fix this by making the metrics_daemon code tolerate bad sysfs values. It is an easier fix than managing the kernel patches. Talk to you soon
,
Feb 1 2017
Ketaki, is this a common occurrence? If it's rare we don't need to fix it, because the metrics daemon will restart itself, and occasionally missing one set of samples is not a big deal.
,
Feb 8 2017
this is trending higher yes. So not one off.
,
Feb 8 2017
1866 crashes in 9202.12.0 the latest dev build. https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.version%20%3D%20%279202.12.0%27%20and%20%20stable_signature%3D%27MetricsDaemon%3A%3AReportZram-ed8aa3dc%27&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=
,
Feb 8 2017
CL in progress at https://chromium-review.googlesource.com/439005
,
Feb 8 2017
Thanks Luigi!
,
Feb 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15 commit f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15 Author: Luigi Semenzato <semenzato@chromium.org> Date: Wed Feb 08 06:27:06 2017 metrics: avoid overflow from bad sysfs value A zram bug can create very large values for orig_data_size, which causes an overflow. Rather than fixing various kernel versions, we work around the problem here. BUG= chromium:686928 TEST=none Change-Id: If3657091cff5727adf0dd2a15ccfa3f94095cc83 Reviewed-on: https://chromium-review.googlesource.com/439005 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> [modify] https://crrev.com/f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15/metrics/metrics_daemon.cc
,
Feb 15 2017
,
Mar 14 2017
Could this bug result in this behavior ? https://goo.gl/photos/aSucBbkmrHQfc6ar7 We have a customer reporting multiple devices are impacted buy this.
,
Mar 30 2017
Luigi, this has gone up from 0% in 56 stable to 12% in 57 stable. Can you please take a look? https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20product.version%20in%20(%279202.56.1%27%2C%279202.60.0%27%2C%20%279000.92.0%27)&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=&compProp=product.Version&v1=9000.92.0&v2=9202.56.1
,
Mar 31 2017
I put this on ToT on 2/8 and never ported it to R57, sorry. Is it worth merging now in case we do another R57 release? The crash is noisy, but has very little consequences (we lose a small number of 0 samples). Otherwise it will be in R58, which branched in late February. I will call this fixed, but feel free to change it.
,
Mar 31 2017
Given the high crash rate for this signature I would say let us get this into 58 first and see if the crash rate there reduces. Then take it in stable by next monday. sg?
,
Mar 31 2017
OK good plan. I think we should also merge https://chromium-review.googlesource.com/c/442628 for issue 692271 .
,
Apr 1 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 3 2017
My mistake---this was already merged.
,
Apr 5 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90 commit d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90 Author: Luigi Semenzato <semenzato@chromium.org> Date: Wed Apr 05 15:34:20 2017 metrics: avoid overflow from bad sysfs value A zram bug can create very large values for orig_data_size, which causes an overflow. Rather than fixing various kernel versions, we work around the problem here. BUG= chromium:686928 TEST=none Change-Id: If3657091cff5727adf0dd2a15ccfa3f94095cc83 Reviewed-on: https://chromium-review.googlesource.com/439005 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> (cherry picked from commit f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15) Reviewed-on: https://chromium-review.googlesource.com/468529 Reviewed-by: Luigi Semenzato <semenzato@chromium.org> Commit-Queue: Luigi Semenzato <semenzato@chromium.org> [modify] https://crrev.com/d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90/metrics/metrics_daemon.cc
,
Apr 5 2017
To recapitulate: no need to merge to 58 (already there) and just merged to 57.
,
Apr 19 2017
Issue 708406 has been merged into this issue.
,
Jul 19 2017
The crash is no longer can be found in crash server. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by keta...@chromium.org
, Jan 31 2017Status: Assigned (was: Untriaged)