New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 686928 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Crash at MetricsDaemon::ReportZram-ed8aa3dc

Project Member Reported by keta...@chromium.org, Jan 30 2017

Issue description

Chrome Version: 9202.1.0
OS: ChromeOS 
Magic SignatureMetricsDaemon::ReportZram

Sample crash report ID: f6d96f6680000000
Minidump: attached.

Thread 0 CRASHED [SIGFPE @ 0x00000000 ] MAGIC SIGNATURE THREAD

Stack Quality88%Show frame trust levels
0x00005ee2c8a76a7d	(metrics_daemon -metrics_daemon.cc:868 )	MetricsDaemon::ReportZram(base::FilePath const&)
0x00005ee2c8a76142	(metrics_daemon -metrics_daemon.cc:820 )	MetricsDaemon::MeminfoCallback(base::TimeDelta)
0x000074cdeb566ab8	(libbase-core-395517.so -callback.h:397 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&)
0x000074cdeb582a68	(libbase-core-395517.so -message_loop.cc:478 )	base::MessageLoop::RunTask(base::PendingTask const&)
0x000074cdeb582d8a	(libbase-core-395517.so -message_loop.cc:487 )	base::MessageLoop::DeferOrRunPendingTask(base::PendingTask const&)
0x000074cdeb583340	(libbase-core-395517.so -message_loop.cc:642 )	base::MessageLoop::DoDelayedWork(base::TimeTicks*)
0x000074cdeb585c8c	(libbase-core-395517.so -message_pump_libevent.cc:229 )	base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)
0x000074cdeb5a7694	(libbase-core-395517.so -run_loop.cc:35 )	base::RunLoop::Run()
0x000074cdeb7ac9e5	(libbrillo-core-395517.so -base_message_loop.cc:212 )	brillo::BaseMessageLoop::Run()
0x000074cdeb778455	(libbrillo-core-395517.so -daemon.cc:29 )	brillo::Daemon::Run()
0x00005ee2c8a71b4e	(metrics_daemon -metrics_daemon.cc:191 )	MetricsDaemon::Run()
0x00005ee2c8a70e5e	(metrics_daemon -metrics_daemon_main.cc:101 )	main
0x000074cdeac23795	(libc-2.23.so -libc-start.c:289 )	__libc_start_main
0x00005ee2c8a70338	(metrics_daemon + 0x0000a338 )	_start
0x00007ffde86c6027		
0x00005ee2c8a7030f	(metrics_daemon + 0x0000a30f )	
 
upload_file_minidump-f6d96f6680000000.dmp
46.4 KB Download
Owner: semenzato@chromium.org
Status: Assigned (was: Untriaged)
Luigi, would you know who should look at this one?
I'll take a look, thanks.
The exception happens at the second-to-last line where we divide by compr_data_size.  It looks like something is overflowing (SIGFPE).  That is weird though.

orig_data_size is provided by two calls to ReadFileToUint64 from sysfs entries.  If the reads or the conversions to int fail, this code is not executed.  Thus, orig_data_size would have to be very big to overflow when multiplied by 100.  More precisely, it would have to be over 2^57.  It is conceivable that a kernel bug could place unreasonable values in the sysfs entries.

The division by compr_data_size cannot hurt, because compr_data_size_mb > 1, thus compr_data_size > 2^20.

Ketaki, is this a one-off, or are we seeing multiple instances?


  // |orig_data_size| does not include zero-filled pages.
  orig_data_size += zero_pages * page_size;

  const int compr_data_size_mb = compr_data_size >> 20;
  const int savings_mb = (orig_data_size - compr_data_size) >> 20;
  const int zero_ratio_percent = zero_pages * page_size * 100 / orig_data_size;

  // Report compressed size in megabytes.  100 MB or less has little impact.
  SendSample("Platform.ZramCompressedSize", compr_data_size_mb, 100, 4000, 50);
  SendSample("Platform.ZramSavings", savings_mb, 100, 4000, 50);
  // The compression ratio is multiplied by 100 for better resolution.  The
  // ratios of interest are between 1 and 6 (100% and 600% as reported).  We
  // don't want samples when very little memory is being compressed.
  if (compr_data_size_mb >= 1) {
    SendSample("Platform.ZramCompressionRatioPercent",
               orig_data_size * 100 / compr_data_size, 100, 600, 50);
  }
Cc: minc...@kernel.org
Components: OS>Kernel
Looking at kernel.

CHROMEOS_RELEASE_BUILDER_PATH=gandof-release/R57-9202.1.0
using kernel 3.14

values into sysfs are generated by these:

	return sprintf(buf, "%u\n", atomic_read(&zram->stats.pages_zero));

	return sprintf(buf, "%llu\n",
		(u64)(atomic_read(&zram->stats.pages_stored)) << PAGE_SHIFT);

pages_zero is atomically incremented/decremented every time a page is allocated/freed.

pages_stored is treated similarly.  VERY unlikely that there are bugs here... except... mmm... what if one of them goes negative...

Let's see if Minchan has ideas/suggestions.








Hi,

It's v3.14 which is really old so it might have bug. However, I never heard such problem until now. Maybe, there is no one to test zram stat like chromium.
Thanks for the notice, Luigi.

At a first glance for v3.14, it seems to have race pages_{stored|zeroed} updating. For example, zram_bvec_write doesn't increase pages_stored yet but others on other CPU can free the page via zram_free_page by swap_slot_free_notify. In that case, pages_stored can be negative temporally and sysfs read can hit it.

It would be really rare but is possible theoretically.
stats.pages_stored increased side should be protected by tb_lock, too.
pages_zero is same.

Thank you very much for your prompt reply!

I might fix this by making the metrics_daemon code tolerate bad sysfs values. It is an easier fix than managing the kernel patches.

Talk to you soon

Cc: keta...@chromium.org
Ketaki, is this a common occurrence?  If it's rare we don't need to fix it, because the metrics daemon will restart itself, and occasionally missing one set of samples is not a big deal.

this is trending higher yes. So not one off.
Thanks Luigi!
Project Member

Comment 12 by bugdroid1@chromium.org, Feb 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15

commit f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Wed Feb 08 06:27:06 2017

metrics: avoid overflow from bad sysfs value

A zram bug can create very large values for orig_data_size,
which causes an overflow.  Rather than fixing various kernel
versions, we work around the problem here.

BUG= chromium:686928 
TEST=none

Change-Id: If3657091cff5727adf0dd2a15ccfa3f94095cc83
Reviewed-on: https://chromium-review.googlesource.com/439005
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>

[modify] https://crrev.com/f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15/metrics/metrics_daemon.cc

Cc: semenzato@chromium.org snanda@chromium.org
 Issue 692773  has been merged into this issue.

Comment 14 by roy...@google.com, Mar 14 2017

Could this bug result in this behavior ? https://goo.gl/photos/aSucBbkmrHQfc6ar7 
We have a customer reporting multiple devices are impacted buy this.
Status: Fixed (was: Assigned)
I put this on ToT on 2/8 and never ported it to R57, sorry.

Is it worth merging now in case we do another R57 release?  The crash is noisy, but has very little consequences (we lose a small number of 0 samples).  Otherwise it will be in R58, which branched in late February.

I will call this fixed, but feel free to change it.
Given the high crash rate for this signature I would say let us get this into 58 first and see if the crash rate there reduces. Then take it in stable by next monday. sg?
Labels: Merge-Request-58
OK good plan.  I think we should also merge https://chromium-review.googlesource.com/c/442628 for  issue 692271 .
Project Member

Comment 19 by sheriffbot@chromium.org, Apr 1 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
My mistake---this was already merged.
Project Member

Comment 21 by sheriffbot@chromium.org, Apr 5 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 22 by bugdroid1@chromium.org, Apr 5 2017

Labels: merge-merged-release-R57-9202.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90

commit d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Wed Apr 05 15:34:20 2017

metrics: avoid overflow from bad sysfs value

A zram bug can create very large values for orig_data_size,
which causes an overflow.  Rather than fixing various kernel
versions, we work around the problem here.

BUG= chromium:686928 
TEST=none

Change-Id: If3657091cff5727adf0dd2a15ccfa3f94095cc83
Reviewed-on: https://chromium-review.googlesource.com/439005
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>
(cherry picked from commit f56f45d1b6ec53ed52dc88061d3d6b03ccc4bc15)
Reviewed-on: https://chromium-review.googlesource.com/468529
Reviewed-by: Luigi Semenzato <semenzato@chromium.org>
Commit-Queue: Luigi Semenzato <semenzato@chromium.org>

[modify] https://crrev.com/d48e2fec85d1b87ed5a1202fe2b5ca77798dfe90/metrics/metrics_daemon.cc

Labels: -Merge-Approved-58 Merge-Merged
To recapitulate: no need to merge to 58 (already there) and just merged to 57.

Comment 24 by ka...@chromium.org, Apr 19 2017

Issue 708406 has been merged into this issue.
Status: Verified (was: Fixed)
The crash is no longer can be found in crash server.

Sign in to add a comment