New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 878145 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 29
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

mosys is reading /dev/mem and crashing machines

Project Member Reported by diand...@chromium.org, Aug 27

Issue description

Starting on Chrome OS 10998.0.0 we found that cheza boards would constantly reboot themselves.  After much searching, we found that the problem was:

1. "/usr/sbin/mosys" was accessing some memory addresses via "/dev/mem".

2. These accesses were getting trapped by ATF (ARM Trusted Firmware) on our system and directly causing a very hard to debug reboot.


We should certainly make the system easier to debug in this case (and we should make it so userspace can't crash us in this way), but also mosys shouldn't be accessing "/dev/mem" like this.


The current theory is that the problematic CL was <http://crosreview.com/1175110> and possibly the problem was that we now call smbios_find_table() on ARM.
 
Cc: ravisadineni@chromium.org
Somehow the crosreview.com URL isn't linking for me, but you can try:

https://chromium-review.googlesource.com/1175110

---

Also note that the bug to make cheza behave better is b/113294745
BTW: should we revert the CL in question?  I found that I could make mosys happy again with:

  git revert --no-edit 5ec537f409df8534096f228e62e4be1ef6f683a9
  git revert --no-edit d402d8411b3686d23f6f8149843a5cf2ad61f655
Since we were talking about SMBIOS ... wondering why mosys doesn't just read /sys/firmware/dmi/tables/smbios_entry_point and go from there.


5ec537f409df8534096f228e62e4be1ef6f683a9 is certainly not related to this in any way: it only increases the strictness with which C code is error checked at compile time; it doesn't change generated code.

d402d8411b3686d23f6f8149843a5cf2ad61f655 seems extremely unlikely: it's an integration test framework and only slightly modified the run-time mosys code to allow for mocking out hardware interfaces with fakes and most of that is behind #ifdef guards. Unless there's a corrupt pointer dereference somewhere, I don't see how it could be related.

A more likely explanation is that the ATF reboot is non-deterministic and you just happened to not trip it when reverting those two.

@5: 

* The first revert is only needed to make the second revert compile

* The 2nd revert is almost certainly the culprit.  We are now accessing /dev/mem on ARM devices and we weren't before.  The CL may look like a refactor but it does more than just that.  Specifically note that internal_read_identity_info() is now called on ARM devices and that calls a bunch of smbios functions.  Those are the problems.


I'll be posting the reverts now and we can just re-land stuff later.
Owner: diand...@chromium.org
Status: Started (was: Untriaged)
remote:   https://chromium-review.googlesource.com/c/chromiumos/platform/mosys/+/1194283 Revert "Adds a few clang flags to mosys meson.build file"        
remote:   https://chromium-review.googlesource.com/c/chromiumos/platform/mosys/+/1194284 Revert "mosys: Add integration test"        

Cc: -benchan@chromium.org adurbin@chromium.org
Thanks for the reverts.

I'll work on relanding these today and sorry for the long debugging session. As a follow-up, can you add an Autotest or Tast test that trips the ATF code path? What were your repro steps?

Re #4: like many things in Mosys, it's a result of the tragedy of the commons: Mosys was effectively unmaintained for two years so folks did the absolute minimum amount of work to make a platform work. Case in point: the CL that caused this was adding the first integration testing framework on Mosys; none existed before. At the current staffing level, it will take years to undo that damage and the poor architectural decisions that predate the current maintainers. We're currently discussing how to get more resources on Mosys. In the meantime, if you'd like to send a CL to change it, please feel encouraged to do so.

There are certain invariants that should have been held in any change that goes into mosys. In this particular case: non-x86 devices do not use smbios so no smbios paths should ever be taken on such systems. I think it's important to internalize the differences in the platform behavior before proceeding with large changes w/o understanding the requirements. Feel free to reach out on platform behaviors and expectations for future work.
> Thanks for the reverts.

They are still winding their way through the CQ.


> What were your repro steps?

Boot cheza and watch device reboot over and over again with no indication of why.


> can you add an Autotest or Tast test that trips the ATF code path

What are we trying to get at here?  Right now when you access a bad location in /dev/mem the whole machine reboots.  This is not desired behavior and we're trying to improve it.  See b/113294745.  ...but even if the whole machine shouldn't reboot that doesn't mean that mosys should be reading / writing random memory.  Ideally userspace shouldn't use "/dev/mem" at all and should prefer official kernel interfaces (see Guenter's comment).  If userspace uses "/dev/mem" it should be for debugging purposes.  There was some discussion that perhaps we should start working towards removing "/dev/mem" on production systems.
Yea, that was a big mistake. In this case, the original change didn't contain this error but it was introduced during a later round of refactoring requested in code review. Mistakes happen; we need hardware tests to prevent this from happening again.
I'd rather sit on re-landing these changes and put effort into hardening the general test coverage around mosys beforehand.

We can addresses this with more targeted testing over mosys versus upstream coverage.
> we need hardware tests to prevent this from happening again

Yes, cheza isn't running HW tests in the lab yet.  :(  We're working towards it.
> Boot cheza and watch device reboot over and over again with no indication of why.

What Mosys invocation tripped it? Just the initial platform probe, right?

> What are we trying to get at here?  Right now when you access a bad location in /dev/mem the whole machine reboots.  This is not desired behavior and we're trying to improve it.  See b/113294745.  ...but even if the whole machine shouldn't reboot that doesn't mean that mosys should be reading / writing random memory.  Ideally userspace shouldn't use "/dev/mem" at all and should prefer official kernel interfaces (see Guenter's comment).  If userspace uses "/dev/mem" it should be for debugging purposes.  There was some discussion that perhaps we should start working towards removing "/dev/mem" on production systems.

We're trying to prevent this from happening again. Mosys does many, many terrible things. This is just one example in one code path. If we have hardware test coverage, this kind of error wouldn't land in the future.

> All I needed to do was type "mosys" on the command line and it would trip it.

What Mosys invocation tripped it? Just the initial platform probe, right?

---

> If we have hardware test coverage, this kind of error wouldn't land in the future.

I think the root cause of why the really bad behavior wasn't caught is that cheza isn't to the point where it's running hw test in the lab.

...however, even on other Chromebooks it's undesirable for mosys to access "/dev/mem" like this.  It doesn't cause a crash but it's still not so great.  Someone could audit mosys and see if it ever should access "/dev/mem" on ARM.  If it shouldn't then perhaps we could simply disable that code path in mosys.  If we wanted to do more than this we should kick off a task to see if we can stop providing access to /dev/mem.
I'll take on those tasks ... auditing mosys and basic test coverage.

I can also reland the triggering changes once those two bits have been completed.


There is currently no unibuild + arm BVT in CQ. Since this also repros on scarlet, enabling that platform in CQ would catch the issue.
I'll can start with that.

It looks like we have 10 scarlets in the lab marked for CQ.


> Since this also repros on scarlet

Does it cause a reboot on scarlet or just crash mosys?  I wouldn't really expect a reboot but I haven't tested.
mosys just crashes. But I assume BVT would catch it.
if it doesn't, i'll add an autotest to make sure it does

for now, getting arm coverage seems correct regardless
> Since this also repros on scarlet, enabling that platform in CQ would catch the issue.

Indeed it does. Sometimes mosys just crashes with a bus error, but sometimes it triggers a system crash.

Somebody needs to revisit this arbitrary change if they want Scarlet to get any testing:

https://chromium-review.googlesource.com/1135997
Bug 878863 to track why the heck Scarlet still isn't getting HW tests.
Project Member

Comment 25 by bugdroid1@chromium.org, Aug 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/mosys/+/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0

commit 13fb1aac8dec52f72158a4fbd2919a3c16bd44c0
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Aug 29 18:15:15 2018

Revert "Adds a few clang flags to mosys meson.build file"

This reverts commit 5ec537f409df8534096f228e62e4be1ef6f683a9.

This commit is needed to make the code compile after the next revert.

BUG= chromium:878145 ,  chromium:876106 
TEST=mosys is no longer accessing /dev/mem

Fixes: d402d8411b36 ("mosys: Add integration test")
Change-Id: I9d87c746478dc2ab7aa26d8906972923a10b4056
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1194283
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>

[modify] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/meson.build

Project Member

Comment 26 by bugdroid1@chromium.org, Aug 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/mosys/+/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e

commit d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Aug 29 18:15:16 2018

Revert "mosys: Add integration test"

This reverts commit d402d8411b3686d23f6f8149843a5cf2ad61f655.

By putting a printout in the kernel in
"arch/arm64/mm/mmap.c:devmem_is_allowed()" it can be seen that
"/dev/mem" is being accessed.  Reverting this change stops the bad
accesses.

It is believed that the problem is that the bad commit started calling
internal_read_identity_info() on ARM devices.  This calls
smbios_sysinfo_get_name().  That calls smbios_find_table().  If you
keep digging, you'll find that this eventually translates into reads
of address ranges in "/dev/mem".  It's not a good idea to do this.

This commit can be re-landed when it can be confirmed that random
/dev/mem accesses are no longer generated by it on ARM or ARM64
devices.

BUG= chromium:878145 
TEST=mosys is no longer accessing /dev/mem

Fixes: d402d8411b36 ("mosys: Add integration test")
Change-Id: Iabb27a9dd5e2a8e651302a43373eea19cfb110f9
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1194284
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>

[delete] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/platform/dummy/eeprom.c
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/platform/dummy/dummy.c
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/src/lib.rs
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/src/main.rs
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/platform/dummy/dummy.h
[delete] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/platform/dummy/sys.c
[delete] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/platform/dummy/ec.c
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/meson.build
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/lib/cros_config/meson.build
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/lib/cros_config/cros_config.c
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/lib/cros_config/config.dtb.S
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/meson_options.txt
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/platform/dummy/meson.build
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/include/lib/cros_config.h
[delete] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/lib/cros_config/dummy_cros_config_data.c
[delete] https://crrev.com/13fb1aac8dec52f72158a4fbd2919a3c16bd44c0/tests/functional_test.rs
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/lib/cros_config/cros_config_struct.c
[modify] https://crrev.com/d1ae3d1ff9aaf7ce38aa2055c26d2b58e4a9a50e/include/lib/cros_config_struct.h

Status: Fixed (was: Started)
OK, it's no longer reading /dev/mem so this is fixed.  I assume someone will open a new bug for re-landing the changes?  The original change had BUG=None so we can't just re-open the old bug...
We can use issue 795887 to re-land. I'll update it.

Sign in to add a comment