New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 920263 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Nyan-big: release build failed with "unreasonably low CPU frequency"

Project Member Reported by martinroth@google.com, Jan 9

Issue description

Running on chromeos4-row5-rack9-host7, the build failed due to this error:
graphics_Gbm ERROR: Unhandled AssertionError: Unreasonably low CPU frequency.

Looking in the messages logfile, we see:
2019-01-09T12:44:56.996563+00:00 ERR kernel: [    0.687980] /cpus/cpu@0 missing clock-frequency property
2019-01-09T12:44:56.996566+00:00 ERR kernel: [    0.687988] /cpus/cpu@1 missing clock-frequency property
2019-01-09T12:44:56.996570+00:00 ERR kernel: [    0.687993] /cpus/cpu@2 missing clock-frequency property
2019-01-09T12:44:56.996573+00:00 ERR kernel: [    0.687997] /cpus/cpu@3 missing clock-frequency property
These may be a red herring though, since I see these messages in other builds that failed for other reasons.  Unfortunately logs aren't saved from passing runs.


Link to the failed build:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924811486768777968


 
Labels: Gfx-Guard M-73
Owner: ihf@chromium.org
Blocking: 920552
I took a quick look, the bug is in this function:

def _get_cpufreq_paths(filename, host=None):
    """
    Returns a list of paths to the governors.
    """
    run_func = host.run if host else run
    cmd = 'ls /sys/devices/system/cpu/cpu*/cpufreq/' + filename
    try:
        paths = run_func(cmd, verbose=False).stdout.splitlines()
    except error.CmdError:
        return []
    return paths


The race is between enumeration of cpufreq files and CPU enable/disable. Basically when we run 'ls /sys/devices/system/cpu/cpu*/cpufreq/' we enumerate all online CPUs. But the set of online CPUs can change over time, so by the time we try to read it in get_cpu_max_frequency():

def get_cpu_max_frequency():
    """
    Returns the largest of the max CPU core frequencies. The unit is Hz.
    """
    max_frequency = -1
    paths = utils._get_cpufreq_paths('cpuinfo_max_freq')
    for path in paths:
        # Convert from kHz to Hz.
        frequency = 1000 * _get_float_from_file(path, 0, None, None)
        max_frequency = max(frequency, max_frequency)
    # Sanity check.
    assert max_frequency > 1e8, 'Unreasonably low CPU frequency.'
    return max_frequency

...the file has disappeared and max_frequency is -1. This is why it fails sometimes.
Blocking: -920552
I am not sure I can make the code fully race free. I did not write it with sets of active CPUs changing so rapidly. I assume even a 'cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq' is not fully atomic? It might be better in this case, as I assume it would return at least one result.
Cc: diand...@chromium.org sonnyrao@chromium.org
CQ failed today because of this, in a slightly different form:

https://stainless.corp.google.com/browse/chromeos-autotest-results/277008350-chromeos-test/

IOError: [Errno 2] No such file or directory: '/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq'

Simply watching 'nproc' on a nyan-big device shows that we are constantly changing which CPUs are available -- we bump between 1 and 2 CPUs "present" every few seconds:

# for i in $(seq 1 10); do nproc; sleep 0.5; done
2
1
2
1
1
2
2
2
1
1

I don't know what kind of crazy CPU governors we have on these systems, but maybe something about the "idle" workload has changed recently? But clearly something should be done differently in the test code. For starters, we probably shouldn't let a disappearing CPU count as a failure. I'd bet you'd get 99% of what you want by just skipping those CPUs for which we fail to read their info. I highly doubt we're actually switching between disjoint sets of CPUs, such that you are ever likely to fail to read *all* CPU's state. I believe we always keep CPU0 on.
Cc: dgreid@chromium.org
I didn't work on nyan at all.  Perhaps Dylan remembers some of this, though?

IIRC, though, on nyan we switch between the 1 tiny CPU and 2 big CPUs.  All 3 CPUs can't be active at once, though?  ...so I think we actually do switch between two disjoint sets of CPUs...
@9: You may be right, but I'm seeing either CPU0 or CPU0+1 online.

# for i in $(seq 1 10); do cat /sys/devices/system/cpu/online ; sleep 0.5; done
0-1
0
0-1
0-1
0
0-1
0
0-1
0
0

Does CPU0 actually get hot-swapped somehow?? I see 4 total CPUs reported (possible = 0-3), and the datasheets suggest it's a "4-plus-1", so something seems a little odd here.
There are 5 cores on nyan. 4 "big" cores and 1 "low power" core. There is a switch between the two cpu clusters when we scale below some frequency, I believe 200-400 MHz but my memory isn't exact here.

you could work around this by forcing the cpu frequency to stay above that rate.
I'm not sure this bug is really about a "low frequency" (as long as no core has a max less than 100MHz??) -- it's just that the test code doesn't really know how to deal with constantly-changing number of (or indexing of) CPUs. If we understand the pattern of CPUs appearing/disappearing, then we can make better-educated attempts at fixing this code.

A simple question: do we ever expect CPU 0 to go offline? Or, does the cluster-switching always leave *some* valid CPU 0 running?
If you force a higher rate, you will prevent the cluster switch.

I think that there is always a cpu 0, although the physical core that it points to will change.
I don't think we need/want to force anything on the clock side, especially since these changes can sometimes be "sticky"; in other words affect subsequent tests.

IMHO we simply need to make the test code more robust against the disappearance of the file (i.e. when the file can't be read we just accept it and move on without failure). After all, the clock bit is just for sanity checking, not a critical part of the test.
BTW, I was confused a bit as to why the 'ls' command in #4 would fail. But I guess that's because the shell globbing runs prior to 'ls'. So this fails within a few seconds:

# i=0; while : ; do i=$((i+1)); ls /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq > /dev/null || break; sleep 0.2; done
ls: cannot access '/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq': No such file or directory
# echo $i
64


That could be easily fixed by making it an 'echo /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq', so we only do a single globbing, which will not produce errors.

Anyway, marcheu is probably right and skipping errors is smart (even if, AIUI, we can ensure that we always get at least 1 successful cpuinfo_max_freq).

Comment 17 by briannorris@chromium.org, Jan 16 (6 days ago)

Owner: briannorris@chromium.org
Status: Started

Comment 18 by marcheu@google.com, Jan 16 (6 days ago)

Thanks Brian for fixing this!
Project Member

Comment 19 by bugdroid1@chromium.org, Jan 18 (5 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/951f357958e218221c1555dede04e896275bbc71

commit 951f357958e218221c1555dede04e896275bbc71
Author: Brian Norris <briannorris@chromium.org>
Date: Fri Jan 18 01:26:08 2019

[autotest] utils: improve CPU max freq robustness

CPUs can come online/offline at arbitrary times, especially on some
Tegra CPUs. This causes problems in utils.get_cpu_max_frequency() as
follows:

 * the 'ls /sys/.../cpu*/' glob might match some set of CPU paths in the
   shell glob, but those paths might not all be there by the time 'ls'
   runs, and so 'ls' will totally fail (rather than, say, printing the
   subset of CPUs that are still there). To fix this, we can just 'echo'
   the glob result back, and ensure that it matched at least 1 valid path.
 * when iterating on paths returned from utils._get_cpufreq_paths(), the
   path may no longer exist, causing IOErrors in _get_float_from_file().
   We should just ignore these errors, as long as we get some valid max
   frequency from some other CPU.

With those fixes, we shouldn't see any further failures, unless
completely disjoint sets of CPUs show up at different times, causing us
to fail to read any single CPU properly. This shouldn't happen, since
CPU0 should always exist.

CL:1128445 already sorta noticed this problem, but it didn't extend its
pattern to existing cpufreq use cases.

Additionally, let's add a much clearer exception if a test is trying to
retrieve the max CPU frequency on a system where cpufreq isn't
supported (i.e., there are no cpufreq paths available). This may have
come up on deprecated boards (see chromium:867320)?

BUG= chromium:920263 
TEST=graphics_Gbm, on nyan-big and others

Change-Id: I9014cd48cb2c268001e8bbbb098aed8b4104bd2e
Reviewed-on: https://chromium-review.googlesource.com/c/1415676
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Brian Norris <briannorris@chromium.org>

[modify] https://crrev.com/951f357958e218221c1555dede04e896275bbc71/client/common_lib/utils.py
[modify] https://crrev.com/951f357958e218221c1555dede04e896275bbc71/client/bin/utils.py

Comment 20 by briannorris@chromium.org, Jan 18 (5 days ago)

Status: Fixed (was: Started)

Sign in to add a comment