Nyan-big: release build failed with "unreasonably low CPU frequency" |
|||||||||
Issue descriptionRunning on chromeos4-row5-rack9-host7, the build failed due to this error: graphics_Gbm ERROR: Unhandled AssertionError: Unreasonably low CPU frequency. Looking in the messages logfile, we see: 2019-01-09T12:44:56.996563+00:00 ERR kernel: [ 0.687980] /cpus/cpu@0 missing clock-frequency property 2019-01-09T12:44:56.996566+00:00 ERR kernel: [ 0.687988] /cpus/cpu@1 missing clock-frequency property 2019-01-09T12:44:56.996570+00:00 ERR kernel: [ 0.687993] /cpus/cpu@2 missing clock-frequency property 2019-01-09T12:44:56.996573+00:00 ERR kernel: [ 0.687997] /cpus/cpu@3 missing clock-frequency property These may be a red herring though, since I see these messages in other builds that failed for other reasons. Unfortunately logs aren't saved from passing runs. Link to the failed build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924811486768777968
,
Jan 10
,
Jan 10
,
Jan 10
I took a quick look, the bug is in this function:
def _get_cpufreq_paths(filename, host=None):
"""
Returns a list of paths to the governors.
"""
run_func = host.run if host else run
cmd = 'ls /sys/devices/system/cpu/cpu*/cpufreq/' + filename
try:
paths = run_func(cmd, verbose=False).stdout.splitlines()
except error.CmdError:
return []
return paths
The race is between enumeration of cpufreq files and CPU enable/disable. Basically when we run 'ls /sys/devices/system/cpu/cpu*/cpufreq/' we enumerate all online CPUs. But the set of online CPUs can change over time, so by the time we try to read it in get_cpu_max_frequency():
def get_cpu_max_frequency():
"""
Returns the largest of the max CPU core frequencies. The unit is Hz.
"""
max_frequency = -1
paths = utils._get_cpufreq_paths('cpuinfo_max_freq')
for path in paths:
# Convert from kHz to Hz.
frequency = 1000 * _get_float_from_file(path, 0, None, None)
max_frequency = max(frequency, max_frequency)
# Sanity check.
assert max_frequency > 1e8, 'Unreasonably low CPU frequency.'
return max_frequency
...the file has disappeared and max_frequency is -1. This is why it fails sometimes.
,
Jan 11
There has been about one or two failure per month across all DUTs/jobs. But I agree this is due to the sets of CPUs switching (say from 0&1 active to 2&3 active). https://stainless.corp.google.com/search?view=list&first_date=2018-01-15&last_date=2019-01-11&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=Unreasonably+low+CPU+frequency.&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
,
Jan 11
,
Jan 11
I am not sure I can make the code fully race free. I did not write it with sets of active CPUs changing so rapidly. I assume even a 'cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq' is not fully atomic? It might be better in this case, as I assume it would return at least one result.
,
Jan 15
CQ failed today because of this, in a slightly different form: https://stainless.corp.google.com/browse/chromeos-autotest-results/277008350-chromeos-test/ IOError: [Errno 2] No such file or directory: '/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq' Simply watching 'nproc' on a nyan-big device shows that we are constantly changing which CPUs are available -- we bump between 1 and 2 CPUs "present" every few seconds: # for i in $(seq 1 10); do nproc; sleep 0.5; done 2 1 2 1 1 2 2 2 1 1 I don't know what kind of crazy CPU governors we have on these systems, but maybe something about the "idle" workload has changed recently? But clearly something should be done differently in the test code. For starters, we probably shouldn't let a disappearing CPU count as a failure. I'd bet you'd get 99% of what you want by just skipping those CPUs for which we fail to read their info. I highly doubt we're actually switching between disjoint sets of CPUs, such that you are ever likely to fail to read *all* CPU's state. I believe we always keep CPU0 on.
,
Jan 15
I didn't work on nyan at all. Perhaps Dylan remembers some of this, though? IIRC, though, on nyan we switch between the 1 tiny CPU and 2 big CPUs. All 3 CPUs can't be active at once, though? ...so I think we actually do switch between two disjoint sets of CPUs...
,
Jan 15
@9: You may be right, but I'm seeing either CPU0 or CPU0+1 online. # for i in $(seq 1 10); do cat /sys/devices/system/cpu/online ; sleep 0.5; done 0-1 0 0-1 0-1 0 0-1 0 0-1 0 0 Does CPU0 actually get hot-swapped somehow?? I see 4 total CPUs reported (possible = 0-3), and the datasheets suggest it's a "4-plus-1", so something seems a little odd here.
,
Jan 15
There are 5 cores on nyan. 4 "big" cores and 1 "low power" core. There is a switch between the two cpu clusters when we scale below some frequency, I believe 200-400 MHz but my memory isn't exact here. you could work around this by forcing the cpu frequency to stay above that rate.
,
Jan 15
I'm not sure this bug is really about a "low frequency" (as long as no core has a max less than 100MHz??) -- it's just that the test code doesn't really know how to deal with constantly-changing number of (or indexing of) CPUs. If we understand the pattern of CPUs appearing/disappearing, then we can make better-educated attempts at fixing this code. A simple question: do we ever expect CPU 0 to go offline? Or, does the cluster-switching always leave *some* valid CPU 0 running?
,
Jan 15
If you force a higher rate, you will prevent the cluster switch. I think that there is always a cpu 0, although the physical core that it points to will change.
,
Jan 15
I don't think we need/want to force anything on the clock side, especially since these changes can sometimes be "sticky"; in other words affect subsequent tests. IMHO we simply need to make the test code more robust against the disappearance of the file (i.e. when the file can't be read we just accept it and move on without failure). After all, the clock bit is just for sanity checking, not a critical part of the test.
,
Jan 15
BTW, I was confused a bit as to why the 'ls' command in #4 would fail. But I guess that's because the shell globbing runs prior to 'ls'. So this fails within a few seconds: # i=0; while : ; do i=$((i+1)); ls /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq > /dev/null || break; sleep 0.2; done ls: cannot access '/sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq': No such file or directory # echo $i 64 That could be easily fixed by making it an 'echo /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq', so we only do a single globbing, which will not produce errors. Anyway, marcheu is probably right and skipping errors is smart (even if, AIUI, we can ensure that we always get at least 1 successful cpuinfo_max_freq).
,
Jan 16
(6 days ago)
I got a little bored: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1415676 https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1415677
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
Thanks Brian for fixing this!
,
Jan 18
(5 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/951f357958e218221c1555dede04e896275bbc71 commit 951f357958e218221c1555dede04e896275bbc71 Author: Brian Norris <briannorris@chromium.org> Date: Fri Jan 18 01:26:08 2019 [autotest] utils: improve CPU max freq robustness CPUs can come online/offline at arbitrary times, especially on some Tegra CPUs. This causes problems in utils.get_cpu_max_frequency() as follows: * the 'ls /sys/.../cpu*/' glob might match some set of CPU paths in the shell glob, but those paths might not all be there by the time 'ls' runs, and so 'ls' will totally fail (rather than, say, printing the subset of CPUs that are still there). To fix this, we can just 'echo' the glob result back, and ensure that it matched at least 1 valid path. * when iterating on paths returned from utils._get_cpufreq_paths(), the path may no longer exist, causing IOErrors in _get_float_from_file(). We should just ignore these errors, as long as we get some valid max frequency from some other CPU. With those fixes, we shouldn't see any further failures, unless completely disjoint sets of CPUs show up at different times, causing us to fail to read any single CPU properly. This shouldn't happen, since CPU0 should always exist. CL:1128445 already sorta noticed this problem, but it didn't extend its pattern to existing cpufreq use cases. Additionally, let's add a much clearer exception if a test is trying to retrieve the max CPU frequency on a system where cpufreq isn't supported (i.e., there are no cpufreq paths available). This may have come up on deprecated boards (see chromium:867320)? BUG= chromium:920263 TEST=graphics_Gbm, on nyan-big and others Change-Id: I9014cd48cb2c268001e8bbbb098aed8b4104bd2e Reviewed-on: https://chromium-review.googlesource.com/c/1415676 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> [modify] https://crrev.com/951f357958e218221c1555dede04e896275bbc71/client/common_lib/utils.py [modify] https://crrev.com/951f357958e218221c1555dede04e896275bbc71/client/bin/utils.py
,
Jan 18
(5 days ago)
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by drinkcat@chromium.org
, Jan 10Components: -OS>Kernel OS>Kernel>Graphics
Labels: OS-Chrome