Issue metadata
Sign in to add a comment
|
Very high memory usage by TaskSchedulerSi (system lockup, OOM killer)
Reported by
eric.rannaud@gmail.com,
Mar 3 2018
|
||||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36 Steps to reproduce the problem: 1. Start top(1), set frequent refreshes to observe processes in real time 2. Open a new tab 3. Load a complex website, e.g. cnn.com What is the expected behavior? What went wrong? Many TaskSchedulerSi processes (name as reported by top) are running (8-10) for a while, each with 700MB+ of resident memory. Also lots of CPU activity, maxing out all cores on a i7-4750HQ, load average reaches 10. This is on a machine with 8GB of RAM. This leads to tons of paging, system lockup (for up to a minute) and sometimes the OOM killer gets involved (example log included in attachment, non-chromium processes are filtered out from the kernel report). This machine has /proc/sys/vm/admin_reserve_kbytes set to 131072, which typically helps prevent system lockups during OOM, but it's not effective here. Did this work before? Yes Not sure, behavior first seen in the last week I believe. Never seen TaskSchedulerSi in a process list on this system before. Chromium typically updated within a day of a new version being available on Arch Linux. Chrome version: 64.0.3282.186 Channel: stable OS Version: Arch Flash Version:
,
Mar 5 2018
@Reporter: Could you please explain step-1 in above comment? i.e; Please let us know how to set frequent refreshes. If possible please guide us with a video on how to do that. This info would help in further triaging of the issue. Thanks!
,
Mar 5 2018
#2: That's not really important, using the program top is just a handy way to observe the activity on the system, other tools can be used too (or no tools, the system lockup is pretty obvious). Nevertheless, start top from the command line, press "s", then enter something like ".1", press ENTER.
,
Mar 5 2018
Thank you for providing more feedback. Adding the requester to the cc list. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 5 2018
As an (extreme) workaround, having in the background: while true; do killall -9 TaskSchedulerSi; sleep .5; done appears to prevent the worst symptoms, and does not appear to interfere in any obvious way with the browser.
,
Mar 6 2018
Tested the issue on Ubuntu 14.04 using 64.0.3282.186 and is not reproducible with steps mentioned above. 1. Launched chrome, opened terminal and typed top -- Observed si as 0.0 2. Navigated to cnn.com and observed si as 0.0 and observed no system lock. 3. In terminal pressed s and gave 0.1 as input and observed top process but no system lock -- Attaching screencast for reference. As ET/Inhouse team doesn't have arch linux machine could someone from Internals>TaskScheduler team please have a look at this issue. Hence adding TE-NeedsTriageHelp label. Thanks!
,
Mar 13 2018
Can record what's happening via chrome://tracing? If you can privately share a trace with me, I can help take a look.
,
Mar 13 2018
"Can you record ..."
,
Mar 13 2018
Can you provide us the command-line for this "TaskSchedulerSi" process? In system monitor, you can right-click on the header of the table and check "Command Line". Thanks.
,
Mar 13 2018
Can you please lookup in Chrome's TaskManager what those processes are actually named? I suspect they have nothing to do with TaskScheduler and that this is merely poor naming caused by issue 821543 . hubbe@ reported @ https://bugs.chromium.org/p/chromium/issues/detail?id=817314#c33 that these may be "Utility: Video Capture Service" processes.
,
Mar 13 2018
,
Mar 13 2018
I'm unable to reproduce this issue at the moment. I have not seen the problem since upgrading to 65.0.3325.146 a few days ago, so I downgraded to 64.0.3282.186 that I had previously on this machine, but I haven't seen this issue either, yet. I reduced the swap available on the machine to try and increase the memory pressure but to no avail. I will continue using this session for longer, maybe the problem will appear after using it for some amount of time.
As I look right now, the command lines for the processes reported under the name TaskSchedulerSi are all "/usr/lib/chromium/chromium --disk-cache-dir=/tmp --disk-cache-size=200000000" (options found in my .config/chromium-flags.conf).
They are all direct children of the main chromium process (looking at a forest view). None of them appear in the Task Manager (looking by PID), at the moment anyway.
For reference, as there are many "top"-like programs, I'm using the top that comes from the package procps-ng, if that's relevant to figure out where the names reported come from ("c" toggles displaying the command line, "H" toggles what they call "thread display" which shows names like TaskSchedulerSi).
,
Mar 14 2018
I should note that as I downgraded from 65 to 64, I was unable to use my profile when starting chromium, if that's relevant.
,
Mar 14 2018
+chfremer@ FYI basing on #10
,
Mar 14 2018
Assuming the rogue processes are indeed instances of the video capture service, what may help with the repro is to either navigate to https://hangouts.google.com or to https://mail.google.com. The hangouts page (or the embedded hangouts in gmail) currently seems to query for video capture devices about every 1~2 seconds. With caching of these requests currently disabled (see Issue 810980 ) the video capture service starts up to serve these requests. When I test on my machine, the service process stays alive for 5 seconds and then quits. Then a new process for the service is started with the next incoming request and so on. There is never supposed to be more than one Video Capture Service process at a time and the process should also not consume that much memory. The only thing I can think of that could explain symptoms like this is if something bad happens inside the process that prevents it from quitting while the Chromium service manager has already released it and therefore starts up a new one for the next request. Please let me know if the above information helps with reproing this. I am attempting to repro on my end as well, but so far unsuccessful.
,
Mar 14 2018
One more note: Hangouts seems to do the infinite polling to enumerateDevices() only when there are no video capture devices present. If a camera is present, only 1-2 requests are made and then it stops.
,
Mar 14 2018
I created https://jsfiddle.net/mq218m9q/6/ to simulate polling of enumerateDevicers() every 100ms without the need for hangouts. But so far I am unable to repro.
,
Mar 14 2018
There are two separate issues AFAICT: 1. video_capture restarts every 5 seconds if used at regular intervals (it seems unlikely that every client just happens to reconnect exactly 5 seconds after the last one) 2. video_capture instances may hang instead of exiting once disconnected from the service manager Are not able to repro either of them?
,
Mar 14 2018
re #18: I am able to repro 1., but this is the expected behavior. After enumerateDevices() has replied, the caller disconnects from the service, and the service starts a 5 second timer. After 5 seconds, it checks whether or not any client is connected. And since each enumerateDevices() call only takes a very short amount of time, almost always, there is no client connected when the timer expires. Therefore, the service quits. This behavior may not be the best strategy when getting polled every second. It might be better to cancel the timer when a new connection comes in. But the way it works today is how it was designed. I am not able to reproduce 2. so far. This is the issue I am trying to find.
,
Mar 14 2018
Indeed, the machine I have seen this on does not have a video capture device available (it does have a camera but Linux has no driver for it). And hangouts.google.com was one of the tabs open every time I experienced this issue. I cannot reproduce at the moment with the jsfiddle in #17.
,
Mar 14 2018
Maybe I've misunderstood. I totally get the logic of the service waiting 5 seconds to die if it's not in use. It was my impression however that the service *is* in use at high frequency, so restarting every 5 seconds would imply that there is a flaw in the logic which decides whether or not it's in use. As for issue 2 , that does seem much trickier. Possibly contingent on driver issues too, isn't it?
,
Mar 14 2018
re #21: You are right. The service is in use at high frequency, but only for a few milliseconds for each request. The logic which decides whether or not it is in use checks only at one exact moment in time (when the 5 second timer expires) whether or not any client is connected. It does not check whether or not clients have connected and disconnected during those 5 seconds. I agree that this is not a good strategy when being polled at regular intervals. Regarding the hang on disconnect/process shutdown, I am not sure what needs to happen for process shutdown to succeed. Does it wait for all threads to exit, so that any hanging thread would block the shutdown? A driver issue might be able to cause this, but it seems unlikely because of two things: 1. The same issue has been reported on Windows 2. According to #20 the issue happens with no camera present
,
Mar 14 2018
I see. Yeah, that seems broken and it would be trivial to fix that part... Just reset the timer every time a new client connects. A hanging thread would process cleanup, yes.
,
Mar 14 2018
Would block* process cleanup.
,
Mar 14 2018
Thanks for these insights. I played around with the code a bit and found that by simulating a hang (via PlatformThread::Sleep()) in the destructor of video_capture::ServiceImpl [1], I can indeed trigger the symptom of new video capture service processes getting spawned while the old ones still hang.
,
Mar 21 2018
Issue 813444 has been merged into this issue.
,
Mar 21 2018
What's the status here? Is anyone working on this? It sounds like a pretty bad user experience. @chfremer per above investigation
,
Apr 2 2018
I haven't been able to reproduce this without intentionally adding code to simulate a hang as described in #14. Based on this and on #12, it seems the issue is not very wide-spread. We probably won't be able to find out what is causing the hang reported here unless we can reproduce it. With that, I am inclined to close this one as WontFix and reopen it in case we get new information or reports.
,
May 7 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by sindhu.chelamcherla@chromium.org
, Mar 4 2018