Native / Java task ordering is random |
|||
Issue descriptionIn issue 787647 I found that sometimes native task that is posted after Java task is run before the Java task. At first I suspected that this is due to the fact that native tasks are posting asynchronous Java messages (Message.setAsynchronous()), but to my surprise the issue persisted even after I commented out setAsynchronous() call. In the end I found that the problem is how delayed tasks are scheduled: 1. IncomingTaskQueue::AddToIncomingQueue() is called for both delayed and non-delayed tasks. 2. AddToIncomingQueue() calls IncomingTaskQueue::PostPendingTask(), which calls MessagePumpForUI::ScheduleWork() for each task posted to TYPE_UI / TYPE_JAVA message loops, since IncomingTaskQueue::always_schedule_work_ is true for them. 3. MessageLoop::DoWork() moves all delayed tasks to the delayed queue until it finds a non-delayed task, which it then runs. In some cases ScheduleDelayedWork() is called during that process, but that's irrelevant. Imagine this sequence of tasks: 1. Native posts delayed task A, ScheduleWork() posts Java message mA. Java looper tasks: [mA]. 2. Java posts task B. Java looper tasks: [mA, B]. 3. Native posts task C, ScheduleWork() posts Java message mC. Java looper tasks: [mA, B, mC]. Now, when mA is handled, task A gets moved to the delayed queue, and task C is run before B, even though it was posted after. Since it's not deterministic when and how many delayed tasks are run, we end up with effectively random ordering between Java and native tasks. One example of that is issue 787647 . Another scenario which breaks ordering is canceled tasks. MessageLoop::DoWork() will skip all IsCanceled() tasks, i.e. those weakly bound to deleted pointers. So step 1 above can be substituted with 'Post native task A weakly bound to ptr, delete ptr'. I think we need to accept that native / Java task ordering is not guaranteed, and rework our code accordingly. I.e. we probably don't need to call ScheduleWork() for all tasks, we can run multiple native tasks per Java message, etc.
,
Nov 23 2017
Assuming ordering between delayed and non-delayed tasks? That's pretty sketchy on a regular message loop too. Hopefully nothing relies on ordering. > Another concern is that I feel there are cases where number of unprocessed messages posted from ScheduleWork() grows over time, affecting memory usage. Hmm, I'm curious what causes that exactly? Are delayed tasks causing more of these messages than strictly necessary?
,
Nov 23 2017
The issue is that delayed tasks mess up ordering of non-delayed native / Java tasks. Some comments in the code seem to imply that we wanted to guarantee some Java / native ordering, but it's effectively random. I.e. we can probably simplify some code. And also we should not call ScheduleWork() for delayed tasks.
,
Nov 23 2017
I did find it pretty confusing that posting a delayed task calls ScheduleWork(), and (usually) also ScheduleDelayedWork(). It seems like ScheduleDelayedWork is assumed to always be called from the message pump's thread. When you say we shouldn't call ScheduleWork for delayed tasks, are you implying we should change that assumption and call ScheduleDelayedWork directly on the calling thread? In my CL (https://chromium-review.googlesource.com/c/chromium/src/+/751322) I did find that a few things depended on Java/native ordering and are currently racy - not many though, so I'll be fixing those regardless of whether or not my CL goes through. The other thing my CL does is make always_schedule_work_ nearly free, as it only needs to increment an atomic integer for each task posted after the first, but it would be easy enough to implement without always_schedule_work_ and save a very small amount of time. Using the native looper API we can also not block native tasks on all of the java tasks in the queue if we do want to run multiple native tasks each time the looper calls us back.
,
Feb 9 2018
,
Jul 10
With https://bugs.chromium.org/p/chromium/issues/detail?id=780100, we no longer call ScheduleWork() for all tasks. We now run all of the available native tasks before returning control to the java looper to run java tasks. I'm not sure if there's anything actionable left in this bug? Maybe we should close it? I've heard there's also ongoing work to post critical java tasks natively to get them to run earlier with proper ordering/priority.
,
Jul 10
> We now run all of the available native tasks before returning control to the java looper to run java tasks. That's probably not desirable. This was actually the original behavior, and we had to change it so that native and java tasks are fair, so that java tasks are not penalized when there are a lot of native ones, causing performance issues. This was more important for webview where the UI thread is shared with the app, but I don't know how much java stuff runs in clank these days..
,
Jul 10
We're currently looking at ways to make C++/Java task ordering somewhat fairer. The current thinking is that we'll treat C++ and Java identically (as in, they end up in the same FIFO) and apply normal task prioritization logic between them. The exception is non-Clank Java work which we can't easily redirect to the scheduler (short of class loader hacks), so we may need to keep round-robining them with other work. Details TBD.
,
Jul 10
> That's probably not desirable. This was actually the original behavior, and we had to change it so that native and java tasks are fair, so that java tasks are not penalized when there are a lot of native ones, causing performance issues. This was more important for webview where the UI thread is shared with the app, but I don't know how much java stuff runs in clank these days.. For what it's worth, startup profiling on Go devices showed no change in startup times, and while there were a small number of tasks whose order was reversed, the ordering of those tasks didn't seem to have an impact on startup time. As for trying to be fairer for C++/Java task ordering, I did have a version of my patch where I returned control to the looper after every native task, so it would go back and forth between java and native tasks, but it turned out to be really complicated as a lot of java code actually relied on the native tasks running first.
,
Jul 10
The previous effort to make things fairer was https://bugs.chromium.org/p/chromium/issues/detail?id=250924 Also mentioned in that bug that I didn't think about, if you run all the native tasks before returning to java, you kind of run the risk of causing more ANRs
,
Jul 10
If we want to avoid ANRs, we could always do something like have a max time we spend in native code before returning control to the looper. Not possible to strictly enforce, if we have an individual task that takes a long time to run, but we could probably do a reasonable job there. In fact I think what would help a lot here is what skyostil mentioned. If all (most) of our own java tasks are posted through the native queue, we could return control to the looper and have a much higher chance of, say, handling input.
,
Jul 10
Unfortunately that doesn't help webview where most of the java tasks are posted by the app, and we have no control how the posting happens.
,
Jul 10
Possibly Webview should have an entirely different message pump implementation in that case, with different priorities?
,
Jul 10
Do we have perf benchmarks or sample poorly-behaved webview apps to actually test changes on?
,
Jul 10
Not really. :/ Anyway, the behavior before crbug.com/250924 was preferring java tasks over native, which then caused a bad behaving app to starve webview work (mostly native). The current behavior now is to prefer native over java, so as long as webview doesn't behave, it's probably not a huge deal. And hopefully we don't ever misbehave..
,
Jul 14
Well, looks like this was a ~15% startup time improvement on some go phones :) https://bugs.chromium.org/p/chromium/issues/detail?id=862984
,
Jul 16
Nice work! |
|||
►
Sign in to add a comment |
|||
Comment 1 by dskiba@chromium.org
, Nov 23 2017