If the user disables Android apps, the user data is wiped. While in this state, re-enabling Android is racy until after the data is done being wiped, and might either cause timeouts or enter invalid states.
We need to make sure Android cannot be enabled until after session_manager has finished wiping user data.
There's a bug (627664) for one specific race condition where where opting-in too quickly causes user data to never be removed, but I don't know of any others.
Last time I looked in to this I got the impression that the race condition as described here doesn't actually happen. If I understand correctly the threading prevents the user from opting in before RemoveArcData is called, and the threading of session_manager_impl prevents new containers from being created before it finishes.
That being said its all too indirect for comfort and stronger guarantees would be awesome.
The issue here is that if the user opts out and in again too quickly, and the wiping operation takes longer than the D-Bus timeout (which we have seen is very likely in the wild), starting ARC will fail from Chrome's perspective, so it would not retry. I don't know if the container itself would be started later or not, which also seems dangerous.
There was some logic added to the code that does data wiping to wait before restarting, but it needs to be moved one layer below since there are other ways in which ARC can be started.
Another bug is that if toggling is done more quickly, then enabling event happens before the ARC stopping (practically this is doable easily).
In such cases, UI shows underterministic error messages and sometimes flickering because of its internal state management.
ArcAuthService needs to manage more precise states of ABS as well as, as Luis said at #6, data wipe out states.
Looking at the code, let me propose alternative approach, which is;
Merging ArcUserDataService into ArcAuthService.
Conceptually, we have mutually exclusive two tasks.
- ARC instance management.
- User data removal procedure.
If either runs, the other should not run.
Merging user data removal into ABS is still possible approach, I think.
Though, considering current code, merging it to ArcAuthService looks smaller work.
Actually, in some cases, ArcAuthService also removes the user data (independently from ArcUserDataService, which is broken code path now...). So
we'll be able to reusing it. Although it is not complete fix, but this can reduce the much risk of mysterious state/buggy behaviors, IMHO.
Though, AAS is now complicated class, because of organic growth based on various requirements.
So, after some important bug fix, I'll clean ArcAuthService up, which needs relatively huge code change. I will send another refactoring proposal similar to what I did for ABS later. It should also be the fixes of edge cases.
For the first part (merging to ArcAuthService) should small. (Will make a CL).
Though latter part is huge. Thus, I think first part is the cherry-pick target, and the latter is not.
As for priority, I personally think, data removal race is critical (because it is user data breakage), though others look less important since retry later should work well. Luis, any thoughts?
WDYT?
Comment 1 by uekawa@chromium.org
, Aug 2 2016