New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 871733 link

Starred by 4 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: 2019-02-10
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 850893
issue 867929



Sign in to add a comment

Validation of broken USS metadata records

Reported by ascheg...@yandex-team.ru, Aug 7

Issue description

Here at Yandex, we've found some users with broken cached DeviceInfo metadata records.

They have |sequence_number > acked_sequence_number| (e.g. 5 and 0) and no corresponding specifics in |DeviceInfo::all_data_|.  
Thus, |ProcessorEntityTracker::RequiresCommitRequest| returns true, |ClientTagBasedModelTypeProcessor::HasLocalChanges| returns true, and they constantly trigger NudgeForCommit.

So we're thinking that it would be reasonable to clear such broken metadata records (or maybe drop all the metadata), either in |ClientTagBasedModelTypeProcessor::ModelReadyToSync| for all the types, or in |DeviceInfoSyncBridge::OnReadAllMetadata| only for DeviceInfo (as we didn't yet encounter such issue with other types).

What are your thoughts on this?
 
Cc: mastiz@chromium.org jkrcal@chromium.org
Status: Available (was: New)
Thanks for the report! 

Looping in mastiz@. This could also be related to the infinite look of commits with the loopback server, right?

My take is that I wonder what is the root cause of such inconsistent data. I am worried that such a fix will only hide the original bug (maybe in DeviceInfoBridge?).

An unrelated question: this bug slipped through our bug triage queue due to using Status "New" which is not in the list of standard statuses (in the dropbox menu). What kind of tool/template did you use to file the bug so that we can fix it to use the "Unconfirmed" status?
Can this CL be possibly fixing the root-cause of the issue?
https://chromium-review.googlesource.com/c/chromium/src/+/1167051
Blocking: 867929
Blocking: 850893
Cc: j...@opera.com yn...@vivaldi.com
I've made this bug blocking two other bugs (one related to DEVICE_INFO, other related to TYPED_URL) as this seems the proper fix for the issues. I cc'ed relevant folks from these other bugs.

What we propose is to fix it in the ClientTagBasedModelTypeProcessor. 
 - Not in ModelReadyToSync() as this would involve asking the bridge for all data (for which it has metadata) which might be much of a performance penalty for some data types. 
 - Instead, we propose to fix it GetLocalChanges() and in OnPendingDataLoaded(). The latter could detect if the bridge did omit some entity that has been requested. For each such entity, we would like to:
   - untrack the entity locally,
   - ask the bridge to delete its metadata,
   - emit a UMA metric separately per data types so that we see what are the data types with this issue and so that we can hunt for the underlying issue and have some way to assess if the fix was successful.

What do you think?

Does anyone volunteer to start a CL to fix this?
To clarify the last questions, there are (at least) two open CLs that try to address this issue. Please agree on who continues with fixing this bug :)
 - https://chromium-review.googlesource.com/c/chromium/src/+/1167051 by yngve@vivaldi.com
 - https://chromium-review.googlesource.com/c/chromium/src/+/1151322 by mzajaczkowski@opera.com
Cc: mzajaczk...@opera.com
> My take is that I wonder what is the root cause of such inconsistent data. 

The leveldb log dump for a faulty entry shows that at some point the specifics record was deleted, and its metadata wasn't.

  'device_info-md-S47/8UfKTwy6LMBE4wwJIw==' @ 572 : val => '...'
  'device_info-dt-S47/8UfKTwy6LMBE4wwJIw==' @ 573 : val => '...'
  'device_info-dt-S47/8UfKTwy6LMBE4wwJIw==' @ 574 : del => ''

The metadata has 

  sequence_number: 5
  acked_sequence_number: 0
  server_version: -1

so it never was synchronized.

> An unrelated question: this bug slipped through our bug triage queue due to using Status "New" which is not in the list of standard statuses (in the dropbox menu). What kind of tool/template did you use to file the bug so that we can fix it to use the "Unconfirmed" status?

http://crbug.com/new . It avoids all those forms, but well apparently it shouldn't be used.
ascheglov@, thanks for the detailed analysis!

I guess this delete-before-upload situation can happen for almost any data type (in particular when offline) so it deserves a general fix like the one sketched in #5.

On the other hand, I suspect that some of the bridges are particularly prone to this situation and might deserve follow-up fixes to reduce the frequency of this happening.
any updates on this?  crbug.com/867929  and crbug.com/850893 are a bit worrisome.
We agreed to timebox external contributors and alternatively start a fix from our side. jkrcal@, please correct me but I think it's time to get ourselves actively involved.
sync-triage-ping
Yes, it is the time to start the patch. Let's discuss it offline this week.
We need to discuss this through, but here's a proposed plan as recap and basis for discussion:
1. Introduce UMA metric(s) to assess the severity: at least, one that detects the situation in OnPendingDataLoaded().
2. Deep dive into each affected datatype (because I don't yet see a design flaw affecting USS, but instead buggy bridges).
3. Regardless of #2, introduce a mitigation in ClientTagBasedModelTypeProcessor, most likely either of the following:
a) Mark entity metadata as synced (i.e. last known server version prevails).
b) Commit empty (default) specifics.
c) Untrack affected entities.
Owner: jkrcal@chromium.org
Status: Assigned (was: Available)
I am picking the bug for step 1 from #14 in ClientTagBasedModelTypeProcessor::OnPendingDataLoaded():

If no data gets loaded & the processor still has the entity in entities_ (has not been deleted or untracked by the bridge in the meantime), report it to UMA.

We need to report such failures to separate metrics per data type so that we see what bridges are buggy.
Project Member

Comment 16 by bugdroid1@chromium.org, Sep 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0

commit 6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0
Author: Jan Krcal <jkrcal@chromium.org>
Date: Wed Sep 12 22:38:05 2018

[Sync] Record orphaned metadata counts per data type to UMA

This CL makes a first step in addressing the issue that some bridges
leak orphan metadata: the data entities get deleted but the metadata
entities stay around and the processor does not get notified about the
deletion.

This CL introduces a metric that will help us understand which bridges
have such a leaky behavior.

Bug: 871733
Change-Id: I2c751c574823cddbb21640d21462b6b778d9cb6f
Reviewed-on: https://chromium-review.googlesource.com/1219615
Commit-Queue: Ilya Sherman <isherman@chromium.org>
Reviewed-by: Ilya Sherman <isherman@chromium.org>
Reviewed-by: Mikel Astiz <mastiz@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590844}
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/components/sync/model/fake_model_type_sync_bridge.cc
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/components/sync/model/fake_model_type_sync_bridge.h
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/components/sync/model_impl/client_tag_based_model_type_processor.cc
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/components/sync/model_impl/client_tag_based_model_type_processor.h
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/components/sync/model_impl/client_tag_based_model_type_processor_unittest.cc
[modify] https://crrev.com/6f5ec1c6d6bec9e7a08170aca95bd101f6d0a1e0/tools/metrics/histograms/histograms.xml

Labels: -Pri-3 Pri-1
Bumping to P1 after UMA metrics surfacing this situation is more common than we thought.
Project Member

Comment 18 by bugdroid1@chromium.org, Sep 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a6f711fd64fb69209c46ca3eb00b227e2267c1c4

commit a6f711fd64fb69209c46ca3eb00b227e2267c1c4
Author: Mikel Astiz <mastiz@chromium.org>
Date: Thu Sep 20 13:08:50 2018

Untrack sync entity metadata if data is gone

This is a mitigation to what are believed to be bugs in the bridges,
which may forget to appropriately report local deletions to the
processor, leading to orphan sync metadata.

We have recently instrumented a UMA metric
("Sync.ModelTypeOrphanMetadata") to assess the severity of the issue,
and the collected data suggests that the problem is not specific to a
single datatype. While we investigate each bridge individually, a
reasonable workaround is to let the processor remove the orphaned
metadata and untrack the entity.

Bugs in the bridges should still be surfaced by the aforementioned
UMA metric, although absolute numbers should tend to be smaller after
this patch, because issues will get fixed after emitting the metric
once.

Bug:  875671 ,871733
Change-Id: Ieb06692f46739d73ce6f04115c25e14198ade1dd
Reviewed-on: https://chromium-review.googlesource.com/1233676
Commit-Queue: Mikel Astiz <mastiz@chromium.org>
Reviewed-by: Jan Krcal <jkrcal@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592775}
[modify] https://crrev.com/a6f711fd64fb69209c46ca3eb00b227e2267c1c4/components/sync/model_impl/client_tag_based_model_type_processor.cc
[modify] https://crrev.com/a6f711fd64fb69209c46ca3eb00b227e2267c1c4/components/sync/model_impl/client_tag_based_model_type_processor_unittest.cc

As regards device_info oprhans:

I've looked into DeviceInfoSyncBridge and could not spot any obvious problem. The only potential fix is to get rid of the early return in ApplySyncChanges() (as earlier proposed by yngve@vivaldi.com). Still, after some depper discussions, we can not see how this early return contributes to orphan metadata.


Patch in #18 made sure, we remove existing orphans. Thus, we propose to wait a week until all reported orphans are actually new orphans. If there are still non-zero orphans in device_info, we propose to get rid of the early return. 

Options for that:
 - Caching the current cache_guid (so that device_info instance is not needed).
 - Exposing the current cache_guid from the processor (so that device_info instance is not needed).
 - Avoiding asynchronousness between the controller and the processor&bridge for device info. This should make sure (we need to double-check) that the nullptr case cannot happen anymore.
There's a couple of "non-standard" behaviours that I've done which /may/ have something to do with this.

1) Killing Chrome processes

Could running a script that kills Chrome tabs cause this issue? e.g.: https://github.com/sindresorhus/kill-tabs

I frequently use a script like this, or kill tabs in the Chrome Task Manager.

(I guess this wouldn't be an issue if sync always happens in the main thread.)


2) Copying the APPDATA directory from one PC to another.

In order to keep all the localstorage of various web apps and chrome extensions, I copied the "C:\Users\%USERNAME%\AppData\Local\Google\Chrome\User Data" folder from my previous PC to my current one. Could this have "confused" the device_info settings?

NB: I did this in January, but I didn't notice any problems until some months later.
jkrcal@: triage ping, could you please provide a status update? Can we close this?
update for device_info:

The counts went down, significantly. Nevertheless, there are non-zero new orphans on the newest versions of Canary. 

I vote for a quick implementation of one of the simpler options from #19 to verify we have the right root-cause.

session and typed_url still generate quite some new orphans
Delayed by being sick. I am still working on a patch as promised in #22.
Labels: sync-fixit-2018q4
Status: Started (was: Assigned)
Project Member

Comment 26 by bugdroid1@chromium.org, Nov 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/417c517b73416d7c3f8697ef1d75a679c294d77e

commit 417c517b73416d7c3f8697ef1d75a679c294d77e
Author: Jan Krcal <jkrcal@chromium.org>
Date: Fri Nov 23 08:06:36 2018

[DeviceInfo USS] Fix the bridge leaking metadata orphans

This CL makes the device info USS bridge more robust in handling
ApplySyncChanges(). Previously, in a shutdown race condition, the bridge
could receive such calls when the device info is not available any more.
This could lead to the bridge not acting on the ApplySyncChanges() call,
not deleting any metadata.

This CL caches the current guid (which is stable between OnSyncStarting
calls) in the bridge.

Bug: 871733
Change-Id: I18d0bdcbcd9677e21196ee986a6af40a5abb29d9
Reviewed-on: https://chromium-review.googlesource.com/c/1273042
Commit-Queue: Jan Krcal <jkrcal@chromium.org>
Reviewed-by: Mikel Astiz <mastiz@chromium.org>
Cr-Commit-Position: refs/heads/master@{#610548}
[modify] https://crrev.com/417c517b73416d7c3f8697ef1d75a679c294d77e/components/sync/device_info/device_info_sync_bridge.cc
[modify] https://crrev.com/417c517b73416d7c3f8697ef1d75a679c294d77e/components/sync/device_info/device_info_sync_bridge_unittest.cc

NextAction: 2018-12-09
We believe that after patch from #26, device_info should become orphan-free. 

Let me check for new orphans from DEVICE_INFO on Canary in two weeks from now.
The NextAction date has arrived: 2018-12-09
NextAction: 2019-01-03
Not enough data to conclude anything clear. There are still a few new orphans on an updated Canary (M73) after Dec 01 but these could be older orphans.

Let me check again in early January.
The NextAction date has arrived: 2019-01-03
Canary/Dev data contains weird peaks (maybe due to some crashes, corrupt profiles, etc?). Data from Beta looks way more reliable.

The fix from #26 did not solve the problem: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=bcf733e2cac9a9a73fc47d0840cda011
(M71 contains both #16 and #18, M72 contains #26)

It is possible that #26 reduced the problem a bit but there still is some source of new orphans. The volume is quite low so it could be related to corner case situations such as crashes... I need to dig deeper.
jkrcal@: triage ping, thanks! Should we adjust priority to P2 at this point?
Labels: -Pri-1 Pri-2
Yes, agreed, reducing prio to P2.

Apart from that there is no further update compared to #31.
NextAction: 2019-02-10
Setting Next-Action to remind myself to look at data from Stable after M72 rollout. Nevertheless hope to dig deeper before that.

Sign in to add a comment