New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 668574 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Dec 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Speed up cloud import's duplicate detection logic

Project Member Reported by satorux@chromium.org, Nov 25 2016

Issue description

Currently, cloud import's duplicate detection is done based on MD5 hashes of image files. The hashes are compared against local Drive metadata. The hashes are stored so these won't be recomputed for the same files.

However, computing hashes only once is still annoying because it will slow down the device while the computation is in progress.

One way to solve this problem is to switch to a simpler solution that uses only file names and sizes. This way is slightly inaccurate [1] but fast and simple, and would be good enough?

[1] cannot detect duplicates with different file names, and duplicates with different contents but the same file names and sizes (unlikely).
 
False positives (same metadata, but different contents detected as duplicates) may be dangerous, as it may result in some pictures not being backed up, then permanently lost.

Maybe we could flip the logic slightly and compute md5 only once metadata based deduplication returns true? Then all pictures would be backed up at least once (or more for false negatives). In such case md5 calculation should be very rare.
Sounds reasonable to me.
yamaguchi@ noticed that my solution from #1 won't work, as reinserting a USB stick would cause heavy scan of all files.

Maybe we could extend the light scan to include more metadata such as last modified time? We have an API on Drive [1] to attach extra metadata to files, so we could attach the original last modified time from the SD card. And then use the solution suggested by satorux@, which is removing the heavy scan entirely. Collisions in such case should be very unlikely (same file name, same size and same last modified time, but different contents).

Or, how about computing hashes for like only the first 16KB of data? We'd need to keep the checksum manually via [1], instead of the one provided by Drive API.

[1] https://cs.chromium.org/chromium/src/chrome/common/extensions/api/file_manager_private.idl?rcl=0&l=950
Relying on name+size+timestamp makes sense. If only timestamp is different, maybe we should compute the checksum, because timestamps can be less reliable.


> Or, how about computing hashes for like only the first 16KB of data? We'd need to keep the checksum manually via [1], instead of the one provided by Drive API.

This works efficiently only if files on Drive are cached locally.

Blocking: 671517
Here is the details of current logic I examined:

Step 1. History de-duplication, based on metadata.
If there’s a record that indicates Cloud Import has imported or copied the file in the past, we determine that the file is already backed up, without further check.
The "record" is based on metadata (timestamp, size) hash, and synced across devices by chromeSyncFilesSystem.
(note: if user removes backed up file in remote, it still says 'already backed up'. It's by design.)

Step 2. De-duplication based on content hash.
Calculate content hash of every local file and try to find in metadata of Drive filesystem.
We have local in-memory cache for the hash of local files. (key is metadata hash + URL). The cache is purged when user logs out.

The both runs every time a media directory is opened, or files are selected. (which is before user clicks the "back up" button)

If user actually uses backup by Cloud Import, most file will be detected as already backed up at step 1, thus skips step 2.
On the other hand, however, if user never uses backup by Cloud Import, it will always do step 2 for every file in the media directory after media is plugged. It can take more time after re-login not having cached hash.


After discussion outside this bug, a possible solution would be to omit Step 2 check at the scan stage before starting backup. We can consider those as "new files" which potentially require backup. Step 2 deduping can be conducted after "back up" button is clicked by the user.
This will detect more new files, including false new files, than currently it does. It means it won't add more risk to miss files by false positives.

Besides that, regarding Step 1, we already have issue with false positives in some special case (See crbug.com/672017). I think it should be addressed separately to reduce files that are not backed up by mistake.

Labels: -Pri-2 Pri-1
Project Member

Comment 8 by bugdroid1@chromium.org, Dec 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4703c1fc80d263275181e92c9bd1e124036f4e64

commit 4703c1fc80d263275181e92c9bd1e124036f4e64
Author: yamaguchi <yamaguchi@chromium.org>
Date: Mon Dec 19 07:05:47 2016

Add scan mode parameter to choose whether to do content duplication check or just check the import history.

This change will not actually change the behavior of the app but will be used by a following change https://codereview.chromium.org/2569163002/ .

BUG= 668574 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation

Review-Url: https://codereview.chromium.org/2568723005
Cr-Commit-Position: refs/heads/master@{#439422}

[modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/background/js/duplicate_finder.js
[modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/background/js/media_scanner.js
[modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/common/js/importer_common.js
[modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/foreground/js/import_controller.js

Cc: weifangsun@chromium.org
If we run the step 2 (in Comment 6) after a user click the button, It'll require new message text to show its progress. The scan will take a significant amount of time (in a linear proportion to the number of the files) before the file import starts.

I think we can show a message and the progress bar for the scan in the same place as the "importing %d files" message which we currently show while importing files. (see the attached file.)

Weifang, will you get copies for such message(s)?

My proposals are:
- "Preparing to import files..." at the beginning, and
- "Preparing to import files... 3 files will be transferred..." after finding 1 or more files to be copied.

import-progress-messages.mp4
190 KB View Download
Blocking: -671517
Cc: smckay@chromium.org
Can we do something like:

1) Based on the step 1), get the number of candidate files, and uses the number for "XXX new photos found"
2) Iterate over the candidate files one by one. If a duplicate is found by the content hash, just count the file as success. If a non-duplicate is found, upload it.

If this works, no new messages would be required. This's what smckay@ suggested elsewhere.
I've made a changelist by the approach described in Comment 11.
https://codereview.chromium.org/2594523002/

Therefore we will not need new messages I described in Comment 9.
Project Member

Comment 13 by bugdroid1@chromium.org, Dec 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b02ae16b895b09fe5dd3319b37a32876abe58eaf

commit b02ae16b895b09fe5dd3319b37a32876abe58eaf
Author: yamaguchi <yamaguchi@chromium.org>
Date: Wed Dec 21 08:22:41 2016

Postpone the content duplication check until starting import.

Skip content duplication check at the directory or selection scan, which
runs automatically upon directory or selection change.
The de-duplication by content hash check is added to the import task
instead. When it finds a file already exist in Drive, it skips the
transfer of such file but just marks as imported.

BUG= 668574 
TEST=browser_test: FileManagerJsTest.* , and manual test : Put 2 new image files (A and B) in /DCIM folder of a removable drive. Copy a file (A) to a folder in Drive manually. See the both file detected as "new file" by cloud backup. Execute Cloud Backup for both files and confirm only (B) is copied to Cloud Backup folder, and see the both file gets Drive icon in the Status column.
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation

Review-Url: https://codereview.chromium.org/2594523002
Cr-Commit-Position: refs/heads/master@{#440052}

[modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/background.js
[modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/duplicate_finder.js
[modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/media_import_handler.js
[modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/media_import_handler_unittest.js
[modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/foreground/js/import_controller.js

Status: Fixed (was: Assigned)
Project Member

Comment 15 by bugdroid1@chromium.org, Jan 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d31e43ce6af3cea0654218a0d4db7188addb71d3

commit d31e43ce6af3cea0654218a0d4db7188addb71d3
Author: yamaguchi <yamaguchi@chromium.org>
Date: Thu Jan 19 03:08:30 2017

Count and report number of duplicated files found during import.

Issue 2594523002 has changed the scan process for Cloud Backup.
The directory scan and selection scan no longer reports before import
content_duplicate files. This patch instead counts the number of such
duplicated files found during the import task and adds it as
ImprotEvents.CONTENT_DUPLICATE metric.

BUG= 668574 
TEST=browser_test FileManagerJsTest.MediaImportHandlerTest
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation

Review-Url: https://codereview.chromium.org/2607443002
Cr-Commit-Position: refs/heads/master@{#444618}

[modify] https://crrev.com/d31e43ce6af3cea0654218a0d4db7188addb71d3/ui/file_manager/file_manager/background/js/media_import_handler.js

Status: Verified (was: Fixed)

Sign in to add a comment