Speed up cloud import's duplicate detection logic |
||||||||
Issue descriptionCurrently, cloud import's duplicate detection is done based on MD5 hashes of image files. The hashes are compared against local Drive metadata. The hashes are stored so these won't be recomputed for the same files. However, computing hashes only once is still annoying because it will slow down the device while the computation is in progress. One way to solve this problem is to switch to a simpler solution that uses only file names and sizes. This way is slightly inaccurate [1] but fast and simple, and would be good enough? [1] cannot detect duplicates with different file names, and duplicates with different contents but the same file names and sizes (unlikely).
,
Nov 25 2016
Sounds reasonable to me.
,
Nov 28 2016
yamaguchi@ noticed that my solution from #1 won't work, as reinserting a USB stick would cause heavy scan of all files. Maybe we could extend the light scan to include more metadata such as last modified time? We have an API on Drive [1] to attach extra metadata to files, so we could attach the original last modified time from the SD card. And then use the solution suggested by satorux@, which is removing the heavy scan entirely. Collisions in such case should be very unlikely (same file name, same size and same last modified time, but different contents). Or, how about computing hashes for like only the first 16KB of data? We'd need to keep the checksum manually via [1], instead of the one provided by Drive API. [1] https://cs.chromium.org/chromium/src/chrome/common/extensions/api/file_manager_private.idl?rcl=0&l=950
,
Nov 29 2016
Relying on name+size+timestamp makes sense. If only timestamp is different, maybe we should compute the checksum, because timestamps can be less reliable. > Or, how about computing hashes for like only the first 16KB of data? We'd need to keep the checksum manually via [1], instead of the one provided by Drive API. This works efficiently only if files on Drive are cached locally.
,
Dec 6 2016
,
Dec 7 2016
Here is the details of current logic I examined: Step 1. History de-duplication, based on metadata. If there’s a record that indicates Cloud Import has imported or copied the file in the past, we determine that the file is already backed up, without further check. The "record" is based on metadata (timestamp, size) hash, and synced across devices by chromeSyncFilesSystem. (note: if user removes backed up file in remote, it still says 'already backed up'. It's by design.) Step 2. De-duplication based on content hash. Calculate content hash of every local file and try to find in metadata of Drive filesystem. We have local in-memory cache for the hash of local files. (key is metadata hash + URL). The cache is purged when user logs out. The both runs every time a media directory is opened, or files are selected. (which is before user clicks the "back up" button) If user actually uses backup by Cloud Import, most file will be detected as already backed up at step 1, thus skips step 2. On the other hand, however, if user never uses backup by Cloud Import, it will always do step 2 for every file in the media directory after media is plugged. It can take more time after re-login not having cached hash. After discussion outside this bug, a possible solution would be to omit Step 2 check at the scan stage before starting backup. We can consider those as "new files" which potentially require backup. Step 2 deduping can be conducted after "back up" button is clicked by the user. This will detect more new files, including false new files, than currently it does. It means it won't add more risk to miss files by false positives. Besides that, regarding Step 1, we already have issue with false positives in some special case (See crbug.com/672017). I think it should be addressed separately to reduce files that are not backed up by mistake.
,
Dec 8 2016
,
Dec 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4703c1fc80d263275181e92c9bd1e124036f4e64 commit 4703c1fc80d263275181e92c9bd1e124036f4e64 Author: yamaguchi <yamaguchi@chromium.org> Date: Mon Dec 19 07:05:47 2016 Add scan mode parameter to choose whether to do content duplication check or just check the import history. This change will not actually change the behavior of the app but will be used by a following change https://codereview.chromium.org/2569163002/ . BUG= 668574 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation Review-Url: https://codereview.chromium.org/2568723005 Cr-Commit-Position: refs/heads/master@{#439422} [modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/background/js/duplicate_finder.js [modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/background/js/media_scanner.js [modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/common/js/importer_common.js [modify] https://crrev.com/4703c1fc80d263275181e92c9bd1e124036f4e64/ui/file_manager/file_manager/foreground/js/import_controller.js
,
Dec 19 2016
If we run the step 2 (in Comment 6) after a user click the button, It'll require new message text to show its progress. The scan will take a significant amount of time (in a linear proportion to the number of the files) before the file import starts. I think we can show a message and the progress bar for the scan in the same place as the "importing %d files" message which we currently show while importing files. (see the attached file.) Weifang, will you get copies for such message(s)? My proposals are: - "Preparing to import files..." at the beginning, and - "Preparing to import files... 3 files will be transferred..." after finding 1 or more files to be copied.
,
Dec 19 2016
,
Dec 20 2016
Can we do something like: 1) Based on the step 1), get the number of candidate files, and uses the number for "XXX new photos found" 2) Iterate over the candidate files one by one. If a duplicate is found by the content hash, just count the file as success. If a non-duplicate is found, upload it. If this works, no new messages would be required. This's what smckay@ suggested elsewhere.
,
Dec 20 2016
I've made a changelist by the approach described in Comment 11. https://codereview.chromium.org/2594523002/ Therefore we will not need new messages I described in Comment 9.
,
Dec 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b02ae16b895b09fe5dd3319b37a32876abe58eaf commit b02ae16b895b09fe5dd3319b37a32876abe58eaf Author: yamaguchi <yamaguchi@chromium.org> Date: Wed Dec 21 08:22:41 2016 Postpone the content duplication check until starting import. Skip content duplication check at the directory or selection scan, which runs automatically upon directory or selection change. The de-duplication by content hash check is added to the import task instead. When it finds a file already exist in Drive, it skips the transfer of such file but just marks as imported. BUG= 668574 TEST=browser_test: FileManagerJsTest.* , and manual test : Put 2 new image files (A and B) in /DCIM folder of a removable drive. Copy a file (A) to a folder in Drive manually. See the both file detected as "new file" by cloud backup. Execute Cloud Backup for both files and confirm only (B) is copied to Cloud Backup folder, and see the both file gets Drive icon in the Status column. CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation Review-Url: https://codereview.chromium.org/2594523002 Cr-Commit-Position: refs/heads/master@{#440052} [modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/background.js [modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/duplicate_finder.js [modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/media_import_handler.js [modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/background/js/media_import_handler_unittest.js [modify] https://crrev.com/b02ae16b895b09fe5dd3319b37a32876abe58eaf/ui/file_manager/file_manager/foreground/js/import_controller.js
,
Dec 21 2016
,
Jan 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d31e43ce6af3cea0654218a0d4db7188addb71d3 commit d31e43ce6af3cea0654218a0d4db7188addb71d3 Author: yamaguchi <yamaguchi@chromium.org> Date: Thu Jan 19 03:08:30 2017 Count and report number of duplicated files found during import. Issue 2594523002 has changed the scan process for Cloud Backup. The directory scan and selection scan no longer reports before import content_duplicate files. This patch instead counts the number of such duplicated files found during the import task and adds it as ImprotEvents.CONTENT_DUPLICATE metric. BUG= 668574 TEST=browser_test FileManagerJsTest.MediaImportHandlerTest CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation Review-Url: https://codereview.chromium.org/2607443002 Cr-Commit-Position: refs/heads/master@{#444618} [modify] https://crrev.com/d31e43ce6af3cea0654218a0d4db7188addb71d3/ui/file_manager/file_manager/background/js/media_import_handler.js
,
Mar 6 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by mtomasz@chromium.org
, Nov 25 2016