New issue
Advanced search Search tips

Issue 793609 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-12-12
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

fetch_telemetry_binary_dependencies takes 90-120+ seconds on the bots for what should be a no-op

Project Member Reported by dpranke@chromium.org, Dec 9 2017

Issue description

It looks like the fetch_telemetry_binary_dependencies hook in `runhooks` takes 90+ seconds whenever it is run on a bot. I would assume that this is re-running the same downloads over and over again, even though the files should already exist on the machine.

This is a fairly serious cycle time regression and is hopefully an easy fix.

martiniss@, does it make sense for you to look at this?
 
Cc: -nednguyen@chromium.org martiniss@chromium.org perezju@chromium.org
Owner: nedngu...@google.com
Status: Started (was: Untriaged)
90% of time is spent computing file hash. Telemetry just has a quite inefficient filehash function:

def CalculateHash(file_path):
  """Calculates and returns the hash of the file at file_path."""
  sha1 = hashlib.sha1()
  with open(file_path, 'rb') as f:
    while True:
      # Read in 1mb chunks, so it doesn't all have to be loaded into memory.
      chunk = f.read(1024 * 1024)
      if not chunk:
        break
      sha1.update(chunk)
  return sha1.hexdigest()

(https://cs.chromium.org/chromium/src/third_party/catapult/common/py_utils/py_utils/cloud_storage.py?rcl=d624b3ced2c81d4fb4ea98a8dbb4532272cc1e0a&l=446)

What does swarming use to compute file hash? I guess switching Telemetry to reuse that may fix the problem.
Cc: mar...@chromium.org estaab@chromium.org
+Maruel/Erick for question about file hashing
Interestingly, the file hashing implementation in #4 seems almost the same as Telemetry (except I don't know what you use for 'algo' in production).

If that's true, the best we can do to improve the speed is by multiprocessing the code to check & fetch file. 

If the builder machines have 4 cores, that probably fetch_telemetry_binary_dependencies's runtime to 30+ s. Dirk: is that good enough for you? 
I think we need to figure out how to get this down to a couple seconds or less when it should be a no-op, just like the other hooks are, and I feel like we must be doing something less intelligently than we could be doing here, because none of the other download hooks take this long.

How much data are we downloading, and in how many files, again? Is this something we could solve by versioning something, maybe? Or adding a 'got' sha1 that we save when we do actually download the file, so all we have to do is compare the checked-in sha1 against the got sha1, rather than recomputing it every time?


#6: checking in the computed sha for the file is essentially the same as just ignoring the binary content of the file & check if it exists at all.

I think the best solution moving forward is modifying this to use cipd package system, rather than trying to invent yet-another-method of managing large number of binary files. Do you have any doc about how to use cipd? I would be happy to modify telemetry to use that instead.
Actually the short term fix for this is fetch_telemetry_binary_dependencies also produce precomputed hash files on the builder (which seems like the builder would keep these around). Upon detecting version mismatch between checked-in .sha1 files & .computed-sha1, it will delete binary & computed-sha1 & try to refetch things again.
Right, that's what I was suggesting (not checking in the computed sha for the file).

That said, I still feel like we're missing something. 
#9: I can prototype & see how it goes.
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/e188f9648ada03c31681a688c4fa07407d43aa7a

commit e188f9648ada03c31681a688c4fa07407d43aa7a
Author: Nghia Nguyen <nednguyen@google.com>
Date: Mon Dec 11 23:24:10 2017

Make sure cloud_storage.GetIfHashChanged also save the binary's hash upon first time fetching it

Bug:  chromium:793609 
Change-Id: Ide32946e5c83d141978efc8d0b180136213616e0

TBR=perezju@chromium.org

Change-Id: Ide32946e5c83d141978efc8d0b180136213616e0
Reviewed-on: https://chromium-review.googlesource.com/820131
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/e188f9648ada03c31681a688c4fa07407d43aa7a/common/py_utils/py_utils/cloud_storage_unittest.py
[modify] https://crrev.com/e188f9648ada03c31681a688c4fa07407d43aa7a/common/py_utils/py_utils/cloud_storage.py
[modify] https://crrev.com/e188f9648ada03c31681a688c4fa07407d43aa7a/.gitignore

NextAction: 2017-12-12
I will check the runtime of fetch_telemetry_binary_dependencies again tomorrow. Should be few seconds.
Components: Speed>Benchmarks
Project Member

Comment 14 by bugdroid1@chromium.org, Dec 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/98335bb2216f982670ac80bc969e1d7123ef4501

commit 98335bb2216f982670ac80bc969e1d7123ef4501
Author: Ned Nguyen <nednguyen@google.com>
Date: Mon Dec 11 23:32:32 2017

Ignore .fetchts

Bug:793609

NOTRY=true
TBR=dpranke@chromium.org

Change-Id: I770559876a6dd6b7de0a3575e02a1b854858a414
Reviewed-on: https://chromium-review.googlesource.com/820939
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#523248}
[modify] https://crrev.com/98335bb2216f982670ac80bc969e1d7123ef4501/tools/perf/page_sets/.gitignore

The NextAction date has arrived: 2017-12-12
Status: Fixed (was: Started)
"gclient runhooks" takes 38 sec now: https://ci.chromium.org/buildbot/chromium.win/Win7%20Tests%20%281%29/74488

So I think this is fixed
Nice! Thank you for fixing this so quickly :).

Sign in to add a comment