New issue
Advanced search Search tips

Issue 747094 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 475693



Sign in to add a comment

CIPD fails to verify large packages

Project Member Reported by sergeybe...@chromium.org, Jul 20 2017

Issue description

Uploading a 13GB package (compressed to ~6GB):

[P28705 21:03:20.336 storage.go:67 I] cipd: uploading - 99%
[P28705 21:03:25.303 storage.go:67 I] cipd: uploading - 100%
[P28705 21:03:25.793 client.go:775 I] cipd: uploading - verifying
[P28705 21:03:26.871 client.go:775 I] cipd: uploading - verifying
[...]
[P28705 21:08:19.095 client.go:775 I] cipd: uploading - verifying
[P28705 21:08:23.173 client.go:775 I] cipd: uploading - verifying
[P28705 21:08:27.273 client.go:772 W] cipd: upload of 07d235xxxxxxxxxxxxxxxxxx failed: timeout
Error: timeout while waiting for CAS service to finalize the upload.

The root cause is a 10m timeout in a single taskqueue Task:

https://chromium.googlesource.com/infra/infra/+/master/appengine/chrome_infra_packages/cas/impl.py#411
 
Interestingly, GS provides server-side MD5 (or CRC32c) validation: https://cloud.google.com/storage/docs/hashes-etags#_Validation

"Google Cloud Storage supports server-side validation for uploads, but client-side validation is also a common approach. If your application has already computed the object’s MD5 or CRC32c prior to starting the upload, you can supply it with the upload request, and Google Cloud Storage will only create the object if the hash you provided matches the value we calculated."

I wonder if we can use that instead of verifying ourselves?
No, MD5 is awful hash for this sort of stuff. We using hashes as security mechanism (to some extent), not just like a checksum. Even SHA1 is now considered inappropriate and we'll need to migrate to SHA256 at some point.
OK, thanks, good to know.

On to the next problem: apparently, hashlib algos are not natively serializable / resumable. There is a workaround: https://stackoverflow.com/questions/2130892/persisting-hashlib-state
but I need to look closer if it'll work on AppEngine.
ctypes (and OpenSSL) won't work on GAE.

There may be something in PyCrypto package that IS available.

Doing hashing in native python will be very slow :( (and will cost even more CPU $$$), so I'm against it.

The last resort is move verification outside of GAE standard (into GKE or GAE Flex), but it is big design change that is outside of scope for now.
PyCrypto uses hashlib for SHA1 under the hood: https://github.com/dlitz/pycrypto/blob/master/lib/Crypto/Hash/SHA1.py#L51

So, I guess the answer is "no"...
Status: Started (was: Assigned)
From talking to vadimsh@ offline and more reading, turns out GAE supports B4 class instances which support manual or basic scaling, both of which allow taskqueue tasks to run for up to 24h:
  https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine#scaling_types_and_instance_classes
  https://cloud.google.com/appengine/docs/standard/#instance_classes

I'm inclined to use "basic" scaling for the "backend" service, and set "max_instances" to something definitely sufficient, like 10. It's normally running only 1 instance.
SGTM. Basic scaling looks ideal for our case.
When uploading a tainted version to the -dev instance, GAE failed to verify that Cloud Endpoints were updated, and the logs show this crash:

/_ah/spi/BackendService.getApiConfigs

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 351, in __getattr__
    self._update_configs()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 287, in _update_configs
    self._registry.initialize()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 160, in initialize
    import_func(self._modname)
  File "/base/data/home/apps/s~chrome-infra-packages-dev/10860-aaf3a2c-tainted-sergeyberezin.402814690065330475/appengine_config.py", line 20, in <module>
    import google.protobuf
ImportError: No module named protobuf

How did it work before??
OK, I give up for now. It's a known issue, the bug is in AppEngine code (not declaring "google" as namespace package), and there is a workaround in https://github.com/google/protobuf/issues/1296 :

"...import google.appengine would fail, but import google.protobuf; import google.appengine would work, presumably due to some namespace package magic on Python 2.7. We fixed it by adding a gae.pth file to our local site-packages that forces google to be a namespace package: import google; import pkgutil; pkgutil.extend_path(google.__path__, google.__name__)."

I'll try that on Monday. Reverted the -dev instance for the time being.
FTR, I had lots of "fun" with this last year, to no useful resolution: https://crbug.com/666985#c6
It worked fine until https://chromium.googlesource.com/infra/infra/+/f3d37c5bee642fdea02905cb395765187707f698

Then it broke, then it was supposedly fixed, but I haven't tried it for c-i-p yet. (I knew the fix worked for cr-buildbucket, which is structurally similar).

Make sure to run gclient sync and runhooks.
Updated infra.git to the latest commit and ran gclient sync. I still get the same problem after upload. BTW, I'm doing it from a Mac. I'll try it next on linux, and if that doesn't work, will try fresh checkout.
I think it's just broken.

We need to apply a fix similar to: https://chromium.googlesource.com/infra/infra/+/ce86b096dc556b656ed8f6efdf999c8e378df9c0

To reverse effects of https://chromium.googlesource.com/infra/infra/+/f3d37c5bee642fdea02905cb395765187707f698

(Both cr-buildbucket and chrome_infra_packages were broken, but only cr-buildbucket was fixed).
Yes! It worked! Thanks, Vadim! Landing the CL with this fix.
Project Member

Comment 15 by bugdroid1@chromium.org, Jul 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/e355fed6c6c53484cfc3bffee3ba53182e80ebe6

commit e355fed6c6c53484cfc3bffee3ba53182e80ebe6
Author: Sergey Berezin <sergeyberezin@google.com>
Date: Wed Jul 26 01:53:36 2017

CIPD: change backend service instance type to B4

This is to allow basic scaling which in turn allows longer (24h)
taskqueue task requests. Specifically, this allows to complete
verification of very large packages (multiple GB).

R=vadimsh@chromium.org
BUG= 747094 

Change-Id: I1889046ae4cb45e6450cdc4128a882ce38769f6b
Reviewed-on: https://chromium-review.googlesource.com/580373
Commit-Queue: Sergey Berezin <sergeyberezin@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/e355fed6c6c53484cfc3bffee3ba53182e80ebe6/appengine/chrome_infra_packages/appengine_config.py
[modify] https://crrev.com/e355fed6c6c53484cfc3bffee3ba53182e80ebe6/appengine/chrome_infra_packages/module-backend.yaml

I'm deploying the new version to production. Will split traffic first for "default" module to make sure I don't break it. Backend will not benefit from splitting, will just migrate to it.
All seems clear, migrating to 100%.
Now is the real test: I'll upload the new Xcode package and see how it goes.
No cheese:

[P13008 22:17:55.467 storage.go:67 I] cipd: uploading - 100%
[P13008 22:17:56.281 client.go:775 I] cipd: uploading - verifying
[P13008 22:17:57.387 client.go:775 I] cipd: uploading - verifying
[P13008 22:17:59.037 client.go:775 I] cipd: uploading - verifying
...
[P13008 22:22:51.968 client.go:775 I] cipd: uploading - verifying
[P13008 22:22:56.078 client.go:772 W] cipd: upload of 898ceacadb78c161a107fa83c8f1b1e4be739219 failed: timeout
Error: timeout while waiting for CAS service to finalize the upload.

D'oh, I forgot about the default -verification-timeout=5m. Setting it to 60m, trying again.
It eventually succeeded on the backend. Took ~16 min.
Status: Fixed (was: Started)
Hazzah! It only took 15m to verify :-)

[P21704 12:55:59.914 storage.go:67 I] cipd: uploading - 100%
[P21704 12:56:00.414 client.go:775 I] cipd: uploading - verifying
...
[P21704 13:11:34.980 client.go:775 I] cipd: uploading - verifying
[P21704 13:11:39.090 client.go:768 I] cipd: successfully uploaded 5dd9dda6e94e529b0230d43850d621b3222ee42e
[P21704 13:11:39.090 client.go:951 I] cipd: registering infra_internal/ios/xcode:5dd9dda6e94e529b0230d43850d621b3222ee42e
[P21704 13:11:39.435 client.go:966 I] cipd: instance infra_internal/ios/xcode:5dd9dda6e94e529b0230d43850d621b3222ee42e was successfully registered
[P21704 13:11:39.436 client.go:1025 I] cipd: attaching tag xcode_version:9.0
[P21704 13:11:39.436 client.go:1025 I] cipd: attaching tag build_version:9M174d
[P21704 13:11:39.601 client.go:1034 I] cipd: all tags attached

real	852m14.330s
user	16m46.552s
sys	3m16.167s

Sign in to add a comment