CIPD fails to verify large packages |
|||
Issue descriptionUploading a 13GB package (compressed to ~6GB): [P28705 21:03:20.336 storage.go:67 I] cipd: uploading - 99% [P28705 21:03:25.303 storage.go:67 I] cipd: uploading - 100% [P28705 21:03:25.793 client.go:775 I] cipd: uploading - verifying [P28705 21:03:26.871 client.go:775 I] cipd: uploading - verifying [...] [P28705 21:08:19.095 client.go:775 I] cipd: uploading - verifying [P28705 21:08:23.173 client.go:775 I] cipd: uploading - verifying [P28705 21:08:27.273 client.go:772 W] cipd: upload of 07d235xxxxxxxxxxxxxxxxxx failed: timeout Error: timeout while waiting for CAS service to finalize the upload. The root cause is a 10m timeout in a single taskqueue Task: https://chromium.googlesource.com/infra/infra/+/master/appengine/chrome_infra_packages/cas/impl.py#411
,
Jul 20 2017
No, MD5 is awful hash for this sort of stuff. We using hashes as security mechanism (to some extent), not just like a checksum. Even SHA1 is now considered inappropriate and we'll need to migrate to SHA256 at some point.
,
Jul 20 2017
OK, thanks, good to know. On to the next problem: apparently, hashlib algos are not natively serializable / resumable. There is a workaround: https://stackoverflow.com/questions/2130892/persisting-hashlib-state but I need to look closer if it'll work on AppEngine.
,
Jul 20 2017
ctypes (and OpenSSL) won't work on GAE. There may be something in PyCrypto package that IS available. Doing hashing in native python will be very slow :( (and will cost even more CPU $$$), so I'm against it. The last resort is move verification outside of GAE standard (into GKE or GAE Flex), but it is big design change that is outside of scope for now.
,
Jul 20 2017
PyCrypto uses hashlib for SHA1 under the hood: https://github.com/dlitz/pycrypto/blob/master/lib/Crypto/Hash/SHA1.py#L51 So, I guess the answer is "no"...
,
Jul 21 2017
From talking to vadimsh@ offline and more reading, turns out GAE supports B4 class instances which support manual or basic scaling, both of which allow taskqueue tasks to run for up to 24h: https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine#scaling_types_and_instance_classes https://cloud.google.com/appengine/docs/standard/#instance_classes I'm inclined to use "basic" scaling for the "backend" service, and set "max_instances" to something definitely sufficient, like 10. It's normally running only 1 instance.
,
Jul 21 2017
SGTM. Basic scaling looks ideal for our case.
,
Jul 21 2017
When uploading a tainted version to the -dev instance, GAE failed to verify that Cloud Endpoints were updated, and the logs show this crash:
/_ah/spi/BackendService.getApiConfigs
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 351, in __getattr__
self._update_configs()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 287, in _update_configs
self._registry.initialize()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 160, in initialize
import_func(self._modname)
File "/base/data/home/apps/s~chrome-infra-packages-dev/10860-aaf3a2c-tainted-sergeyberezin.402814690065330475/appengine_config.py", line 20, in <module>
import google.protobuf
ImportError: No module named protobuf
How did it work before??
,
Jul 21 2017
OK, I give up for now. It's a known issue, the bug is in AppEngine code (not declaring "google" as namespace package), and there is a workaround in https://github.com/google/protobuf/issues/1296 : "...import google.appengine would fail, but import google.protobuf; import google.appengine would work, presumably due to some namespace package magic on Python 2.7. We fixed it by adding a gae.pth file to our local site-packages that forces google to be a namespace package: import google; import pkgutil; pkgutil.extend_path(google.__path__, google.__name__)." I'll try that on Monday. Reverted the -dev instance for the time being.
,
Jul 21 2017
FTR, I had lots of "fun" with this last year, to no useful resolution: https://crbug.com/666985#c6
,
Jul 21 2017
It worked fine until https://chromium.googlesource.com/infra/infra/+/f3d37c5bee642fdea02905cb395765187707f698 Then it broke, then it was supposedly fixed, but I haven't tried it for c-i-p yet. (I knew the fix worked for cr-buildbucket, which is structurally similar). Make sure to run gclient sync and runhooks.
,
Jul 25 2017
Updated infra.git to the latest commit and ran gclient sync. I still get the same problem after upload. BTW, I'm doing it from a Mac. I'll try it next on linux, and if that doesn't work, will try fresh checkout.
,
Jul 25 2017
I think it's just broken. We need to apply a fix similar to: https://chromium.googlesource.com/infra/infra/+/ce86b096dc556b656ed8f6efdf999c8e378df9c0 To reverse effects of https://chromium.googlesource.com/infra/infra/+/f3d37c5bee642fdea02905cb395765187707f698 (Both cr-buildbucket and chrome_infra_packages were broken, but only cr-buildbucket was fixed).
,
Jul 26 2017
Yes! It worked! Thanks, Vadim! Landing the CL with this fix.
,
Jul 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/e355fed6c6c53484cfc3bffee3ba53182e80ebe6 commit e355fed6c6c53484cfc3bffee3ba53182e80ebe6 Author: Sergey Berezin <sergeyberezin@google.com> Date: Wed Jul 26 01:53:36 2017 CIPD: change backend service instance type to B4 This is to allow basic scaling which in turn allows longer (24h) taskqueue task requests. Specifically, this allows to complete verification of very large packages (multiple GB). R=vadimsh@chromium.org BUG= 747094 Change-Id: I1889046ae4cb45e6450cdc4128a882ce38769f6b Reviewed-on: https://chromium-review.googlesource.com/580373 Commit-Queue: Sergey Berezin <sergeyberezin@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/e355fed6c6c53484cfc3bffee3ba53182e80ebe6/appengine/chrome_infra_packages/appengine_config.py [modify] https://crrev.com/e355fed6c6c53484cfc3bffee3ba53182e80ebe6/appengine/chrome_infra_packages/module-backend.yaml
,
Jul 26 2017
I'm deploying the new version to production. Will split traffic first for "default" module to make sure I don't break it. Backend will not benefit from splitting, will just migrate to it.
,
Jul 26 2017
All seems clear, migrating to 100%.
,
Jul 26 2017
Now is the real test: I'll upload the new Xcode package and see how it goes.
,
Jul 26 2017
No cheese: [P13008 22:17:55.467 storage.go:67 I] cipd: uploading - 100% [P13008 22:17:56.281 client.go:775 I] cipd: uploading - verifying [P13008 22:17:57.387 client.go:775 I] cipd: uploading - verifying [P13008 22:17:59.037 client.go:775 I] cipd: uploading - verifying ... [P13008 22:22:51.968 client.go:775 I] cipd: uploading - verifying [P13008 22:22:56.078 client.go:772 W] cipd: upload of 898ceacadb78c161a107fa83c8f1b1e4be739219 failed: timeout Error: timeout while waiting for CAS service to finalize the upload.
,
Jul 26 2017
D'oh, I forgot about the default -verification-timeout=5m. Setting it to 60m, trying again.
,
Jul 26 2017
It eventually succeeded on the backend. Took ~16 min.
,
Jul 26 2017
Hazzah! It only took 15m to verify :-) [P21704 12:55:59.914 storage.go:67 I] cipd: uploading - 100% [P21704 12:56:00.414 client.go:775 I] cipd: uploading - verifying ... [P21704 13:11:34.980 client.go:775 I] cipd: uploading - verifying [P21704 13:11:39.090 client.go:768 I] cipd: successfully uploaded 5dd9dda6e94e529b0230d43850d621b3222ee42e [P21704 13:11:39.090 client.go:951 I] cipd: registering infra_internal/ios/xcode:5dd9dda6e94e529b0230d43850d621b3222ee42e [P21704 13:11:39.435 client.go:966 I] cipd: instance infra_internal/ios/xcode:5dd9dda6e94e529b0230d43850d621b3222ee42e was successfully registered [P21704 13:11:39.436 client.go:1025 I] cipd: attaching tag xcode_version:9.0 [P21704 13:11:39.436 client.go:1025 I] cipd: attaching tag build_version:9M174d [P21704 13:11:39.601 client.go:1034 I] cipd: all tags attached real 852m14.330s user 16m46.552s sys 3m16.167s |
|||
►
Sign in to add a comment |
|||
Comment 1 by sergeybe...@chromium.org
, Jul 20 2017