Issue metadata
Sign in to add a comment
|
Possible slow response of hwid service (18:05 canary runs failing bvt-inline on many systems) |
||||||||||||||||||||||
Issue descriptionThe "18:05" canary run failed every night for the past 4 days for many boards (both x86 & arm): Some examples: https://chromegw.corp.google.com/i/chromeos/builders/peach_pi-release https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release Example failure snippet: https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/73/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio """ Reason: Suite job failed or provisioning failed. Output below this line is for buildbot consumption: @@@STEP_LINK@[Test-Logs]: Suite job: ABORT@http://cautotest/tko/retrieve_logs.cgi?job=/results/65190657-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: Suite job@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=Suite job@@@ Will return from run_suite with status: INFRA_FAILURE 21:11:00: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpL6XJfH/tmpiJ5giG/temp_summary.json --raw-cmd --task-name auron_paine-release/R53-8392.0.0-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 10800 --io-timeout 10800 --hard-timeout 10800 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build auron_paine-release/R53-8392.0.0 --board auron_paine --suite_name bvt-inline --pool bvt --num 6 --file_bugs True --priority Build --timeout_mins 120 --retry True --max_retries 10 --minimum_duts 4 --suite_min_duts 6 --offload_failures_only False --json_dump -m 65190657 #JSON_START#{"autotest_instance": "cautotest", "return_code": 3, "return_message": "Suite job failed or provisioning failed.", "suite_job_id": 65190657, "suite_timings": {"artifact_download_end": "2016-05-30 20:44:59", "download_start": "2016-05-30 20:44:00", "payload_download_end": "2016-05-30 20:44:03", "suite_start": "2016-05-30 20:44:56", "tests_end": "2016-05-30 20:52:26", "tests_start": "2016-05-30 20:51:43"}, "tests": {"Suite job": {"attributes": [], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190657-chromeos-test/", "reason": "", "retry_count": 0, "status": "ABORT", "wmatrix_link": "https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=Suite job"}, "platform_DMVerityBitCorruption.first": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}, "platform_DMVerityBitCorruption.last": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}, "platform_DMVerityBitCorruption.middle": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}}}#JSON_END# 21:11:37: INFO: Running cidb query on pid 8885, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x8589810> 21:11:42: INFO: cidb query succeeded after 1 retries """
,
May 31 2016
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack5-host9/57279-reset/20163005211049/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack5-host9/57250-reset/20163005204607/debug/ 05/30 20:46:37.386 INFO | repair:0279| Verifying this condition: Ensure the host labels are up-to-date 05/30 20:48:45.481 DEBUG| ssh_host:0180| Running (ssh) 'which ectool' From the status logs, it seems that verify.label is the step causing issue. From the reset jobs running more recently, I'm seeing the same issue. So it's not a problem related to AFE RPC load. More likely, it's something related to the board that leads to a slow verify.label. +kevcheng and jrbarnette who worked on verify.label
,
May 31 2016
For question in #1, it's not related to tests overload. I checked that pool/board:
./site_utils/host_history.py --board auron_paine --pool bvt --start "2016-05-30 20:00:00" --end "2016-05-30 22:00:00"
Overall stats for hosts: board:auron_paine, pool:bvt
2016-05-30 20:00:00 - 2016-05-30 22:00:00
Number of total hosts: 8
Verifying: 0.00 %
Running: 18.54 %
Ready: 50.63 %
Repairing: 3.04 %
Repair Failed: 0.00 %
Cleaning: 0.00 %
Pending: 6.18 %
Resetting: 5.32 %
Provisioning: 16.30 %
Locked: 0.00 %
The duts were only used 50% of the time. The suite timeout is only 24 mins, so a 5 min reset job can really lead to the timeout.
,
May 31 2016
Some other cause of the timeout is provision failure: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host3/57256-provision/20163005204610/debug/ 05/30 20:51:54.242 DEBUG| base_utils:0269| [stdout] Restoring state to factory_install with dev_image. 05/30 21:05:22.829 DEBUG| base_utils:0269| [stdout] [0530/210521:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed. The restore took 14 min to finish.
,
May 31 2016
05/30 20:51:54.242 DEBUG| base_utils:0269| [stdout] Restoring state to factory_install with dev_image. 05/30 21:05:22.829 DEBUG| base_utils:0269| [stdout] [0530/210521:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed. I believe the lines above don't indicate a long time for the "Restoring state to ..." operation. That's actually running in parallel with the AU. What took was downloading the new image for update. That started at 20:47, and didn't finish until the log message at 21:05, 18 minutes later.
,
May 31 2016
I should add: Provision also runs verify(), which invokes the label detection. So, if label detection takes 5 minutes, that'll add to the time of every provision, too.
,
Jun 2 2016
The entire label process took over 4 mins (should take a couple seconds): 05/30 21:11:06.626 INFO | repair:0279| Verifying this condition: Ensure the host labels are up-to-date 05/30 21:15:27.685 INFO | server_job:0128| GOOD ---- verify.label timestamp=1464668127 localtime=May 30 21:15:27 I note two slowdown spots: 05/30 21:11:06.626 INFO | repair:0279| Verifying this condition: Ensure the host labels are up-to-date 05/30 21:13:14.469 DEBUG| ssh_host:0180| Running (ssh) 'which ectool' 05/30 21:13:15.512 DEBUG| base_utils:0288| [stdout] PAINE D25-B2B-B6A-A7H 05/30 21:15:22.988 DEBUG| ssh_host:0180| Running (ssh) 'cat /etc/ui_use_flags.txt' Both of which happen during access to the hwid service. Perhaps network congestion is causing a slow response? Or maybe we're getting rate-limited by GAE? Keith, It looks like it's taking 2 mins to get a response back from the service during the night, do you see any fishy on the GAE side of things? Or those kinds of response times?
,
Jun 2 2016
I do not think so graphs show that the average response rate from the server is 36ms, worst case is one trace I found that takes 2 seconds. It is possible there is a delay somewhere upstream of my server in the apiary system but I would doubt it but I will try to dig in to see if I can get numbers. https://screenshot.googleplex.com/ptK5rDDLa6B I see also zero errors, if you were being rate limited you would get an error right back saying "Unauthorized" or similar, is there any logging like that ? I have successfully managed to run at 60qps in the past, at the moment we are only running 8-10qps
,
Jun 2 2016
5/30 is right at the edge of history for the apiary stack console, but even with apiary overhead the average response rate is on the 100-200ms, with very occasional spikes up to 2s https://screenshot.googleplex.com/sUTpJLWt0TN I also see no quota issues https://apiary-global-borgmon.corp.google.com/t/apiz?apiz=chromeoshwid&job=apiserver
,
Jun 2 2016
Thanks for looking into that Keith. I'll need to add some more logging then to see what else is causing the holdup.
,
Aug 22 2016
I just debugged failing builds on oak-release, and tracked it down to what looks like the exact same always failing "18:06" builds symptom: https://chromegw.corp.google.com/i/chromeos/builders/oak-release And a quick spot-check shows a similar pattern for peach_pit, although it does not fail every night: https://chromegw.corp.google.com/i/chromeos/builders/peach_pit-release?numbuilds=200
,
Aug 22 2016
label auto-detection has been disabled since 7/12 and has not been re-enabled (yet - pending cl is going through the CQ right now https://chromium-review.googlesource.com/#/c/368472/). HWID is not the cause for these failures.
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 18 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by dgarr...@chromium.org
, May 31 2016