New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 615997 link

Starred by 3 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Possible slow response of hwid service (18:05 canary runs failing bvt-inline on many systems)

Project Member Reported by djkurtz@chromium.org, May 31 2016

Issue description

The "18:05" canary run failed every night for the past 4 days for many boards (both x86 & arm):

Some examples:

https://chromegw.corp.google.com/i/chromeos/builders/peach_pi-release
https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release

Example failure snippet:

https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/73/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio

"""
Reason: Suite job failed or provisioning failed.

Output below this line is for buildbot consumption:
@@@STEP_LINK@[Test-Logs]: Suite job: ABORT@http://cautotest/tko/retrieve_logs.cgi?job=/results/65190657-chromeos-test/@@@
@@@STEP_LINK@[Flake-Dashboard]: Suite job@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=Suite job@@@
Will return from run_suite with status: INFRA_FAILURE
21:11:00: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpL6XJfH/tmpiJ5giG/temp_summary.json --raw-cmd --task-name auron_paine-release/R53-8392.0.0-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 10800 --io-timeout 10800 --hard-timeout 10800 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build auron_paine-release/R53-8392.0.0 --board auron_paine --suite_name bvt-inline --pool bvt --num 6 --file_bugs True --priority Build --timeout_mins 120 --retry True --max_retries 10 --minimum_duts 4 --suite_min_duts 6 --offload_failures_only False --json_dump -m 65190657
#JSON_START#{"autotest_instance": "cautotest", "return_code": 3, "return_message": "Suite job failed or provisioning failed.", "suite_job_id": 65190657, "suite_timings": {"artifact_download_end": "2016-05-30 20:44:59", "download_start": "2016-05-30 20:44:00", "payload_download_end": "2016-05-30 20:44:03", "suite_start": "2016-05-30 20:44:56", "tests_end": "2016-05-30 20:52:26", "tests_start": "2016-05-30 20:51:43"}, "tests": {"Suite job": {"attributes": [], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190657-chromeos-test/", "reason": "", "retry_count": 0, "status": "ABORT", "wmatrix_link": "https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=Suite job"}, "platform_DMVerityBitCorruption.first": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}, "platform_DMVerityBitCorruption.last": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}, "platform_DMVerityBitCorruption.middle": {"attributes": ["suite:bvt-inline", "subsystem:default"], "link_to_logs": "http://cautotest/tko/retrieve_logs.cgi?job=/results/65190830-chromeos-test/", "reason": "completed successfully", "retry_count": 0, "status": "GOOD"}}}#JSON_END#
21:11:37: INFO: Running cidb query on pid 8885, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x8589810>
21:11:42: INFO: cidb query succeeded after 1 retries
"""
 
Owner: dshi@chromium.org
This sounds like we are scheduling more work than we can handle, perhaps because we have test suits scheduled to overlap with this load?

Comment 2 by dshi@chromium.org, May 31 2016

Cc: jrbarnette@chromium.org kevcheng@chromium.org
Summary: reset takes over 5min to finish (was: 18:05 canary runs failing bvt-inline on many systems)
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack5-host9/57279-reset/20163005211049/
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack5-host9/57250-reset/20163005204607/debug/

05/30 20:46:37.386 INFO |            repair:0279| Verifying this condition: Ensure the host labels are up-to-date
05/30 20:48:45.481 DEBUG|          ssh_host:0180| Running (ssh) 'which ectool'

From the status logs, it seems that verify.label is the step causing issue. From the reset jobs running more recently, I'm seeing the same issue. So it's not a problem related to AFE RPC load. More likely, it's something related to the board that leads to a slow verify.label.

+kevcheng and jrbarnette who worked on verify.label

Comment 3 by dshi@chromium.org, May 31 2016

Cc: -kevcheng@chromium.org dshi@chromium.org
Owner: kevcheng@chromium.org
For question in #1, it's not related to tests overload. I checked that pool/board:

./site_utils/host_history.py --board auron_paine --pool bvt --start "2016-05-30 20:00:00" --end "2016-05-30 22:00:00"
Overall stats for hosts: board:auron_paine, pool:bvt
 2016-05-30 20:00:00 - 2016-05-30 22:00:00
 Number of total hosts: 8
    Verifying:        0.00 %
    Running:          18.54 %
    Ready:            50.63 %
    Repairing:        3.04 %
    Repair Failed:    0.00 %
    Cleaning:         0.00 %
    Pending:          6.18 %
    Resetting:        5.32 %
    Provisioning:     16.30 %
    Locked:           0.00 %

The duts were only used 50% of the time. The suite timeout is only 24 mins, so a 5 min reset job can really lead to the timeout.

Comment 4 by dshi@chromium.org, May 31 2016

Some other cause of the timeout is provision failure:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host3/57256-provision/20163005204610/debug/

05/30 20:51:54.242 DEBUG|        base_utils:0269| [stdout] Restoring state to factory_install with dev_image.
05/30 21:05:22.829 DEBUG|        base_utils:0269| [stdout] [0530/210521:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.

The restore took 14 min to finish. 
05/30 20:51:54.242 DEBUG|        base_utils:0269| [stdout] Restoring state to factory_install with dev_image.
05/30 21:05:22.829 DEBUG|        base_utils:0269| [stdout] [0530/210521:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.

I believe the lines above don't indicate a long time for the
"Restoring state to ..." operation.  That's actually running
in parallel with the AU.  What took was downloading the new
image for update.  That started at 20:47, and didn't finish
until the log message at 21:05, 18 minutes later.

I should add:  Provision also runs verify(), which invokes the
label detection.  So, if label detection takes 5 minutes, that'll
add to the time of every provision, too.

Cc: haddowk@chromium.org
Summary: Possible slow response of hwid service (was: reset takes over 5min to finish)
The entire label process took over 4 mins (should take a couple seconds):
05/30 21:11:06.626 INFO |            repair:0279| Verifying this condition: Ensure the host labels are up-to-date
05/30 21:15:27.685 INFO |        server_job:0128| 	GOOD	----	verify.label	timestamp=1464668127	localtime=May 30 21:15:27


I note two slowdown spots:
05/30 21:11:06.626 INFO |            repair:0279| Verifying this condition: Ensure the host labels are up-to-date
05/30 21:13:14.469 DEBUG|          ssh_host:0180| Running (ssh) 'which ectool'


05/30 21:13:15.512 DEBUG|        base_utils:0288| [stdout] PAINE D25-B2B-B6A-A7H
05/30 21:15:22.988 DEBUG|          ssh_host:0180| Running (ssh) 'cat /etc/ui_use_flags.txt'

Both of which happen during access to the hwid service.  Perhaps network congestion is causing a slow response?  Or maybe we're getting rate-limited by GAE?

Keith, It looks like it's taking 2 mins to get a response back from the service during the night, do you see any fishy on the GAE side of things?  Or those kinds of response times?
I do not think so graphs show that the average response rate from the server is 36ms, worst case is one trace I found that takes 2 seconds.  It is possible there is a delay somewhere upstream of my server in the apiary system but I would doubt it but I will try to dig in to see if I can get numbers.

https://screenshot.googleplex.com/ptK5rDDLa6B

I see also zero errors, if you were being rate limited you would get an error right back saying "Unauthorized" or similar, is there any logging like that ?

I have successfully managed to run at 60qps in the past, at the moment we are only running 8-10qps
5/30 is right at the edge of history for the apiary stack console, but even with apiary overhead the average response rate is on the 100-200ms, with very occasional spikes up to 2s

https://screenshot.googleplex.com/sUTpJLWt0TN

I also see no quota issues

https://apiary-global-borgmon.corp.google.com/t/apiz?apiz=chromeoshwid&job=apiserver
Thanks for looking into that Keith.

I'll need to add some more logging then to see what else is causing the holdup.
Labels: -Pri-2 Pri-1
Summary: Possible slow response of hwid service (18:05 canary runs failing bvt-inline on many systems) (was: Possible slow response of hwid service)
I just debugged failing builds on oak-release, and tracked it down to what looks like the exact same always failing "18:06" builds symptom:

https://chromegw.corp.google.com/i/chromeos/builders/oak-release

And a quick spot-check shows a similar pattern for peach_pit, although it does not fail every night:

https://chromegw.corp.google.com/i/chromeos/builders/peach_pit-release?numbuilds=200
label auto-detection has been disabled since 7/12 and has not been re-enabled (yet - pending cl is going through the CQ right now https://chromium-review.googlesource.com/#/c/368472/).

HWID is not the cause for these failures.  
Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.

Comment 14 by sosa@chromium.org, Jul 18 2017

Status: Archived (was: Assigned)

Sign in to add a comment