New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 731214 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

[Canary] autoupdate_EndToEndTest failures of timeout on receiving finished notification

Project Member Reported by mcchou@chromium.org, Jun 8 2017

Issue description

The following builder failed at autoupdate_EndToEndTest with timeout of finished notification which is different from  crbug.com/730134  and  crbug.com/730141 .

Builder: reef-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/reef-release/builds/1187
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122207283-chromeos-test/chromeos2-row4-rack6-host16/:
06/08 07:55:05.182 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 08:05:05.631 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 08:05:05.632 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: chell-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/chell-release/builds/1177
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122207822-chromeos-test/chromeos4-row12-rack9-host11/:
06/08 08:04:38.699 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 08:14:39.678 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 08:14:39.680 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122208115-chromeos-test/chromeos4-row12-rack9-host3/:
06/08 08:52:40.413 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 09:02:40.454 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 09:02:40.455 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: pyro-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/pyro-release/builds/714
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122208673-chromeos-test/chromeos2-row4-rack4-host5/:
06/08 08:22:14.412 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 08:32:14.511 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 08:32:14.513 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: stumpy-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/stumpy-release/builds/5312
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122193972-chromeos-test/chromeos4-row2-rack8-host20/:
06/08 06:47:23.270 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 06:57:23.835 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 06:57:23.836 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: parrot-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/parrot-release/builds/3836
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122185628-chromeos-test/chromeos6-row2-rack4-host2/:
06/08 06:25:37.974 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 06:35:38.922 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 06:35:38.923 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122185653-chromeos-test/chromeos6-row2-rack3-host2/:
06/08 06:30:40.740 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 06:40:41.601 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 06:40:41.601 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: lumpy-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/lumpy-release/builds/5046
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122197192-chromeos-test/chromeos6-row2-rack7-host22/:
06/08 06:55:53.710 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 07:05:54.649 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 07:05:54.649 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122197194-chromeos-test/chromeos6-row2-rack8-host11/:
06/08 06:56:33.906 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 07:06:34.781 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 07:06:34.782 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).

Builder: reks-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/reks-release/builds/1194
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122203330-chromeos-test/chromeos4-row11-rack7-host2/:
06/08 07:31:50.169 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 07:41:51.135 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 07:41:51.136 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
 
Builder: cyan-release
Build:    https://chromegw.corp.google.com/i/chromeos/builders/cyan-release/builds/1196
Logs from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122194707-chromeos-test/chromeos4-row11-rack10-host5/:
06/08 06:49:56.331 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=9630.0.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 06:59:56.530 ERROR|autoupdate_EndToEn:0310| Timeout expired
06/08 06:59:56.531 ERROR|autoupdate_EndToEn:0278| Failed expected event: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
Cc: adlr@chromium.org
Builder: auron_paine-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/auron_paine-release/builds/1194
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: auron_yuna-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/auron_yuna-release/builds/1194
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: banjo-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/banjo-release/builds/1179
Logs:
  autoupdate_EndToEndTest.paygen_au_dev_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_dev_delta     retry_count: 1

Builder: candy-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/candy-release/builds/1194
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: gandof-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/gandof-release/builds/1186
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta   [ FAILED ]
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: gnawty-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/gnawty-release/builds/1194
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta   [ FAILED ]
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: heli-release
Bulid:   https://chromegw.corp.google.com/i/chromeos/builders/heli-release/builds/1197
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta   [ FAILED ]
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: lulu-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/lulu-release/builds/1196
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta   [ FAILED ]
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: tidus-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/tidus-release/builds/1191
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1

Builder: veyron_minnie
Build:   https://chromegw.corp.google.com/i/chromeos/builders/veyron_minnie-release/builds/1198
Logs:
  autoupdate_EndToEndTest.paygen_au_dev_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_dev_delta     retry_count: 1

Builder: veyron_speedy-release
Build:   https://chromegw.corp.google.com/i/chromeos/builders/veyron_speedy-release/builds/1194
Logs:
  autoupdate_EndToEndTest.paygen_au_canary_delta   [ FAILED ]
  autoupdate_EndToEndTest.paygen_au_canary_delta     FAIL: Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
  autoupdate_EndToEndTest.paygen_au_canary_delta     retry_count: 1
There is a chance this is related to my change:
   https://chromium-review.googlesource.com/c/502369/
   update_engine_client: add O_DSYNC flags

However, my first thought is 600 seconds (10 minutes) might not be enough time depending on how busy the device is and what the latency is to download parts.

Despite my change being a one-liner, it fundamentally changes the client behavior. We really needs this change for at least two customer facing issues. I'll talk with gwendal about this.
I looked at the stumpy failure above and it looks like the test passed:
   https://chromegw.corp.google.com/i/chromeos/builders/stumpy-release/builds/5312/steps/PaygenTestDev/logs/stdio


points at http://cautotest/tko/retrieve_logs.cgi?job=/results/122185183-chromeos-test/ which redirects me to:

   https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122185183-chromeos-test/chromeos4-row2-rack8-host18/

status.log contents:
START	autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1496926855	localtime=Jun 08 06:00:55	
	START	----	----	timestamp=1496926872	localtime=Jun 08 06:01:12	
		GOOD	----	sysinfo.before	timestamp=1496926872	localtime=Jun 08 06:01:12	
	END GOOD	----	----	timestamp=1496926872	localtime=Jun 08 06:01:12	
	START	----	----	timestamp=1496926882	localtime=Jun 08 06:01:22	
		GOOD	----	sysinfo.iteration.before	timestamp=1496926882	localtime=Jun 08 06:01:22	
	END GOOD	----	----	timestamp=1496926882	localtime=Jun 08 06:01:22	
	START	----	reboot	timestamp=1496926928	localtime=Jun 08 06:02:08	
		GOOD	----	reboot.start	timestamp=1496926928	localtime=Jun 08 06:02:08	
		GOOD	----	reboot.verify	timestamp=1496926949	localtime=Jun 08 06:02:29	
	END GOOD	----	reboot	kernel=3.8.11	localtime=Jun 08 06:02:29	timestamp=1496926949	
	START	----	reboot	timestamp=1496927668	localtime=Jun 08 06:14:28	
		GOOD	----	reboot.start	timestamp=1496927668	localtime=Jun 08 06:14:28	
		GOOD	----	reboot.verify	timestamp=1496927802	localtime=Jun 08 06:16:42	
	END GOOD	----	reboot	kernel=3.8.11	localtime=Jun 08 06:16:44	timestamp=1496927804	
	START	----	reboot	timestamp=1496928041	localtime=Jun 08 06:20:41	
		GOOD	----	reboot.start	timestamp=1496928041	localtime=Jun 08 06:20:41	
		GOOD	----	reboot.verify	timestamp=1496928110	localtime=Jun 08 06:21:50	
	END GOOD	----	reboot	kernel=3.8.11	localtime=Jun 08 06:21:51	timestamp=1496928111	
	START	----	----	timestamp=1496928164	localtime=Jun 08 06:22:44	
		GOOD	----	sysinfo.iteration.after	timestamp=1496928164	localtime=Jun 08 06:22:44	
	END GOOD	----	----	timestamp=1496928164	localtime=Jun 08 06:22:44	
	START	----	----	timestamp=1496928175	localtime=Jun 08 06:22:55	
		GOOD	----	sysinfo.after	timestamp=1496928175	localtime=Jun 08 06:22:55	
	END GOOD	----	----	timestamp=1496928175	localtime=Jun 08 06:22:55	
	GOOD	autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1496928214	localtime=Jun 08 06:23:34	completed successfully
END GOOD	autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1496928214	localtime=Jun 08 06:23:34


I don't understand what was happening for ~12 minutes between the first and second reboots (from this log). That is the only time delta that exceeds 10 minutes (600 seconds).

I am certain the O_DSYNC will slow down total throughput - but I have found no update engine (AU) perf requirement since it's running background. Happen to know where the 600 seconds timeout is specified? I'd like to extend that to 900 or 1200 seconds.
I have evidence the timeout is too short: looks like stumpy needs at least 12 minutes. This is not a slow machine. I'm going to propose a CL to extend the timeout to 1200 seconds.

Some key bits from the autoserv.DEBUG log file:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122185183-chromeos-test/chromeos4-row2-rack8-host18/debug/

06/08 06:00:55.811 INFO |        server_job:0199| START	autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1496926855	localtime=Jun 08 06:00:55	
...
06/08 06:01:08.925 DEBUG|          autotest:0380| Autotest job starts on remote host: chromeos4-row2-rack8-host18
...
06/08 06:01:30.567 DEBUG|autoupdate_EndToEn:1800| Choose the least loaded devserver: http://100.115.219.135:8082
06/08 06:01:30.567 INFO |autoupdate_EndToEn:0997| Staging images onto autotest devserver (http://100.115.219.135:8082)
06/08 06:01:30.569 INFO |        dev_server:1108| Staging artifacts on devserver http://100.115.219.135:8082: build=stable-channel/stumpy/8530.96.0/payloads, artifacts=None, files=['chromeos_8530.96.0_stumpy_stable-channel_full_test.bin-857dd3431a264c5554e54b810a58df7c'], archive_url=gs://chromeos-releases/stable-channel/stumpy/8530.96.0/payloads
...
06/08 06:01:35.994 INFO |        dev_server:1126| Finished staging artifacts: build=stable-channel/stumpy/8530.96.0, artifacts=None, files=['stateful.tgz'], archive_url=gs://chromeos-releases/stable-channel/stumpy/8530.96.0
06/08 06:01:35.998 INFO |autoupdate_EndToEn:1028| Source full payload from gs://chromeos-releases/stable-channel/stumpy/8530.96.0/payloads/chromeos_8530.96.0_stumpy_stable-channel_full_test.bin-857dd3431a264c5554e54b810a58df7c staged at http://100.115.219.135:8082/static/stable-channel/stumpy/8530.96.0/payloads/chromeos_8530.96.0_stumpy_stable-channel_full_test.bin-857dd3431a264c5554e54b810a58df7c
...
06/08 06:02:07.119 INFO |autoupdate_EndToEn:1147| Installing a source image on the DUT
...
06/08 06:02:08.919 DEBUG|          ssh_host:0286| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!'
...
06/08 06:02:29.320 DEBUG|      abstract_ssh:0587| Host chromeos4-row2-rack8-host18 is now up
...
06/08 06:02:32.271 WARNI|autoupdate_EndToEn:0486| Waiting for devserver to start up.
...
06/08 06:02:33.965 INFO |autoupdate_EndToEn:0494| Devserver pid is 29320, serving on port 45779
06/08 06:02:33.966 WARNI|autoupdate_EndToEn:0510| Waiting for devserver to accept network requests.
...
06/08 06:02:35.487 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.135:45779/update'
06/08 06:02:35.737 ERROR|             utils:0298| [stderr] [0608/060234:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate.
06/08 06:02:35.739 ERROR|             utils:0298| [stderr] [0608/060234:INFO:update_engine_client.cc(473)] Initiating update check and install.
06/08 06:02:35.740 ERROR|             utils:0298| [stderr] [0608/060234:INFO:update_engine_client.cc(502)] Waiting for update to complete.
**** TEST TIMEOUT HERE? ****
06/08 06:14:12.348 ERROR|             utils:0298| [stderr] [0608/061411:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.
...
06/08 06:14:13.540 DEBUG|             utils:0298| [stdout] Downloading stateful payload from http://100.115.219.135:8082/static/stable-channel/stumpy/8530.96.0/stateful.tgz
...
06/08 06:14:25.947 DEBUG|             utils:0298| [stdout] Successfully downloaded update.
06/08 06:14:25.947 DEBUG|             utils:0298| [stdout] Restoring state to factory_install with dev_image.
...



Oops... I think I handed this bug to gwendal too quickly. 

"Failed to finish download" is not the same as the "python missing" bug he's been working on.

I really don't understand what this failure means yet.
Owner: grundler@chromium.org
I think the failure is related to my change but it's a problem with the test that extending the timeout will fix. I'll submit a CL shortly to extend the timeout...just need to find out where it is. :)
Status: Started (was: Untriaged)
I need to find the source to this test it seems:
   https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=autoupdate_EndToEndTest.paygen_au_dev_full

cs.corp isn't finding it. :(
dgarett stopped by and pointed me at the right file which specifies the 600 second timeout:
  chromite/lib/paygen/signer_payloads_client.py: _CleanSignerFilesByKeyset()

I've submitted:
   https://chromium-review.googlesource.com/528629
   paygen: increase EndToEnd timeout to 1200 seconds        

Data below supports the change to 1200 seconds (20 minutes).



So far,  7 boards have failed autoupdate_EndToEndTest.paygen_au_dev_full since my change landed:
   https://wmatrix.googleplex.com/failures/unfiltered?tests=autoupdate_EndToEndTest.paygen_au_dev_full

[Tip: search for "bin/update_engine_client" in the autoserv.DEBUG log]

  BOARD   Storage   UpdateTime[logs below]
  chell    eMMC     10:28  [1]
  lumpy    SATA     13:23  [2]
  parrot   HDD      18:37  [3]
  pyro     ---      12:01  [5]
  reef     ---      11:27  [6]
  sand     ---      11:29  [7]
  stumpy   SATA     11:59 (provisioning and timeout after 600 sec on test update) 

On a successive test that passed:
  stumpy   SATA     11:37  [4] (provisioning)

IIRC, stumpy and lumpy use the same M.2 SanDisk part that caused grief before we even released those products.

I was expecting more eMMC boards to fail with timeouts. But it's possible NEWER eMMC parts have lower write speeds due to MLC (older/faster) vs TLC (Newer/Lower $/bit cost).  Gwendal might know for sure - just my first guess.


There are two failure modes it seems and I'm only going address the "Failed to receive a download finished notification (download_finished) within 600 seconds" symptom.

The other failure seems to be "Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test." and is very likely unrelated to my change.




[1] CHELL UPDATE LOG SNIPPETS
06/08 08:33:11.441 DEBUG|             utils:0203| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:33632/check_health?"''
06/08 08:33:12.648 INFO |       autoupdater:0302| Updating image via: /usr/bin/update_engine_client --update --omaha_url=http://100.115.219.136:33632/update
06/08 08:33:12.650 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.136:33632/update'
06/08 08:33:13.105 ERROR|             utils:0298| [stderr] [0608/083312:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate.
06/08 08:33:13.106 ERROR|             utils:0298| [stderr] [0608/083312:INFO:update_engine_client.cc(473)] Initiating update check and install.
06/08 08:33:13.143 ERROR|             utils:0298| [stderr] [0608/083312:INFO:update_engine_client.cc(502)] Waiting for update to complete.
06/08 08:44:41.644 ERROR|             utils:0298| [stderr] [0608/084440:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.
06/08 08:44:41.688 WARNI|       autoupdater:0315| Unable to parse build name /update for metrics. Continuing anyway.
06/08 08:44:41.690 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client -status | grep CURRENT_OP'
06/08 08:44:42.102 ERROR|             utils:0298| [stderr] [0608/084441:INFO:update_engine_client.cc(493)] Querying Update Engine status...
06/08 08:44:42.139 DEBUG|             utils:0298| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
06/08 08:44:42.143 INFO |       autoupdater:0521| Updating stateful partition...
...
06/08 08:46:16.548 DEBUG|             utils:0298| [stdout] Downloading command returns code 0.
06/08 08:46:16.548 DEBUG|             utils:0298| [stdout] Successfully downloaded update.
06/08 08:46:16.548 DEBUG|             utils:0298| [stdout] Restoring state to factory_install with dev_image.
...
06/08 08:48:03.506 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.219.136:43002/update'
...
06/08 08:58:14.060 ERROR|autoupdate_EndToEn:0310| Timeout expired


[2] LUMPY UPDATE LOG SNIPPETS
06/08 14:29:33.630 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.185.227:40618/update'
...
06/08 14:42:56.454 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client -status | grep CURRENT_OP'
06/08 14:42:56.817 ERROR|             utils:0298| [stderr] [0608/144255:INFO:update_engine_client.cc(493)] Querying Update Engine status...
06/08 14:42:56.819 DEBUG|             utils:0298| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
06/08 14:42:56.821 INFO |       autoupdater:0521| Updating stateful partition...
...
06/08 14:42:57.950 DEBUG|             utils:0298| [stdout] Downloading stateful payload from http://100.115.185.227:8082/static/dev-channel/lumpy/9631.0.0/stateful.tgz
...
06/08 14:45:21.275 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.185.227:44601/update'
...
06/08 14:55:28.377 ERROR|autoupdate_EndToEn:0310| Timeout expired


[3] PARROT UPDATE LOG SNIPPETS
06/08 06:09:01.141 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.185.227:46773/update'
...
06/08 06:27:38.762 ERROR|             utils:0298| [stderr] [0608/062737:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.
...
06/08 06:31:43.930 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.185.227:46303/update'
...
06/08 06:41:49.873 ERROR|autoupdate_EndToEn:0310| Timeout expired


[4] STUMPY UPDATE LOG SNIPPETS
06/08 06:02:35.487 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.135:45779/update'
...
06/08 06:14:12.348 ERROR|             utils:0298| [stderr] [0608/061411:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed.
...
06/08 06:17:22.141 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.219.135:43124/update'
...
06/08 06:17:29.046 INFO |autoupdate_EndToEn:0274| Expecting event_result=1:success version=8530.96.0 event_type=14:download_finished previous_version=any, within 600 seconds
06/08 06:19:29.238 INFO |autoupdate_EndToEn:0374| Consumed new event: {u'event_result': '1', u'event_type': '14', u'track': 'stable-channel', u'timestamp': '2017-06-08 06:19:29', u'version': '8530.96.0', u'board': 'stumpy'}
06/08 06:19:29.239 INFO |autoupdate_EndToEn:0301| Event received after 120.1 seconds
...
06/08 06:21:51.283 DEBUG|          ssh_host:0286| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.219.135:43124/update'
...
06/08 06:21:51.520 INFO |autoupdate_EndToEn:0274| Expecting event_result=2:success_reboot version=9630.0.0 event_type=3:update_complete previous_version=8530.96.0 OR event_result=1:success version=9630.0.0 event_type=54:rebooted_after_update previous_version=8530.96.0, within 900 seconds
...
06/08 06:22:00.262 INFO |autoupdate_EndToEn:0374| Consumed new event: {u'event_result': '1', u'event_type': '54', u'previous_version': '8530.96.0', u'track': 'stable-channel', u'timestamp': '2017-06-08 06:21:59', u'version': '9630.0.0', u'board': 'stumpy'}
06/08 06:22:00.263 INFO |autoupdate_EndToEn:0301| Event received after 8.7 seconds

[5] pyro: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122255948-chromeos-test/chromeos2-row4-rack4-host1/debug/

[6] reef: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122207426-chromeos-test/chromeos2-row4-rack5-host3/debug/

[7] sand: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122206647-chromeos-test/chromeos4-row7-rack11-host5/debug/
Project Member

Comment 12 by bugdroid1@chromium.org, Jun 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/178afc7c3cf5b5d74c944328b62209b679a61677

commit 178afc7c3cf5b5d74c944328b62209b679a61677
Author: Grant Grundler <grundler@google.com>
Date: Fri Jun 09 03:16:02 2017

paygen: increase EndToEnd timeout to 1200 seconds

https://chromium-review.googlesource.com/c/502369/ adds O_DSYNC
to update_engine and this is required to fix OOM issues with
"small mem" (ie gale; 512MB RAM) devices.

One side effect is this slows down write throughput.
However, we want slower throughput on consumer devices since
we don't want to affect other activities (like video playback
or hangouts) during an update.  While this is "suboptimal" for
lab provisioning, we can add a "lab only" parameter to NOT
use O_DSYNC in Chromium test lab.

BUG= chromium:731214 
TEST=./signer_payloads_client_unittest
OK (skipped=2)
Note: 2 network test(s) skipped; use --network to run them.

Change-Id: Iec15fbd647b6483f6f303a7f7d748814edf9f532
Reviewed-on: https://chromium-review.googlesource.com/528629
Reviewed-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Tested-by: Grant Grundler <grundler@chromium.org>
Trybot-Ready: Grant Grundler <grundler@chromium.org>
Commit-Queue: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/178afc7c3cf5b5d74c944328b62209b679a61677/lib/paygen/signer_payloads_client.py

Summary: [Canary] autoupdate_EndToEndTest failures of timeout on receiving finished notification (was: [Canary] autoupdate_EndToEndTest failures)
Other tests are still failing with this symptom:
   https://uberchromegw.corp.google.com/i/chromeos/builders/chell-release
   https://uberchromegw.corp.google.com/i/chromeos/builders/candy-release

I changed the wrong place. jrbarnette pointed out:
    _WAIT_FOR_DOWNLOAD_COMPLETED_SECONDS = 10 * 60

in server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py

since that is associated with the failure message:
1675             chain.add_event(
1676                     ExpectedUpdateEvent(
1677                         event_type=EVENT_TYPE_DOWNLOAD_FINISHED,
1678                         event_result=EVENT_RESULT_SUCCESS,
1679                         version=source_release,
1680                         on_error=self._error_download_finished),
1681                     self._WAIT_FOR_DOWNLOAD_COMPLETED_SECONDS,
1682                     on_timeout=self._timeout_err(
1683                             'a download finished notification',
1684                             self._WAIT_FOR_DOWNLOAD_COMPLETED_SECONDS,
1685                             event_type=EVENT_TYPE_DOWNLOAD_FINISHED))


[Personally, I would have NOT approved this code just for it's coding style.]

I've uploaded another CL to address this and would appreciate any help to land this since quite a few release builders (canaries) are red now due to this.

   https://chromium-review.googlesource.com/530054
Project Member

Comment 15 by bugdroid1@chromium.org, Jun 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f3c5dcd63105ae2f98a9dbb4da1cdbeacd49e832

commit f3c5dcd63105ae2f98a9dbb4da1cdbeacd49e832
Author: Grant Grundler <grundler@google.com>
Date: Sat Jun 10 04:44:04 2017

autotest: increase "download" timeout to 1200 sec

Timeouts started showing up once this CL landed:
	https://chromium-review.googlesource.com/c/502369/

CL502369 adds O_DSYNC to update_engine and this is required
to fix OOM issues with "small mem" (ie gale; 512MB RAM) devices.
OOM reports doubled while any new release was pushed.

CL502369 has two side effects because writes are no longer asynchronous:
1) slightly slower write throughput
2) timeout now covers ALL data reaching media.

(1) is primarily due to not "pipelining" preparing (uncompressing) data
to be written and having to wait for the current write() to complete.

(2) isn't as certain but I expect "download completed" event to be
issued when the last write() completes. Thus if write() is asynchronous,
the storage device is "slow" (perhaps even busy with other activity),
quite a bit of data could still be in the linux kernel buffer cache
when the "download finished" event is sent back to autotest.

However, we _want_ slower throughput on consumer devices in order
to avoid competing with other activities (like video playback or
hangouts) during an update.  While this is "suboptimal" for lab
provisioning. We will have to follow up with additional changes
improve update performance again in the Chrome OS Test lab.

BUG= chromium:731214 
TEST=none (will try to run cbuildbot - no promises)

Change-Id: Ia0f7446d6639f06423de00b0650f6b2ce30383b7
Reviewed-on: https://chromium-review.googlesource.com/530054
Commit-Ready: Grant Grundler <grundler@chromium.org>
Tested-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/f3c5dcd63105ae2f98a9dbb4da1cdbeacd49e832/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py

Status: Fixed (was: Started)
I'm going to claim victory for now. My changes plus recent changes from ahassani (use BLKDISCARD instead of BLKSECDISCARD) should help resolve most of this issue.

If there are specific platforms still intermittently failing this test, I think it would be good to open a new bug specific to those platforms.
Status: Verified (was: Fixed)

Sign in to add a comment