Ensure the android test runner still emits test results json even if a SIGTERM is caught |
|||||||
Issue descriptionSee https://chromium-swarm.appspot.com/task?id=407d3fa49b092e10 The test got sent a sigterm from swarming due to timeout. It caught the signal and quit execution... except it never printed the test results it had so far. That can wreak havoc w/ the recipe result-handling/retry-logic. If we can, let's dump the results if we get sigterm'ed.
,
Oct 25
Investigating.
,
Oct 25
I've confirmed that //build/android/test_runner.py will correctly populate the file sent to --test-launcher-summary-output= on SIGTERM. This suggests that swarming is likely not picking up the output from the file after it sends SIGTERM. maruel@, any ideas? I'll continue to dig in the meanwhile.
,
Oct 26
The properties passed to the swarming remote client can be seen here: https://apis-explorer.appspot.com/apis-explorer/?base=https://chromium-swarm.appspot.com/_ah/api#p/swarming/v1/swarming.task.request?task_id=407d3fa49b092e10&_h=2& To repro a local run with a 100s hard timeout, 30s grace period: 1) Patch https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1301720. 2) """ python ~/projects/infra/infra/luci/client/isolateserver.py download -I https://isolateserver.appspot.com --namespace default-gzip -s 7e65e8492576d1a2763ad8642d39ac3162116948 --target bar """ 3) """ ~/projects/infra/infra/luci/client/run_isolated.py -I https://isolateserver.appspot.com -s 7e65e8492576d1a2763ad8642d39ac3162116948 --hard-timeout=100 --grace-period=30 --test-launcher-summary-output=\${ISOLATED_OUTDIR}/output.json --leak-temp-dir """ Output from local run: """ C 99.319s Main Received SIGTERM. Stopping test execution. C 99.334s Main ******************************************************************************** C 99.334s Main Detailed Logs C 99.334s Main ******************************************************************************** C 99.361s Main ******************************************************************************** C 99.361s Main Summary C 99.361s Main ******************************************************************************** C 99.366s Main [==========] 2552 tests ran. C 99.366s Main [ PASSED ] 15 tests. C 99.366s Main ******************************************************************************** I 99.367s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( su 0 sh -c '"'"'echo -n '"'"'"'"'"'"'"'"'_ --strict-mode=testing --enable-test-intents'"'"'"'"'"'"'"'"' > /data/local/tmp/test-cmdline-file'"'"' );echo %$?' I 99.483s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( test -e /data/local/tmp/test-cmdline-file );echo %$?' I 99.486s TimeoutThread-1-for-run_tests_on_device(FA79E1A05388) Setting permissions for org.chromium.chrome. I 99.487s TimeoutThread-1-for-run_tests_on_device(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( ( p=org.chromium.chrome;for q in android.permission.ACCESS_COARSE_LOCATION android.permission.ACCESS_FINE_LOCATION android.permission.CAMERA android.permission.GET_ACCOUNTS android.permission.READ_EXTERNAL_STORAGE android.permission.RECORD_AUDIO android.permission.WRITE_EXTERNAL_STORAGE;do pm grant "$p" "$q";echo "~X~$q~X~$?~X~";done )>/data/local/tmp/temp_file-57d76c4e6a40 2>&1 );echo %$?' I 99.548s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( su 0 sh -c '"'"'TZ=utc ls -a -l /data/local/tmp/'"'"' );echo %$?' I 99.604s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( su 0 sh -c '"'"'cat /data/local/tmp/test-cmdline-file'"'"' );echo %$?' I 99.657s individual_device_tear_down(FA79E1A05388) Flags now set on the device: ['--strict-mode=testing', '--enable-test-intents'] I 99.658s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( am clear-debug-app );echo %$?' I 99.758s TimeoutThread-1-for-individual_device_tear_down(FA79E1A05388) [host]> /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/third_party/android_tools/sdk/platform-tools/adb -s FA79E1A05388 shell '( su 0 sh -c '"'"'rm -f /data/local/tmp/chrome_timeout_scale'"'"' );echo %$?' I 99.837s Main Total time spent symbolizing: 0.32s I 99.986s tear_down_device(FA79E1A05388) Wrote device cache: /usr/local/google/home/erikchen/projects/test_isolates/bar/ir/out/Release/device_cache_FA79E1A05388.json I 100.062s Main Generated json results file at /usr/local/google/home/erikchen/projects/test_isolates/bar/itToVaZS/tmpau0JkL I 100.062s Main Finishing archiving output. [run_isolated_out_hack]{"hash":"35896db86fa9875984d2cfbcc133f4113f2f3d0d","namespace":"default-gzip","storage":"https://isolateserver.appspot.com"}[/run_isolated_out_hack] """ Output from swarming run above: """ C 3598.871s Main Received SIGTERM. Stopping test execution. C 3598.876s Main ******************************************************************************** C 3598.876s Main Detailed Logs C 3598.876s Main ******************************************************************************** C 3598.877s Main ******************************************************************************** C 3598.877s Main Summary C 3598.877s Main ******************************************************************************** C 3598.880s Main [==========] 203 tests ran. C 3598.880s Main [ PASSED ] 177 tests. C 3598.880s Main [ SKIPPED ] Skipped 10 tests, listed below: C 3598.880s Main [ SKIPPED ] org.chromium.chrome.browser.browserservices.TrustedWebActivityClientTest#doesntUseIconFromServiceIfStatusBarBitmapsSupported C 3598.880s Main [ SKIPPED ] org.chromium.chrome.browser.compositor.overlays.strip.TabStripTest#testSwitchStripStackersWithFirstTabSelected C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testCanExternalAppHandleUrl C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testInstantAppsIntent_incomingIntentRedirect C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testLaunchWebApk_ShowIntentPickerInvalidWebApk C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testPageTransitionType C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testRedirectToMarketWithReferrer C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.externalnav.ExternalNavigationHandlerTest#testSameDomainDifferentApps C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.notifications.NotificationPlatformBridgeTest#testReplyToNotification C 3598.881s Main [ SKIPPED ] org.chromium.chrome.browser.notifications.channels.ChannelsInitializerTest#testDeleteLegacyChannels_noopOnCurrentDefinitions C 3598.881s Main [ FAILED ] 15 tests, listed below: C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.download.DownloadNotificationServiceTest#testServiceWillNotStopOnInterruptedDownload (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.feature_engagement.ScreenshotTabObserverTest#testScreenshotActionReporting (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.firstrun.FirstRunIntegrationTest#testAbortFirstRun (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.ntp.cards.NewTabPageRecyclerViewTest#testClickSuggestion (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.offlinepages.OfflinePageBridgeTest#testOfflinePageBridgeDisabledInIncognito (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.offlinepages.OfflinePageBridgeTest#testRemoveRequestsFromQueue (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.omnibox.OmniboxTest#testDefaultText with --disable-features=SpannableInlineAutocomplete (TIMEOUT) C 3598.882s Main [ FAILED ] org.chromium.chrome.browser.omnibox.OmniboxTest#testRequestZeroSuggestAfterDelete with --disable-features=SpannableInlineAutocomplete (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.omnibox.UrlBarTest#testBatchModeChangesTriggerCorrectSuggestions with --disable-features=SpannableInlineAutocomplete (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.partnercustomizations.PartnerDisableIncognitoModeIntegrationTest#testIncognitoMenuItemEnabledBasedOnParentalControls (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.payments.CurrencyFormatterTest#testMultipleConversions (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.payments.PaymentRequestEmailAndFreeShippingTest#testPaymentRequestEventsMetric (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.tabmodel.UndoTabModelTest#testTwoTabs (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.webapps.WebappModeTest#testWebappLaunches (TIMEOUT) C 3598.883s Main [ FAILED ] org.chromium.chrome.browser.webapps.WebappNavigationTest#testNewTabLinkToExternalApp (TIMEOUT) C 3598.883s Main C 3598.883s Main 15 FAILED TESTS C 3598.883s Main ******************************************************************************** I 3598.884s TimeoutThread-1-for-individual_device_tear_down(04d6c3b700f4400b) [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 04d6c3b700f4400b shell '( echo -n '"'"'_ --strict-mode=testing --disable-features=SpannableInlineAutocomplete --enable-test-intents'"'"' > /data/local/tmp/test-cmdline-file );echo %$?' C 3601.282s run_tests_on_device(04d6c3b700f4400b) (TimeoutThread-2-for-run_tests_on_device(04d6c3b700f4400b)) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['shell', '( mkdir -p /storage/emulated/legacy/tmp-c58ebcf96e5de );echo %$?'], check_error=False, retries=2, device_serial=04d6c3b700f4400b, ensure_logs_on_timeout=False, timeout=30), attempt 2 of 3: TimeoutError() I 3601.283s TimeoutThread-3-for-run_tests_on_device(04d6c3b700f4400b) [host]> /b/swarming/w/ir/third_party/android_tools/sdk/platform-tools/adb -s 04d6c3b700f4400b shell '( mkdir -p /storage/emulated/legacy/tmp-c58ebcf96e5de );echo %$?' """
,
Oct 26
And now, looking at the diff in these logs we see a problem: """ C 3601.282s run_tests_on_device(04d6c3b700f4400b) (TimeoutThread-2-for-run_tests_on_device(04d6c3b700f4400b)) Exception on _RunAdbCmd(<class 'devil.android.sdk.adb_wrapper.AdbWrapper'>, ['shell', '( mkdir -p /storage/emulated/legacy/tmp-c58ebcf96e5de );echo %$?'], check_error=False, retries=2, device_serial=04d6c3b700f4400b, ensure_logs_on_timeout=False, timeout=30), attempt 2 of 3: TimeoutError() """ ADB is timing out. We retry up to 3 times, waiting 30s each. This causes us to hit the 30s grace period timeout, at which point the process is killed.
,
Oct 26
Note to self: The easiest way to test is: 1) following instructions in c#4 to create isolated python files. 2) Apply the patch below, and then directly make changes to the python files. The changes will not be overwritten and can be directly tested.
,
Oct 26
There are two problems. 1) The SIGTERM signal handler in test_runner.py: https://cs.chromium.org/chromium/src/build/android/test_runner.py?l=745 Is being temporarily overwritten by the SIGTERM signal handler in local_device_test_run.py: https://cs.chromium.org/chromium/src/build/android/pylib/local/device/local_device_test_run.py?l=116 Which is eventually catching, and then ignoring the exception: https://cs.chromium.org/chromium/src/build/android/pylib/local/device/local_device_test_run.py?l=174 2) LocalDeviceEnvironment.Teardown: https://cs.chromium.org/chromium/src/build/android/pylib/local/device/local_device_environment.py?type=cs&q=local_device_environment.py+&g=0&l=234 LocalDeviceInstrumentationTestRun.Teardown: https://cs.chromium.org/chromium/src/build/android/pylib/local/device/local_device_instrumentation_test_run.py?type=cs&q=local_device_instrumentation_test_run.py&g=0&l=333 Both attempt to invoke ADB. If ADB is borked and timing out, then these may never return before the process is hard killed. If we get a SIGTERM, then we should attempt to write the output results and exit. We should not attempt to make any ADB calls, since ADB may timeout. I'll investigate to figure out the best way to get this desired behavior.
,
Oct 26
For (1), we may want to use signal_handler.AddSignalHandler rather than signal_handler.SignalHandler. For (2), we'll need to teach timeout_retry to not retry in the event of a SIGTERM: https://codesearch.chromium.org/chromium/src/third_party/catapult/devil/devil/utils/timeout_retry.py?rcl=687f318e303c48b791fa608da923c23b31cd6afe&l=168
,
Oct 29
Thanks. I believe that the following 4 changes should fix the problem: 1) signal_handler.SignalHandler -> signal_handler.AddSignalHandler 2) teach timeout_retry to not retry in the event of a SIGTERM 3) Change Teardown methods to not call ADB on SIGTERM 4) Propagate test results to test_runner.py even on SIGTERM
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d4d9c16f1a163d205ac631236dea509d30e65ac4 commit d4d9c16f1a163d205ac631236dea509d30e65ac4 Author: erikchen <erikchen@chromium.org> Date: Mon Oct 29 18:10:42 2018 Fix status/string conversion for android json NOTRUN status. Previously, the if/else block was missing the NOTRUN status. Since the statuses are already represented by strings, this CL converts the strings to be expected outputs. This allows us to remove the entire if/else block. Bug: 895027 Change-Id: Ieb426eb268c30dd07a6642dcfd97122cdb07c926 Reviewed-on: https://chromium-review.googlesource.com/c/1305414 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> Cr-Commit-Position: refs/heads/master@{#603562} [modify] https://crrev.com/d4d9c16f1a163d205ac631236dea509d30e65ac4/build/android/pylib/base/base_test_result.py [modify] https://crrev.com/d4d9c16f1a163d205ac631236dea509d30e65ac4/build/android/pylib/results/json_results.py
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea commit 43aaf3b8b671171f29ffc9bb74917ebb16bce5ea Author: erikchen <erikchen@chromium.org> Date: Mon Oct 29 20:10:41 2018 Propagate Android test results through exceptions. Previously, Android test results were propagated after all tests finished running. In the event of an Exception, none of the test results would be written to file. This CL slightly changes the control flow of TestRun.RunTests(). Instead of returning a results array, it now receives a results array as input and modifies the array as it receives test results. This allows the json_writer context TearDown() to serialize the partial test results. This CL updates local_device_test_run.py to have the correct flow. The flow of local_device_perf_test_run.py and local_machine_junit_test_run.py are left untouched -- they can be modified in the future if it's deemed necessary. This minimizes the size of this CL. Bug: 895027 Change-Id: I60fe78da993de366d678252b84a91aaff044c10b Reviewed-on: https://chromium-review.googlesource.com/c/1305396 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#603599} [modify] https://crrev.com/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea/build/android/pylib/base/test_run.py [modify] https://crrev.com/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea/build/android/pylib/local/device/local_device_perf_test_run.py [modify] https://crrev.com/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea/build/android/pylib/local/device/local_device_test_run.py [modify] https://crrev.com/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea/build/android/pylib/local/machine/local_machine_junit_test_run.py [modify] https://crrev.com/43aaf3b8b671171f29ffc9bb74917ebb16bce5ea/build/android/test_runner.py
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/5e1c1c293b07ef04a247dd8dff50972d207663a4 commit 5e1c1c293b07ef04a247dd8dff50972d207663a4 Author: erikchen <erikchen@chromium.org> Date: Tue Oct 30 15:10:31 2018 ADB commands should only retry errors, not exceptions. By default, all exceptions were swallowed and retried, including exceptions generated by SIGTERM handler. This is undesired. Bug: chromium:895027 Change-Id: Ic6ce39eae466399e2543410a16906f154de47179 Reviewed-on: https://chromium-review.googlesource.com/c/1305714 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> [modify] https://crrev.com/5e1c1c293b07ef04a247dd8dff50972d207663a4/devil/devil/android/sdk/adb_wrapper.py
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c5466799a1a5b20be34cc4c2a3b90f52d6d70660 commit c5466799a1a5b20be34cc4c2a3b90f52d6d70660 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Tue Oct 30 17:52:33 2018 Roll src/third_party/catapult d15a3be3ab45..5e1c1c293b07 (1 commits) https://chromium.googlesource.com/catapult.git/+log/d15a3be3ab45..5e1c1c293b07 git log d15a3be3ab45..5e1c1c293b07 --date=short --no-merges --format='%ad %ae %s' 2018-10-30 erikchen@chromium.org ADB commands should only retry errors, not exceptions. Created with: gclient setdep -r src/third_party/catapult@5e1c1c293b07 The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG= chromium:895027 TBR=sullivan@chromium.org Change-Id: Ifa4ab1888a5a3fd36652fbda11af4e269463b18c Reviewed-on: https://chromium-review.googlesource.com/c/1307753 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#603954} [modify] https://crrev.com/c5466799a1a5b20be34cc4c2a3b90f52d6d70660/DEPS
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bf9552573217f0ca53fa4a57fd6de9af5c82881b commit bf9552573217f0ca53fa4a57fd6de9af5c82881b Author: erikchen <erikchen@chromium.org> Date: Tue Oct 30 20:48:02 2018 local_device_test_run.py should not swallow SIGTERM signal. Previously, local_device_test_run.py used SignalHandler() to catch SIGTERM, raise an exception, and then handle the exception. This prevents other SIGTERM handlers from ever receiving the exception. This CL changes local_device_test_run.py to instead use AddSignalHandler(), so that the exception is propagated to the previously registered SIGTERM handler. Bug: 895027 Change-Id: Ia6f84022ee4cf54e7dc36afad012acb3212b51e2 Reviewed-on: https://chromium-review.googlesource.com/c/1308753 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> Cr-Commit-Position: refs/heads/master@{#604004} [modify] https://crrev.com/bf9552573217f0ca53fa4a57fd6de9af5c82881b/build/android/pylib/local/device/local_device_test_run.py
,
Nov 1
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b16aeb4251477d70bf27694c52597fdb33d37359 commit b16aeb4251477d70bf27694c52597fdb33d37359 Author: erikchen <erikchen@chromium.org> Date: Thu Nov 01 14:56:20 2018 Context TearDown should not invoke ADB if process receives SIGTERM. The TearDown method on several contexts will invokes ADB. In some instances [e.g. on receiving a SIGTERM due to timeout] there's a high probability that ADB is non-responsive. In these cases, sending an ADB command will potentially take a long time to time out. Before this happens, the process will be hard-killed for not responding to SIGTERM fast enough. This CL makes the test_runner.py SIGTERM handler set allow_adb_on_teardown=False. This prevents these contexts from invoking ADB, increasing the likelihood that SIGTERM will handled within the 30 second limit and emit JSON test results. Bug: 895027 Change-Id: Ic30cff73a82a5415c5f531eae5c8f62fb4c953aa Reviewed-on: https://chromium-review.googlesource.com/c/1311259 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> Cr-Commit-Position: refs/heads/master@{#604584} [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/pylib/base/environment.py [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/pylib/base/test_run.py [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/pylib/local/device/local_device_environment.py [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/pylib/local/device/local_device_gtest_run.py [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/pylib/local/device/local_device_instrumentation_test_run.py [modify] https://crrev.com/b16aeb4251477d70bf27694c52597fdb33d37359/build/android/test_runner.py
,
Nov 1
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/971689c0597c1b92614e0700e3c5d2518f0e70ee commit 971689c0597c1b92614e0700e3c5d2518f0e70ee Author: Ben Pastene <bpastene@chromium.org> Date: Thu Nov 01 22:33:17 2018 Revert "ADB commands should only retry errors, not exceptions." This reverts commit 5e1c1c293b07ef04a247dd8dff50972d207663a4. Reason for revert: speculative fix for crbug.com/900511 Original change's description: > ADB commands should only retry errors, not exceptions. > > By default, all exceptions were swallowed and retried, including exceptions > generated by SIGTERM handler. This is undesired. > > Bug: chromium:895027 > Change-Id: Ic6ce39eae466399e2543410a16906f154de47179 > Reviewed-on: https://chromium-review.googlesource.com/c/1305714 > Reviewed-by: John Budorick <jbudorick@chromium.org> > Commit-Queue: Erik Chen <erikchen@chromium.org> TBR=erikchen@chromium.org,jbudorick@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: chromium:895027 Change-Id: I9b76f5fda03d1aed1f569652e34770170914e9e7 Reviewed-on: https://chromium-review.googlesource.com/c/1313251 Commit-Queue: Ben Pastene <bpastene@chromium.org> Reviewed-by: Ben Pastene <bpastene@chromium.org> Reviewed-by: Erik Chen <erikchen@chromium.org> [modify] https://crrev.com/971689c0597c1b92614e0700e3c5d2518f0e70ee/devil/devil/android/sdk/adb_wrapper.py
,
Nov 2
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9ce5a6743ef251e6509ff1eae579367fe7485964 commit 9ce5a6743ef251e6509ff1eae579367fe7485964 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Fri Nov 02 03:01:04 2018 Roll src/third_party/catapult b281d52b301d..971689c0597c (1 commits) https://chromium.googlesource.com/catapult.git/+log/b281d52b301d..971689c0597c git log b281d52b301d..971689c0597c --date=short --no-merges --format='%ad %ae %s' 2018-11-01 bpastene@chromium.org Revert "ADB commands should only retry errors, not exceptions." Created with: gclient setdep -r src/third_party/catapult@971689c0597c The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG= chromium:895027 TBR=sullivan@chromium.org Change-Id: I805d1b998211fe410ea719e9a3486f8f90dc61ae Reviewed-on: https://chromium-review.googlesource.com/c/1313689 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#604825} [modify] https://crrev.com/9ce5a6743ef251e6509ff1eae579367fe7485964/DEPS
,
Nov 2
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/4527bffcdfd7d9f9e1dff0781f3824681eb3a0b0 commit 4527bffcdfd7d9f9e1dff0781f3824681eb3a0b0 Author: Erik Chen <erikchen@chromium.org> Date: Fri Nov 02 14:24:39 2018 [Reland 1] ADB commands should only retry errors, not exceptions. Previously, TimeoutErrors were exceptions rather than errors, so they were not being retried. This has been fixed. > ADB commands should only retry errors, not exceptions. > > By default, all exceptions were swallowed and retried, including exceptions > generated by SIGTERM handler. This is undesired. > > Bug: chromium:895027 > Change-Id: Ic6ce39eae466399e2543410a16906f154de47179 > Reviewed-on: https://chromium-review.googlesource.com/c/1305714 > Reviewed-by: John Budorick <jbudorick@chromium.org> > Commit-Queue: Erik Chen <erikchen@chromium.org> Bug: chromium:895027 Change-Id: Icec85af8b978942ec7699636e68b7500df32e0ba Reviewed-on: https://chromium-review.googlesource.com/c/1313609 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> [modify] https://crrev.com/4527bffcdfd7d9f9e1dff0781f3824681eb3a0b0/devil/devil/utils/cmd_helper.py [modify] https://crrev.com/4527bffcdfd7d9f9e1dff0781f3824681eb3a0b0/devil/devil/android/sdk/adb_wrapper.py [modify] https://crrev.com/4527bffcdfd7d9f9e1dff0781f3824681eb3a0b0/devil/devil/utils/reraiser_thread.py
,
Nov 2
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/661fddf6f497b0392411d47198312c0b39608736 commit 661fddf6f497b0392411d47198312c0b39608736 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Fri Nov 02 18:11:14 2018 Roll src/third_party/catapult a6ef719b69f4..a8f088180a1c (2 commits) https://chromium.googlesource.com/catapult.git/+log/a6ef719b69f4..a8f088180a1c git log a6ef719b69f4..a8f088180a1c --date=short --no-merges --format='%ad %ae %s' 2018-11-02 chiniforooshan@chromium.org Telemetry: migrate thread_*_cpu_time_per_frame 2018-11-02 erikchen@chromium.org [Reland 1] ADB commands should only retry errors, not exceptions. Created with: gclient setdep -r src/third_party/catapult@a8f088180a1c The AutoRoll server is located here: https://autoroll.skia.org/r/catapult-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG= chromium:890757 , chromium:895027 TBR=sullivan@chromium.org Change-Id: I070af3f6172541300ce6dc4d5ac183f117dd1db6 Reviewed-on: https://chromium-review.googlesource.com/c/1315408 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#604985} [modify] https://crrev.com/661fddf6f497b0392411d47198312c0b39608736/DEPS
,
Nov 9
This still isn't fully functional. Looks like logdog/butler is not transmitting the output when the subprocess exits via sys.exit(constants.INFRA_EXIT_CODE). Example build: https://chromium-swarm.appspot.com/task?id=40e6b33438f2be10&refresh=10&show_raw=1 Output: """ I 3599.583s Main Generated json results file at /b/swarming/w/itYFNqM4/tmpkDNAgX I 3599.583s Main Finishing archiving output. E 3599.583s Main Unrecognized error occurred. Traceback (most recent call last): File "/b/swarming/w/ir/build/android/test_runner.py", line 1018, in main return RunTestsCommand(args) File "/b/swarming/w/ir/build/android/test_runner.py", line 715, in RunTestsCommand return RunTestsInPlatformMode(args) File "/b/swarming/w/ir/build/android/test_runner.py", line 854, in RunTestsInPlatformMode test_run.RunTests(raw_results) File "/b/swarming/w/ir/build/android/pylib/local/device/local_device_test_run.py", line 155, in RunTests run_tests_on_device, tc, try_results).pGet(None) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/parallelizer.py", line 263, in pMap r.pFinish(None) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/parallelizer.py", line 135, in pFinish self._objs.JoinAll() File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 196, in JoinAll self._JoinAll(watcher, timeout) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 163, in _JoinAll thread.join(0.1) File "/b/swarming/w/ir/.swarming_module/lib/python2.7/threading.py", line 951, in join self.__block.wait(delay) File "/b/swarming/w/ir/.swarming_module/lib/python2.7/threading.py", line 359, in wait _sleep(delay) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/signal_handler.py", line 41, in handler existing_handler(signum, frame) File "/b/swarming/w/ir/build/android/test_runner.py", line 758, in unexpected_sigterm infra_error('\n'.join(msg)) File "/b/swarming/w/ir/build/android/test_runner.py", line 738, in infra_error sys.exit(constants.INFRA_EXIT_CODE) SystemExit: 87 [W2018-11-01T07:07:02.603707Z 2412 0 butler.go:240] Butler Context was cancelled. Initiating shutdown. {"error":"context canceled"} INFO:devil.utils.timeout_retry:condition 'logdog_stopped' not met [E2018-11-01T07:07:02.603855Z 2412 0 main.go:205] Butler terminated with error. {"error":"context canceled"} [E2018-11-01T07:07:02.604048Z 2412 0 main.go:233] Failed to serve. original error: context canceled [I2018-11-01T07:07:02.604084Z 2412 0 main.go:349] Terminating. {"returnCode":250} INFO:devil.utils.timeout_retry:condition 'logdog_stopped' met """
,
Nov 9
jbudorick investigated. This is likely because of incorrect handling for state == 'TIMED_OUT' in standard_gtest_merge.py. I'm working on a fix.
,
Nov 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/cf68f1d78354078932322dc24b192d262f6b7b4e commit cf68f1d78354078932322dc24b192d262f6b7b4e Author: Erik Chen <erikchen@chromium.org> Date: Mon Nov 12 17:24:37 2018 Recipes should parse results output regardless of failure status. When a bot times out, it may still output json results. This CL updates the gtest_merge logic to always parse json results, if they exist. In addition to adding a test, this CL refactors the tests to make string comparisons independent of unicode vs str type, which otherwise causes failures. Bug: 895027 Change-Id: Ifc6b65ead4879c9ccb136a2ca31796b186b234c0 Reviewed-on: https://chromium-review.googlesource.com/c/1330161 Commit-Queue: Erik Chen <erikchen@chromium.org> Auto-Submit: Erik Chen <erikchen@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/cf68f1d78354078932322dc24b192d262f6b7b4e/scripts/slave/recipe_modules/swarming/resources/standard_gtest_merge.py [modify] https://crrev.com/cf68f1d78354078932322dc24b192d262f6b7b4e/scripts/slave/recipe_modules/swarming/unittests/standard_gtest_merge_test.py
,
Dec 13
,
Dec 13
Oops, c#23 had some bad links. Trying again. This build: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-kitkat-arm-rel/145816 Had a shard time out: https://chromium-swarm.appspot.com/task?id=41bc71b36a7b8310&refresh=10&show_raw=1 The output.json from the shard looks correct: https://isolateserver.appspot.com/browse?namespace=default-gzip&digest=87247384865ee6a3464527b1d8b7d4a835ec4107&as=output.json So there's probably a bug causing the chromium_test recipe to treat the step as "TEST RESULTS WERE INVALID".
,
Dec 13
Note to future self: To test this, I ran the following commands: """ ./tools/swarming_client/swarming.py collect -S chromium-swarm.appspot.com --task-output-dir=foo 41bc71b36a7b8310 build/android/pylib/results/presentation/test_results_presentation.py --test-name "blah" --build-number 10 --builder-name asdf --output-json /tmp/output2.json --summary-json ./foo/summary.json ./foo/0/output.json --bucket None """ This bug is caused by my failure to land an earlier fix, which had already been code reviewed. :( https://chromium-review.googlesource.com/c/chromium/tools/build/+/1330161
,
Dec 13
Wrong CL, it's this one: https://chromium-review.googlesource.com/c/chromium/src/+/1330066
,
Dec 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1b497b3b3e089f9149228b92998493d62d395408 commit 1b497b3b3e089f9149228b92998493d62d395408 Author: erikchen <erikchen@chromium.org> Date: Thu Dec 13 21:08:19 2018 Update gtest_merge logic to parse outputs regardless of failure status. When a bot times out, it may still output json results. This CL updates the gtest_merge logic to always parse json results, if they exist. Bug: 895027 Change-Id: I4c78e5f8a7270a100a97fb8d6f9ffe8f3cf30b83 Reviewed-on: https://chromium-review.googlesource.com/c/1330066 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#616430} [modify] https://crrev.com/1b497b3b3e089f9149228b92998493d62d395408/build/android/pylib/results/presentation/standard_gtest_merge.py
,
Dec 13
I think this problem should be fixed. I'm closing for now and will reopen if I observe this again in the future. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jbudorick@chromium.org
, Oct 12