devserver logs are too large |
|||||||||
Issue descriptioncyan-release build 482 failed most of the autoupdate_EndToEndTest jobs, and the log has oodles of error message. It is difficult to understand which ones are recoverable errors, which ones are cascading errors, and which one(s) point to the root cause. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/80585806-chromeos-test/chromeos4-row6-rack9-host17/debug/ Tne status.log file for this job contains errors like this: FAIL autoupdate_EndToEndTest.paygen_au_canary_full autoupdate_EndToEndTest.paygen_au_canary_full timestamp=1476311631 localtime=Oct 12 15:33:51 Failed to install device image using payload at http://100.115.219.132:47062/update on chromeos4-row6-rack9-host17. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError The debug/autoserv.DEBUG shows that the devserver is successfully staging image, although with many hiccups (connection reset by peer during the is_staged query). There are 5 is_staged queries, each eventually returning TRUE. The last one is here: 10/12 15:26:52.028 DEBUG| base_utils:0185| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/is_staged?artifacts=&files=stateful.tgz&archive_url=gs://chromeos-image-archive/cyan-chrome-pfq/R56-8891.0.0-rc1"'' 10/12 15:27:08.948 DEBUG| dev_server:0869| whether artifact is staged: 'True' Then we get this: 10/12 15:27:54.182 DEBUG| ssh_host:0180| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.132:47062/update' 10/12 15:27:54.358 ERROR| base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate. 10/12 15:27:54.359 ERROR| base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(473)] Initiating update check and install. 10/12 15:27:54.359 ERROR| base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(502)] Waiting for update to complete. 10/12 15:28:45.232 DEBUG| monitors:0191| ts_mon: sending 8 metrics to PubSub 10/12 15:32:30.915 ERROR| base_utils:0280| [stderr] [1012/153230:ERROR:update_engine_client.cc(217)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kSuccess(0) 10/12 15:32:30.993 ERROR| autoupdater:0179| command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_MN7OtNssh- master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host17 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::perform_update|update_image|_base_update_handler] -> ssh_run(/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.132:47062/update)\";fi; /usr/bin/update_engine_client --update --omaha_url=http://100.115.219.132:47062/update" Exit status: 1 Duration: 276.765402079 Then we see this: stderr: [1012/152754:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate. [1012/152754:INFO:update_engine_client.cc(473)] Initiating update check and install. [1012/152754:INFO:update_engine_client.cc(502)] Waiting for update to complete. [1012/153230:ERROR:update_engine_client.cc(217)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kSuccess(0) 10/12 15:32:30.997 WARNI| autoupdater:0111| Cannot check contents of devserver, update url http://100.115.219.132:47062/update: http://100.115.219.132:47062/update is not an update url 10/12 15:32:30.997 DEBUG| ssh_host:0180| Running (ssh) '/usr/bin/update_engine_client --last_attempt_error' 10/12 15:32:31.176 DEBUG| base_utils:0280| [stdout] ERROR_CODE=0 10/12 15:32:31.307 DEBUG| base_utils:0280| [stdout] ERROR_MESSAGE=ErrorCode::kSuccess 10/12 15:32:31.307 INFO |autoupdate_EndToEn:0674| Killing omaha/devserver 10/12 15:32:31.307 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:31.436 DEBUG| ssh_host:0180| Running (ssh) 'kill -s SIGTERM 23371' 10/12 15:32:31.567 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:32.706 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:33.842 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:34.976 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:36.113 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:37.251 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:38.382 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:39.518 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:40.654 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:40.783 ERROR| site_utils:0939| Timed out waiting for condition: devserver down 10/12 15:32:40.783 WARNI|autoupdate_EndToEn:0601| Could not kill devserver with SIGTERM. 10/12 15:32:40.784 DEBUG| ssh_host:0180| Running (ssh) 'kill -s SIGKILL 23371' 10/12 15:32:40.917 DEBUG| ssh_host:0180| Running (ssh) 'test -e /proc/23371' 10/12 15:32:41.050 DEBUG|autoupdate_EndToEn:0676| Final devserver log before killing 10/12 15:32:41.051 DEBUG| ssh_host:0180| Running (ssh) 'cat /tmp/devserver-stdout.t35CHJ' This is chromeos4-devserver4, which has been up 18 days and was lightly loaded at the time. So I look at the server log, grep only the lines with "cyan" and see this: [12/Oct/2016:15:28:05] GSUTIL_UTIL crbug.com/639314 GSUtilRun(gsutil cp gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz /home/chromeos-test/images/dev-channel/cyan/8891.0.0/, Failed to download "gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz".)=0, stdout=<<<>>>, stderr=<<<Copying gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz... Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 0 B/271.33 MiB Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 2.03 MiB/271.33 MiB Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 4.06 MiB/271.33 MiB Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 6.09 MiB/271.33 MiB ... Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 268.13 MiB/271.33 MiB Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 270.16 MiB/271.33 MiB Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 271.33 MiB/271.33 MiB so the stateful download appears to have succeeded [12/Oct/2016:15:28:35] GOOGLE_STORAGE_DOWNLOADER Downloading artifacts chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads. [12/Oct/2016:15:28:35] GOOGLE_STORAGE_DOWNLOADER Invoking background download of artifacts for [chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads] ::ffff:127.0.0.1 - - [12/Oct/2016:15:28:35] "GET /stage?artifacts=&files=chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b&async=True&archive_url=gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads HTTP/1.1" 200 7 "" "curl/7.35.0" [12/Oct/2016:15:28:35] ARTIFACT No marker file, chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads is not staged. "No marker file"? What's a marker file? And then we're back to bug 639314, is this the root cause? [12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 GSUtilRun(gsutil stat gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads/chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b, Failed to stat on the artifact file.)=0, stdout=<<<gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads/chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b: [12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 override_result=chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b [12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 uploaded_list=['chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b'] [12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 matching_names=['chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b'], is_regex_pattern=False [12/Oct/2016:15:28:39] ARTIFACT Downloading file chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b If that's the root cause, is there any way If not, what's the next step? This is hard enough to get to. Those server logs are 1GB each, I have to binary-search with dd to extract a section that's small enough to analyze (grep etc.)
,
Dec 1 2016
I wrote this script to get around the problem of huge log size. It extracts part of a log. "dd" would also work but this is a bit easier (the position is expressed in %, and the extraction size in lines, rather than byte counts)
extract.py:
#! /usr/bin/python
import os
import sys
if len(sys.argv) != 4:
print 'Usage: extract.py <file> <position (in %)> <length (in lines)>'
print 'extracts <length> lines starting at <position> in <file>'
sys.exit(1)
path = sys.argv[1]
fraction = float(sys.argv[2])
length = int(sys.argv[3])
size = os.stat(path).st_size
fd = open(path)
fd.seek(size * fraction / 100)
while True:
c = fd.read(1)
if not c:
print 'EOF reached without finding newline'
sys.exit(1)
if c == '\n':
break
while length > 0:
line = fd.readline()
if line == "":
break
print line,
length -= 1
,
Jan 11 2017
,
Jan 18 2017
See also issue 682417 for the complaint in the OP (multiple exception stacktraces)
,
Jan 31 2017
,
Jan 31 2017
,
Feb 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d commit 3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d Author: xixuan <xixuan@chromium.org> Date: Fri Feb 03 13:15:07 2017 devserver: rotate devserver log twice a day. This CL changes devserver log's rotation to twice a day. BUG= chromium:656829 TEST=test it locally with setting rotation to every 30s, verify it works. Change-Id: Ie7843dcb2e7f4191eed24f3b62f9cde65d46dbf4 Reviewed-on: https://chromium-review.googlesource.com/435021 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d/devserver.py
,
Feb 3 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by semenzato@chromium.org
, Nov 10 2016Labels: -Pri-2 Pri-1
Summary: devserver logs are too large (was: problematic logging for autoupdate_EndToEndTest failure)