New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 656829 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

devserver logs are too large

Project Member Reported by semenzato@chromium.org, Oct 17 2016

Issue description

cyan-release build 482 failed most of the autoupdate_EndToEndTest jobs, and the log has oodles of error message.  It is difficult to understand which ones are recoverable errors, which ones are cascading errors, and which one(s) point to the root cause.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/80585806-chromeos-test/chromeos4-row6-rack9-host17/debug/

Tne status.log file for this job contains errors like this:

	FAIL	autoupdate_EndToEndTest.paygen_au_canary_full	autoupdate_EndToEndTest.paygen_au_canary_full	timestamp=1476311631	localtime=Oct 12 15:33:51	Failed to install device image using payload at http://100.115.219.132:47062/update on chromeos4-row6-rack9-host17. Update failed. Returned update_engine error code: ERROR_CODE=0, ERROR_MESSAGE=ErrorCode::kSuccess. Reported error: AutoservRunError

The debug/autoserv.DEBUG shows that the devserver is successfully staging image, although with many hiccups (connection reset by peer during the is_staged query).   There are 5 is_staged queries, each eventually returning TRUE.  The last one is here:

10/12 15:26:52.028 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.132 'curl "http://100.115.219.132:8082/is_staged?artifacts=&files=stateful.tgz&archive_url=gs://chromeos-image-archive/cyan-chrome-pfq/R56-8891.0.0-rc1"''
10/12 15:27:08.948 DEBUG|        dev_server:0869| whether artifact is staged: 'True'

Then we get this:

10/12 15:27:54.182 DEBUG|          ssh_host:0180| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.219.132:47062/update'
10/12 15:27:54.358 ERROR|        base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate.
10/12 15:27:54.359 ERROR|        base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(473)] Initiating update check and install.
10/12 15:27:54.359 ERROR|        base_utils:0280| [stderr] [1012/152754:INFO:update_engine_client.cc(502)] Waiting for update to complete.
10/12 15:28:45.232 DEBUG|          monitors:0191| ts_mon: sending 8 metrics to PubSub
10/12 15:32:30.915 ERROR|        base_utils:0280| [stderr] [1012/153230:ERROR:update_engine_client.cc(217)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kSuccess(0)
10/12 15:32:30.993 ERROR|       autoupdater:0179| command execution error
* Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_MN7OtNssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos4-row6-rack9-host17 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::perform_update|update_image|_base_update_handler] ->
    ssh_run(/usr/bin/update_engine_client --update
    --omaha_url=http://100.115.219.132:47062/update)\";fi;
    /usr/bin/update_engine_client --update
    --omaha_url=http://100.115.219.132:47062/update"
Exit status: 1
Duration: 276.765402079

Then we see this:

stderr:
[1012/152754:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate.
[1012/152754:INFO:update_engine_client.cc(473)] Initiating update check and install.
[1012/152754:INFO:update_engine_client.cc(502)] Waiting for update to complete.
[1012/153230:ERROR:update_engine_client.cc(217)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kSuccess(0)
10/12 15:32:30.997 WARNI|       autoupdater:0111| Cannot check contents of devserver, update url http://100.115.219.132:47062/update: http://100.115.219.132:47062/update is not an update url
10/12 15:32:30.997 DEBUG|          ssh_host:0180| Running (ssh) '/usr/bin/update_engine_client --last_attempt_error'
10/12 15:32:31.176 DEBUG|        base_utils:0280| [stdout] ERROR_CODE=0
10/12 15:32:31.307 DEBUG|        base_utils:0280| [stdout] ERROR_MESSAGE=ErrorCode::kSuccess
10/12 15:32:31.307 INFO |autoupdate_EndToEn:0674| Killing omaha/devserver
10/12 15:32:31.307 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:31.436 DEBUG|          ssh_host:0180| Running (ssh) 'kill -s SIGTERM 23371'
10/12 15:32:31.567 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:32.706 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:33.842 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:34.976 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:36.113 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:37.251 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:38.382 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:39.518 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:40.654 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:40.783 ERROR|        site_utils:0939| Timed out waiting for condition: devserver down
10/12 15:32:40.783 WARNI|autoupdate_EndToEn:0601| Could not kill devserver with SIGTERM.
10/12 15:32:40.784 DEBUG|          ssh_host:0180| Running (ssh) 'kill -s SIGKILL 23371'
10/12 15:32:40.917 DEBUG|          ssh_host:0180| Running (ssh) 'test -e /proc/23371'
10/12 15:32:41.050 DEBUG|autoupdate_EndToEn:0676| Final devserver log before killing
10/12 15:32:41.051 DEBUG|          ssh_host:0180| Running (ssh) 'cat /tmp/devserver-stdout.t35CHJ'

This is chromeos4-devserver4, which has been up 18 days and was lightly loaded at the time.  So I look at the server log, grep only the lines with "cyan" and see this:

[12/Oct/2016:15:28:05] GSUTIL_UTIL crbug.com/639314 GSUtilRun(gsutil cp gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz /home/chromeos-test/images/dev-channel/cyan/8891.0.0/, Failed to download "gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz".)=0, stdout=<<<>>>, stderr=<<<Copying gs://chromeos-releases/dev-channel/cyan/8891.0.0/stateful.tgz...
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 0 B/271.33 MiB    
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 2.03 MiB/271.33 MiB    
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 4.06 MiB/271.33 MiB    
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 6.09 MiB/271.33 MiB  
...
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 268.13 MiB/271.33 MiB    
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 270.16 MiB/271.33 MiB    
Downloading ...st/images/dev-channel/cyan/8891.0.0/stateful.tgz: 271.33 MiB/271.33 MiB

so the stateful download appears to have succeeded
    
[12/Oct/2016:15:28:35] GOOGLE_STORAGE_DOWNLOADER Downloading artifacts chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads.
[12/Oct/2016:15:28:35] GOOGLE_STORAGE_DOWNLOADER Invoking background download of artifacts for [chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads]
::ffff:127.0.0.1 - - [12/Oct/2016:15:28:35] "GET /stage?artifacts=&files=chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b&async=True&archive_url=gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads HTTP/1.1" 200 7 "" "curl/7.35.0"
[12/Oct/2016:15:28:35] ARTIFACT No marker file, chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b->/home/chromeos-test/images/dev-channel/cyan/8891.0.0/payloads is not staged.

"No marker file"?  What's a marker file?  And then we're back to bug 639314, is this the root cause?

[12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 GSUtilRun(gsutil stat gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads/chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b, Failed to stat on the artifact file.)=0, stdout=<<<gs://chromeos-releases/dev-channel/cyan/8891.0.0/payloads/chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b:
[12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 override_result=chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b
[12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 uploaded_list=['chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b']
[12/Oct/2016:15:28:39] GSUTIL_UTIL crbug.com/639314 matching_names=['chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b'], is_regex_pattern=False
[12/Oct/2016:15:28:39] ARTIFACT Downloading file chromeos_8872.6.2-8891.0.0_cyan_dev-channel_delta_test.bin-f7f3daddef5f25e69e275372826bd03b

If that's the root cause, is there any way If not, what's the next step?  This is hard enough to get to.  Those server logs are 1GB each, I have to binary-search with dd to extract a section that's small enough to analyze (grep etc.)



 
Cc: nxia@chromium.org dgarr...@chromium.org dshi@chromium.org xixuan@chromium.org snanda@chromium.org
Labels: -Pri-2 Pri-1
Summary: devserver logs are too large (was: problematic logging for autoupdate_EndToEndTest failure)
I am repurposing this bug to a smaller and cleared mandate: make the devserver logs smaller for easier analysis.

The logs right now are too large for a quick analysis.  On chromeos4-devserver5, console.log is 2GB and server.log is 660MB.  The former is two weeks old.  We should rotate them at least once a day, better twice.


chromeos-test@chromeos4-devserver5:/var/log/devserver$ ls -l
total 17970976
-rw-r--r-- 1 root          root          6040407558 Nov 10 09:00 apache_log_metrics.log
-rw-r--r-- 1 root          root          2045149208 Nov 10 09:00 console.log
-rw-r--r-- 1 root          root             1381322 Oct 26 11:52 console.log.1
-rw-r--r-- 1 root          root          1957941127 Oct 26 11:04 console.log.2
-rw-r--r-- 1 root          root           108679313 Oct 13 11:08 console.log.3
-rw-r--r-- 1 root          root            77930718 Oct 12 23:52 console.log.4
-rw-r--r-- 1 root          root           222183424 Oct 12 14:05 console.log.5
-rw-r--r-- 1 chromeos-test chromeos-test  660860097 Nov 10 09:00 server.log
-rw-r--r-- 1 chromeos-test chromeos-test   31194039 Aug 12 23:59 server.log.2016-08-06
-rw-r--r-- 1 chromeos-test chromeos-test  240993520 Aug 19 23:59 server.log.2016-08-13
-rw-r--r-- 1 chromeos-test chromeos-test  173062708 Aug 26 23:59 server.log.2016-08-20
-rw-r--r-- 1 chromeos-test chromeos-test  296009289 Sep  2 23:59 server.log.2016-08-27
-rw-r--r-- 1 chromeos-test chromeos-test  326655013 Sep  9 23:59 server.log.2016-09-03
-rw-r--r-- 1 chromeos-test chromeos-test  623701907 Sep 16 23:59 server.log.2016-09-10
-rw-r--r-- 1 chromeos-test chromeos-test  524577943 Sep 23 23:59 server.log.2016-09-17
-rw-r--r-- 1 chromeos-test chromeos-test  673063455 Sep 30 23:59 server.log.2016-09-24
-rw-r--r-- 1 chromeos-test chromeos-test  920846470 Oct  7 23:59 server.log.2016-10-01
-rw-r--r-- 1 chromeos-test chromeos-test  857925669 Oct 14 23:59 server.log.2016-10-08
-rw-r--r-- 1 chromeos-test chromeos-test  798116629 Oct 21 23:59 server.log.2016-10-15
-rw-r--r-- 1 chromeos-test chromeos-test  951474535 Oct 28 23:59 server.log.2016-10-22
-rw-r--r-- 1 chromeos-test chromeos-test  869956548 Nov  4 23:59 server.log.2016-10-29
chromeos-test@chromeos4-devserver5:/var/log/devserver$ 
I wrote this script to get around the problem of huge log size.  It extracts part of a log.  "dd" would also work but this is a bit easier (the position is expressed in %, and the extraction size in lines, rather than byte counts)

extract.py:

#! /usr/bin/python

import os
import sys

if len(sys.argv) != 4:
    print 'Usage: extract.py <file> <position (in %)> <length (in lines)>'
    print 'extracts <length> lines starting at <position> in <file>'
    sys.exit(1)

path = sys.argv[1]
fraction = float(sys.argv[2])
length = int(sys.argv[3])

size = os.stat(path).st_size
fd = open(path)
fd.seek(size * fraction / 100)

while True:
    c = fd.read(1)
    if not c:
        print 'EOF reached without finding newline'
        sys.exit(1)
    if c == '\n':
        break

while length > 0:
    line = fd.readline()
    if line == "":
        break
    print line,
    length -= 1



Labels: Hotlist-Fixit
See also  issue 682417  for the complaint in the OP (multiple exception stacktraces)

Comment 5 by xixuan@chromium.org, Jan 31 2017

Owner: xixuan@chromium.org

Comment 6 by xixuan@chromium.org, Jan 31 2017

Status: Started (was: Untriaged)
Project Member

Comment 7 by bugdroid1@chromium.org, Feb 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d

commit 3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d
Author: xixuan <xixuan@chromium.org>
Date: Fri Feb 03 13:15:07 2017

devserver: rotate devserver log twice a day.

This CL changes devserver log's rotation to twice a day.

BUG= chromium:656829 
TEST=test it locally with setting rotation to every 30s, verify it
works.

Change-Id: Ie7843dcb2e7f4191eed24f3b62f9cde65d46dbf4
Reviewed-on: https://chromium-review.googlesource.com/435021
Commit-Ready: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/3d48bff3de541e32401d8f3a4f4d9dc1e5f2976d/devserver.py

Labels: cros-infra-fixedit-q117
Status: Fixed (was: Started)

Comment 9 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 10 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 12 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment