New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 774597 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Oct 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 760652



Sign in to add a comment

provision-workqueue job logs not rotating properly

Reported by jrbarnette@chromium.org, Oct 13 2017

Issue description

The 'provision-workqueue' job starts a long-running Python program
that writes a log under /var/log/devserver.  The logging is configured
to rotate once a week, preserving a quarter's worth of logs.  The
handler setup looks like this:
  handler = handlers.TimedRotatingFileHandler(logfile, when='W4',
                                              backupCount=13)

That means "rotate at midnight between Friday and Saturday, keeping
13 weeks of history."

Looking at the log history, something's rotten.  Here's a sampling:
    chromeos-test@chromeos4-devserver2:~$ cd /var/log/devserver/
    chromeos-test@chromeos4-devserver2:/var/log/devserver$ ls workqueue.log*
    workqueue.log             workqueue.log.2017-09-09  workqueue.log.2017-09-30
    workqueue.log.2017-08-26  workqueue.log.2017-09-16
    workqueue.log.2017-09-02  workqueue.log.2017-09-23
    chromeos-test@chromeos4-devserver2:/var/log/devserver$ head -1 workqueue.log.2017-08-26
    2017-09-02 00:05:21 DEBUG      - Starting tick number 1476458

Basically, the log that was supposed to hold all the data from
the week of 8/26 through 9/1 instead holds logs starting on 9/2.

Equally curious (and presumably related) is this:
    chromeos-test@chromeos4-devserver2:/var/log/devserver$ tail -5 workqueue.log.2017-09-30
    2017-10-13 10:21:23 DEBUG      - Starting tick number 2729918
    2017-10-13 10:31:23 DEBUG      - Starting tick number 2732311
    2017-10-13 10:41:23 DEBUG      - Starting tick number 2734704
    2017-10-13 10:51:23 DEBUG      - Starting tick number 2737097
    2017-10-13 11:01:24 DEBUG      - Starting tick number 2739490
    chromeos-test@chromeos4-devserver2:/var/log/devserver$ tail -5 workqueue.log
    2017-10-13 08:47:57 INFO       - Refreshing access_token
    2017-10-13 09:47:58 INFO       - Refreshing due to a 401 (attempt 1/2)
    2017-10-13 09:47:58 INFO       - Refreshing access_token
    2017-10-13 10:48:57 INFO       - Refreshing due to a 401 (attempt 1/2)
    2017-10-13 10:48:57 INFO       - Refreshing access_token

"workqueue.log" is this week's log and the 2017-09-30 is last week's.
However, both contain entries from this week(!).

This is, at best, awkward.  At worst, it breaks debug.  In particular,
logs from the time of  bug 760652  are now gone.

 
I have a handful of theories as to possible causes (from least to
most likely):
  * A bug in the Python logging libraries, or some misunderstanding
    in how the libraries are supposed to work.
  * Some bad interaction between the Python multiprocessing
    module and Python logging.
  * Some bad interaction with how ts_mon does its logging.

Looking especially at the anomaly with the current logs (as of 10/13),
I note that the "Refreshing access_token" log messages only occur in
the unrotated log file.
I have a theory regarding the cause of the problem.

First, based on observed behavior, I believe that TimedRotatingFileHandler
operates like this:
  * Whenever a new request to write to the log comes in, the timestamp
    of the previous request is compared to the current timestamp.
  * If the timestamps cross over a rotation boundary, the old log file
    is closed and renamed, and a new file is opened.

Next, the ts_mon code creates a subprocess for its work, and that
subprocess will inherit the logging configuration of its parent.

My theory is that these factors trigger this sequence:
 1) A log message from the workqueue code triggers rotation as
    described above.  At the end, the logger is writing with a new
    file descriptor on the standard "workqueue.log" file.
 2) Shortly after, a log message from the ts_mon child process triggers
    rotation _again_.  Now, the workqueue parent process is writing
    to a rotated file named for last week, and the ts_mon child is
    writing to the current log file.

The ordering above (parent process first, ts_mon second) is the most
common ordering, because the parent logs no less often that once every
10 minutes, whereas the ts_mon process seems to log only once an hour.

Blocking: 760652
Labels: -Pri-3 Pri-1
Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
Project Member

Comment 5 by bugdroid1@chromium.org, Oct 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/3ff41e90b0c10710d92bbaa748a397beeabcdad1

commit 3ff41e90b0c10710d92bbaa748a397beeabcdad1
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Oct 17 18:07:46 2017

[workqueue] Fix workqueue_server log rotation.

The workqueue_server command sets up logging with a
TimedRotatingFileHandler.  However, the ts_mon setup used by the
program creates a child process that also writes log messages.  The
rotating log handler isn't multiprocess safe, with the result that
log rotation is wrong, and some content gets lost.

This fixes the logging setup so that ts_mon and the workqueue_server
process proper log to different files.

BUG= chromium:774597 
TEST=manually run the program locally

Change-Id: I1b5a0582f07f81cf9827725189c89a7116312384
Reviewed-on: https://chromium-review.googlesource.com/719465
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/3ff41e90b0c10710d92bbaa748a397beeabcdad1/lib/workqueue/workqueue_server.py

Status: Fixed (was: Started)
Eventually, we'll want a push to devserver, after which we can call this
"Verified".

Sign in to add a comment