provision-workqueue job logs not rotating properly
Reported by
jrbarnette@chromium.org,
Oct 13 2017
|
||||
Issue description
The 'provision-workqueue' job starts a long-running Python program
that writes a log under /var/log/devserver. The logging is configured
to rotate once a week, preserving a quarter's worth of logs. The
handler setup looks like this:
handler = handlers.TimedRotatingFileHandler(logfile, when='W4',
backupCount=13)
That means "rotate at midnight between Friday and Saturday, keeping
13 weeks of history."
Looking at the log history, something's rotten. Here's a sampling:
chromeos-test@chromeos4-devserver2:~$ cd /var/log/devserver/
chromeos-test@chromeos4-devserver2:/var/log/devserver$ ls workqueue.log*
workqueue.log workqueue.log.2017-09-09 workqueue.log.2017-09-30
workqueue.log.2017-08-26 workqueue.log.2017-09-16
workqueue.log.2017-09-02 workqueue.log.2017-09-23
chromeos-test@chromeos4-devserver2:/var/log/devserver$ head -1 workqueue.log.2017-08-26
2017-09-02 00:05:21 DEBUG - Starting tick number 1476458
Basically, the log that was supposed to hold all the data from
the week of 8/26 through 9/1 instead holds logs starting on 9/2.
Equally curious (and presumably related) is this:
chromeos-test@chromeos4-devserver2:/var/log/devserver$ tail -5 workqueue.log.2017-09-30
2017-10-13 10:21:23 DEBUG - Starting tick number 2729918
2017-10-13 10:31:23 DEBUG - Starting tick number 2732311
2017-10-13 10:41:23 DEBUG - Starting tick number 2734704
2017-10-13 10:51:23 DEBUG - Starting tick number 2737097
2017-10-13 11:01:24 DEBUG - Starting tick number 2739490
chromeos-test@chromeos4-devserver2:/var/log/devserver$ tail -5 workqueue.log
2017-10-13 08:47:57 INFO - Refreshing access_token
2017-10-13 09:47:58 INFO - Refreshing due to a 401 (attempt 1/2)
2017-10-13 09:47:58 INFO - Refreshing access_token
2017-10-13 10:48:57 INFO - Refreshing due to a 401 (attempt 1/2)
2017-10-13 10:48:57 INFO - Refreshing access_token
"workqueue.log" is this week's log and the 2017-09-30 is last week's.
However, both contain entries from this week(!).
This is, at best, awkward. At worst, it breaks debug. In particular,
logs from the time of bug 760652 are now gone.
,
Oct 13 2017
I have a theory regarding the cause of the problem.
First, based on observed behavior, I believe that TimedRotatingFileHandler
operates like this:
* Whenever a new request to write to the log comes in, the timestamp
of the previous request is compared to the current timestamp.
* If the timestamps cross over a rotation boundary, the old log file
is closed and renamed, and a new file is opened.
Next, the ts_mon code creates a subprocess for its work, and that
subprocess will inherit the logging configuration of its parent.
My theory is that these factors trigger this sequence:
1) A log message from the workqueue code triggers rotation as
described above. At the end, the logger is writing with a new
file descriptor on the standard "workqueue.log" file.
2) Shortly after, a log message from the ts_mon child process triggers
rotation _again_. Now, the workqueue parent process is writing
to a rotated file named for last week, and the ts_mon child is
writing to the current log file.
The ordering above (parent process first, ts_mon second) is the most
common ordering, because the parent logs no less often that once every
10 minutes, whereas the ts_mon process seems to log only once an hour.
,
Oct 13 2017
,
Oct 13 2017
,
Oct 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/3ff41e90b0c10710d92bbaa748a397beeabcdad1 commit 3ff41e90b0c10710d92bbaa748a397beeabcdad1 Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Oct 17 18:07:46 2017 [workqueue] Fix workqueue_server log rotation. The workqueue_server command sets up logging with a TimedRotatingFileHandler. However, the ts_mon setup used by the program creates a child process that also writes log messages. The rotating log handler isn't multiprocess safe, with the result that log rotation is wrong, and some content gets lost. This fixes the logging setup so that ts_mon and the workqueue_server process proper log to different files. BUG= chromium:774597 TEST=manually run the program locally Change-Id: I1b5a0582f07f81cf9827725189c89a7116312384 Reviewed-on: https://chromium-review.googlesource.com/719465 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/3ff41e90b0c10710d92bbaa748a397beeabcdad1/lib/workqueue/workqueue_server.py
,
Oct 18 2017
Eventually, we'll want a push to devserver, after which we can call this "Verified". |
||||
►
Sign in to add a comment |
||||
Comment 1 by jrbarnette@chromium.org
, Oct 13 2017I have a handful of theories as to possible causes (from least to most likely): * A bug in the Python logging libraries, or some misunderstanding in how the libraries are supposed to work. * Some bad interaction between the Python multiprocessing module and Python logging. * Some bad interaction with how ts_mon does its logging. Looking especially at the anomaly with the current logs (as of 10/13), I note that the "Refreshing access_token" log messages only occur in the unrotated log file.