clearly mark in test logs which errors are recoverable (and recovered) |
|||||
Issue descriptionError reporting in test logs is messy. This can happen: 1. An error occurs and is logged. 2. The python code throws an exception. 3. The exception is caught and various seemingly normal operations are logged. Then the exception is thrown again, or the error status is returned. 4. Cascading errors may occur and be logged. 5. More information related to the original error is printed. Because the distance between 1 and 5 can be large, often it is difficult to tell when an error is a root cause of a failure, because some errors are recoverable and silently ignored. For this reason, every time an error is recovered from, the recovery should be logged.
,
Mar 28 2017
Makes (instinctive) sense - do you have an example or two?
,
Apr 4 2017
+ semenzato do you have some examples?
,
Apr 5 2017
Sure, take for instance autoserv test logs and search for ERROR. I randomly picked the autoserv.DEBUG from this bucket: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/110935083-chromeos-test/chromeos2-row3-rack4-host5/debug/ First instance: 04/05 05:37:41.335 ERROR| base_utils:0280| [stderr] 2017-04-05 08:37:41 URL:http://100.115.245.198:8082/static/falco-release/R59-9433.0.0/autotest/packages/packages.checksum [34778/34778] -> "/usr/local/autotest/packages.checksum" [1] and then 7 lines later: 04/05 05:37:42.647 ERROR| base_utils:0280| [stderr] cat: /usr/local/autotest/.checksum: No such file or directory Apparently this error is recoverable, because the test keeps going. Then we have these errors from the TKO parser: 04/05 05:38:03.950 ERROR| logging_manager:0626| tko parser: {'builds': "{'cros-version': 'falco-release/R59-9433.0.0'}", 'job_started': 1491395851, 'hostname': 'chromeos2-row3-rack4-host5', 'status_version': 1, 'label': 'falco-release/R59-9433.0.0/bvt-inline/login_OwnershipNotRetaken', 'parent_job_id': 110934980, 'build': 'falco-release/R59-9433.0.0', 'drone': 'chromeos-server12.cbf.corp.google.com', 'user': 'chromeos-test', 'suite': 'bvt-inline', 'experimental': 'False', 'job_queued': 1491395114} 04/05 05:38:03.953 ERROR| logging_manager:0626| tko parser: MACHINE NAME: chromeos2-row3-rack4-host5 04/05 05:38:03.953 ERROR| logging_manager:0626| tko parser: MACHINE GROUP: falco 04/05 05:38:03.953 ERROR| logging_manager:0626| tko parser: parsing partial test ---- SERVER_JOB 04/05 05:38:03.954 ERROR| logging_manager:0626| tko parser: parsing partial test None CLIENT_JOB.0 04/05 05:38:03.954 ERROR| logging_manager:0626| tko parser: RUNNING: RUNNING 04/05 05:38:03.955 ERROR| logging_manager:0626| CLIENT_JOB.0 04/05 05:38:03.955 ERROR| logging_manager:0626| 04/05 05:38:03.956 ERROR| logging_manager:0626| tko parser: parsing partial test login_OwnershipNotRetaken login_OwnershipNotRetaken 04/05 05:38:03.956 ERROR| logging_manager:0626| tko parser: RUNNING: RUNNING 04/05 05:38:03.956 ERROR| logging_manager:0626| Subdir: login_OwnershipNotRetaken 04/05 05:38:03.957 ERROR| logging_manager:0626| Testname: login_OwnershipNotRetaken 04/05 05:38:03.957 ERROR| logging_manager:0626| 04/05 05:38:03.957 ERROR| logging_manager:0626| tko parser: update RUNNING reason: completed successfully 04/05 05:38:03.957 ERROR| logging_manager:0626| tko parser: parsing test login_OwnershipNotRetaken login_OwnershipNotRetaken 04/05 05:38:03.958 ERROR| logging_manager:0626| tko parser: ADD: GOOD 04/05 05:38:03.958 ERROR| logging_manager:0626| Subdir: login_OwnershipNotRetaken 04/05 05:38:03.958 ERROR| logging_manager:0626| Testname: login_OwnershipNotRetaken 04/05 05:38:03.958 ERROR| logging_manager:0626| completed successfully 04/05 05:38:03.959 ERROR| logging_manager:0626| tko parser: parsing test None CLIENT_JOB.0 04/05 05:38:03.959 ERROR| logging_manager:0626| tko parser: ADD: GOOD 04/05 05:38:03.959 ERROR| logging_manager:0626| Subdir: None 04/05 05:38:03.960 ERROR| logging_manager:0626| Testname: CLIENT_JOB.0 04/05 05:38:03.960 ERROR| logging_manager:0626| 04/05 05:38:03.960 ERROR| logging_manager:0626| tko parser: parsing test ---- SERVER_JOB Why are they even errors? Let me get some more examples.
,
Apr 5 2017
From the same build, paygen tests: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/110940933-chromeos-test/chromeos2-row3-rack4-host3/debug/ 04/05 06:25:57.651 DEBUG| ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 04/05 06:25:57.862 DEBUG| base_utils:0280| [stdout] d4584ac5-7d82-4e4f-8504-9b605822164d 04/05 06:25:58.870 DEBUG| ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 04/05 06:26:29.825 ERROR| base_utils:0280| [stderr] mux_client_request_session: read from master failed: Broken pipe 04/05 06:26:30.458 DEBUG| base_utils:0280| [stdout] 03f0ce03-1cba-4ec2-add1-b9c605063478 04/05 06:26:30.462 DEBUG| abstract_ssh:0682| Host chromeos2-row3-rack4-host3 now has boot_id 03f0ce03-1cba-4ec2-add1-b9c605063478 and so must have rebooted It's clear that the error here is not unexpected, but one has to understand what is going on. Then later: 04/05 06:26:37.198 DEBUG| ssh_host:0284| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.245.198:38072/update' 04/05 06:26:37.556 ERROR| base_utils:0280| [stderr] [0405/062636:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate. 04/05 06:26:37.558 ERROR| base_utils:0280| [stderr] [0405/062636:INFO:update_engine_client.cc(473)] Initiating update check and install. 04/05 06:26:37.559 ERROR| base_utils:0280| [stderr] [0405/062636:INFO:update_engine_client.cc(502)] Waiting for update to complete. 04/05 06:29:43.564 ERROR| base_utils:0280| [stderr] [0405/062942:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed. 04/05 06:29:43.569 WARNI| autoupdater:0315| Unable to parse build name /update for metrics. Continuing anyway. 04/05 06:29:43.569 DEBUG| ssh_host:0284| Running (ssh) '/usr/bin/update_engine_client -status | grep CURRENT_OP' 04/05 06:29:43.792 ERROR| base_utils:0280| [stderr] [0405/062942:INFO:update_engine_client.cc(493)] Querying Update Engine status... 04/05 06:29:43.794 DEBUG| base_utils:0280| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT 04/05 06:29:43.796 INFO | autoupdater:0521| Updating stateful partition... Why are these even ERRORs? More similar errors later: 04/05 06:31:16.954 ERROR| base_utils:0280| [stderr] [0405/063115:INFO:update_engine_client.cc(493)] Querying Update Engine status... 04/05 06:31:16.957 DEBUG| base_utils:0280| [stdout] CURRENT_OP=UPDATE_STATUS_IDLE 04/05 06:31:16.958 INFO | autoupdater:0253| Triggering update via: /usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.245.198:36881/update 04/05 06:31:16.959 DEBUG| ssh_host:0284| Running (ssh) '/usr/bin/update_engine_client --check_for_update --omaha_url=http://100.115.245.198:36881/update' 04/05 06:31:17.142 ERROR| base_utils:0280| [stderr] [0405/063116:INFO:update_engine_client.cc(473)] Initiating update check and install. These are not really addressing the problem in comment #0 though.
,
Apr 7 2017
So, there are three cases, I think. 1) First instance: in #4. I think this is a real example of #0's? There is an error case for some specific function, but in later phase, it is recovered. 2) tok parser in #4. Seems like in TKO, "dprint" usage looks mix of "INFO" and "ERROR". So, we'd probably like to walk through TKO code, and clear separation of them? 3) SSH in #5 by default stderr from subprocess run (inc. SSH) is redirect to ERROR logging. However, the severity of stderr can be different for commands. In these cases, the output from stderr looks kind of "INFO". Note: though, if update_engine_client.cc outputs ERROR logs, we may want to redirect it to still ERROR, I think. TBH, 2) and 3) sound somehow slightly different topic, which is "logging severity is not properly set", but 1) is, as mentioned above, an example of #0. Luigi, which do you think more important? I think I can work on (at least point fix of) 2 and 3 cases, quickly. Although, 1) sounds a bit more unclear topic. WDYT, Luigi?
,
Apr 11 2017
Had a bit more investigation. Seems like wget/cp errors come from unexpected stderr output by subprocess commands. So, the quick fix would be something like; https://chromium-review.googlesource.com/c/474786/ Though, we have to think about a bit more, IMHO. If an actual error occurs, what is expected here? "stderr" output can contain some useful information. If it is already output to INFO (or ignored), then there is no way to recover. Several options; 1) Output stderr to INFO. If an error occurs, just output some short message about it on ERROR. 2) Output stderr to INFO, but buffer it as well. If an error occurs, output buffered stderr to ERROR, too. Cons: it dups the log between INFO and ERROR. Also, buffering may need huge memory. 3) Just buffer the stderr, and decide whether INFO or ERROR is better to be output on subprocess completion. Similarly, may need big buffer. 4) Slightly more aggressive approach is; rewrite the code so that we can know the detailed error condition in the python world. E.g., get rid of "wget" here, and use urllib2.urlopen (with probably small wrapping). Cons: more diverge from the ToT autotest. WDYT?
,
Apr 20 2017
Hi Hidehiko and sorry for the delay here. I am not sure it is always possible to change the ERROR logging because in some cases those failures actually result in an error, and in other cases they are recovered. The function that performs the low-level action doesn't know if the caller will recover or not. One problem is that we always log something instead of fully propagating the error data to the caller, which could then decide whether to discard it (because it's ignoring the error) or log it, together with more context information. One possibility is to pass an extra argument to the callee telling it if the caller will recover on error, and on which kinds of error (maybe a list of exceptions). Then the callee could log the information as ERROR or WARNING depending on that information. But that gets complicated too. So ideally we should come up with guidelines to deal with these situations, and then start applying them. I favor some discipline in propagating error information up, and letting the caller decide what to log. Not sure how easy it is to move in that direction. Let me think some more.
,
Apr 20 2017
Here's another example. Perhaps ERROR should not be used here. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113507434-chromeos-test/chromeos6-row2-rack20-host1/debug/ 04/20 09:44:44.719 DEBUG| ssh_host:0284| Running (ssh) '/usr/bin/update_engine_client --update --omaha_url=http://100.115.185.227:33117/update' 04/20 09:44:44.900 ERROR| base_utils:0280| [stderr] [0420/094444:INFO:update_engine_client.cc(471)] Forcing an update by setting app_version to ForcedUpdate. 04/20 09:44:44.901 ERROR| base_utils:0280| [stderr] [0420/094444:INFO:update_engine_client.cc(473)] Initiating update check and install. 04/20 09:44:44.978 ERROR| base_utils:0280| [stderr] [0420/094444:INFO:update_engine_client.cc(502)] Waiting for update to complete. 04/20 09:50:43.773 ERROR| base_utils:0280| [stderr] [0420/095043:INFO:update_engine_client.cc(224)] Update succeeded -- reboot needed. 04/20 09:50:43.857 WARNI| autoupdater:0315| Unable to parse build name /update for metrics. Continuing anyway. 04/20 09:50:43.857 DEBUG| ssh_host:0284| Running (ssh) '/usr/bin/update_engine_client -status | grep CURRENT_OP' 04/20 09:50:44.178 ERROR| base_utils:0280| [stderr] [0420/095043:INFO:update_engine_client.cc(493)] Querying Update Engine status... 04/20 09:50:44.255 DEBUG| base_utils:0280| [stdout] CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT 04/20 09:50:44.258 INFO | autoupdater:0521| Updating stateful partition...
,
May 17 2017
Re: #8. No update for a while, so I investigated this case, and think it again. Probably, it is nice to split the problem into two perspective. So, here's my proposal. 1) General logging strategy. IMHO, it is much simpler if callee can decide the logging level by itself. Indeed, reported cases are just verbose, so we'd like to suppress them. However, "whether the error is recoverable" is difficult, because caller and callee is often stack, and which layer should handles the error is not trivial. Propagating the recoverable info sounds like too complex practically, IMHO, from that perspective. Instead, I believe/assume, as long as we have enough control, the logging severity can be reasonably assigned by the line itself. 2) Logging strategy for subprocesses. In autotests, there are many subprocess execution. Problematic cases reported by Luigi is categorized to this. We have less control for error messages in such cases. What we can control is; whether we should output the stdout/stderr from subprocess or not, and (if output) their severity. Currently, by default, stdout is redirected to INFO, and stderr is to ERROR. I'd propose a strategy here for better logging. 1) If error is easily recoverable by the direct command executor, then redirect stderr to INFO or DEBUG. 2) If stderr is used not only for error message, but also status update for success case, redirect stderr to INFO or DEBUG, too. In case of un-recoverable case, caller should log ERROR line. 3) Otherwise, keep redirect stderr to ERROR. Let's consider by example. 1) is for #4's "cat" command error. In case of no file, it is generated. Maybe nice to log; ".checksum file is not found, so creating..." or something, too. 2) is for other cases. E.g., ssh error / tko parsing error / update_engine_client logging are actually status update, so it should be redirect to INFO (or DEBUG). In case of error, unfortunately it is redirect to INFO (or DEBUG), because we have no simple way to redirect it to ERROR. We may be able to tweak here, e.g. buffer and decide the severity later. However, it is yet another complexity, and it may shuffle the order of logging in stdout and stderr, which is not good. Or, in case of unexpected shutdown (e.g. timeout), it is another complexity to preserve the buffered log always. Instead, we can log ERROR message in case of un-recoverable error case by caller easily. The log reader (human) can easily check the line just above, so that there should be useful information, at least in autoserv.DEBUG files. Switching the commandline stderr severity can be easily switched by just passing |stderr_is_expected| arg. WDYT, Luigi?
,
Mar 31 2018
,
Aug 1
,
Jan 17
(6 days ago)
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by keta...@chromium.org
, Mar 22 2017