run_suite shows no error in stdio, but test side logging does. |
|||||||||||
Issue description01/09 09:28:57.231 INFO | server_job:0153| START 95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483981838 localtime=Jan 09 09:10:38 01/09 09:28:57.232 INFO | server_job:0153| GOOD 95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483982806 localtime=Jan 09 09:26:46 completed successfully 01/09 09:28:57.233 INFO | server_job:0153| END GOOD 95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483982806 localtime=Jan 09 09:26:46 01/09 09:28:57.233 DEBUG| suite:1104| Adding status keyval for autoupdate_EndToEndTest.paygen_au_dev_full=95318788-chromeos-test 01/09 09:28:57.235 DEBUG| monitors:0191| ts_mon: sending 11 metrics to PubSub 01/09 09:35:23.684 ERROR| db:0025| 09:35:23 01/09/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet 01/09 09:36:00.108 INFO | server_job:0153| START 95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483982005 localtime=Jan 09 09:13:25 01/09 09:36:00.109 INFO | server_job:0153| FAIL 95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483983207 localtime=Jan 09 09:33:27 Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log). 01/09 09:36:00.110 INFO | server_job:0153| END FAIL 95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full autoupdate_EndToEndTest.paygen_au_dev_full timestamp=1483983207 localtime=Jan 09 09:33:27
,
Jan 9 2017
,
Jan 9 2017
I don't know of a reason why this is something that I both can and should fix.
,
Jan 9 2017
,
Jan 9 2017
Didn't you do work on run_suite recently? For some reason I thought you had, in addition to everything else.
,
Jan 11 2017
,
Jan 26 2017
,
Jan 27 2017
Looking at the timeline of events, the builder timed out at 9:50, while the suite finished after 9:56. There is the issue that while waiting for a suite, we don't return any "progress" information back to the builder. We just make one long run_suite call to get the results when the test finishes. It'd be better if all run_suite calls returned immediately (like the first one) with some information about the current state of the suite. That is a good feature request, and should be handled as such. Some things to consider: - We probably won't have much information about the error even then, just which of the subjobs have started/finished/failed etc. - This is not a fixit sized feature. This isn't on my plate, so unassigning.
,
Jan 27 2017
In particular note that when the builder timed out, the run_suite call hadn't failed, it was just waiting around. Hence no failure information back in the build logs.
,
Feb 1 2017
- possible approach: run suite in a loop with short timeout to update results - might be fixit size, adding back to fixit list.
,
Mar 22 2017
,
Apr 5 2017
Looking.
I think this is very useful for product-side ChromeOS engs,
because many people are familiar with bot page, and stdout is the way to see the log. Specifically, this is similar to Chrome.
I'm walking through the flow of logs, and looking for a good channel to communicate from DUT to bot.
Some high level random ideas;
- Read autoserve.DEBUG file on storage server, and let run_suite dump it to stdout.
A kind of most straightforward solution.
IIUC this needs some credential installed in the bot?
- Store the log in TKO.
Then, the server can return the log via RPC.
- Some concerns: Is TKO powerful enough to serve the whole log data?
Similarly, TKO DB has enough space to store such logs?
- Send some log from DUT to AFE, and let it return to bots.
- Concern: similar to TKO solution, is AFE powerful enough? Recently, looks overloaded?
- AFAIK, DUT is not directly communicating with AFE now (am I true?). If so another RPC point may be needed in AFE.
- Directly connect from bot to DUT somehow.
- DUT needs to be a server to return logs.
- Similarly, does credential need to be installed?
I'd like to hear recommendation/advice by chromeos-infra experts.
WDYT?
,
Apr 6 2017
#12 thank you for your questions and suggestions.
Looking back at this, I see the explanation for this behavior in comment #8. Essentially, there are two failures: the one in the snippet, and whatever else caused some tests to not complete before the timeout. IIUC, the timeout happens in the caller (by the way, what's the matter with the timeout length? 27,000 seconds is many hours, but the log spans less than 1 hour). There is no attempt to collect information about failures that may have happened in the remote devices.
I am not sure how important it is to collect and summarize that information. It would be useful for stats, but I haven't seen this situation happen a lot, and I think it would be a lot of effort to collect it. So I am not completely sure this is worth doing at this point.
To answer some of the questions in #12:
- Some concerns: Is TKO powerful enough to serve the whole log data?
Similarly, TKO DB has enough space to store such logs?
I am not the expert here, but I think this is a valid concern and I suspect that the answers are NO and NO. Same for the later AFE question.
,
Mar 31 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by semenzato@chromium.org
, Jan 9 2017Components: Infra>Client>ChromeOS
Labels: -Pri-3 OS-Chrome Pri-2