New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 679423 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

run_suite shows no error in stdio, but test side logging does.

Project Member Reported by semenzato@chromium.org, Jan 9 2017

Issue description




01/09 09:28:57.231 INFO |        server_job:0153| START	95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483981838	localtime=Jan 09 09:10:38	
01/09 09:28:57.232 INFO |        server_job:0153| 	GOOD	95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483982806	localtime=Jan 09 09:26:46	completed successfully
01/09 09:28:57.233 INFO |        server_job:0153| END GOOD	95318788-chromeos-test/chromeos4-row11-rack7-host22/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483982806	localtime=Jan 09 09:26:46	
01/09 09:28:57.233 DEBUG|             suite:1104| Adding status keyval for autoupdate_EndToEndTest.paygen_au_dev_full=95318788-chromeos-test
01/09 09:28:57.235 DEBUG|          monitors:0191| ts_mon: sending 11 metrics to PubSub
01/09 09:35:23.684 ERROR|                db:0025| 09:35:23 01/09/17: An operational error occurred during a database operation: (2006, 'MySQL server has gone away'); retrying, don't panic yet
01/09 09:36:00.108 INFO |        server_job:0153| START	95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483982005	localtime=Jan 09 09:13:25	
01/09 09:36:00.109 INFO |        server_job:0153| 	FAIL	95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483983207	localtime=Jan 09 09:33:27	Failed to receive a download finished notification (download_finished) within 600 seconds. This could be a problem with the updater or a connectivity issue. For more details, check the update_engine log (in sysinfo or on the DUT, also included in the test log).
01/09 09:36:00.110 INFO |        server_job:0153| END FAIL	95318795-chromeos-test/chromeos4-row11-rack8-host1/autoupdate_EndToEndTest.paygen_au_dev_full	autoupdate_EndToEndTest.paygen_au_dev_full	timestamp=1483983207	localtime=Jan 09 09:33:27	
 
Cc: rspangler@chromium.org dgarr...@chromium.org snanda@chromium.org kinaba@chromium.org
Components: Infra>Client>ChromeOS
Labels: -Pri-3 OS-Chrome Pri-2
(Sorry, typed <enter> while editing Summary.)

This is the builder:

https://uberchromegw.corp.google.com/i/chromeos/builders/lars-release/builds/744

There is no error in the stdio.

But there is an error (snippet above) in autoserv.DEBUG for the suite:

http://chromeos-server6.mtv.corp.google.com/results/95318765-chromeos-test/hostless/debug/autoserv.DEBUG

Shouldn't this error be propagated to the stdio?  Maybe it's another symptom of the swarming timeout bug?

Owner: jrbarnette@chromium.org
Summary: run_suite shows no error in stdio, but test side logging does. (was: PaygenTest on lars-release failed, but there is no error in the stdio)
Owner: dgarr...@chromium.org
I don't know of a reason why this is something that I both can and
should fix.

Status: Assigned (was: Untriaged)
Didn't you do work on run_suite recently? For some reason I thought you had, in addition to everything else.
Labels: -Pri-2 Hotlist-Fixit Pri-3
Owner: pprabhu@chromium.org
Labels: -Hotlist-Fixit current-issue
Owner: ----
Status: Untriaged (was: Assigned)
Looking at the timeline of events, the builder timed out at 9:50, while the suite finished after 9:56.

There is the issue that while waiting for a suite, we don't return any "progress" information back to the builder. We just make one long run_suite call to get the results when the test finishes. It'd be better if all run_suite calls returned immediately (like the first one) with some information about the current state of the suite.
That is a good feature request, and should be handled as such. Some things to consider:
- We probably won't have much information about the error even then, just which of the subjobs have started/finished/failed etc.
- This is not a fixit sized feature.

This isn't on my plate, so unassigning.
In particular note  that when the builder timed out, the run_suite call hadn't failed, it was just waiting around. Hence no failure information back in the build logs.
Labels: -current-issue Hotlist-Fixit
- possible approach: run suite in a loop with short timeout to update results 
- might be fixit size, adding back to fixit list. 
Labels: Logging-improv
Cc: akes...@chromium.org
Owner: hidehiko@chromium.org
Status: Assigned (was: Untriaged)
Looking.

I think this is very useful for product-side ChromeOS engs,
because many people are familiar with bot page, and stdout is the way to see the log. Specifically, this is similar to Chrome.

I'm walking through the flow of logs, and looking for a good channel to communicate from DUT to bot.

Some high level random ideas;
- Read autoserve.DEBUG file on storage server, and let run_suite dump it to stdout.
  A kind of most straightforward solution.
  IIUC this needs some credential installed in the bot?

- Store the log in TKO.
  Then, the server can return the log via RPC.
  - Some concerns: Is TKO powerful enough to serve the whole log data?
    Similarly, TKO DB has enough space to store such logs?

- Send some log from DUT to AFE, and let it return to bots.
  - Concern: similar to TKO solution, is AFE powerful enough? Recently, looks overloaded?
  - AFAIK, DUT is not directly communicating with AFE now (am I true?). If so another RPC point may be needed in AFE.

- Directly connect from bot to DUT somehow.
  - DUT needs to be a server to return logs.
  - Similarly, does credential need to be installed?

I'd like to hear recommendation/advice by chromeos-infra experts.
WDYT?

#12 thank you for your questions and suggestions.

Looking back at this, I see the explanation for this behavior in comment #8.  Essentially, there are two failures: the one in the snippet, and whatever else caused some tests to not complete before the timeout.  IIUC, the timeout happens in the caller (by the way, what's the matter with the timeout length?  27,000 seconds is many hours, but the log spans less than 1 hour).  There is no attempt to collect information about failures that may have happened in the remote devices.

I am not sure how important it is to collect and summarize that information.  It would be useful for stats, but I haven't seen this situation happen a lot, and I think it would be a lot of effort to collect it.  So I am not completely sure this is worth doing at this point.

To answer some of the questions in #12:

- Some concerns: Is TKO powerful enough to serve the whole log data?
    Similarly, TKO DB has enough space to store such logs?

I am not the expert here, but I think this is a valid concern and I suspect that the answers are NO and NO.  Same for the later AFE question.


Status: Archived (was: Assigned)

Sign in to add a comment