New issue
Advanced search Search tips

Issue 903084 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 3
Type: Bug



Sign in to add a comment

lots of tasks fail with GOMAERROR when using the externally-accessible goma backend

Project Member Reported by most...@vewd.com, Nov 8

Issue description

I get lots of GOMAERROR tasks when using the externally-accessible goma backend: 1031 failed tasks for a content_shell build.

The response_header for these tasks says:

HTTP/1.1 502 Bad Gateway
 Content-Type: text/html; charset=UTF-8
 Referrer-Policy: no-referrer
 Content-Length: 332
 Date: Thu, 08 Nov 2018 08:22:57 GMT
 Alt-Svc: clear

Using linux, near the tip of chromium master.
 
So, not all tasks are GOMA_ERROR, right?

We have to investigate, but probably we have to check our network configuration?


I also see in tasks that succeeded with state RETRY, in the error_message field:

A bunch of "Required file not on goma cache" lines (expected), then a:
compiler_proxy:36.600457955sinput: rpc error: code = Unavailable desc = transport is closing
compiler_proxy:36.600470167sinput: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.47.244.101:5050: connect: connection refused"

Followed by the last error repeated lots of times with the same address mentioned.

I'm not sure if this is a separate issue or not (since I don't know your backend architecture).
> So, not all tasks are GOMA_ERROR, right?

Correct- about 1000 compile steps out of ~25000 ninja steps failed.

The build succeeded, but it was slow: 28m29.102s.  Rebuilds took 12 and then 17 minutes.

> We have to investigate, but probably we have to check our network configuration?

Yes, probably.
RETRY is maybe another issue.

Ah, since the backend has just started, we don't have file cache so much. So, you must see lots of RETRY. If `missing inputs` (= an error that not all files exist in goma backend) continues 4 times, goma client gives up and runs a compile locally. This is also reported as GOMA_ERROR, and probably you will see hundreds of GOMA_ERROR at the first build, because you're the first person of the cluster. (we should have used external cluster by ourselves...)

Note that we don't send all missing files at once (there is heuristic here).
RETRY should be resolved eventually. Probably your second build should work well (if no network error).

Tasks with RETRY state are expected, but in tests with our own goma backend I have been able to populate the input cache from scratch within a single content_shell build without jobs retrying more than 4 times.  I test our backend with GOMA_USE_LOCAL=false and GOMA_FALLBACK=false, so this would not have succeeded unless handled entirely by the backend.

The connection refused errors are the unexpected part- something specific to your backend most likely, and maybe related to the http 502 / bad gateway errors in the tasks with GOMAERROR state?
Owner: ukai@chromium.org
Status: Started (was: Untriaged)
how many -j do you use?

maybe, we need to raise backend capacity.
I was using -j800.
Sorry, currently our backend only has 100 workers.
Will increase the number of workers soon (ETA beginning of next week).
Good to know.  You should probably suggest a reasonable upper limit for the number of build threads to other trial users (if there are any).
We increased the number of worker to 800.

I think you can do -j800 now.
But we'll check how much parallelism we can accept.
I fixed backend capacity settings.
I think it would reduce 502 errors.
Thanks for the update.  I tried a couple of linux content_shell builds (with enable_nacl=false).  Cache hits seem reasonably quick, compile jobs are sometimes slower than expected.  But I'm still getting http 502 errors, and at one point http 500 errors.  It took a few attempts before my build succeeded (with some goma client restarts in the middle), I ended up with 320 failed/GOMAERROR tasks.

Also, the goma client aborted during my second attempt (filed crbug.com/904340).
it looks 502 errors are reduced a lot, but still seeing GOMAERROR due to backend server error.
seems we need to allocate more resource for servers.
I've fixed resource allocation, so you would not see many GOMAERRORs.
As far as I tested (build full chromium release and full chromium build), I observed 3 GOMAERRORs.
could you try it again?
It looks much healthier now- I can make complete content_shell builds without any GOMAERRORs, and the performance is about what I expect.

I guess we can mark this closed now.
Status: Fixed (was: Started)
thanks!

Sign in to add a comment