lots of tasks fail with GOMAERROR when using the externally-accessible goma backend |
|||
Issue descriptionI get lots of GOMAERROR tasks when using the externally-accessible goma backend: 1031 failed tasks for a content_shell build. The response_header for these tasks says: HTTP/1.1 502 Bad Gateway Content-Type: text/html; charset=UTF-8 Referrer-Policy: no-referrer Content-Length: 332 Date: Thu, 08 Nov 2018 08:22:57 GMT Alt-Svc: clear Using linux, near the tip of chromium master.
,
Nov 8
I also see in tasks that succeeded with state RETRY, in the error_message field: A bunch of "Required file not on goma cache" lines (expected), then a: compiler_proxy:36.600457955sinput: rpc error: code = Unavailable desc = transport is closing compiler_proxy:36.600470167sinput: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.47.244.101:5050: connect: connection refused" Followed by the last error repeated lots of times with the same address mentioned. I'm not sure if this is a separate issue or not (since I don't know your backend architecture).
,
Nov 8
> So, not all tasks are GOMA_ERROR, right? Correct- about 1000 compile steps out of ~25000 ninja steps failed. The build succeeded, but it was slow: 28m29.102s. Rebuilds took 12 and then 17 minutes. > We have to investigate, but probably we have to check our network configuration? Yes, probably.
,
Nov 8
RETRY is maybe another issue. Ah, since the backend has just started, we don't have file cache so much. So, you must see lots of RETRY. If `missing inputs` (= an error that not all files exist in goma backend) continues 4 times, goma client gives up and runs a compile locally. This is also reported as GOMA_ERROR, and probably you will see hundreds of GOMA_ERROR at the first build, because you're the first person of the cluster. (we should have used external cluster by ourselves...) Note that we don't send all missing files at once (there is heuristic here). RETRY should be resolved eventually. Probably your second build should work well (if no network error).
,
Nov 8
Tasks with RETRY state are expected, but in tests with our own goma backend I have been able to populate the input cache from scratch within a single content_shell build without jobs retrying more than 4 times. I test our backend with GOMA_USE_LOCAL=false and GOMA_FALLBACK=false, so this would not have succeeded unless handled entirely by the backend. The connection refused errors are the unexpected part- something specific to your backend most likely, and maybe related to the http 502 / bad gateway errors in the tasks with GOMAERROR state?
,
Nov 9
how many -j do you use? maybe, we need to raise backend capacity.
,
Nov 9
I was using -j800.
,
Nov 9
Sorry, currently our backend only has 100 workers. Will increase the number of workers soon (ETA beginning of next week).
,
Nov 9
Good to know. You should probably suggest a reasonable upper limit for the number of build threads to other trial users (if there are any).
,
Nov 12
We increased the number of worker to 800. I think you can do -j800 now. But we'll check how much parallelism we can accept.
,
Nov 12
I fixed backend capacity settings. I think it would reduce 502 errors.
,
Nov 12
Thanks for the update. I tried a couple of linux content_shell builds (with enable_nacl=false). Cache hits seem reasonably quick, compile jobs are sometimes slower than expected. But I'm still getting http 502 errors, and at one point http 500 errors. It took a few attempts before my build succeeded (with some goma client restarts in the middle), I ended up with 320 failed/GOMAERROR tasks. Also, the goma client aborted during my second attempt (filed crbug.com/904340).
,
Nov 14
it looks 502 errors are reduced a lot, but still seeing GOMAERROR due to backend server error. seems we need to allocate more resource for servers.
,
Nov 15
I've fixed resource allocation, so you would not see many GOMAERRORs. As far as I tested (build full chromium release and full chromium build), I observed 3 GOMAERRORs. could you try it again?
,
Nov 15
It looks much healthier now- I can make complete content_shell builds without any GOMAERRORs, and the performance is about what I expect. I guess we can mark this closed now.
,
Nov 16
thanks! |
|||
►
Sign in to add a comment |
|||
Comment 1 by shinyak@google.com
, Nov 8