New issue
Advanced search Search tips

Issue 598754 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Isolate client segfaults when it can't reach remote service.

Project Member Reported by d...@chromium.org, Mar 29 2016

Issue description

Apparently Isolate client attempts to access the remote Isolate service when it tests, and will segfault if it fails.

ok  	github.com/luci/luci-go/client/isolate	0.103s
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/preupload
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/preupload
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/preupload
2016/03/29 09:52:19 Task failed, retrying after a sleep of 0: http request failed: Service Unavailable (HTTP 503)
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/preupload
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/preupload
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/store_inline
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/store_inline
2016/03/29 09:52:19 PUT  /fake/cloudstorage
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/store_inline
2016/03/29 09:52:19 Task failed, retrying after a sleep of 0: http request failed: Service Unavailable (HTTP 503)
2016/03/29 09:52:19 POST /_ah/api/isolateservice/v1/store_inline
2016/03/29 09:52:19 Task failed, retrying after a sleep of 0: http request failed: Service Unavailable (HTTP 503)
2016/03/29 09:52:19 PUT  /fake/cloudstorage
2016/03/29 09:52:19 PUT  /fake/cloudstorage
2016/03/29 09:52:19 Task failed, retrying after a sleep of 0: Put http://127.0.0.1:50976/fake/cloudstorage?digest=639b8c971d725db4b2eed1f1794c6c8a75ba09fb: read tcp 127.0.0.1:50977->127.0.0.1:50976: wsarecv: An existing connection was forcibly closed by the remote host.
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x28 pc=0x4896cc]

goroutine 34 [running]:
panic(0x7d12e0, 0xc082012060)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/runtime/panic.go:464 +0x3f4
github.com/luci/luci-go/client/isolatedclient.(*compressed).Read(0xc082061350, 0xc0821ae000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/infra/go/src/github.com/luci/luci-go/client/isolatedclient/isolatedclient.go:250 +0x6c
io/ioutil.(*nopCloser).Read(0xc082013060, 0xc0821ae000, 0x8000, 0x8000, 0xc0821820b0, 0x0, 0x0)
	<autogenerated>:4 +0x89
io.(*multiReader).Read(0xc0820a9880, 0xc0821ae000, 0x8000, 0x8000, 0x1, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/io/multi.go:13 +0xab
io.copyBuffer(0x28b01c0, 0xc082013390, 0x28b00f0, 0xc0820a9880, 0xc0821ae000, 0x8000, 0x8000, 0x4018, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/io/io.go:380 +0x24e
io.Copy(0x28b01c0, 0xc082013390, 0x28b00f0, 0xc0820a9880, 0xc082013390, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/io/io.go:350 +0x6b
net/http.(*transferWriter).WriteBody(0xc0821902a0, 0x28b0168, 0xc0821820b0, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/net/http/transfer.go:218 +0x2b9
net/http.(*Request).write(0xc08239e620, 0x28b0078, 0xc082100080, 0x28b0000, 0xc082061470, 0x0, 0x0, 0x0)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/net/http/request.go:497 +0xb9b
net/http.(*persistConn).writeLoop(0xc08203bd40)
	E:/b/build/slave/luci-go-win64/build/golang/go/src/net/http/transport.go:1279 +0x2e3
created by net/http.(*Transport).dialConn
	E:/b/build/slave/luci-go-win64/build/golang/go/src/net/http/transport.go:854 +0x10d2
FAIL	github.com/luci/luci-go/client/isolatedclient	0.157s

PTAL. I think the remote service alone is bad for tests, but the failure handling definitely seems like a bug.
 
I admit it's known issue to me, but I wasn't able to reliably and trivially repro this so far. Have you? Is this from buildbot waterfall/tryjob?

Comment 2 by d...@chromium.org, Mar 29 2016

Yep, this is from the Win64 luci-go tryserver logs.

It may be hard to reproduce since it relies on remote service flake. IMO the actual solution here is to mock the remote service during the test.

The line in question is: return c.r.Read(p)

AFAICT, "c" and "p" are not nil, meaning "c.r" probably is. This only really happens when Close is called, which it is in defer: https://github.com/luci/luci-go/blob/master/client/isolatedclient/isolatedclient.go#L182

Since the only thing that has a "compressed" instance is the HTTP library, and the stack trace is coming from the HTTP library, it seems like the problem is that the HTTP request retains the "compressed" instance and is trying to Read from the "compressed" object after it has been closed.

Perhaps the thing to do is have "Read" return io.EOF if "compressed" has already been closed?
Owner: tandrii@chromium.org
Status: Assigned (was: Untriaged)
assigning to myself to be looked in mid-april.
Project Member

Comment 5 by bugdroid1@chromium.org, Apr 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-go.git/+/50fe3c609c97b5effbc76d816a439fd19ab620ec

commit 50fe3c609c97b5effbc76d816a439fd19ab620ec
Author: dnj <dnj@chromium.org>
Date: Fri Apr 01 19:25:47 2016

Isolate: Use generators instead of seekers

Have Isolate use generators to obtain io.ReadCloser instances instead of
reusing a single io.ReadCloser instance via Seek.

This greatly simplifies the code and ensures a clean division between
HTTP request data on retry. In the standard case, this will require one
additional file open per archival, but will demand far fewer
simultaneously-open file handles.

BUG= chromium:598754 
TEST=None

Review URL: https://codereview.chromium.org/1846263002

[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/archiver/archiver.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/archiver/archiver_test.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/archiver/utils.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/internal/lhttp/client.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/internal/lhttp/client_test.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/isolate/isolate.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/isolatedclient/isolatedclient.go
[modify] https://crrev.com/50fe3c609c97b5effbc76d816a439fd19ab620ec/client/isolatedclient/isolatedclient_test.go

Comment 6 by d...@chromium.org, Apr 1 2016

Status: Fixed (was: Assigned)
I'm optimistically marking this "Fixed" with the CL in #5; please re-open if this sort of failure is observed again.

Comment 7 by aga...@chromium.org, Apr 27 2016

Components: Infra>Platform
Labels: -Infra-Platform

Sign in to add a comment