p2p_ShareFiles fails - "Expected exported file ..."
Reported by
jrbarnette@chromium.org,
Sep 6 2017
|
|||||
Issue description
The p2p_ShareFiles test has been seen to fail with this message:
Expected exported file on the DUT.
According to wmatrix, the first failures are in R63-9916.0.0, in
daisy and mighty. These are the suite jobs:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140170909
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140170794
Additionally, the same failure has occurred in a recent CQ run
for nyan_kitty. This is the suite job:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140179965
Due to retries, the test fails at least twice each suite.
,
Sep 6 2017
This is the test failure log https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/140179965-chromeos-test/hostless/ from the recent failed CQ run http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140179965 Chris, do you have context on p2p tests, or do you know who owns this test?
,
Sep 6 2017
wiley@ is no longer on the project; let's try adlr@
,
Sep 7 2017
Please escalate; this bug has caused a second CQ failure:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15984
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2850
,
Sep 7 2017
Recent failure & pass on Reef, https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=550042427,549911530 Failure has same signature, File "/usr/local/autotest/tests/p2p_ShareFiles/p2p_ShareFiles.py", line 117, in run_once raise error.TestFail('Expected exported file on the DUT.') TestFail: Expected exported file on the DUT. code snippet, # Run the loop until the file is shared. Normally, the p2p-server takes # up to 1 second to detect a change on the shared directory and # announces it right away a few times. Wait until the file is announced, # what should not take more than a few seconds. If after 30 seconds the # files isn't announced, that is an error. self._run_lansim_loop(timeout=30, until=lambda: p2pcli.get_peer_files(peer_name)) files = p2pcli.get_peer_files(peer_name) if files != [('my_file=HASH==', 10)]: logging.info('Peer files: %r', files) raise error.TestFail('Expected exported file on the DUT.') Leading up to the failure, 09/06 21:44:36.539 DEBUG| dbus_send:0124| Got DBus response: 1 09/06 21:44:36.641 DEBUG| utils:0212| Running 'dbus-send --system --print-reply --reply-timeout=2000 --dest=org.freedesktop.Avahi / org.freedesktop.Avahi.Server.GetState' 09/06 21:44:36.665 DEBUG| dbus_send:0124| Got DBus response: 2 09/06 21:44:36.667 DEBUG| utils:0212| Running 'rm /tmp/avahi-conf.4X5B' 09/06 21:44:36.683 DEBUG| utils:0212| Running 'start p2p' 09/06 21:44:36.786 DEBUG| utils:0280| [stdout] p2p start/running, process 10128 09/06 21:44:56.847 INFO | p2p_ShareFiles:0116| Peer files: [('my_file=HASH==', 0)] While in the passing case, 09/06 17:20:07.611 DEBUG| dbus_send:0124| Got DBus response: 1 09/06 17:20:07.713 DEBUG| utils:0212| Running 'dbus-send --system --print-reply --reply-timeout=2000 --dest=org.freedesktop.Avahi / org.freedesktop.Avahi.Server.GetState' 09/06 17:20:07.736 DEBUG| dbus_send:0124| Got DBus response: 2 09/06 17:20:07.739 DEBUG| utils:0212| Running 'rm /tmp/avahi-conf.MUjr' 09/06 17:20:07.763 DEBUG| utils:0212| Running 'start p2p' 09/06 17:20:07.880 DEBUG| utils:0280| [stdout] p2p start/running, process 25237 09/06 17:20:30.901 DEBUG| test:0381| The test has completed successfully
,
Sep 7 2017
Third CQ failure:
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2853
,
Sep 7 2017
This appears in dmesg on the host that fails the p2p_ShareFiles test: 2017-09-07T19:00:00.518833+00:00 INFO p2p-client[7148]: Found peer for the given ID but there are already 100 download(s) in the LAN which exceeds the threshold of 3 download(s). Sleeping 30 seconds until retrying. [../../../../../../../../../mnt/host/source/src/platform2/p2p/client/peer_selector.cc:152]
,
Sep 7 2017
amin, can you take a look at this and see if you see anything suspicious?
,
Sep 8 2017
This doesn't seem to be causing any problem in the last two CQs at least. I'll keep looking into it to see if there is any problem.
,
Sep 8 2017
> This doesn't seem to be causing any problem in the last two CQs at least. I'll keep looking into it to see if there is any problem. The problem is intermittent, and has caused three failures in the past few days. Lack of failures in recent runs is no indication of any improvement.
,
Sep 8 2017
Till now it seems like that the p2p client gets the peer ID but it goes to sleep for 30 seconds because the number of connections in the system is exceeding 3 (actual 100) as it is shown in the error message in #7. On the other hand the test waits for 30 seconds for the p2p client to respond (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py?rcl=c4eacc966febe1513d9cfab1f30da35cba9449bb&l=106), but p2p client is sleeping. I think the flake of this test comes from this 30 second border that sometimes crosses and causes failure. But the root cause might be the 100 connections and does anyone know how to see what those connections are and get their stats?
,
Sep 8 2017
Actually its not the number of connections in the client, but the accumulation of number of active downloads for all the peers that are found by the client. I looked at least two builds and for both the exact number 100 seems suspicious. Although this code hasn't changed since 2014 and I'm guessing this could be caused by some network issue too. But not sure yet.
,
Sep 8 2017
2017-09-07T18:59:58.461793+00:00 NOTICE p2p-client[7146]: TXT: len=18 data=num_connections=99 [../../../../../../../../../mnt/host/source/src/platform2/p2p/client/service_finder.cc:219] 99 num_connections maybe suspicious :(
,
Sep 11 2017
New sheriff here- this is still causing CQ failures. Latest one: https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2909
,
Sep 11 2017
I looked into the code and cannot really replicate it. I'm looking more into it.
,
Sep 11 2017
I'm able to replicate it now. Looking for the root cause!
,
Sep 11 2017
The issue happened again. https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/2911. I am curious about why this issue only shows up occationally?
,
Sep 11 2017
p2p_SharedFile test code and p2p stuff has not changed in years. I'm not sure why this happening now. My initial guess is it is something related to network/dns. But I cannot exactly pinpoint where!
,
Sep 11 2017
Ok. I think I might have found the problem. in here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py?rcl=d49d7f2f343d20e5c06896ea414c38cc3e7b2aa8&l=102 the p2p_ShareFiles test creates a small file (10 bytes) and waits for the p2p-server to announce the shared file. But, the p2p server is announcing the file with size 0. I'm guessing before the content of the file is written, the p2p-server detects the added file (with size zero) and reports it. This in the end causes this problem. This could be a symptom of a problem in kernel or disk or python problem too or just be this race condition. So a solution I'm suggesting is to run get_peer_files() function in a loop and check the result constantly unless the expected value is detected otherwise let it timeout.
,
Sep 11 2017
or simply do os.fsync to make sure the file writes to disk faster??!!
,
Sep 11 2017
I uploaded a fix CL here: https://chromium-review.googlesource.com/660783 Can someone add a reviewer except deymo@? deymo is not in the chromeos or aosp team anymore and he was the original owner of this project.
,
Sep 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2eaa7ae986afea881f3d9e54e5509f93effaffec commit 2eaa7ae986afea881f3d9e54e5509f93effaffec Author: Amin Hassani <ahassani@google.com> Date: Tue Sep 12 11:19:33 2017 p2p_ShareFiles: Fix file write race condition. Somewhere at the end of this test, a 10 byte files is created and assumed the p2p-server shares the file (with correct size) back to the simulator. But on some boards this may flake when the p2p-server shares the file info before its content is written to disk. This causes the test fail because the expected file on the DUT was not found. This patch wraps the call to get_peer_files() in another function to check for the correctness of the file name and size before complaining. However, if the file is never shared (due to some unknown problem), the simulator does not raise an exception on timeout. This additional problem is outside the scope of this bug. BUG= chromium:762579 TEST=test_that --board=nyan_kitty chromeos4-row13-rack1-host3.cros p2p_ShareFiles Change-Id: I3ea8bb5ba3f19f0ba5bb863ee4d67b7bf3fbc939 Reviewed-on: https://chromium-review.googlesource.com/660783 Commit-Ready: Shyh-In Hwang <josephsih@chromium.org> Tested-by: Amin Hassani <ahassani@chromium.org> Reviewed-by: Shuqian Zhao <shuqianz@chromium.org> [modify] https://crrev.com/2eaa7ae986afea881f3d9e54e5509f93effaffec/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py
,
Sep 12 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Sep 6 2017Labels: -Pri-3 Pri-1
Owner: jintao@chromium.org
Status: Assigned (was: Untriaged)