New issue
Advanced search Search tips
Starred by 2 users
Status: Fixed
Owner:
Closed: Sep 12
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment
p2p_ShareFiles fails - "Expected exported file ..."
Project Member Reported by jrbarnette@chromium.org, Sep 6 Back to list
The p2p_ShareFiles test has been seen to fail with this message:
    Expected exported file on the DUT.

According to wmatrix, the first failures are in R63-9916.0.0, in
daisy and mighty.  These are the suite jobs:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140170909
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140170794

Additionally, the same failure has occurred in a recent CQ run
for nyan_kitty.  This is the suite job:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140179965

Due to retries, the test fails at least twice each suite.

 
Cc: shuqianz@chromium.org
Labels: -Pri-3 Pri-1
Owner: jintao@chromium.org
Status: Assigned
Passing to a sheriff for triage.  The p2p tests relate to AU, so this
needs an expert in update_engine stuff.

I'm calling it P1 for now, since it's caused a CQ failure.  This needs
more time to be sure of its seriousness.

Cc: wiley@chromium.org
This is the test failure log 
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/140179965-chromeos-test/hostless/

from the recent failed CQ run 
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=140179965

Chris, do you have context on p2p tests, or do you know who owns this test?


Cc: -wiley@chromium.org adlr@chromium.org
wiley@ is no longer on the project; let's try adlr@

Recent failure & pass on Reef,

https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=550042427,549911530

Failure has same signature,

  File "/usr/local/autotest/tests/p2p_ShareFiles/p2p_ShareFiles.py", line 117, in run_once
    raise error.TestFail('Expected exported file on the DUT.')
TestFail: Expected exported file on the DUT.

code snippet,

        # Run the loop until the file is shared. Normally, the p2p-server takes
        # up to 1 second to detect a change on the shared directory and
        # announces it right away a few times. Wait until the file is announced,
        # what should not take more than a few seconds. If after 30 seconds the
        # files isn't announced, that is an error.
        self._run_lansim_loop(timeout=30,
                              until=lambda: p2pcli.get_peer_files(peer_name))

        files = p2pcli.get_peer_files(peer_name)
        if files != [('my_file=HASH==', 10)]:
            logging.info('Peer files: %r', files)
            raise error.TestFail('Expected exported file on the DUT.')

Leading up to the failure,


09/06 21:44:36.539 DEBUG|         dbus_send:0124| Got DBus response: 1
09/06 21:44:36.641 DEBUG|             utils:0212| Running 'dbus-send --system --print-reply --reply-timeout=2000 --dest=org.freedesktop.Avahi / org.freedesktop.Avahi.Server.GetState'
09/06 21:44:36.665 DEBUG|         dbus_send:0124| Got DBus response: 2
09/06 21:44:36.667 DEBUG|             utils:0212| Running 'rm /tmp/avahi-conf.4X5B'
09/06 21:44:36.683 DEBUG|             utils:0212| Running 'start p2p'
09/06 21:44:36.786 DEBUG|             utils:0280| [stdout] p2p start/running, process 10128
09/06 21:44:56.847 INFO |    p2p_ShareFiles:0116| Peer files: [('my_file=HASH==', 0)]

While in the passing case,

09/06 17:20:07.611 DEBUG|         dbus_send:0124| Got DBus response: 1
09/06 17:20:07.713 DEBUG|             utils:0212| Running 'dbus-send --system --print-reply --reply-timeout=2000 --dest=org.freedesktop.Avahi / org.freedesktop.Avahi.Server.GetState'
09/06 17:20:07.736 DEBUG|         dbus_send:0124| Got DBus response: 2
09/06 17:20:07.739 DEBUG|             utils:0212| Running 'rm /tmp/avahi-conf.MUjr'
09/06 17:20:07.763 DEBUG|             utils:0212| Running 'start p2p'
09/06 17:20:07.880 DEBUG|             utils:0280| [stdout] p2p start/running, process 25237
09/06 17:20:30.901 DEBUG|              test:0381| The test has completed successfully


This appears in dmesg on the host that fails the p2p_ShareFiles test:
2017-09-07T19:00:00.518833+00:00 INFO p2p-client[7148]: Found peer for the given ID but there are already 100 download(s) in the LAN which exceeds the threshold of 3 download(s). Sleeping 30 seconds until retrying. [../../../../../../../../../mnt/host/source/src/platform2/p2p/client/peer_selector.cc:152]
Owner: ahassani@chromium.org
amin, can you take a look at this and see if you see anything suspicious?
This doesn't seem to be causing any problem in the last two CQs at least. I'll keep looking into it to see if there is any problem. 
> This doesn't seem to be causing any problem in the last two CQs at least. I'll keep looking into it to see if there is any problem. 

The problem is intermittent, and has caused three failures in the past
few days.  Lack of failures in recent runs is no indication of any
improvement.

Till now it seems like that the p2p client gets the peer ID but it goes to sleep for 30 seconds because the number of connections in the system is exceeding 3 (actual 100) as it is shown in the error message in #7. On the other hand the test waits for 30 seconds for the p2p client to respond (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py?rcl=c4eacc966febe1513d9cfab1f30da35cba9449bb&l=106), but p2p client is sleeping. I think the flake of this test comes from this 30 second border that sometimes crosses and causes failure. But the root cause might be the 100 connections and does anyone know how to see what those connections are and get their stats?

Actually its not the number of connections in the client, but the accumulation of number of active downloads for all the peers that are found by the client. 
I looked at least two builds and for both the exact number 100 seems suspicious. Although this code hasn't changed since 2014 and I'm guessing this could be caused by some network issue too. But not sure yet.
2017-09-07T18:59:58.461793+00:00 NOTICE p2p-client[7146]:  TXT: len=18 data=num_connections=99 [../../../../../../../../../mnt/host/source/src/platform2/p2p/client/service_finder.cc:219]

99 num_connections maybe suspicious :(
New sheriff here- this is still causing CQ failures.  Latest one:

https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/2909
I looked into the code and cannot really replicate it. I'm looking more into it.

I'm able to replicate it now. Looking for the root cause!
The issue happened again. https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/2911. I am curious about why this issue only shows up occationally?
p2p_SharedFile test code and p2p stuff has not changed in years. I'm not sure why this happening now. My initial guess is it is something related to network/dns. But I cannot exactly pinpoint where!
Ok. I think I might have found the problem. in here:
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py?rcl=d49d7f2f343d20e5c06896ea414c38cc3e7b2aa8&l=102

the p2p_ShareFiles test creates a small file (10 bytes) and waits for the p2p-server to announce the shared file. But, the p2p server is announcing the file with size 0. I'm guessing before the content of the file is written, the p2p-server detects the added file (with size zero) and reports it. This in the end causes this problem. This could be a symptom of a problem in kernel or disk or python problem too or just be this race condition.

So a solution I'm suggesting is to run get_peer_files() function in a loop and check the result constantly unless the expected value is detected otherwise let it timeout.

or simply do os.fsync to make sure the file writes to disk faster??!!
I uploaded a fix CL here:  https://chromium-review.googlesource.com/660783

Can someone add a reviewer except deymo@? deymo is not in the chromeos or aosp team anymore and he was the original owner of this project.
Project Member Comment 22 by bugdroid1@chromium.org, Sep 12
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2eaa7ae986afea881f3d9e54e5509f93effaffec

commit 2eaa7ae986afea881f3d9e54e5509f93effaffec
Author: Amin Hassani <ahassani@google.com>
Date: Tue Sep 12 11:19:33 2017

p2p_ShareFiles: Fix file write race condition.

Somewhere at the end of this test, a 10 byte files is created and assumed the
p2p-server shares the file (with correct size) back to the simulator. But on
some boards this may flake when the p2p-server shares the file info before its
content is written to disk. This causes the test fail because the expected file
on the DUT was not found.

This patch wraps the call to get_peer_files() in another function to check for
the correctness of the file name and size before complaining. However, if the
file is never shared (due to some unknown problem), the simulator does not raise
an exception on timeout. This additional problem is outside the scope of this
bug.

BUG= chromium:762579 
TEST=test_that --board=nyan_kitty chromeos4-row13-rack1-host3.cros p2p_ShareFiles

Change-Id: I3ea8bb5ba3f19f0ba5bb863ee4d67b7bf3fbc939
Reviewed-on: https://chromium-review.googlesource.com/660783
Commit-Ready: Shyh-In Hwang <josephsih@chromium.org>
Tested-by: Amin Hassani <ahassani@chromium.org>
Reviewed-by: Shuqian Zhao <shuqianz@chromium.org>

[modify] https://crrev.com/2eaa7ae986afea881f3d9e54e5509f93effaffec/client/site_tests/p2p_ShareFiles/p2p_ShareFiles.py

Status: Fixed
Sign in to add a comment