netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zoltan Kiss <zoltan.kiss@citrix.com>
To: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Ian Campbell <Ian.Campbell@citrix.com>,
	"David S. Miller" <davem@davemloft.net>, <netdev@vger.kernel.org>,
	<xen-devel@lists.xen.org>
Subject: Re: [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
Date: Thu, 1 May 2014 14:37:41 +0100	[thread overview]
Message-ID: <53624E25.8040404@citrix.com> (raw)
In-Reply-To: <94755525.20140501002553@eikelenboom.it>

On 30/04/14 23:25, Sander Eikelenboom wrote:
>
> Wednesday, April 30, 2014, 10:53:39 PM, you wrote:
>
>> On 30/04/14 11:45, Sander Eikelenboom wrote:
>>> Hi Zoltan,
>>>
>>> Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", merged into mainline with merge commit 4caeccb4de76440e433a15009636e77d003eb3d6,
>>> seem to introduce a subtle bug on network traffic between 2 guests on a bridge on the same host.
>>> I have one guest running apache as webdav server with SSL and another guest that is using that is uploading large files to that webdav server.
>>> Small requests (some get's and propfind's) seem to work ok, but when the bulk uploading begins it fails with:
>>>
>>> Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>> Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>> Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>> Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>
>>> So some how large (probably fragmented) packets can get mangled when from guest to guest on the same host.
>>> I don't see this with clients that upload large files from external sources.
>>> Probably if SSL wasn't complaining it would probably be unnoticed for longer and doing some silent corruption.
>>>
>>> I first blamed openssl, since it started around all the latest openssl mayhem and updates, but it turns out it is all xen-netback related again.
>>>
>>> Since these commits break bisectabillity:
>>>       - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d  (note in commit message && kernel panic)
>>>       - 62bad3199a4c20505fc36c169deef20b25e17c5f  (kernel panic)
>>> i stopped bisecting at this point.
>>>
>>> The upside is .. it's 100% reproduceable :-)
>> That's good :) Can you take tcpdump captures along the way (sending
>> guest, dom0, receiving guest), and try to work out which packets are
>> different, and where? Although taking captures in Dom0 might change your
>> result, as it triggers the pages to be copied and unmapped before they
>> reach their target.
>
>> Thanks,
>> Zoli
>
>
> Hrrmm that sounds like a lot of data and a lot of work ..
If you could make captures in the sending and receiving guest with 
tcpdump (take care of increasing snaplen so the whole packet is there, 
and filter to the SSH connection itself), and upload it somewhere for 
me, that would be enough for start. I will try to work out where the 
corruption happens.
Also, do you have timestamps for the above mentioned log entries? I 
guess they appear on the receiving side.
And some info about the componenets on the server, so I can work out 
where is that _ssl.c:1415, and which part of the packet it actually 
looks for.

>
> how ever .. could it be just a type and would the following make sense ?
>
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 7666540..abeea10 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif *vif, struct sk_buff *skb)
>
>          xenvif_fill_frags(vif, nskb);
>          /* Subtract frags size, we will correct it later */
> -       skb->truesize -= skb->data_len;
> +       skb->truesize -= nskb->data_len;
>          skb->len += nskb->len;
>          skb->data_len += nskb->len;

Nope, that's correct there: after that skb->truesize will be the size of 
the struct plus the linear buffer itself. The code is just about the 
ditch the original fragments plus the skb on the frag_list. When the new 
pages are created, it will update it again.
Also, this code path runs only if the guest sends more slots we can 
handle (so we put the extra one to the frag_list until we can get rid of 
it). On Linux it can only happen with 3.2 or older guest kernels, and 
only occasionally. As you said, this is 100% reproducible, so I would 
doubt the problem is with this part of the code.

Zoli

  reply	other threads:[~2014-05-01 13:37 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-30 10:45 [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series Sander Eikelenboom
2014-04-30 15:24 ` Eric Dumazet
2014-04-30 20:40   ` Zoltan Kiss
2014-04-30 20:53 ` Zoltan Kiss
2014-04-30 22:25   ` Sander Eikelenboom
2014-05-01 13:37     ` Zoltan Kiss [this message]
2014-05-01 13:59       ` Sander Eikelenboom
2014-05-01 15:46         ` Zoltan Kiss
2014-05-01 17:39           ` Sander Eikelenboom
2014-05-01 17:46             ` Eric Dumazet
2014-05-01 19:39             ` [Xen-devel] " Sander Eikelenboom
2014-05-02 14:00               ` Zoltan Kiss
2014-05-02 14:06                 ` Sander Eikelenboom
2014-05-02 14:47                   ` Zoltan Kiss
2014-05-02 15:21                     ` Eric Dumazet
2014-05-02 15:26                       ` Zoltan Kiss
2014-05-02 16:28                         ` Sander Eikelenboom
2014-05-02 16:45                           ` Zoltan Kiss
2014-05-05 10:19                             ` Sander Eikelenboom
2014-05-06 17:07                               ` Steven Haigh
2014-05-06 17:13                                 ` Zoltan Kiss
2014-05-06 17:37                                   ` Sander Eikelenboom
2014-05-06 18:07                                     ` Steven Haigh
2014-05-07  8:16                                       ` [Xen-devel] " David Vrabel
2014-05-16  2:13                                         ` Steven Haigh
2014-05-06 17:08                               ` [Xen-devel] " Zoltan Kiss
2014-05-06 17:10                               ` Zoltan Kiss
2014-05-06 17:33                                 ` Sander Eikelenboom
2014-05-01 13:49 ` Zoltan Kiss
2014-05-01 14:05   ` Sander Eikelenboom
2014-05-01 15:16     ` Zoltan Kiss
2014-05-01 15:40       ` Sander Eikelenboom
2014-05-02 15:35         ` Eric Dumazet
2014-05-02 22:18           ` Sander Eikelenboom
2014-05-09 22:19           ` Neal Cardwell
2014-05-09 21:02 ` Zoltan Kiss
2014-05-13 13:40   ` Zoltan Kiss

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53624E25.8040404@citrix.com \
    --to=zoltan.kiss@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=davem@davemloft.net \
    --cc=linux@eikelenboom.it \
    --cc=netdev@vger.kernel.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).