All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Neil Brown <neilb@suse.de>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v3 6/6] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
Date: Mon, 30 Jan 2012 20:06:53 +0200	[thread overview]
Message-ID: <20120130180652.GC9345@redhat.com> (raw)
In-Reply-To: <1327941813.26983.270.camel@zakaz.uk.xensource.com>

On Mon, Jan 30, 2012 at 04:43:33PM +0000, Ian Campbell wrote:
> On Mon, 2012-01-30 at 16:23 +0000, Michael S. Tsirkin wrote:
> > On Mon, Jan 30, 2012 at 03:51:53PM +0000, Ian Campbell wrote:
> > > On Thu, 2012-01-26 at 13:11 +0000, Michael S. Tsirkin wrote:
> > > > On Wed, Jan 25, 2012 at 12:27:14PM +0000, Ian Campbell wrote:
> > > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > > completes and the userspace process can continue, potentially modifying data
> > > > > referenced by the retransmission before the retransmission occurs.
> > > > > 
> > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > > > Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> > > > > Cc: "David S. Miller" <davem@davemloft.net>
> > > > > Cc: Neil Brown <neilb@suse.de>
> > > > > Cc: "J. Bruce Fields" <bfields@fieldses.org>
> > > > > Cc: linux-nfs@vger.kernel.org
> > > > > Cc: netdev@vger.kernel.org
> > > > 
> > > > This doesn't include either of the two options you proposed to address
> > > > the sender blocked forever by receiver issue with bridged septups and
> > > > endpoints such a tap device or a socket on the same box,
> > > > does it?
> > > 
> > > There was never any response to Bruce's question:
> > > http://thread.gmane.org/gmane.linux.network/210873/focus=44849
> > > 
> > >         Stupid question: Is it a requirement that you be safe against DOS by a
> > >         rogue process with a tap device?  (And if so, does current code satisfy
> > >         that requirement?)
> > > 
> > > IMHO the answer to both questions is no, there are plenty of ways a
> > > rogue process with a tap device can wreak havoc.
> > 
> > I thought the answer is an obviious yes :(
> > What are these ways tap can wreak havoc?
> 
> Can't they spoof traffic
> and all sorts of things like that?
> Hrm. I
> suppose that the same as any peer on the network so we already handle
> that sort of thing. Maybe that's a red herring then.

Right. It typically does not include DOS on the sender :)

> 
> > > > How about patching __skb_queue_tail to do a deep copy?
> > > > That would seem to handle both tap and socket cases -
> > > > any other ones left?
> > > 
> > > Wouldn't that mean we were frequently (almost always) copying for lots
> > > of other cases too? That would rather defeat the purpose of being able
> > > to hand pages off to the network stack in a zero copy fashion.
> > 
> > Yes but the case of an rpc connection to the same box
> > is very rare I think, not worth optimizing for.
> 
> But changing __skb_queue_tail doesn't only impact rpc connections to the
> same box, does it? At least I can see plenty of callers of
> __skb_queue_tail which don't look like they would want a copy to occur
> -- plenty of drivers for one thing.
> Perhaps in combination with a per-queue flag or per-socket flag to
> enable it though it might work though?

Right. I missed that. I'm guessing drivers don't hang on to skbs
indefinitely.  Still, copying is always safe - maybe the right thing to
do is to add an __skb_queue_tail variant that does not copy, and
gradually convert drivers that care to that API?

> > > > If we do this, I think it would be beneficial to pass a flag
> > > > to the destructor indicating that a deep copy was done:
> > > > this would allow senders to detect that and adapt.
> > > 
> > > If you were doing a deep copy anyway you might as well create a
> > > completely new skb and release the old one, thereby causing the
> > > destructors to fire in the normal way for it SKB. The copy wouldn't have
> > > destructors because the pages would no longer be owned by the sender.
> > > 
> > > Ian.
> > 
> > What I mean is that page pin + deep copy might be more expensive
> > than directly copying. So the owner of the original skb
> > cares whether we did a deep copy or zero copy transmit worked.
> 
> You mean so they can adaptively do a copy directly next time?
> I think that would add a fair bit more complexity to what, as you point
> out, is a fairly rare occurrence.
> 
> Ian.

For sunrpc yes but I was thinking about utilizing this
mechanism for e.g. kvm in the future. It might be more common
there. I agree this might be a future extension.

-- 
MSt

WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Ian Campbell <Ian.Campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Cc: "netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>,
	Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>,
	"J. Bruce Fields"
	<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
	"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH v3 6/6] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
Date: Mon, 30 Jan 2012 20:06:53 +0200	[thread overview]
Message-ID: <20120130180652.GC9345@redhat.com> (raw)
In-Reply-To: <1327941813.26983.270.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>

On Mon, Jan 30, 2012 at 04:43:33PM +0000, Ian Campbell wrote:
> On Mon, 2012-01-30 at 16:23 +0000, Michael S. Tsirkin wrote:
> > On Mon, Jan 30, 2012 at 03:51:53PM +0000, Ian Campbell wrote:
> > > On Thu, 2012-01-26 at 13:11 +0000, Michael S. Tsirkin wrote:
> > > > On Wed, Jan 25, 2012 at 12:27:14PM +0000, Ian Campbell wrote:
> > > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > > completes and the userspace process can continue, potentially modifying data
> > > > > referenced by the retransmission before the retransmission occurs.
> > > > > 
> > > > > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > > > > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > > > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > > > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > > > > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > > > > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > 
> > > > This doesn't include either of the two options you proposed to address
> > > > the sender blocked forever by receiver issue with bridged septups and
> > > > endpoints such a tap device or a socket on the same box,
> > > > does it?
> > > 
> > > There was never any response to Bruce's question:
> > > http://thread.gmane.org/gmane.linux.network/210873/focus=44849
> > > 
> > >         Stupid question: Is it a requirement that you be safe against DOS by a
> > >         rogue process with a tap device?  (And if so, does current code satisfy
> > >         that requirement?)
> > > 
> > > IMHO the answer to both questions is no, there are plenty of ways a
> > > rogue process with a tap device can wreak havoc.
> > 
> > I thought the answer is an obviious yes :(
> > What are these ways tap can wreak havoc?
> 
> Can't they spoof traffic
> and all sorts of things like that?
> Hrm. I
> suppose that the same as any peer on the network so we already handle
> that sort of thing. Maybe that's a red herring then.

Right. It typically does not include DOS on the sender :)

> 
> > > > How about patching __skb_queue_tail to do a deep copy?
> > > > That would seem to handle both tap and socket cases -
> > > > any other ones left?
> > > 
> > > Wouldn't that mean we were frequently (almost always) copying for lots
> > > of other cases too? That would rather defeat the purpose of being able
> > > to hand pages off to the network stack in a zero copy fashion.
> > 
> > Yes but the case of an rpc connection to the same box
> > is very rare I think, not worth optimizing for.
> 
> But changing __skb_queue_tail doesn't only impact rpc connections to the
> same box, does it? At least I can see plenty of callers of
> __skb_queue_tail which don't look like they would want a copy to occur
> -- plenty of drivers for one thing.
> Perhaps in combination with a per-queue flag or per-socket flag to
> enable it though it might work though?

Right. I missed that. I'm guessing drivers don't hang on to skbs
indefinitely.  Still, copying is always safe - maybe the right thing to
do is to add an __skb_queue_tail variant that does not copy, and
gradually convert drivers that care to that API?

> > > > If we do this, I think it would be beneficial to pass a flag
> > > > to the destructor indicating that a deep copy was done:
> > > > this would allow senders to detect that and adapt.
> > > 
> > > If you were doing a deep copy anyway you might as well create a
> > > completely new skb and release the old one, thereby causing the
> > > destructors to fire in the normal way for it SKB. The copy wouldn't have
> > > destructors because the pages would no longer be owned by the sender.
> > > 
> > > Ian.
> > 
> > What I mean is that page pin + deep copy might be more expensive
> > than directly copying. So the owner of the original skb
> > cares whether we did a deep copy or zero copy transmit worked.
> 
> You mean so they can adaptively do a copy directly next time?
> I think that would add a fair bit more complexity to what, as you point
> out, is a fairly rare occurrence.
> 
> Ian.

For sunrpc yes but I was thinking about utilizing this
mechanism for e.g. kvm in the future. It might be more common
there. I agree this might be a future extension.

-- 
MSt
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2012-01-30 18:04 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-25 12:26 [PATCH v3 0/6] skb paged fragment destructors Ian Campbell
2012-01-25 12:27 ` [PATCH v3 1/6] net: pad skb data and shinfo as a whole rather than individually Ian Campbell
2012-01-25 12:51   ` Eric Dumazet
2012-01-25 13:09     ` Ian Campbell
2012-01-25 13:12       ` Eric Dumazet
2012-01-31 14:35         ` Ian Campbell
2012-01-31 14:45           ` Eric Dumazet
2012-02-01  9:39             ` Ian Campbell
2012-02-01 10:02               ` Eric Dumazet
2012-02-01 10:09                 ` Ian Campbell
2012-01-31 14:54           ` Francois Romieu
2012-02-01  9:38             ` Ian Campbell
2012-01-25 12:27 ` [PATCH v3 2/6] net: move destructor_arg to the front of sk_buff Ian Campbell
2012-01-25 12:27 ` [PATCH v3 3/6] net: add support for per-paged-fragment destructors Ian Campbell
2012-01-25 12:27 ` [PATCH v3 4/6] net: only allow paged fragments with the same destructor to be coalesced Ian Campbell
     [not found] ` <1327494389.24561.316.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2012-01-25 12:27   ` [PATCH v3 5/6] net: add paged frag destructor support to kernel_sendpage Ian Campbell
2012-01-25 12:27     ` [Ocfs2-devel] " Ian Campbell
2012-01-25 12:27     ` Ian Campbell
2012-01-25 12:27 ` [PATCH v3 6/6] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell
2012-01-26 13:11   ` Michael S. Tsirkin
2012-01-26 13:11     ` Michael S. Tsirkin
2012-01-30 15:51     ` Ian Campbell
2012-01-30 15:51       ` Ian Campbell
2012-01-30 16:23       ` Michael S. Tsirkin
2012-01-30 16:43         ` Ian Campbell
2012-01-30 16:43           ` Ian Campbell
2012-01-30 18:06           ` Michael S. Tsirkin [this message]
2012-01-30 18:06             ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120130180652.GC9345@redhat.com \
    --to=mst@redhat.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=bfields@fieldses.org \
    --cc=davem@davemloft.net \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.