* [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-10 17:46 ` Ian Campbell
2012-05-07 13:54 ` [PATCH RFC 2/6] skbuff: add an api to orphan frags Michael S. Tsirkin
` (6 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
sunrpc wants to use zero copy with tcp which means
some fragments are zero copy while others are not.
This in turn means there's no per skb destructor_arg,
instead some fragments have destructors. Teach
skb_copy_ubufs and skb_release_data to handle such skbs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
net/core/skbuff.c | 18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c81240c..bd28e80 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -423,7 +423,7 @@ static void skb_release_data(struct sk_buff *skb)
struct ubuf_info *uarg;
uarg = skb_shinfo(skb)->destructor_arg;
- if (uarg->callback)
+ if (uarg && uarg->callback)
uarg->callback(uarg);
}
@@ -721,6 +721,8 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
for (i = 0; i < num_frags; i++) {
u8 *vaddr;
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ if (unlikely((!uarg && !f->page.destructor)))
+ continue;
page = alloc_page(GFP_ATOMIC);
if (!page) {
@@ -740,13 +742,21 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
}
/* skb frags release userspace buffers */
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- skb_frag_unref(skb, i);
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ if (unlikely((!uarg && !f->page.destructor)))
+ continue;
+ __skb_frag_unref(f);
+ }
- uarg->callback(uarg);
+ if (uarg)
+ uarg->callback(uarg);
/* skb frags point to kernel buffers */
for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ if (unlikely((!uarg && !f->page.destructor)))
+ continue;
__skb_fill_page_desc(skb, i-1, head, 0,
skb_shinfo(skb)->frags[i - 1].size);
head = (struct page *)head->private;
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-07 13:54 ` [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs Michael S. Tsirkin
@ 2012-05-10 17:46 ` Ian Campbell
2012-05-10 18:42 ` Michael S. Tsirkin
0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-10 17:46 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> /* skb frags point to kernel buffers */
> for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> + skb_frag_t *f = &skb_shinfo(skb)->frags[i];
This needs to be ....->frags[i - 1]
otherwise you put every new frag one too high and don't do anything to
frag 0, which leaves the old destructor pointer in place and leads to a
double free.
I think skb_frag_set_destructor and skb_copy_frag_destructor need to
clear and propagate respectively (or maybe just clear in both cases) the
destructor_arg field since it is otherwise not initialised when we set
SKBTX_DEV_ZEROCOPY and that can trigger wrong behaviour in this
function.
Ian.
> + if (unlikely((!uarg && !f->page.destructor)))
> + continue;
> __skb_fill_page_desc(skb, i-1, head, 0,
> skb_shinfo(skb)->frags[i - 1].size);
> head = (struct page *)head->private;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-10 17:46 ` Ian Campbell
@ 2012-05-10 18:42 ` Michael S. Tsirkin
2012-05-11 9:00 ` Ian Campbell
0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-10 18:42 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Thu, May 10, 2012 at 06:46:17PM +0100, Ian Campbell wrote:
> On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
>
> > /* skb frags point to kernel buffers */
> > for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> > + skb_frag_t *f = &skb_shinfo(skb)->frags[i];
>
> This needs to be ....->frags[i - 1]
Good catch.
for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
would be a bit clearer though.
> otherwise you put every new frag one too high and don't do anything to
> frag 0, which leaves the old destructor pointer in place and leads to a
> double free.
>
> I think skb_frag_set_destructor and skb_copy_frag_destructor need to
> clear and propagate respectively (or maybe just clear in both cases) the
> destructor_arg field since it is otherwise not initialised when we set
> SKBTX_DEV_ZEROCOPY and that can trigger wrong behaviour in this
> function.
>
> Ian.
Agree, let's just clear it.
> > + if (unlikely((!uarg && !f->page.destructor)))
> > + continue;
> > __skb_fill_page_desc(skb, i-1, head, 0,
> > skb_shinfo(skb)->frags[i - 1].size);
> > head = (struct page *)head->private;
>
So the below on top then. I pushed these on
top of my zerocopy branch - can you confirm pls?
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 930a50e..e52bc8d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1270,8 +1270,10 @@ static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
{
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
frag->page.destructor = destroy;
- if (destroy)
+ if (destroy) {
skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+ skb_shinfo(skb)->destructor_arg = NULL;
+ }
}
/**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b7fc47e..453f621 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -753,12 +753,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
uarg->callback(uarg);
/* skb frags point to kernel buffers */
- for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
+ for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
if (unlikely((!uarg && !f->page.destructor)))
continue;
- __skb_fill_page_desc(skb, i-1, head, 0,
- skb_shinfo(skb)->frags[i - 1].size);
+ __skb_fill_page_desc(skb, i, head, 0, f->size);
head = (struct page *)head->private;
}
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-10 18:42 ` Michael S. Tsirkin
@ 2012-05-11 9:00 ` Ian Campbell
2012-05-11 10:58 ` Ian Campbell
0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-11 9:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Thu, 2012-05-10 at 19:42 +0100, Michael S. Tsirkin wrote:
> On Thu, May 10, 2012 at 06:46:17PM +0100, Ian Campbell wrote:
> > On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> So the below on top then. I pushed these on
> top of my zerocopy branch - can you confirm pls?
I added these to my test branch:
93772fea6cd66616912101b9e0144dfed645d8fe fix per page destructors in copy ubufs
d72b7ab15f944c5df5f28cce7b5c9a0bca61ff6d clear destructor arg when set zerocopy
I think you also need, as part of the second one:
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index af2d10e..40ca43e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1744,6 +1744,7 @@ static inline void skb_copy_frag_destructor(struct sk_buff *to,
{
skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
SKBTX_DEV_ZEROCOPY;
+ skb_shinfo(to)->destructor_arg = NULL;
}
/**
I'm seeing copy_ubufs called in my remote NFS test, which I don't think
I expected -- I'll investigate why this is happening today.
Ian.
>
> ---
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 930a50e..e52bc8d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1270,8 +1270,10 @@ static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
> {
> skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> frag->page.destructor = destroy;
> - if (destroy)
> + if (destroy) {
> skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
> + skb_shinfo(skb)->destructor_arg = NULL;
> + }
> }
>
> /**
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b7fc47e..453f621 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -753,12 +753,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
> uarg->callback(uarg);
>
> /* skb frags point to kernel buffers */
> - for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> + for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
> skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> if (unlikely((!uarg && !f->page.destructor)))
> continue;
> - __skb_fill_page_desc(skb, i-1, head, 0,
> - skb_shinfo(skb)->frags[i - 1].size);
> + __skb_fill_page_desc(skb, i, head, 0, f->size);
> head = (struct page *)head->private;
> }
>
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-11 9:00 ` Ian Campbell
@ 2012-05-11 10:58 ` Ian Campbell
2012-05-11 12:08 ` Michael S. Tsirkin
0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-11 10:58 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> I'm seeing copy_ubufs called in my remote NFS test, which I don't
> think I expected -- I'll investigate why this is happening today.
It's tcp_transmit_skb which can (conditionally) call skb_clone
(backtrace below)
I suspect this means that the existing SKBTX_DEV_ZEROCOPY semantics are
a superset of what we need to consider for the destructor case. I'm
assuming here that the existing SKBTX_DEV_ZEROCOPY is copying aside
exactly the right amount and isn't conservatively coying more often than
necessary.
shinfo->tx_flags are pretty scarce -- can we afford a new one for this
usecase?
Or perhaps this is actually a function of the callsite not the of
individual skb and we want to have some concept of "deep" and "shallow"
clones combined with SKBTX_DEV_ZEROCOPY to decide when to copy_ubufs or
not? e.g. deep clone => always copy if SKBTX_DEV_ZEROCOPY and shallow
clone => only copy if SKBTX_DEV_ZEROCOPY && destructor_arg!=NULL
(neither copy if !SKBTX_DEV_ZEROCOPY).
Oh, I suppose that reintroduces the copy_ubufs under a (shallow) cloned
skb race if one of those skbs eventually finds itself in a situation
where a skb_frag_orphan is required doesn't it. Hrm :-/
Will have to have a think...
Ian.
[ 109.680828] ------------[ cut here ]------------
[ 109.685440] WARNING: at /local/scratch/ianc/devel/kernels/linux/include/linux/skbuff.h:1732 skb_clone+0xe6/0xf0()
[ 109.695678] Hardware name:
[ 109.699162] ORPHANING
[ 109.701434] Modules linked in:
[ 109.704495] Pid: 10, comm: kworker/0:1 Tainted: G W 3.4.0-rc4-x86_64-native+ #186
[ 109.712830] Call Trace:
[ 109.715278] [<ffffffff8107edfa>] warn_slowpath_common+0x7a/0xb0
[ 109.721273] [<ffffffff8107eed1>] warn_slowpath_fmt+0x41/0x50
[ 109.727007] [<ffffffff8170feea>] ? tcp_transmit_skb+0x9a/0x8f0
[ 109.732914] [<ffffffff8169b2d6>] skb_clone+0xe6/0xf0
[ 109.737957] [<ffffffff8170feea>] tcp_transmit_skb+0x9a/0x8f0
[ 109.743694] [<ffffffff81712d7a>] tcp_write_xmit+0x1ea/0x9c0
[ 109.749343] [<ffffffff8171357b>] tcp_push_one+0x2b/0x40
[ 109.754648] [<ffffffff81705b2b>] tcp_sendpage+0x64b/0x6d0
[ 109.760126] [<ffffffff8172785d>] inet_sendpage+0x4d/0xf0
[ 109.765518] [<ffffffff817afed7>] xs_sendpages+0x117/0x2a0
[ 109.770996] [<ffffffff817ad3f0>] ? xprt_reserve+0x2d0/0x2d0
[ 109.776647] [<ffffffff817b0178>] xs_tcp_send_request+0x58/0x110
[ 109.782644] [<ffffffff817ad5bb>] xprt_transmit+0x6b/0x2d0
[ 109.788123] [<ffffffff817aa9a0>] ? call_transmit_status+0xd0/0xd0
[ 109.794293] [<ffffffff817aab70>] call_transmit+0x1d0/0x290
[ 109.799857] [<ffffffff817aa9a0>] ? call_transmit_status+0xd0/0xd0
[ 109.806029] [<ffffffff817b3725>] __rpc_execute+0x65/0x260
[ 109.811505] [<ffffffff817b3920>] ? __rpc_execute+0x260/0x260
[ 109.817241] [<ffffffff817b3930>] rpc_async_schedule+0x10/0x20
[ 109.823066] [<ffffffff81098fff>] process_one_work+0x11f/0x460
[ 109.828895] [<ffffffff8109b0b3>] worker_thread+0x173/0x3f0
[ 109.834459] [<ffffffff8109af40>] ? manage_workers+0x210/0x210
[ 109.840283] [<ffffffff8109fa26>] kthread+0x96/0xa0
[ 109.845179] [<ffffffff81861654>] kernel_thread_helper+0x4/0x10
[ 109.851092] [<ffffffff8109f990>] ? kthread_freezable_should_stop+0x70/0x70
[ 109.858053] [<ffffffff81861650>] ? gs_change+0xb/0xb
[ 109.863087] ---[ end trace 3e3acdb7cc57c191 ]---
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-11 10:58 ` Ian Campbell
@ 2012-05-11 12:08 ` Michael S. Tsirkin
2012-05-11 16:30 ` Michael S. Tsirkin
2012-05-11 21:12 ` David Miller
0 siblings, 2 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-11 12:08 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > think I expected -- I'll investigate why this is happening today.
>
> It's tcp_transmit_skb which can (conditionally) call skb_clone
> (backtrace below)
Interesting. I didn't realise we clone skbs on data path:
tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
Could someone comment on why we need to clone on good path
like this?
--
MST
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-11 12:08 ` Michael S. Tsirkin
@ 2012-05-11 16:30 ` Michael S. Tsirkin
2012-05-12 6:01 ` Ian Campbell
2012-05-11 21:12 ` David Miller
1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-11 16:30 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > think I expected -- I'll investigate why this is happening today.
> >
> > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > (backtrace below)
>
> Interesting. I didn't realise we clone skbs on data path:
> tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> Could someone comment on why we need to clone on good path
> like this?
Hmm, it's in case we need to retransmit it later.
> --
> MST
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-11 16:30 ` Michael S. Tsirkin
@ 2012-05-12 6:01 ` Ian Campbell
2012-05-13 10:10 ` Michael S. Tsirkin
0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-12 6:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Fri, 2012-05-11 at 17:30 +0100, Michael S. Tsirkin wrote:
> On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> > On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > > think I expected -- I'll investigate why this is happening today.
> > >
> > > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > > (backtrace below)
> >
> > Interesting. I didn't realise we clone skbs on data path:
> > tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> > Could someone comment on why we need to clone on good path
> > like this?
>
> Hmm, it's in case we need to retransmit it later.
I wonder if we could avoid the copy_ubuf in this particular clone path
and have any subsequent calls to copy_ubufs use skb->fclone to determine
if it can safely replace the frags?
If it cannot then could it do a full copy of the skb (including new
shinfo, new frag pages etc) as a fallback?
Ian.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-12 6:01 ` Ian Campbell
@ 2012-05-13 10:10 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-13 10:10 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Sat, May 12, 2012 at 07:01:24AM +0100, Ian Campbell wrote:
> On Fri, 2012-05-11 at 17:30 +0100, Michael S. Tsirkin wrote:
> > On Fri, May 11, 2012 at 03:08:36PM +0300, Michael S. Tsirkin wrote:
> > > On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
> > > > On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
> > > > > I'm seeing copy_ubufs called in my remote NFS test, which I don't
> > > > > think I expected -- I'll investigate why this is happening today.
> > > >
> > > > It's tcp_transmit_skb which can (conditionally) call skb_clone
> > > > (backtrace below)
> > >
> > > Interesting. I didn't realise we clone skbs on data path:
> > > tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> > > Could someone comment on why we need to clone on good path
> > > like this?
> >
> > Hmm, it's in case we need to retransmit it later.
>
> I wonder if we could avoid the copy_ubuf in this particular clone path
> and have any subsequent calls to copy_ubufs use skb->fclone to determine
> if it can safely replace the frags?
>
> If it cannot then could it do a full copy of the skb (including new
> shinfo, new frag pages etc) as a fallback?
>
> Ian.
>
Yes I think we should call a variant of clone that avoids copy_ubuf on
the first transmit. But need to be careful we don't access the frag
list while it is being modified.
For example very roughly, maybe we could have copy_ubuf detect
packet clone is queued and take some lock?
On retransmit we could check and if we are not the only clone left
(which should be uncommon) trigger copy ubuf then.
Thoughts?
--
MST
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs
2012-05-11 12:08 ` Michael S. Tsirkin
2012-05-11 16:30 ` Michael S. Tsirkin
@ 2012-05-11 21:12 ` David Miller
1 sibling, 0 replies; 30+ messages in thread
From: David Miller @ 2012-05-11 21:12 UTC (permalink / raw)
To: mst; +Cc: Ian.Campbell, netdev, eric.dumazet
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 11 May 2012 15:08:37 +0300
> On Fri, May 11, 2012 at 11:58:12AM +0100, Ian Campbell wrote:
>> On Fri, 2012-05-11 at 10:00 +0100, Ian Campbell wrote:
>> > I'm seeing copy_ubufs called in my remote NFS test, which I don't
>> > think I expected -- I'll investigate why this is happening today.
>>
>> It's tcp_transmit_skb which can (conditionally) call skb_clone
>> (backtrace below)
>
> Interesting. I didn't realise we clone skbs on data path:
> tcp_write_xmit calls tcp_transmit_skb with clone_it flag.
> Could someone comment on why we need to clone on good path
> like this?
We can't send the original SKB that's linked into the retransmit
queue. It's linkage must stay secure in that queue.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH RFC 2/6] skbuff: add an api to orphan frags
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-08 12:47 ` Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 3/6] skbuff: convert to skb_orphan_frags Michael S. Tsirkin
` (5 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
Many places do
if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY))
skb_copy_ubufs(skb, gfp_mask);
to copy and invoke frag destructors if necessary.
Add an inline helper for this.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/skbuff.h | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb78f70..28d842e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1711,6 +1711,30 @@ static inline void skb_orphan(struct sk_buff *skb)
}
/**
+ * skb_orphan_frags - orphan the frags contained in a buffer
+ * @skb: buffer to orphan frags from
+ * @gfp_mask: allocation mask for replacement pages
+ *
+ * For each frag in the SKB which needs a destructor (i.e. has an
+ * owner) create a copy of that frag and release the original
+ * page by calling the destructor.
+ */
+static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+ if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+ return 0;
+ return skb_copy_ubufs(skb, gfp_mask);
+}
+
+
+static inline void skb_copy_frag_destructor(struct sk_buff *to,
+ struct sk_buff *from)
+{
+ skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
+ SKBTX_DEV_ZEROCOPY;
+}
+
+/**
* __skb_queue_purge - empty a list
* @list: list to empty
*
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 2/6] skbuff: add an api to orphan frags
2012-05-07 13:54 ` [PATCH RFC 2/6] skbuff: add an api to orphan frags Michael S. Tsirkin
@ 2012-05-08 12:47 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-08 12:47 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, May 07, 2012 at 04:54:36PM +0300, Michael S. Tsirkin wrote:
> Many places do
> if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY))
> skb_copy_ubufs(skb, gfp_mask);
> to copy and invoke frag destructors if necessary.
> Add an inline helper for this.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> include/linux/skbuff.h | 24 ++++++++++++++++++++++++
> 1 files changed, 24 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bb78f70..28d842e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1711,6 +1711,30 @@ static inline void skb_orphan(struct sk_buff *skb)
> }
>
> /**
> + * skb_orphan_frags - orphan the frags contained in a buffer
> + * @skb: buffer to orphan frags from
> + * @gfp_mask: allocation mask for replacement pages
> + *
> + * For each frag in the SKB which needs a destructor (i.e. has an
> + * owner) create a copy of that frag and release the original
> + * page by calling the destructor.
> + */
> +static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
> +{
> + if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
> + return 0;
> + return skb_copy_ubufs(skb, gfp_mask);
> +}
> +
> +
> +static inline void skb_copy_frag_destructor(struct sk_buff *to,
> + struct sk_buff *from)
> +{
> + skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
> + SKBTX_DEV_ZEROCOPY;
> +}
> +
> +/**
> * __skb_queue_purge - empty a list
> * @list: list to empty
> *
> --
> MST
>
This last chunk should actually be part of 6/6.
squashed it here mistakenly.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH RFC 3/6] skbuff: convert to skb_orphan_frags
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 1/6] skbuff: support per-page destructors in copy_ubufs Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 2/6] skbuff: add an api to orphan frags Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 4/6] tun: orphan frags on xmit Michael S. Tsirkin
` (4 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
Reduce dode duplication a bit using the new helper.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
net/core/skbuff.c | 22 ++++++++--------------
1 files changed, 8 insertions(+), 14 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bd28e80..bdf5b09 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -785,10 +785,8 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
{
struct sk_buff *n;
- if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
- if (skb_copy_ubufs(skb, gfp_mask))
- return NULL;
- }
+ if (skb_orphan_frags(skb, gfp_mask))
+ return NULL;
n = skb + 1;
if (skb->fclone == SKB_FCLONE_ORIG &&
@@ -908,12 +906,10 @@ struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
if (skb_shinfo(skb)->nr_frags) {
int i;
- if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
- if (skb_copy_ubufs(skb, gfp_mask)) {
- kfree_skb(n);
- n = NULL;
- goto out;
- }
+ if (skb_orphan_frags(skb, gfp_mask)) {
+ kfree_skb(n);
+ n = NULL;
+ goto out;
}
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
@@ -1005,10 +1001,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
skb_free_head(skb);
} else {
/* copy this zero copy skb frags */
- if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
- if (skb_copy_ubufs(skb, gfp_mask))
- goto nofrags;
- }
+ if (skb_orphan_frags(skb, gfp_mask))
+ goto nofrags;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
skb_frag_ref(skb, i);
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH RFC 4/6] tun: orphan frags on xmit
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
` (2 preceding siblings ...)
2012-05-07 13:54 ` [PATCH RFC 3/6] skbuff: convert to skb_orphan_frags Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-08 14:36 ` Michael S. Tsirkin
2012-05-07 13:54 ` [PATCH RFC 5/6] net: orphan frags on receive Michael S. Tsirkin
` (3 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
tun xmit is actually receive of the internal tun
socket. Orphan the frags same as we do for normal rx path.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/net/tun.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bb8c72c..1ccbfd0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -414,6 +414,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Orphan the skb - required as we might hang on to it
* for indefinite time. */
+ if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+ goto drop;
skb_orphan(skb);
/* Enqueue packet */
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 4/6] tun: orphan frags on xmit
2012-05-07 13:54 ` [PATCH RFC 4/6] tun: orphan frags on xmit Michael S. Tsirkin
@ 2012-05-08 14:36 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-08 14:36 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, May 07, 2012 at 04:54:46PM +0300, Michael S. Tsirkin wrote:
> tun xmit is actually receive of the internal tun
> socket. Orphan the frags same as we do for normal rx path.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If tun is built as a module, skb_copy_ubufs needs to be exported:
---->
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 83b04d9..a809014 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -765,7 +765,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
return 0;
}
-
+EXPORT_SYMBOL_GPL(skb_copy_ubufs);
/**
* skb_clone - duplicate an sk_buff
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH RFC 5/6] net: orphan frags on receive
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
` (3 preceding siblings ...)
2012-05-07 13:54 ` [PATCH RFC 4/6] tun: orphan frags on xmit Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-09 13:54 ` Ian Campbell
2012-05-07 13:54 ` [PATCH RFC 6/6] skbuff: set zerocopy flag on frag destructor Michael S. Tsirkin
` (2 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
net/core/dev.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index a2be59f..c0cdc00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1630,6 +1630,8 @@ static inline int deliver_skb(struct sk_buff *skb,
struct packet_type *pt_prev,
struct net_device *orig_dev)
{
+ if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+ return -ENOMEM;
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 5/6] net: orphan frags on receive
2012-05-07 13:54 ` [PATCH RFC 5/6] net: orphan frags on receive Michael S. Tsirkin
@ 2012-05-09 13:54 ` Ian Campbell
2012-05-09 15:12 ` Michael S. Tsirkin
2012-05-14 19:18 ` Michael S. Tsirkin
0 siblings, 2 replies; 30+ messages in thread
From: Ian Campbell @ 2012-05-09 13:54 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> zero copy packets are normally sent to the outside
> network, but bridging, tun etc might loop them
> back to host networking stack. If this happens
> destructors will never be called, so orphan
> the frags immediately on receive.
I think this deceptively simply patch is actually the meat of the
series.
It's been a long time since I dug into the bridging code. Am I right
that this function is only reached when an SKB is going to be delivered
locally and not when forwarding to another port on the bridge?
Ian.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> net/core/dev.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a2be59f..c0cdc00 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1630,6 +1630,8 @@ static inline int deliver_skb(struct sk_buff *skb,
> struct packet_type *pt_prev,
> struct net_device *orig_dev)
> {
> + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> + return -ENOMEM;
> atomic_inc(&skb->users);
> return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> }
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 5/6] net: orphan frags on receive
2012-05-09 13:54 ` Ian Campbell
@ 2012-05-09 15:12 ` Michael S. Tsirkin
2012-05-14 19:18 ` Michael S. Tsirkin
1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-09 15:12 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, May 09, 2012 at 02:54:52PM +0100, Ian Campbell wrote:
> On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> > zero copy packets are normally sent to the outside
> > network, but bridging, tun etc might loop them
> > back to host networking stack. If this happens
> > destructors will never be called, so orphan
> > the frags immediately on receive.
>
> I think this deceptively simply patch is actually the meat of the
> series.
> It's been a long time since I dug into the bridging code. Am I right
> that this function is only reached when an SKB is going to be delivered
> locally and not when forwarding to another port on the bridge?
>
> Ian.
I think so - bridge uses netdev_rx_handler_unregister
so it does not go through packet handlers.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > net/core/dev.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index a2be59f..c0cdc00 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1630,6 +1630,8 @@ static inline int deliver_skb(struct sk_buff *skb,
> > struct packet_type *pt_prev,
> > struct net_device *orig_dev)
> > {
> > + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> > + return -ENOMEM;
> > atomic_inc(&skb->users);
> > return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> > }
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 5/6] net: orphan frags on receive
2012-05-09 13:54 ` Ian Campbell
2012-05-09 15:12 ` Michael S. Tsirkin
@ 2012-05-14 19:18 ` Michael S. Tsirkin
1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-14 19:18 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, May 09, 2012 at 02:54:52PM +0100, Ian Campbell wrote:
> On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> > zero copy packets are normally sent to the outside
> > network, but bridging, tun etc might loop them
> > back to host networking stack. If this happens
> > destructors will never be called, so orphan
> > the frags immediately on receive.
>
> I think this deceptively simply patch is actually the meat of the
> series.
Eric, does this patch look ok to you?
It's key to both my and ian's zerocopy work.
--
MST
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH RFC 6/6] skbuff: set zerocopy flag on frag destructor
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
` (4 preceding siblings ...)
2012-05-07 13:54 ` [PATCH RFC 5/6] net: orphan frags on receive Michael S. Tsirkin
@ 2012-05-07 13:54 ` Michael S. Tsirkin
2012-05-08 12:46 ` Michael S. Tsirkin
2012-05-08 9:41 ` [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
2012-05-09 13:18 ` Ian Campbell
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-07 13:54 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
set tx flags when adding a frag destructor to
force frags to be orphaned as appropriate.
copy when copying frags between skbs.
Note: rare false positives (where flag is set with no
destructors) are harmless.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/skbuff.h | 19 ++++++++++++++++++-
net/core/skbuff.c | 3 +++
2 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 28d842e..2876e4d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -258,6 +258,13 @@ enum {
SKBTX_WIFI_STATUS = 1 << 5,
};
+static inline void skb_copy_frag_destructor(struct sk_buff *to,
+ struct sk_buff *from)
+{
+ skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
+ SKBTX_DEV_ZEROCOPY;
+}
+
/*
* The callback notifies userspace to release buffers when skb DMA is done in
* lower device, the skb last reference should be 0 when calling this.
@@ -1260,6 +1267,8 @@ static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
{
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
frag->page.destructor = destroy;
+ if (destroy)
+ skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
}
/**
@@ -1726,7 +1735,15 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
return skb_copy_ubufs(skb, gfp_mask);
}
-
+/**
+ * skb_copy_frag_destructor - update skb after destructor copy
+ * @to: target skb to which frags were copied
+ * @from: source skb from which frags where copied
+ *
+ * Called after some frags move between skbs.
+ * If any frags in @from have a destructor, a flag in tx_flags is set.
+ * Set flag for @to so that it gets checked for destructors.
+ */
static inline void skb_copy_frag_destructor(struct sk_buff *to,
struct sk_buff *from)
{
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bdf5b09..83b04d9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -902,6 +902,7 @@ struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
n->truesize += skb->data_len;
n->data_len = skb->data_len;
n->len = skb->len;
+ skb_copy_frag_destructor(n, skb);
if (skb_shinfo(skb)->nr_frags) {
int i;
@@ -2302,6 +2303,7 @@ static inline void skb_split_no_header(struct sk_buff *skb,
void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
{
int pos = skb_headlen(skb);
+ skb_copy_frag_destructor(skb1, skb);
if (len < pos) /* Split line is inside header. */
skb_split_inside_header(skb, skb1, len, pos);
@@ -2781,6 +2783,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
if (unlikely(!nskb))
goto err;
+ skb_copy_frag_destructor(nskb, skb);
skb_reserve(nskb, headroom);
__skb_put(nskb, doffset);
}
--
MST
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 6/6] skbuff: set zerocopy flag on frag destructor
2012-05-07 13:54 ` [PATCH RFC 6/6] skbuff: set zerocopy flag on frag destructor Michael S. Tsirkin
@ 2012-05-08 12:46 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-08 12:46 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, May 07, 2012 at 04:54:57PM +0300, Michael S. Tsirkin wrote:
> set tx flags when adding a frag destructor to
> force frags to be orphaned as appropriate.
> copy when copying frags between skbs.
> Note: rare false positives (where flag is set with no
> destructors) are harmless.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Ugh. I somehow managed to mangle this last patch and send out
a version that defines skb_copy_frag_destructor twice.
The following is needed on top (not resubmitting everything as it's an
RFC, and needs to be part of Ian's patchset anyway. Sorry about the
noise.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2876e4d..5fbefc3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -258,13 +258,6 @@ enum {
SKBTX_WIFI_STATUS = 1 << 5,
};
-static inline void skb_copy_frag_destructor(struct sk_buff *to,
- struct sk_buff *from)
-{
- skb_shinfo(to)->tx_flags |= skb_shinfo(from)->tx_flags &
- SKBTX_DEV_ZEROCOPY;
-}
-
/*
* The callback notifies userspace to release buffers when skb DMA is done in
* lower device, the skb last reference should be 0 when calling this.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
` (5 preceding siblings ...)
2012-05-07 13:54 ` [PATCH RFC 6/6] skbuff: set zerocopy flag on frag destructor Michael S. Tsirkin
@ 2012-05-08 9:41 ` Michael S. Tsirkin
2012-05-08 10:54 ` Ian Campbell
2012-05-09 13:18 ` Ian Campbell
7 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-08 9:41 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, May 07, 2012 at 04:53:46PM +0300, Michael S. Tsirkin wrote:
> > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > >
> > > > Let's try to converge on something quickly as I think integration of
> > > > his work has been delayed enough as-is.
...
> Here's a first stub at a fix. Basically to be able to modify frags on
> the fly we must make sure the skb isn't cloned, so the moment someone
> clones the skb we need to trigger the frag copy logic. And this is
> exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> to reuse that logic.
>
> The below patchset replaces patch 7
> ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> in Ian's patchset and needs to be applied there.
>
>
> Compiled only but I'd like to hear what people think all the same
> because it does add a couple of branches on fast path. On the other
> hand this makes it generic so the same logic will be reusable for packet
> sockets (which IIRC are currently buggy in the same way as sunrpc) and
> for adding zero copy support to tun.
>
> Please comment,
> Thanks!
Ian, just to stress, need your ack that this does what you wanted.
--
MST
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-08 9:41 ` [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
@ 2012-05-08 10:54 ` Ian Campbell
2012-05-08 14:49 ` Michael S. Tsirkin
0 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-08 10:54 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Tue, 2012-05-08 at 10:41 +0100, Michael S. Tsirkin wrote:
> On Mon, May 07, 2012 at 04:53:46PM +0300, Michael S. Tsirkin wrote:
> > > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > > >
> > > > > Let's try to converge on something quickly as I think integration of
> > > > > his work has been delayed enough as-is.
>
> ...
>
> > Here's a first stub at a fix. Basically to be able to modify frags on
> > the fly we must make sure the skb isn't cloned, so the moment someone
> > clones the skb we need to trigger the frag copy logic. And this is
> > exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> > to reuse that logic.
> >
> > The below patchset replaces patch 7
> > ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> > in Ian's patchset and needs to be applied there.
> >
> >
> > Compiled only but I'd like to hear what people think all the same
> > because it does add a couple of branches on fast path. On the other
> > hand this makes it generic so the same logic will be reusable for packet
> > sockets (which IIRC are currently buggy in the same way as sunrpc) and
> > for adding zero copy support to tun.
> >
> > Please comment,
> > Thanks!
>
> Ian, just to stress, need your ack that this does what you wanted.
Sure, yesterday was a public holiday and today looks likely to be crazy
busy. I'll try and have a look later in the week.
Ian.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-08 10:54 ` Ian Campbell
@ 2012-05-08 14:49 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-08 14:49 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Tue, May 08, 2012 at 11:54:11AM +0100, Ian Campbell wrote:
> On Tue, 2012-05-08 at 10:41 +0100, Michael S. Tsirkin wrote:
> > On Mon, May 07, 2012 at 04:53:46PM +0300, Michael S. Tsirkin wrote:
> > > > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > > > >
> > > > > > Let's try to converge on something quickly as I think integration of
> > > > > > his work has been delayed enough as-is.
> >
> > ...
> >
> > > Here's a first stub at a fix. Basically to be able to modify frags on
> > > the fly we must make sure the skb isn't cloned, so the moment someone
> > > clones the skb we need to trigger the frag copy logic. And this is
> > > exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> > > to reuse that logic.
> > >
> > > The below patchset replaces patch 7
> > > ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> > > in Ian's patchset and needs to be applied there.
> > >
> > >
> > > Compiled only but I'd like to hear what people think all the same
> > > because it does add a couple of branches on fast path. On the other
> > > hand this makes it generic so the same logic will be reusable for packet
> > > sockets (which IIRC are currently buggy in the same way as sunrpc) and
> > > for adding zero copy support to tun.
> > >
> > > Please comment,
> > > Thanks!
> >
> > Ian, just to stress, need your ack that this does what you wanted.
>
> Sure, yesterday was a public holiday and today looks likely to be crazy
> busy. I'll try and have a look later in the week.
>
> Ian.
>
In case you or anyone else has cycles to test this, here's the
updated tree, net-next based - so far I verified that it builds and boots
and regular networking works, but didn't test nfs yet.
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git refs/heads/zerocopy
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-07 13:53 [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
` (6 preceding siblings ...)
2012-05-08 9:41 ` [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors) Michael S. Tsirkin
@ 2012-05-09 13:18 ` Ian Campbell
2012-05-09 13:52 ` Michael S. Tsirkin
7 siblings, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-09 13:18 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Mon, 2012-05-07 at 14:53 +0100, Michael S. Tsirkin wrote:
> On Fri, May 04, 2012 at 11:51:31AM +0100, Ian Campbell wrote:
> > On Fri, 2012-05-04 at 11:03 +0100, Michael S. Tsirkin wrote:
> > > On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote:
> > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Date: Fri, 4 May 2012 00:10:24 +0300
> > > >
> > > > > Hmm we orphan skbs when we loop them back so how about reusing the
> > > > > skb->destructor for this?
> > > >
> > > > That's one possibility.
>
> So originally I thought it would work: destructor would wake the
> original owner which would do data copy and modify the fragments. But
> then I unfortunately realized that would be racy: the new owner could be
By "new owner" you mean the owner of some other skb which refernences
the same shinfo (so either clone or the original if we currently hold a
clone).
> using the old frags and there appears no way for us to
> make sure it doesn't so that we can put the original page.
Right, the key issue is, I think, that cloning etc take an extra dataref
on the shinfo but do not take references on the frags. This is obviously
sane but does cause the issue above.
> And the same logic applies to modifying the frags at any
> other time if the skb is cloned. So it seems we must copy if we
> want to clone the skb.
Right. I had been hoping that the frag destructors could avoid this but
it seems like this is not going to be the case.
Just as a thought experiment would taking a frag ref on clone help us?
(lets assume for now that, iff there are destructors, this is cheaper
than the copy). I think this doesn't work -- since although the ref will
mean that the page doesn't go away under anyone elses feet after we've
orphaned it we will have lost the pointer to the original page by the
time that other ref is dropped, so freeing it will be tricky.
So following that train of thought a step further -- would creating a
new shinfo entirely on frag orphan buy us anything (and at what cost)?
Obviously it an extra allocation -- but we are already allocating N
pages of new frag. The others skb's referencing the shared shinfo would
still continue to see the old shinfo, pages and refs until they
themselves needed to orphan the frags (if they do, they might avoid it).
I think that could work, but I'm not sure it is worth the complexity.
Basically all it means that if you are bridging + flooding then you
would potentially save some number of a copies depending on the types of
devices on each port.
[...]
> Second option is what macvtap zero copy uses and it already
> does copy on clone too. So I hacked that to make it support tcp/udp
> used by sunrpc.
This does seem like it is the preferable solution.
I haven't read the patches yet, so maybe you already support this, but
it would be good to allow the creator of an skb to declare that it
doesn't want this behaviour, e.g. because it has some other mechanism
for dealing with this copy out for the case where an skb gets held
somewhere.
Ian.
> > > >
> > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > >
> > > > Let's try to converge on something quickly as I think integration of
> > > > his work has been delayed enough as-is.
>
> ...
>
> > > It's weekend here, I'll work on a patch like this
> > > Sunday.
> >
> > Thanks, I was starting to feel my nose twitching and my ears beginning
> > to elongate ;-)
> >
> > Ian.
>
> Here's a first stub at a fix. Basically to be able to modify frags on
> the fly we must make sure the skb isn't cloned, so the moment someone
> clones the skb we need to trigger the frag copy logic. And this is
> exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> to reuse that logic.
>
> The below patchset replaces patch 7
> ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> in Ian's patchset and needs to be applied there.
>
>
> Compiled only but I'd like to hear what people think all the same
> because it does add a couple of branches on fast path. On the other
> hand this makes it generic so the same logic will be reusable for packet
> sockets (which IIRC are currently buggy in the same way as sunrpc) and
> for adding zero copy support to tun.
>
> Please comment,
> Thanks!
>
> --
> MST
>
>
> Michael S. Tsirkin (6):
> skbuff: support per-page destructors in copy_ubufs
> skbuff: add an api to orphan frags
> skbuff: convert to skb_orphan_frags
> tun: orphan frags on xmit
> net: orphan frags on receive
> skbuff: set zerocopy flag on frag destructor
>
> drivers/net/tun.c | 2 ++
> include/linux/skbuff.h | 41 +++++++++++++++++++++++++++++++++++++++++
> net/core/dev.c | 2 ++
> net/core/skbuff.c | 43 +++++++++++++++++++++++++------------------
> 4 files changed, 70 insertions(+), 18 deletions(-)
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-09 13:18 ` Ian Campbell
@ 2012-05-09 13:52 ` Michael S. Tsirkin
2012-05-09 14:01 ` Ian Campbell
2012-05-09 14:56 ` Michael S. Tsirkin
0 siblings, 2 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-09 13:52 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, May 09, 2012 at 02:18:49PM +0100, Ian Campbell wrote:
> On Mon, 2012-05-07 at 14:53 +0100, Michael S. Tsirkin wrote:
> > On Fri, May 04, 2012 at 11:51:31AM +0100, Ian Campbell wrote:
> > > On Fri, 2012-05-04 at 11:03 +0100, Michael S. Tsirkin wrote:
> > > > On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote:
> > > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Date: Fri, 4 May 2012 00:10:24 +0300
> > > > >
> > > > > > Hmm we orphan skbs when we loop them back so how about reusing the
> > > > > > skb->destructor for this?
> > > > >
> > > > > That's one possibility.
> >
> > So originally I thought it would work: destructor would wake the
> > original owner which would do data copy and modify the fragments. But
> > then I unfortunately realized that would be racy: the new owner could be
>
> By "new owner" you mean the owner of some other skb which refernences
> the same shinfo (so either clone or the original if we currently hold a
> clone).
>
> > using the old frags and there appears no way for us to
> > make sure it doesn't so that we can put the original page.
>
> Right, the key issue is, I think, that cloning etc take an extra dataref
> on the shinfo but do not take references on the frags. This is obviously
> sane but does cause the issue above.
>
> > And the same logic applies to modifying the frags at any
> > other time if the skb is cloned. So it seems we must copy if we
> > want to clone the skb.
>
> Right. I had been hoping that the frag destructors could avoid this but
> it seems like this is not going to be the case.
>
> Just as a thought experiment would taking a frag ref on clone help us?
> (lets assume for now that, iff there are destructors, this is cheaper
> than the copy). I think this doesn't work -- since although the ref will
> mean that the page doesn't go away under anyone elses feet after we've
> orphaned it we will have lost the pointer to the original page by the
> time that other ref is dropped, so freeing it will be tricky.
>
> So following that train of thought a step further -- would creating a
> new shinfo entirely on frag orphan buy us anything (and at what cost)?
> Obviously it an extra allocation -- but we are already allocating N
> pages of new frag. The others skb's referencing the shared shinfo would
> still continue to see the old shinfo, pages and refs until they
> themselves needed to orphan the frags (if they do, they might avoid it).
So it would address the second problem (cloned skb) but
not the first one (original owner racing with the first one).
> I think that could work, but I'm not sure it is worth the complexity.
> Basically all it means that if you are bridging + flooding then you
> would potentially save some number of a copies depending on the types of
> devices on each port.
>
> [...]
>
> > Second option is what macvtap zero copy uses and it already
> > does copy on clone too. So I hacked that to make it support tcp/udp
> > used by sunrpc.
>
> This does seem like it is the preferable solution.
>
> I haven't read the patches yet, so maybe you already support this, but
> it would be good to allow the creator of an skb to declare that it
> doesn't want this behaviour, e.g. because it has some other mechanism
> for dealing with this copy out for the case where an skb gets held
> somewhere.
>
> Ian.
Sure, if we find a way to do that we just don't set ZEROCOPY skb flag.
So I'll give people a bit more time to review, hope
you find the time too.
> > > > >
> > > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > > >
> > > > > Let's try to converge on something quickly as I think integration of
> > > > > his work has been delayed enough as-is.
> >
> > ...
> >
> > > > It's weekend here, I'll work on a patch like this
> > > > Sunday.
> > >
> > > Thanks, I was starting to feel my nose twitching and my ears beginning
> > > to elongate ;-)
> > >
> > > Ian.
> >
> > Here's a first stub at a fix. Basically to be able to modify frags on
> > the fly we must make sure the skb isn't cloned, so the moment someone
> > clones the skb we need to trigger the frag copy logic. And this is
> > exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> > to reuse that logic.
> >
> > The below patchset replaces patch 7
> > ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> > in Ian's patchset and needs to be applied there.
> >
> >
> > Compiled only but I'd like to hear what people think all the same
> > because it does add a couple of branches on fast path. On the other
> > hand this makes it generic so the same logic will be reusable for packet
> > sockets (which IIRC are currently buggy in the same way as sunrpc) and
> > for adding zero copy support to tun.
> >
> > Please comment,
> > Thanks!
> >
> > --
> > MST
> >
> >
> > Michael S. Tsirkin (6):
> > skbuff: support per-page destructors in copy_ubufs
> > skbuff: add an api to orphan frags
> > skbuff: convert to skb_orphan_frags
> > tun: orphan frags on xmit
> > net: orphan frags on receive
> > skbuff: set zerocopy flag on frag destructor
> >
> > drivers/net/tun.c | 2 ++
> > include/linux/skbuff.h | 41 +++++++++++++++++++++++++++++++++++++++++
> > net/core/dev.c | 2 ++
> > net/core/skbuff.c | 43 +++++++++++++++++++++++++------------------
> > 4 files changed, 70 insertions(+), 18 deletions(-)
> >
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-09 13:52 ` Michael S. Tsirkin
@ 2012-05-09 14:01 ` Ian Campbell
2012-05-09 15:27 ` Michael S. Tsirkin
2012-05-09 14:56 ` Michael S. Tsirkin
1 sibling, 1 reply; 30+ messages in thread
From: Ian Campbell @ 2012-05-09 14:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, 2012-05-09 at 14:52 +0100, Michael S. Tsirkin wrote:
> On Wed, May 09, 2012 at 02:18:49PM +0100, Ian Campbell wrote:
> > On Mon, 2012-05-07 at 14:53 +0100, Michael S. Tsirkin wrote:
> > > On Fri, May 04, 2012 at 11:51:31AM +0100, Ian Campbell wrote:
> > > > On Fri, 2012-05-04 at 11:03 +0100, Michael S. Tsirkin wrote:
> > > > > On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote:
> > > > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Date: Fri, 4 May 2012 00:10:24 +0300
> > > > > >
> > > > > > > Hmm we orphan skbs when we loop them back so how about reusing the
> > > > > > > skb->destructor for this?
> > > > > >
> > > > > > That's one possibility.
> > >
> > > So originally I thought it would work: destructor would wake the
> > > original owner which would do data copy and modify the fragments. But
> > > then I unfortunately realized that would be racy: the new owner could be
> >
> > By "new owner" you mean the owner of some other skb which refernences
> > the same shinfo (so either clone or the original if we currently hold a
> > clone).
> >
> > > using the old frags and there appears no way for us to
> > > make sure it doesn't so that we can put the original page.
> >
> > Right, the key issue is, I think, that cloning etc take an extra dataref
> > on the shinfo but do not take references on the frags. This is obviously
> > sane but does cause the issue above.
> >
> > > And the same logic applies to modifying the frags at any
> > > other time if the skb is cloned. So it seems we must copy if we
> > > want to clone the skb.
> >
> > Right. I had been hoping that the frag destructors could avoid this but
> > it seems like this is not going to be the case.
> >
> > Just as a thought experiment would taking a frag ref on clone help us?
> > (lets assume for now that, iff there are destructors, this is cheaper
> > than the copy). I think this doesn't work -- since although the ref will
> > mean that the page doesn't go away under anyone elses feet after we've
> > orphaned it we will have lost the pointer to the original page by the
> > time that other ref is dropped, so freeing it will be tricky.
> >
> > So following that train of thought a step further -- would creating a
> > new shinfo entirely on frag orphan buy us anything (and at what cost)?
> > Obviously it an extra allocation -- but we are already allocating N
> > pages of new frag. The others skb's referencing the shared shinfo would
> > still continue to see the old shinfo, pages and refs until they
> > themselves needed to orphan the frags (if they do, they might avoid it).
>
> So it would address the second problem (cloned skb) but
> not the first one (original owner racing with the first one).
Right.
> > I think that could work, but I'm not sure it is worth the complexity.
> > Basically all it means that if you are bridging + flooding then you
> > would potentially save some number of a copies depending on the types of
> > devices on each port.
> >
> > [...]
> >
> > > Second option is what macvtap zero copy uses and it already
> > > does copy on clone too. So I hacked that to make it support tcp/udp
> > > used by sunrpc.
> >
> > This does seem like it is the preferable solution.
> >
> > I haven't read the patches yet, so maybe you already support this, but
> > it would be good to allow the creator of an skb to declare that it
> > doesn't want this behaviour, e.g. because it has some other mechanism
> > for dealing with this copy out for the case where an skb gets held
> > somewhere.
> >
> > Ian.
>
> Sure, if we find a way to do that we just don't set ZEROCOPY skb flag.
After reading "net: orphan frags on receive" I'm not sure how necessary
it will be, since the main case I was worried about was lots of
copying/orphaning on bridge forwarding, but I convinced myself (I think)
that the series only effects delivery and not forwarding.
Still, no harm in offering the option.
> So I'll give people a bit more time to review, hope
> you find the time too.
I took a look at a fairly high level and it all looked sensible, I've
not looked closely at the details, not run it yet, I hope I can do that
shortly.
BTW, I never said thanks for looking into this, so thanks ;-)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-09 14:01 ` Ian Campbell
@ 2012-05-09 15:27 ` Michael S. Tsirkin
0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-09 15:27 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, May 09, 2012 at 03:01:47PM +0100, Ian Campbell wrote:
> I took a look at a fairly high level and it all looked sensible, I've
> not looked closely at the details, not run it yet, I hope I can do that
> shortly.
In particular you can trace skb_copy_ubufs to verify that
it's called in the scenario where we want a copybreak
and not called where unexpected e.g. in bridge forwarding.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH RFC 0/6] copy aside frags with destructors (was [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
2012-05-09 13:52 ` Michael S. Tsirkin
2012-05-09 14:01 ` Ian Campbell
@ 2012-05-09 14:56 ` Michael S. Tsirkin
1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-05-09 14:56 UTC (permalink / raw)
To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
On Wed, May 09, 2012 at 04:52:40PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 09, 2012 at 02:18:49PM +0100, Ian Campbell wrote:
> > On Mon, 2012-05-07 at 14:53 +0100, Michael S. Tsirkin wrote:
> > > On Fri, May 04, 2012 at 11:51:31AM +0100, Ian Campbell wrote:
> > > > On Fri, 2012-05-04 at 11:03 +0100, Michael S. Tsirkin wrote:
> > > > > On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote:
> > > > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Date: Fri, 4 May 2012 00:10:24 +0300
> > > > > >
> > > > > > > Hmm we orphan skbs when we loop them back so how about reusing the
> > > > > > > skb->destructor for this?
> > > > > >
> > > > > > That's one possibility.
> > >
> > > So originally I thought it would work: destructor would wake the
> > > original owner which would do data copy and modify the fragments. But
> > > then I unfortunately realized that would be racy: the new owner could be
> >
> > By "new owner" you mean the owner of some other skb which refernences
> > the same shinfo (so either clone or the original if we currently hold a
> > clone).
Sorry didn't clarify this well enough. We often do something like
the following
orphan
set owner
new owner is what is set afterwards.
> > > using the old frags and there appears no way for us to
> > > make sure it doesn't so that we can put the original page.
> >
> > Right, the key issue is, I think, that cloning etc take an extra dataref
> > on the shinfo but do not take references on the frags. This is obviously
> > sane but does cause the issue above.
> >
> > > And the same logic applies to modifying the frags at any
> > > other time if the skb is cloned. So it seems we must copy if we
> > > want to clone the skb.
> >
> > Right. I had been hoping that the frag destructors could avoid this but
> > it seems like this is not going to be the case.
> >
> > Just as a thought experiment would taking a frag ref on clone help us?
> > (lets assume for now that, iff there are destructors, this is cheaper
> > than the copy). I think this doesn't work -- since although the ref will
> > mean that the page doesn't go away under anyone elses feet after we've
> > orphaned it we will have lost the pointer to the original page by the
> > time that other ref is dropped, so freeing it will be tricky.
> >
> > So following that train of thought a step further -- would creating a
> > new shinfo entirely on frag orphan buy us anything (and at what cost)?
> > Obviously it an extra allocation -- but we are already allocating N
> > pages of new frag. The others skb's referencing the shared shinfo would
> > still continue to see the old shinfo, pages and refs until they
> > themselves needed to orphan the frags (if they do, they might avoid it).
>
> So it would address the second problem (cloned skb) but
> not the first one (original owner racing with the first one).
Oh and there's another issue: some code assumes that clones
are always identical. For example tcpdump
might show wrong packet content if the page is modified
while packet gets through networking core, so a malicious
application might confuse anything that relies on
this data to do e.g. security filtering.
I don't know whether anyone cares about tcpdump output
being always correct but it seems nicer to side-step the issue.
> > I think that could work, but I'm not sure it is worth the complexity.
> > Basically all it means that if you are bridging + flooding then you
> > would potentially save some number of a copies depending on the types of
> > devices on each port.
> >
> > [...]
> >
> > > Second option is what macvtap zero copy uses and it already
> > > does copy on clone too. So I hacked that to make it support tcp/udp
> > > used by sunrpc.
> >
> > This does seem like it is the preferable solution.
> >
> > I haven't read the patches yet, so maybe you already support this, but
> > it would be good to allow the creator of an skb to declare that it
> > doesn't want this behaviour, e.g. because it has some other mechanism
> > for dealing with this copy out for the case where an skb gets held
> > somewhere.
> >
> > Ian.
>
> Sure, if we find a way to do that we just don't set ZEROCOPY skb flag.
>
> So I'll give people a bit more time to review, hope
> you find the time too.
>
> > > > > >
> > > > > > But I fear we're about to toss Ian into yet another rabbit hole. :-)
> > > > > >
> > > > > > Let's try to converge on something quickly as I think integration of
> > > > > > his work has been delayed enough as-is.
> > >
> > > ...
> > >
> > > > > It's weekend here, I'll work on a patch like this
> > > > > Sunday.
> > > >
> > > > Thanks, I was starting to feel my nose twitching and my ears beginning
> > > > to elongate ;-)
> > > >
> > > > Ian.
> > >
> > > Here's a first stub at a fix. Basically to be able to modify frags on
> > > the fly we must make sure the skb isn't cloned, so the moment someone
> > > clones the skb we need to trigger the frag copy logic. And this is
> > > exactly what happens with SKBTX_DEV_ZEROCOPY so it seems to make sense
> > > to reuse that logic.
> > >
> > > The below patchset replaces patch 7
> > > ([PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors)
> > > in Ian's patchset and needs to be applied there.
> > >
> > >
> > > Compiled only but I'd like to hear what people think all the same
> > > because it does add a couple of branches on fast path. On the other
> > > hand this makes it generic so the same logic will be reusable for packet
> > > sockets (which IIRC are currently buggy in the same way as sunrpc) and
> > > for adding zero copy support to tun.
> > >
> > > Please comment,
> > > Thanks!
> > >
> > > --
> > > MST
> > >
> > >
> > > Michael S. Tsirkin (6):
> > > skbuff: support per-page destructors in copy_ubufs
> > > skbuff: add an api to orphan frags
> > > skbuff: convert to skb_orphan_frags
> > > tun: orphan frags on xmit
> > > net: orphan frags on receive
> > > skbuff: set zerocopy flag on frag destructor
> > >
> > > drivers/net/tun.c | 2 ++
> > > include/linux/skbuff.h | 41 +++++++++++++++++++++++++++++++++++++++++
> > > net/core/dev.c | 2 ++
> > > net/core/skbuff.c | 43 +++++++++++++++++++++++++------------------
> > > 4 files changed, 70 insertions(+), 18 deletions(-)
> > >
> >
^ permalink raw reply [flat|nested] 30+ messages in thread