* [PATCH] net: Frag list lost on head expansion. @ 2010-09-03 3:43 David Miller 2010-09-03 5:48 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-03 3:43 UTC (permalink / raw) To: netdev When pskb_expand_head() releases the data, with skb_release_data(), it tries to properly preserve any fragment list using skb_clone_fraglist(). Although skb_clone_fraglist() will properly grab a reference to all of the fragment list SKBs, it will not block skb_release_data() from NULL'ing out the ->frag_list pointer when it calls skb_drop_list() via skb_drop_fraglist(). As a result we lose the fragment SKBs and they are leaked forever. Instead, hide the fragment list pointer around the skb_release_data() call and restore it afterwards. This fixes the bug and also makes it cheaper since we won't grab and release every single fragment list SKB reference. Signed-off-by: David S. Miller <davem@davemloft.net> --- I found this via pure code inspection, please review. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 26396ff..def2e49 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -789,6 +789,7 @@ EXPORT_SYMBOL(pskb_copy); int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask) { + struct sk_buff *frag_list; int i; u8 *data; #ifdef NET_SKBUFF_DATA_USES_OFFSET @@ -822,11 +823,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) get_page(skb_shinfo(skb)->frags[i].page); - if (skb_has_frags(skb)) - skb_clone_fraglist(skb); + frag_list = skb_shinfo(skb)->frag_list; + skb_shinfo(skb)->frag_list = NULL; skb_release_data(skb); + skb_shinfo(skb)->frag_list = frag_list; + off = (data + nhead) - skb->head; skb->head = data; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH] net: Frag list lost on head expansion. 2010-09-03 3:43 [PATCH] net: Frag list lost on head expansion David Miller @ 2010-09-03 5:48 ` Eric Dumazet 2010-09-03 6:19 ` Eric Dumazet 2010-09-03 9:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet 0 siblings, 2 replies; 31+ messages in thread From: Eric Dumazet @ 2010-09-03 5:48 UTC (permalink / raw) To: David Miller; +Cc: netdev Le jeudi 02 septembre 2010 à 20:43 -0700, David Miller a écrit : > When pskb_expand_head() releases the data, with skb_release_data(), it > tries to properly preserve any fragment list using > skb_clone_fraglist(). > > Although skb_clone_fraglist() will properly grab a reference to all of > the fragment list SKBs, it will not block skb_release_data() from > NULL'ing out the ->frag_list pointer when it calls skb_drop_list() via > skb_drop_fraglist(). > > As a result we lose the fragment SKBs and they are leaked forever. > > Instead, hide the fragment list pointer around the skb_release_data() > call and restore it afterwards. This fixes the bug and also makes > it cheaper since we won't grab and release every single fragment > list SKB reference. > > Signed-off-by: David S. Miller <davem@davemloft.net> > --- > > I found this via pure code inspection, please review. > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 26396ff..def2e49 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -789,6 +789,7 @@ EXPORT_SYMBOL(pskb_copy); > int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > gfp_t gfp_mask) > { > + struct sk_buff *frag_list; > int i; > u8 *data; > #ifdef NET_SKBUFF_DATA_USES_OFFSET > @@ -822,11 +823,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > get_page(skb_shinfo(skb)->frags[i].page); > > - if (skb_has_frags(skb)) > - skb_clone_fraglist(skb); > + frag_list = skb_shinfo(skb)->frag_list; > + skb_shinfo(skb)->frag_list = NULL; > > skb_release_data(skb); > > + skb_shinfo(skb)->frag_list = frag_list; > + > off = (data + nhead) - skb->head; > > skb->head = data; David, I had this same idea some days ago when reviewing this code, but when I came to conclusion we could not avoid the get_page / put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth trying to avoid the frag_list grab/release operation. But we are the only user of this skb and skb_shinfo(skb)->frag_list, so your patch seems good to me. Please note you are not fixing a bug, because the new frag_list pointer was correctly copied in the memcpy(data + size, skb_end_pointer(skb), offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags])); Please rewrite the changelog to say this patch is an optimization, to avoid the atomic ops on each skb found in frag_list ? Thanks ! ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] net: Frag list lost on head expansion. 2010-09-03 5:48 ` Eric Dumazet @ 2010-09-03 6:19 ` Eric Dumazet 2010-09-03 9:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet 1 sibling, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2010-09-03 6:19 UTC (permalink / raw) To: David Miller; +Cc: netdev Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit : > David, I had this same idea some days ago when reviewing this code, > but when I came to conclusion we could not avoid the get_page / > put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth > trying to avoid the frag_list grab/release operation. > > But we are the only user of this skb and skb_shinfo(skb)->frag_list, so > your patch seems good to me. > > Please note you are not fixing a bug, because the new frag_list pointer > was correctly copied in the > > memcpy(data + size, skb_end_pointer(skb), > offsetof(struct skb_shared_info, > frags[skb_shinfo(skb)->nr_frags])); > > Please rewrite the changelog to say this patch is an optimization, to > avoid the atomic ops on each skb found in frag_list ? > Well, skb is not shared, but we have no guarantee skb_shinfo(skb) is not. To optimize this thing, you'll need to add a new parameter to skb_release_data() ? ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-03 5:48 ` Eric Dumazet 2010-09-03 6:19 ` Eric Dumazet @ 2010-09-03 9:09 ` Eric Dumazet 2010-09-03 13:46 ` David Miller 2010-09-07 1:25 ` David Miller 1 sibling, 2 replies; 31+ messages in thread From: Eric Dumazet @ 2010-09-03 9:09 UTC (permalink / raw) To: David Miller; +Cc: netdev Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit : > David, I had this same idea some days ago when reviewing this code, > but when I came to conclusion we could not avoid the get_page / > put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth > trying to avoid the frag_list grab/release operation. > Here is the patch I cooked, for net-next-2.6 [PATCH net-next-2.6] net: pskb_expand_head() optimization pskb_expand_head() blindly takes references on fragments before calling skb_release_data(), potentially releasing these references. We can add a fast path, avoiding these atomic operations, if we own the last reference on skb->head. Based on a previous patch from David Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/core/skbuff.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 231dff0..59b96fe 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -781,6 +781,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, u8 *data; int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail; long off; + bool fastpath; BUG_ON(nhead < 0); @@ -802,14 +803,28 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, skb_shinfo(skb), offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags])); - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - get_page(skb_shinfo(skb)->frags[i].page); + /* Check if we can avoid taking references on fragments if we own + * the last reference on skb->head. (see skb_release_data()) + */ + if (!skb->cloned) + fastpath = true; + else { + int delta = skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1; - if (skb_has_frag_list(skb)) - skb_clone_fraglist(skb); + fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; + } - skb_release_data(skb); + if (fastpath) { + kfree(skb->head); + } else { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) + get_page(skb_shinfo(skb)->frags[i].page); + if (skb_has_frag_list(skb)) + skb_clone_fraglist(skb); + + skb_release_data(skb); + } off = (data + nhead) - skb->head; skb->head = data; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-03 9:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet @ 2010-09-03 13:46 ` David Miller 2010-09-07 2:20 ` David Miller 2010-09-07 1:25 ` David Miller 1 sibling, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-03 13:46 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 03 Sep 2010 11:09:32 +0200 > Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit : > >> David, I had this same idea some days ago when reviewing this code, >> but when I came to conclusion we could not avoid the get_page / >> put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth >> trying to avoid the frag_list grab/release operation. >> > > Here is the patch I cooked, for net-next-2.6 > > [PATCH net-next-2.6] net: pskb_expand_head() optimization > > pskb_expand_head() blindly takes references on fragments before calling > skb_release_data(), potentially releasing these references. > > We can add a fast path, avoiding these atomic operations, if we own the > last reference on skb->head. > > Based on a previous patch from David > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Ok, I see how this works now, thanks Eric. But this looks to be a bit too complicated to be worthwhile, I think, so let's hold off on this optimization for a while. Let me tell you why I was swimming around in this area and what my ultimate motivation is :-) In trying to eventually convert sk_buff to list_head one of the troubling areas is this frag_list business. Amusingly, if you look, virtually all of the code which constructs frag_list SKBs wants a doubly linked list so it can insert at the tail (all LRO gathering, GRO, you name it). There are only two operations that make a double-linked list currently impossible. This pskb_expand_head() thing, and pskb_copy(). They cause the situation where the list is shared between multiple SKB data shared areas. And because of this a doubly-linked list like list_head won't work at all. For the optimized case of pskb_expand_head() you discovered we can avoid this sharing. And at this point I imagine that for the remaining cases we can simply copy the full SKBs in the frag list to avoid this list sharing. So that's where I'm trying to go in all of this. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-03 13:46 ` David Miller @ 2010-09-07 2:20 ` David Miller 2010-09-07 5:02 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-07 2:20 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: David Miller <davem@davemloft.net> Date: Fri, 03 Sep 2010 06:46:33 -0700 (PDT) > There are only two operations that make a double-linked list currently > impossible. This pskb_expand_head() thing, and pskb_copy(). > > They cause the situation where the list is shared between multiple SKB > data shared areas. > > And because of this a doubly-linked list like list_head won't work at > all. > > For the optimized case of pskb_expand_head() you discovered we can avoid > this sharing. And at this point I imagine that for the remaining cases > we can simply copy the full SKBs in the frag list to avoid this list > sharing. Eric, this goes on top of your patch and demonstrates the idea. Please review if you have a chance: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2d1bc76..aeb56af 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb) skb_get(list); } +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent, + gfp_t gfp_mask) +{ + struct sk_buff *first_skb = NULL; + struct sk_buff *prev_skb = NULL; + struct sk_buff *skb; + + skb_walk_frags(parent, skb) { + struct sk_buff *nskb = pskb_copy(skb, gfp_mask); + + if (!nskb) + goto fail; + if (!first_skb) + first_skb = skb; + else + prev_skb->next = skb; + prev_skb = skb; + } + + return first_skb; + +fail: + skb_drop_list(&first_skb); + return NULL; +} + static void skb_release_data(struct sk_buff *skb) { if (!skb->cloned || @@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; } - if (fastpath) { - kfree(skb->head); - } else { + if (!fastpath) { + if (skb_has_frag_list(skb)) { + struct sk_buff *new_list; + + new_list = skb_copy_fraglist(skb, gfp_mask); + if (!new_list) + goto free_data; + skb_shinfo(skb)->frag_list = new_list; + } for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) get_page(skb_shinfo(skb)->frags[i].page); - if (skb_has_frag_list(skb)) - skb_clone_fraglist(skb); - - skb_release_data(skb); } + + kfree(skb->head); + off = (data + nhead) - skb->head; skb->head = data; @@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, atomic_set(&skb_shinfo(skb)->dataref, 1); return 0; +free_data: + kfree(data); nodata: return -ENOMEM; } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-07 2:20 ` David Miller @ 2010-09-07 5:02 ` Eric Dumazet 2010-09-07 5:05 ` David Miller 2010-09-07 9:16 ` Jarek Poplawski 0 siblings, 2 replies; 31+ messages in thread From: Eric Dumazet @ 2010-09-07 5:02 UTC (permalink / raw) To: David Miller; +Cc: netdev Le lundi 06 septembre 2010 à 19:20 -0700, David Miller a écrit : > Eric, this goes on top of your patch and demonstrates the idea. > > Please review if you have a chance: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 2d1bc76..aeb56af 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb) > skb_get(list); > } > > +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent, > + gfp_t gfp_mask) > +{ > + struct sk_buff *first_skb = NULL; > + struct sk_buff *prev_skb = NULL; > + struct sk_buff *skb; > + > + skb_walk_frags(parent, skb) { > + struct sk_buff *nskb = pskb_copy(skb, gfp_mask); > + > + if (!nskb) > + goto fail; > + if (!first_skb) > + first_skb = skb; > + else > + prev_skb->next = skb; > + prev_skb = skb; > + } > + > + return first_skb; > + > +fail: > + skb_drop_list(&first_skb); > + return NULL; > +} > + > static void skb_release_data(struct sk_buff *skb) > { > if (!skb->cloned || > @@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; > } > > - if (fastpath) { > - kfree(skb->head); > - } else { > + if (!fastpath) { > + if (skb_has_frag_list(skb)) { > + struct sk_buff *new_list; > + > + new_list = skb_copy_fraglist(skb, gfp_mask); > + if (!new_list) > + goto free_data; > + skb_shinfo(skb)->frag_list = new_list; Here, skb_shinfo(skb) still points to old shinfo, you should not touch it. An other user might need it :) > + } > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > get_page(skb_shinfo(skb)->frags[i].page); > > - if (skb_has_frag_list(skb)) > - skb_clone_fraglist(skb); > - > - skb_release_data(skb); > } I believe you cannot remove skb_release_data() call, we really need to perform the atomic operation, and test the result on it, or a double free could happen. > + > + kfree(skb->head); > + > off = (data + nhead) - skb->head; > > skb->head = data; > @@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > atomic_set(&skb_shinfo(skb)->dataref, 1); > return 0; > > +free_data: > + kfree(data); is it a leftover ? > nodata: > return -ENOMEM; > } I understand what you want to do, but problem is we need to perform a CAS2 operation : atomically changes two values (dataref and frag_list) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-07 5:02 ` Eric Dumazet @ 2010-09-07 5:05 ` David Miller 2010-09-07 9:16 ` Jarek Poplawski 1 sibling, 0 replies; 31+ messages in thread From: David Miller @ 2010-09-07 5:05 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 07 Sep 2010 07:02:09 +0200 > I understand what you want to do, but problem is we need to perform a > CAS2 operation : atomically changes two values (dataref and frag_list) Ok, thanks for the review, I'll think some more about this. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-07 5:02 ` Eric Dumazet 2010-09-07 5:05 ` David Miller @ 2010-09-07 9:16 ` Jarek Poplawski 2010-09-07 9:37 ` Eric Dumazet 1 sibling, 1 reply; 31+ messages in thread From: Jarek Poplawski @ 2010-09-07 9:16 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On 2010-09-07 07:02, Eric Dumazet wrote: > Le lundi 06 septembre 2010 Ă 19:20 -0700, David Miller a ĂŠcrit : > >> Eric, this goes on top of your patch and demonstrates the idea. >> >> Please review if you have a chance: >> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 2d1bc76..aeb56af 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb) >> skb_get(list); >> } >> >> +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent, >> + gfp_t gfp_mask) >> +{ >> + struct sk_buff *first_skb = NULL; >> + struct sk_buff *prev_skb = NULL; >> + struct sk_buff *skb; >> + >> + skb_walk_frags(parent, skb) { >> + struct sk_buff *nskb = pskb_copy(skb, gfp_mask); >> + >> + if (!nskb) >> + goto fail; >> + if (!first_skb) >> + first_skb = skb; >> + else >> + prev_skb->next = skb; >> + prev_skb = skb; >> + } >> + >> + return first_skb; >> + >> +fail: >> + skb_drop_list(&first_skb); >> + return NULL; >> +} >> + >> static void skb_release_data(struct sk_buff *skb) >> { >> if (!skb->cloned || >> @@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, >> fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; >> } >> >> - if (fastpath) { >> - kfree(skb->head); >> - } else { >> + if (!fastpath) { >> + if (skb_has_frag_list(skb)) { >> + struct sk_buff *new_list; >> + >> + new_list = skb_copy_fraglist(skb, gfp_mask); >> + if (!new_list) >> + goto free_data; >> + skb_shinfo(skb)->frag_list = new_list; > > Here, skb_shinfo(skb) still points to old shinfo, you should not touch > it. An other user might need it :) Even if there were no users this is written to the area freed with kfree(skb->head) a few lines later, isn't it? > >> + } >> for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) >> get_page(skb_shinfo(skb)->frags[i].page); >> >> - if (skb_has_frag_list(skb)) >> - skb_clone_fraglist(skb); >> - >> - skb_release_data(skb); >> } > > I believe you cannot remove skb_release_data() call, we really need to > perform the atomic operation, and test the result on it, or a double > free could happen. > >> + >> + kfree(skb->head); >> + >> off = (data + nhead) - skb->head; >> >> skb->head = data; >> @@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, >> atomic_set(&skb_shinfo(skb)->dataref, 1); >> return 0; >> >> +free_data: >> + kfree(data); > > is it a leftover ? > >> nodata: >> return -ENOMEM; >> } > > I understand what you want to do, but problem is we need to perform a > CAS2 operation : atomically changes two values (dataref and frag_list) Alas I can't understand why do you think these clone and atomic tests in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough. Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-07 9:16 ` Jarek Poplawski @ 2010-09-07 9:37 ` Eric Dumazet 2010-09-10 19:54 ` David Miller 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2010-09-07 9:37 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Le mardi 07 septembre 2010 à 09:16 +0000, Jarek Poplawski a écrit : > On 2010-09-07 07:02, Eric Dumazet wrote: > > > > I understand what you want to do, but problem is we need to perform a > > CAS2 operation : atomically changes two values (dataref and frag_list) > > Alas I can't understand why do you think these clone and atomic tests > in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough. > It was early in the morning, before a cup of tea. David only had to set frag_list in the new shinfo, not the old. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-07 9:37 ` Eric Dumazet @ 2010-09-10 19:54 ` David Miller 2010-09-11 12:31 ` Jarek Poplawski 0 siblings, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-10 19:54 UTC (permalink / raw) To: eric.dumazet; +Cc: jarkao2, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 07 Sep 2010 11:37:28 +0200 > Le mardi 07 septembre 2010 à 09:16 +0000, Jarek Poplawski a écrit : >> On 2010-09-07 07:02, Eric Dumazet wrote: > >> > >> > I understand what you want to do, but problem is we need to perform a >> > CAS2 operation : atomically changes two values (dataref and frag_list) >> >> Alas I can't understand why do you think these clone and atomic tests >> in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough. >> > > It was early in the morning, before a cup of tea. > > David only had to set frag_list in the new shinfo, not the old. Ok, how does this look? diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 752c197..aaa9750 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb) skb_get(list); } +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent, + gfp_t gfp_mask) +{ + struct sk_buff *first_skb = NULL; + struct sk_buff *prev_skb = NULL; + struct sk_buff *skb; + + skb_walk_frags(parent, skb) { + struct sk_buff *nskb = pskb_copy(skb, gfp_mask); + + if (!nskb) + goto fail; + if (!first_skb) + first_skb = skb; + else + prev_skb->next = skb; + prev_skb = skb; + } + + return first_skb; + +fail: + skb_drop_list(&first_skb); + return NULL; +} + static void skb_release_data(struct sk_buff *skb) { if (!skb->cloned || @@ -775,11 +801,12 @@ EXPORT_SYMBOL(pskb_copy); int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask) { - int i; - u8 *data; int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail; - long off; + struct skb_shared_info *new_shinfo; bool fastpath; + u8 *data; + long off; + int i; BUG_ON(nhead < 0); @@ -797,8 +824,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, */ memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head); - memcpy((struct skb_shared_info *)(data + size), - skb_shinfo(skb), + new_shinfo = (struct skb_shared_info *)(data + size); + memcpy(new_shinfo, skb_shinfo(skb), offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags])); /* Check if we can avoid taking references on fragments if we own @@ -815,14 +842,20 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, if (fastpath) { kfree(skb->head); } else { + if (skb_has_frag_list(skb)) { + struct sk_buff *new_list; + + new_list = skb_copy_fraglist(skb, gfp_mask); + if (!new_list) + goto free_data; + new_shinfo->frag_list = new_list; + } for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) get_page(skb_shinfo(skb)->frags[i].page); - if (skb_has_frag_list(skb)) - skb_clone_fraglist(skb); - skb_release_data(skb); } + off = (data + nhead) - skb->head; skb->head = data; @@ -848,6 +881,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, atomic_set(&skb_shinfo(skb)->dataref, 1); return 0; +free_data: + kfree(data); nodata: return -ENOMEM; } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-10 19:54 ` David Miller @ 2010-09-11 12:31 ` Jarek Poplawski 2010-09-12 3:30 ` David Miller 2010-09-20 0:17 ` David Miller 0 siblings, 2 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-11 12:31 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Fri, Sep 10, 2010 at 12:54:49PM -0700, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 07 Sep 2010 11:37:28 +0200 > > > Le mardi 07 septembre 2010 ? 09:16 +0000, Jarek Poplawski a écrit : > >> On 2010-09-07 07:02, Eric Dumazet wrote: > > > >> > > >> > I understand what you want to do, but problem is we need to perform a > >> > CAS2 operation : atomically changes two values (dataref and frag_list) > >> > >> Alas I can't understand why do you think these clone and atomic tests > >> in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough. > >> > > > > It was early in the morning, before a cup of tea. > > > > David only had to set frag_list in the new shinfo, not the old. > > Ok, how does this look? > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 752c197..aaa9750 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb) > skb_get(list); > } > > +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent, > + gfp_t gfp_mask) > +{ > + struct sk_buff *first_skb = NULL; > + struct sk_buff *prev_skb = NULL; > + struct sk_buff *skb; > + > + skb_walk_frags(parent, skb) { > + struct sk_buff *nskb = pskb_copy(skb, gfp_mask); > + > + if (!nskb) > + goto fail; > + if (!first_skb) > + first_skb = skb; Probably here and below: "= nskb" > + else > + prev_skb->next = skb; > + prev_skb = skb; > + } > + > + return first_skb; > + > +fail: With "if (first_skb)" here it would look better to me even if it currently doesn't matter. Otherwise seems OK, but I still would like to know the scenario demanding this change. Jarek P. > + skb_drop_list(&first_skb); > + return NULL; > +} > + > static void skb_release_data(struct sk_buff *skb) > { > if (!skb->cloned || > @@ -775,11 +801,12 @@ EXPORT_SYMBOL(pskb_copy); > int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > gfp_t gfp_mask) > { > - int i; > - u8 *data; > int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail; > - long off; > + struct skb_shared_info *new_shinfo; > bool fastpath; > + u8 *data; > + long off; > + int i; > > BUG_ON(nhead < 0); > > @@ -797,8 +824,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > */ > memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head); > > - memcpy((struct skb_shared_info *)(data + size), > - skb_shinfo(skb), > + new_shinfo = (struct skb_shared_info *)(data + size); > + memcpy(new_shinfo, skb_shinfo(skb), > offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags])); > > /* Check if we can avoid taking references on fragments if we own > @@ -815,14 +842,20 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > if (fastpath) { > kfree(skb->head); > } else { > + if (skb_has_frag_list(skb)) { > + struct sk_buff *new_list; > + > + new_list = skb_copy_fraglist(skb, gfp_mask); > + if (!new_list) > + goto free_data; > + new_shinfo->frag_list = new_list; > + } > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > get_page(skb_shinfo(skb)->frags[i].page); > > - if (skb_has_frag_list(skb)) > - skb_clone_fraglist(skb); > - > skb_release_data(skb); > } > + > off = (data + nhead) - skb->head; > > skb->head = data; > @@ -848,6 +881,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > atomic_set(&skb_shinfo(skb)->dataref, 1); > return 0; > > +free_data: > + kfree(data); > nodata: > return -ENOMEM; > } ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-11 12:31 ` Jarek Poplawski @ 2010-09-12 3:30 ` David Miller 2010-09-12 10:45 ` Jarek Poplawski 2010-09-20 0:17 ` David Miller 1 sibling, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-12 3:30 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Sat, 11 Sep 2010 14:31:40 +0200 > Otherwise seems OK, but I still would like to know the scenario > demanding this change. I want to make sk_buff use list_head, including all uses such as frag_list et al. If the frag_list chain can be shared, a doubly linked list cannot be used. This is someting I've been gradually working on now for more than 2 years :) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 3:30 ` David Miller @ 2010-09-12 10:45 ` Jarek Poplawski 2010-09-12 10:58 ` Jarek Poplawski 2010-09-12 15:58 ` David Miller 0 siblings, 2 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-12 10:45 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Sat, Sep 11, 2010 at 08:30:02PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sat, 11 Sep 2010 14:31:40 +0200 > > > Otherwise seems OK, but I still would like to know the scenario > > demanding this change. > > I want to make sk_buff use list_head, including all uses such as > frag_list et al. > > If the frag_list chain can be shared, a doubly linked list cannot be > used. > > This is someting I've been gradually working on now for more than 2 > years :) Hmm... Then the first message/changelog in this thread seems to describe the future bug, only with doubly linked lists. If so, it was a bit misleading to me ;-) Then a few more questions: 1) if doubly linked lists really require such pskb_copying, isn't it all too costly? 2) why skb_clone isn't enough instead of pskb_copy? 3) since skb_clone has some cost too, why e.g. saving only the pointer to the tail of the list in skb_shared_info isn't enough? Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 10:45 ` Jarek Poplawski @ 2010-09-12 10:58 ` Jarek Poplawski 2010-09-12 15:58 ` David Miller 1 sibling, 0 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-12 10:58 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Sun, Sep 12, 2010 at 12:45:34PM +0200, Jarek Poplawski wrote: > On Sat, Sep 11, 2010 at 08:30:02PM -0700, David Miller wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Sat, 11 Sep 2010 14:31:40 +0200 > > > > > Otherwise seems OK, but I still would like to know the scenario > > > demanding this change. > > > > I want to make sk_buff use list_head, including all uses such as > > frag_list et al. > > > > If the frag_list chain can be shared, a doubly linked list cannot be > > used. > > > > This is someting I've been gradually working on now for more than 2 > > years :) > > Hmm... Then the first message/changelog in this thread seems to > describe the future bug, only with doubly linked lists. If so, it was > a bit misleading to me ;-) > > Then a few more questions: > 1) if doubly linked lists really require such pskb_copying, isn't it > all too costly? > 2) why skb_clone isn't enough instead of pskb_copy? > 3) since skb_clone has some cost too, why e.g. saving only the pointer > to the tail of the list in skb_shared_info isn't enough? ...3a) IOW, do we really need this double linking... Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 10:45 ` Jarek Poplawski 2010-09-12 10:58 ` Jarek Poplawski @ 2010-09-12 15:58 ` David Miller 2010-09-12 16:13 ` David Miller ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: David Miller @ 2010-09-12 15:58 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Sun, 12 Sep 2010 12:45:34 +0200 > Then a few more questions: > 1) if doubly linked lists really require such pskb_copying, isn't it > all too costly? In the common case the data reference will be one, so we will not copy. > 2) why skb_clone isn't enough instead of pskb_copy? Can't share the metadata. > 3) since skb_clone has some cost too, why e.g. saving only the pointer > to the tail of the list in skb_shared_info isn't enough? Then we won't get the rest of the advantages of using list_head such as prefetching during traversals, automatic debugging facilities, et al. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 15:58 ` David Miller @ 2010-09-12 16:13 ` David Miller 2010-09-12 20:57 ` Jarek Poplawski 2010-09-12 19:55 ` Ben Pfaff 2010-09-12 20:45 ` Jarek Poplawski 2 siblings, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-12 16:13 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how will you sync that tail pointer in all of the shinfo instances referencing the frag list? It simply can't work, we have to copy. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 16:13 ` David Miller @ 2010-09-12 20:57 ` Jarek Poplawski 2010-09-12 22:08 ` David Miller 0 siblings, 1 reply; 31+ messages in thread From: Jarek Poplawski @ 2010-09-12 20:57 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote: > > BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how > will you sync that tail pointer in all of the shinfo instances > referencing the frag list? > > It simply can't work, we have to copy. The question is if we need to sync at all? This is shared data at the moment, so I can't imagine how the list (especialy doubly linked) could be changed without locking? And even if it's possible, I doubt copying e.g. like in your current patch can help when an skb is added at the tail later. Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 20:57 ` Jarek Poplawski @ 2010-09-12 22:08 ` David Miller 2010-09-13 7:49 ` Jarek Poplawski 0 siblings, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-12 22:08 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Sun, 12 Sep 2010 22:57:22 +0200 > On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote: >> >> BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how >> will you sync that tail pointer in all of the shinfo instances >> referencing the frag list? >> >> It simply can't work, we have to copy. > > The question is if we need to sync at all? This is shared data at the > moment, so I can't imagine how the list (especialy doubly linked) > could be changed without locking? And even if it's possible, I doubt > copying e.g. like in your current patch can help when an skb is added > at the tail later. That's the fundamental issue. If you look, everywhere we curently do that trick of "use the skb->prev pointer to remmeber the frag_list tail" the code knows it has exclusive access to both the skb metadata and the underlying data. But for modifications of the frag list during the SKBs lifetime that's another issue, entirely. All of these functions trimming the head or tail of the SKB data which can modify the frag list elements, they can be called from all kinds of contexts. Look for Alexey Kuznetsov's comments in skbuff.c that read "mincing fragments" and similar. The real win with my work is complete unification of all list handling, and making our packet handling code much more "hackable" by non-networking kernel hackers. Really we have the last major core datastructures that do not use standard lists, and I'm going to convert it so we can be sane like the rest of the kernel. :-) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 22:08 ` David Miller @ 2010-09-13 7:49 ` Jarek Poplawski 0 siblings, 0 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-13 7:49 UTC (permalink / raw) To: David Miller; +Cc: jarkao2, eric.dumazet, netdev On 2010-09-13 00:08, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sun, 12 Sep 2010 22:57:22 +0200 > >> On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote: >>> >>> BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how >>> will you sync that tail pointer in all of the shinfo instances >>> referencing the frag list? >>> >>> It simply can't work, we have to copy. >> >> The question is if we need to sync at all? This is shared data at the >> moment, so I can't imagine how the list (especialy doubly linked) >> could be changed without locking? And even if it's possible, I doubt >> copying e.g. like in your current patch can help when an skb is added >> at the tail later. > > That's the fundamental issue. > > If you look, everywhere we curently do that trick of "use the > skb->prev pointer to remmeber the frag_list tail" the code knows > it has exclusive access to both the skb metadata and the > underlying data. > > But for modifications of the frag list during the SKBs lifetime > that's another issue, entirely. All of these functions trimming > the head or tail of the SKB data which can modify the frag > list elements, they can be called from all kinds of contexts. > Look for Alexey Kuznetsov's comments in skbuff.c that read > "mincing fragments" and similar. OK, I've found the skb_cow_data() comments, but I can see there only a modification of copied data. > > The real win with my work is complete unification of all list > handling, and making our packet handling code much more "hackable" > by non-networking kernel hackers. > > Really we have the last major core datastructures that do not > use standard lists, and I'm going to convert it so we can > be sane like the rest of the kernel. :-) I guess we can stay with different opinions, no problem, at least to me. IMHO, considering the need for additional copying or even only cloning data where it's not currently done, doubly linked list is too costly for the frag_list. It would affect pskb_expand_head(), pskb_copy() but probably also skb_segment(), and maybe more. So, with GROwing(!) importance of fragmented packets, I doubt this copying will be as unlikely as presumed. Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 15:58 ` David Miller 2010-09-12 16:13 ` David Miller @ 2010-09-12 19:55 ` Ben Pfaff 2010-09-12 20:24 ` David Miller 2010-09-12 20:45 ` Jarek Poplawski 2 siblings, 1 reply; 31+ messages in thread From: Ben Pfaff @ 2010-09-12 19:55 UTC (permalink / raw) To: David Miller; +Cc: jarkao2, eric.dumazet, netdev David Miller <davem@davemloft.net> writes: >> 3) since skb_clone has some cost too, why e.g. saving only the pointer >> to the tail of the list in skb_shared_info isn't enough? > > Then we won't get the rest of the advantages of using list_head such > as prefetching during traversals, automatic debugging facilities, et al. Did you see the recent patch from Andi Kleen where he proposes removing this prefetching in most situations because the costs outweigh the benefits on most modern architectures? http://permalink.gmane.org/gmane.linux.kernel/1033281 I'm not saying that list_head doesn't have other advantages though. -- Ben Pfaff http://benpfaff.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 19:55 ` Ben Pfaff @ 2010-09-12 20:24 ` David Miller 0 siblings, 0 replies; 31+ messages in thread From: David Miller @ 2010-09-12 20:24 UTC (permalink / raw) To: blp; +Cc: jarkao2, eric.dumazet, netdev From: Ben Pfaff <blp@cs.stanford.edu> Date: Sun, 12 Sep 2010 12:55:54 -0700 > David Miller <davem@davemloft.net> writes: > >>> 3) since skb_clone has some cost too, why e.g. saving only the pointer >>> to the tail of the list in skb_shared_info isn't enough? >> >> Then we won't get the rest of the advantages of using list_head such >> as prefetching during traversals, automatic debugging facilities, et al. > > Did you see the recent patch from Andi Kleen where he proposes > removing this prefetching in most situations because the costs > outweigh the benefits on most modern architectures? > http://permalink.gmane.org/gmane.linux.kernel/1033281 > > I'm not saying that list_head doesn't have other advantages > though. Yes I saw it and I somewhat disagree with him, but don't care enough to argue with him about it. There are much more important things to apply my mind and time to at the moment :-) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-12 15:58 ` David Miller 2010-09-12 16:13 ` David Miller 2010-09-12 19:55 ` Ben Pfaff @ 2010-09-12 20:45 ` Jarek Poplawski 2 siblings, 0 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-12 20:45 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Sun, Sep 12, 2010 at 08:58:33AM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sun, 12 Sep 2010 12:45:34 +0200 > > > Then a few more questions: > > 1) if doubly linked lists really require such pskb_copying, isn't it > > all too costly? > > In the common case the data reference will be one, so we will not > copy. Even if so, one such a case on the fast path should hit performance, so it would need special reviewing. > > > 2) why skb_clone isn't enough instead of pskb_copy? > > Can't share the metadata. I'd really like to understand why the change in handling next/prev should affect more than skb pointers wrt. current sharing. > > > 3) since skb_clone has some cost too, why e.g. saving only the pointer > > to the tail of the list in skb_shared_info isn't enough? > > Then we won't get the rest of the advantages of using list_head such > as prefetching during traversals, automatic debugging facilities, et al. Right, we need to sum pros and cons. So, what's the pros? ;-) Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-11 12:31 ` Jarek Poplawski 2010-09-12 3:30 ` David Miller @ 2010-09-20 0:17 ` David Miller 2010-09-20 7:21 ` Jarek Poplawski 1 sibling, 1 reply; 31+ messages in thread From: David Miller @ 2010-09-20 0:17 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Sat, 11 Sep 2010 14:31:40 +0200 > Otherwise seems OK, but I still would like to know the scenario > demanding this change. Ok Jarek, after some more consideration I agree with you. Removing this kind of sharing would be unwise, ho hum... While pondering over this I thought about why we even need frag lists at all. We need them for two reasons: 1) Because we have an inability to turn kmalloc data references into page based ones easily. We've run into this sort of problem with socket splice() too. 2) For recording the segmentation points of a fragmented packet. We definitely don't use frag lists for performance, if you look in the GRO history the GRO code specifically has been changed to prefer accumulating into the page vector whenever possible because using frag lists is a lot slower. Anyways, even if we somehow solved #1 we'd still have a lot of code (even bluetooth) using it for the sake of issue #2. So for the time being there is no clear path for trying to eliminate frag lists altogether if we wanted to do that either. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 0:17 ` David Miller @ 2010-09-20 7:21 ` Jarek Poplawski 2010-09-20 9:02 ` Eric Dumazet 2010-09-20 16:59 ` David Miller 0 siblings, 2 replies; 31+ messages in thread From: Jarek Poplawski @ 2010-09-20 7:21 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Sun, Sep 19, 2010 at 05:17:25PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sat, 11 Sep 2010 14:31:40 +0200 > > > Otherwise seems OK, but I still would like to know the scenario > > demanding this change. > > Ok Jarek, after some more consideration I agree with you. > Removing this kind of sharing would be unwise, ho hum... > > While pondering over this I thought about why we even need > frag lists at all. We need them for two reasons: > > 1) Because we have an inability to turn kmalloc data references into > page based ones easily. > > We've run into this sort of problem with socket splice() too. > > 2) For recording the segmentation points of a fragmented packet. > > We definitely don't use frag lists for performance, if you look > in the GRO history the GRO code specifically has been changed > to prefer accumulating into the page vector whenever possible > because using frag lists is a lot slower. Yes, and GRO made frag lists more popular btw. > Anyways, even if we somehow solved #1 we'd still have a lot of > code (even bluetooth) using it for the sake of issue #2. Since there exist drivers using entirely paged skbs and napi_gro_frags path I can't see why #1 or #2 could be unsolvable elsewhere. > So for the time being there is no clear path for trying to > eliminate frag lists altogether if we wanted to do that either. Probably we could start from enhacing moving drivers to paged skbs where possible. And maybe simplifying the skb model by not allowing frags and frag lists together? Btw, I wonder what is the exact reason we can't use only NET_SKBUFF_DATA_USES_OFFSET? Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 7:21 ` Jarek Poplawski @ 2010-09-20 9:02 ` Eric Dumazet 2010-09-20 9:14 ` Jarek Poplawski 2010-09-20 16:59 ` David Miller 1 sibling, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2010-09-20 9:02 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Le lundi 20 septembre 2010 à 07:21 +0000, Jarek Poplawski a écrit : > Probably we could start from enhacing moving drivers to paged skbs > where possible. And maybe simplifying the skb model by not allowing > frags and frag lists together? > Sure. I believe current model, pre-allocating skb in huge tx rings is a waste of mem bandwidth anyway. (I am refering to the struct sk_buff itself, not the payload part) Of course some drivers are doing it right, using netdev_alloc_skb() right before feeding this skb to network stack, not an old one. > Btw, I wonder what is the exact reason we can't use only > NET_SKBUFF_DATA_USES_OFFSET? I see no real reason. On 32bit arches, it might be faster to manipulate pointers, and not 'base+offset' values. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 9:02 ` Eric Dumazet @ 2010-09-20 9:14 ` Jarek Poplawski 2010-09-20 12:12 ` Jarek Poplawski 0 siblings, 1 reply; 31+ messages in thread From: Jarek Poplawski @ 2010-09-20 9:14 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Mon, Sep 20, 2010 at 11:02:00AM +0200, Eric Dumazet wrote: > Le lundi 20 septembre 2010 ?? 07:21 +0000, Jarek Poplawski a écrit : > > > Probably we could start from enhacing moving drivers to paged skbs > > where possible. And maybe simplifying the skb model by not allowing > > frags and frag lists together? > > > > Sure. I believe current model, pre-allocating skb in huge tx rings is a > waste of mem bandwidth anyway. (I am refering to the struct sk_buff > itself, not the payload part) > > Of course some drivers are doing it right, using netdev_alloc_skb() > right before feeding this skb to network stack, not an old one. > > > Btw, I wonder what is the exact reason we can't use only > > NET_SKBUFF_DATA_USES_OFFSET? > > I see no real reason. > > On 32bit arches, it might be faster to manipulate pointers, and not > 'base+offset' values. The only reason is code readability and maintenance. Did anybody check how much faster? Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 9:14 ` Jarek Poplawski @ 2010-09-20 12:12 ` Jarek Poplawski 2010-09-20 12:40 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Jarek Poplawski @ 2010-09-20 12:12 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Mon, Sep 20, 2010 at 09:14:25AM +0000, Jarek Poplawski wrote: > On Mon, Sep 20, 2010 at 11:02:00AM +0200, Eric Dumazet wrote: > > Le lundi 20 septembre 2010 ?? 07:21 +0000, Jarek Poplawski a écrit : > > > Btw, I wonder what is the exact reason we can't use only > > > NET_SKBUFF_DATA_USES_OFFSET? > > > > I see no real reason. > > > > On 32bit arches, it might be faster to manipulate pointers, and not > > 'base+offset' values. > > The only reason is code readability and maintenance. Did anybody check > how much faster? Hmm... I probably misread your reasoning. So, if no real reason, really, how about turning this NET_SKBUFF_DATA_USES_OFFSET on unconditionally in net-next, until somebody spots the difference? Jarek P. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 12:12 ` Jarek Poplawski @ 2010-09-20 12:40 ` Eric Dumazet 0 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2010-09-20 12:40 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Le lundi 20 septembre 2010 à 12:12 +0000, Jarek Poplawski a écrit : > Hmm... I probably misread your reasoning. So, if no real reason, > really, how about turning this NET_SKBUFF_DATA_USES_OFFSET on > unconditionally in net-next, until somebody spots the difference? Yes, but most developpers use 64bit kernels anyway, I suspect nobody will ever notice :( Here (with a typical config), here is the vmlinux size before and after this patch : $ size vmlinux.old text data bss dec hex filename 6061748 640208 7285056 13987012 d56cc4 vmlinux.old $ size vmlinux text data bss dec hex filename 6070420 640208 7285056 13995684 d58ea4 vmlinux Thats 8672 bytes of text increase. (1330326 instructions instead of 1328472 -> 1854 more instructions) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-20 7:21 ` Jarek Poplawski 2010-09-20 9:02 ` Eric Dumazet @ 2010-09-20 16:59 ` David Miller 1 sibling, 0 replies; 31+ messages in thread From: David Miller @ 2010-09-20 16:59 UTC (permalink / raw) To: jarkao2; +Cc: eric.dumazet, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 20 Sep 2010 07:21:49 +0000 > On Sun, Sep 19, 2010 at 05:17:25PM -0700, David Miller wrote: >> We definitely don't use frag lists for performance, if you look >> in the GRO history the GRO code specifically has been changed >> to prefer accumulating into the page vector whenever possible >> because using frag lists is a lot slower. > > Yes, and GRO made frag lists more popular btw. Yes. However, realize that this is only really a legacy from inet_lro. These drivers could restructure themselves to operate on pages. I am pretty sure the hardware would be amicable to this and it would likely improve performance just as favoring page vector accumulation did for GRO. > Btw, I wonder what is the exact reason we can't use only > NET_SKBUFF_DATA_USES_OFFSET? As Eric explained, and then demonstrated, it generates a non-trivial increased amount of code. This is almost certainly why Arnaldo didn't make it unconditional back when he implemented the SKB data offset changes for 64-bit. In my opinion, this duality of SKB pointer handling never causes real problems because any change mistakenly accessing the pointers directly usually gets caught by the first person who builds it on 64-bit :-) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization 2010-09-03 9:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet 2010-09-03 13:46 ` David Miller @ 2010-09-07 1:25 ` David Miller 1 sibling, 0 replies; 31+ messages in thread From: David Miller @ 2010-09-07 1:25 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 03 Sep 2010 11:09:32 +0200 > [PATCH net-next-2.6] net: pskb_expand_head() optimization > > pskb_expand_head() blindly takes references on fragments before calling > skb_release_data(), potentially releasing these references. > > We can add a fast path, avoiding these atomic operations, if we own the > last reference on skb->head. > > Based on a previous patch from David > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Eric, I ended up applying this patch after all, thanks! ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2010-09-20 16:59 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-03 3:43 [PATCH] net: Frag list lost on head expansion David Miller 2010-09-03 5:48 ` Eric Dumazet 2010-09-03 6:19 ` Eric Dumazet 2010-09-03 9:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet 2010-09-03 13:46 ` David Miller 2010-09-07 2:20 ` David Miller 2010-09-07 5:02 ` Eric Dumazet 2010-09-07 5:05 ` David Miller 2010-09-07 9:16 ` Jarek Poplawski 2010-09-07 9:37 ` Eric Dumazet 2010-09-10 19:54 ` David Miller 2010-09-11 12:31 ` Jarek Poplawski 2010-09-12 3:30 ` David Miller 2010-09-12 10:45 ` Jarek Poplawski 2010-09-12 10:58 ` Jarek Poplawski 2010-09-12 15:58 ` David Miller 2010-09-12 16:13 ` David Miller 2010-09-12 20:57 ` Jarek Poplawski 2010-09-12 22:08 ` David Miller 2010-09-13 7:49 ` Jarek Poplawski 2010-09-12 19:55 ` Ben Pfaff 2010-09-12 20:24 ` David Miller 2010-09-12 20:45 ` Jarek Poplawski 2010-09-20 0:17 ` David Miller 2010-09-20 7:21 ` Jarek Poplawski 2010-09-20 9:02 ` Eric Dumazet 2010-09-20 9:14 ` Jarek Poplawski 2010-09-20 12:12 ` Jarek Poplawski 2010-09-20 12:40 ` Eric Dumazet 2010-09-20 16:59 ` David Miller 2010-09-07 1:25 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).