* [PATCH 0/2] Cleanups for head_frag usage and tcp_try_coalese @ 2012-05-03 3:38 Alexander Duyck 2012-05-03 3:38 ` [PATCH 1/2] net: Stop decapitating clones that have a head_frag Alexander Duyck 2012-05-03 3:39 ` [PATCH 2/2] tcp: cleanup tcp_try_coalesce Alexander Duyck 0 siblings, 2 replies; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 3:38 UTC (permalink / raw) To: netdev; +Cc: davem I wrote up these patches earlier today to address issues related to the splitting of reference counting into two seperate counts in the case of skbs containing a head frag. In addition I have a secondary patch for tcp_try_coalesce which is meant to be a general cleanup as I found it difficult to read through the code as it was. I have done some performance testing on the series with netperf and saw no appreciable difference after these changes for the non-cloned case. --- Alexander Duyck (2): tcp: cleanup tcp_try_coalesce net: Stop decapitating clones that have a head_frag net/core/skbuff.c | 2 + net/ipv4/tcp_input.c | 85 +++++++++++++++++++++++++------------------------- 2 files changed, 44 insertions(+), 43 deletions(-) -- ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] net: Stop decapitating clones that have a head_frag 2012-05-03 3:38 [PATCH 0/2] Cleanups for head_frag usage and tcp_try_coalese Alexander Duyck @ 2012-05-03 3:38 ` Alexander Duyck 2012-05-03 3:56 ` Eric Dumazet 2012-05-03 3:39 ` [PATCH 2/2] tcp: cleanup tcp_try_coalesce Alexander Duyck 1 sibling, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 3:38 UTC (permalink / raw) To: netdev; +Cc: davem, Alexander Duyck, Eric Dumazet, Jeff Kirsher This change is meant ot prevent stealing the skb->head to use as a page in the event that the skb->head was cloned. This allows the other clones to track each other via shinfo->dataref. Without this we break down to two methods for tracking the reference count, one being dataref, the other being the page count. As a result it becomes difficult to track how many references there are to skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> --- net/core/skbuff.c | 2 +- net/ipv4/tcp_input.c | 9 ++------- 2 files changed, 3 insertions(+), 8 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 52ba2b5..8703754 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1699,7 +1699,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; - bool head_is_linear = !skb->head_frag; + bool head_is_linear = !skb->head_frag || skb_cloned(skb); /* map the linear part : * If skb->head_frag is set, this 'linear' part is backed diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2f696ef..c6f78e2 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4589,7 +4589,7 @@ copyfrags: to->data_len += len; goto merge; } - if (from->head_frag) { + if (from->head_frag && !skb_cloned(from)) { struct page *page; unsigned int offset; @@ -4599,12 +4599,7 @@ copyfrags: offset = from->data - (unsigned char *)page_address(page); skb_fill_page_desc(to, skb_shinfo(to)->nr_frags, page, offset, skb_headlen(from)); - - if (skb_cloned(from)) - get_page(page); - else - *fragstolen = true; - + *fragstolen = true; delta = len; /* we dont know real truesize... */ goto copyfrags; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] net: Stop decapitating clones that have a head_frag 2012-05-03 3:38 ` [PATCH 1/2] net: Stop decapitating clones that have a head_frag Alexander Duyck @ 2012-05-03 3:56 ` Eric Dumazet 0 siblings, 0 replies; 15+ messages in thread From: Eric Dumazet @ 2012-05-03 3:56 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, davem, Eric Dumazet, Jeff Kirsher On Wed, 2012-05-02 at 20:38 -0700, Alexander Duyck wrote: > This change is meant ot prevent stealing the skb->head to use as a page in > the event that the skb->head was cloned. This allows the other clones to > track each other via shinfo->dataref. > > Without this we break down to two methods for tracking the reference count, > one being dataref, the other being the page count. As a result it becomes > difficult to track how many references there are to skb->head. > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> > --- > > net/core/skbuff.c | 2 +- > net/ipv4/tcp_input.c | 9 ++------- > 2 files changed, 3 insertions(+), 8 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 52ba2b5..8703754 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1699,7 +1699,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, > struct splice_pipe_desc *spd, struct sock *sk) > { > int seg; > - bool head_is_linear = !skb->head_frag; > + bool head_is_linear = !skb->head_frag || skb_cloned(skb); > Can you chose a better name than head_is_linear maybe ? > /* map the linear part : > * If skb->head_frag is set, this 'linear' part is backed > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 2f696ef..c6f78e2 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -4589,7 +4589,7 @@ copyfrags: > to->data_len += len; > goto merge; > } > - if (from->head_frag) { > + if (from->head_frag && !skb_cloned(from)) { > struct page *page; > unsigned int offset; > > @@ -4599,12 +4599,7 @@ copyfrags: > offset = from->data - (unsigned char *)page_address(page); > skb_fill_page_desc(to, skb_shinfo(to)->nr_frags, > page, offset, skb_headlen(from)); > - > - if (skb_cloned(from)) > - get_page(page); > - else > - *fragstolen = true; > - > + *fragstolen = true; > delta = len; /* we dont know real truesize... */ > goto copyfrags; > } Thanks ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 3:38 [PATCH 0/2] Cleanups for head_frag usage and tcp_try_coalese Alexander Duyck 2012-05-03 3:38 ` [PATCH 1/2] net: Stop decapitating clones that have a head_frag Alexander Duyck @ 2012-05-03 3:39 ` Alexander Duyck 2012-05-03 4:06 ` Eric Dumazet 1 sibling, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 3:39 UTC (permalink / raw) To: netdev; +Cc: davem, Alexander Duyck, Eric Dumazet, Jeff Kirsher This change is mostly meant to help improve the readability of tcp_try_coalesce. I had a few issues since there were several points when the code would test for a conditional, fail, then succeed on another conditional take some action, and then follow a goto back into the previous conditional. I just tore all of that apart and made the whole thing one linear flow with a single goto. Also there were multiple ways of computing the delta, the one for head_frag made the least amount of sense to me since we were only dropping the sk_buff so I have updated the logic for the stolen head case so that delta is only truesize - sizeof(skb_buff), and for the case where we are dropping the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head). This way we can also account for the head_frag with headlen == 0. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> --- net/ipv4/tcp_input.c | 80 +++++++++++++++++++++++++++----------------------- 1 files changed, 43 insertions(+), 37 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c6f78e2..23bc3ff 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk, int i, delta, len = from->len; *fragstolen = false; + if (tcp_hdr(from)->fin || skb_cloned(to)) return false; + if (len <= skb_tailroom(to)) { BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); -merge: - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); - TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; - TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; - return true; + goto merge; } if (skb_has_frag_list(to) || skb_has_frag_list(from)) return false; - if (skb_headlen(from) == 0 && - (skb_shinfo(to)->nr_frags + - skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) { - WARN_ON_ONCE(from->head_frag); - delta = from->truesize - ksize(from->head) - - SKB_DATA_ALIGN(sizeof(struct sk_buff)); - - WARN_ON_ONCE(delta < len); -copyfrags: - memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, - skb_shinfo(from)->frags, - skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); - skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; - - if (skb_cloned(from)) - for (i = 0; i < skb_shinfo(from)->nr_frags; i++) - skb_frag_ref(from, i); - else - skb_shinfo(from)->nr_frags = 0; - - to->truesize += delta; - atomic_add(delta, &sk->sk_rmem_alloc); - sk_mem_charge(sk, delta); - to->len += len; - to->data_len += len; - goto merge; - } - if (from->head_frag && !skb_cloned(from)) { + if (skb_headlen(from) != 0) { struct page *page; unsigned int offset; - if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS) + if (skb_shinfo(to)->nr_frags + + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS) + return false; + + if (!from->head_frag || skb_cloned(from)) return false; + + delta = from->truesize - sizeof(struct sk_buff); + page = virt_to_head_page(from->head); offset = from->data - (unsigned char *)page_address(page); + skb_fill_page_desc(to, skb_shinfo(to)->nr_frags, page, offset, skb_headlen(from)); *fragstolen = true; - delta = len; /* we dont know real truesize... */ - goto copyfrags; + } else { + if (skb_shinfo(to)->nr_frags + + skb_shinfo(from)->nr_frags > MAX_SKB_FRAGS) + return false; + + delta = from->truesize - + SKB_TRUESIZE(skb_end_pointer(from) - from->head); } - return false; + + memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, + skb_shinfo(from)->frags, + skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); + skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; + + if (!skb_cloned(from)) + skb_shinfo(from)->nr_frags = 0; + + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) + skb_frag_ref(from, i); + + to->truesize += delta; + atomic_add(delta, &sk->sk_rmem_alloc); + sk_mem_charge(sk, delta); + to->len += len; + to->data_len += len; + +merge: + NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); + TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; + TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; + return true; } static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 3:39 ` [PATCH 2/2] tcp: cleanup tcp_try_coalesce Alexander Duyck @ 2012-05-03 4:06 ` Eric Dumazet 2012-05-03 4:58 ` Alexander Duyck 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2012-05-03 4:06 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, davem, Eric Dumazet, Jeff Kirsher On Wed, 2012-05-02 at 20:39 -0700, Alexander Duyck wrote: > This change is mostly meant to help improve the readability of > tcp_try_coalesce. I had a few issues since there were several points when > the code would test for a conditional, fail, then succeed on another > conditional take some action, and then follow a goto back into the previous > conditional. I just tore all of that apart and made the whole thing one > linear flow with a single goto. > > Also there were multiple ways of computing the delta, the one for head_frag > made the least amount of sense to me since we were only dropping the > sk_buff so I have updated the logic for the stolen head case so that delta > is only truesize - sizeof(skb_buff), and for the case where we are dropping > the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head). > This way we can also account for the head_frag with headlen == 0. > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> > --- > Sorry I prefer you dont touch this code like this. The truesize bits must stay as is, since it'll track drivers that lies about truesize. > net/ipv4/tcp_input.c | 80 +++++++++++++++++++++++++++----------------------- > 1 files changed, 43 insertions(+), 37 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index c6f78e2..23bc3ff 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk, > int i, delta, len = from->len; > > *fragstolen = false; > + > if (tcp_hdr(from)->fin || skb_cloned(to)) > return false; > + > if (len <= skb_tailroom(to)) { > BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); > -merge: > - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); > - TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; > - TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; > - return true; > + goto merge; > } > > if (skb_has_frag_list(to) || skb_has_frag_list(from)) > return false; > > - if (skb_headlen(from) == 0 && > - (skb_shinfo(to)->nr_frags + > - skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) { > - WARN_ON_ONCE(from->head_frag); > - delta = from->truesize - ksize(from->head) - > - SKB_DATA_ALIGN(sizeof(struct sk_buff)); This delta was done on purpose. We want the ksize() > - > - WARN_ON_ONCE(delta < len); > -copyfrags: > - memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, > - skb_shinfo(from)->frags, > - skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); > - skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; > - > - if (skb_cloned(from)) > - for (i = 0; i < skb_shinfo(from)->nr_frags; i++) > - skb_frag_ref(from, i); > - else > - skb_shinfo(from)->nr_frags = 0; > - > - to->truesize += delta; > - atomic_add(delta, &sk->sk_rmem_alloc); > - sk_mem_charge(sk, delta); > - to->len += len; > - to->data_len += len; > - goto merge; > - } > - if (from->head_frag && !skb_cloned(from)) { > + if (skb_headlen(from) != 0) { > struct page *page; > unsigned int offset; > > - if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS) > + if (skb_shinfo(to)->nr_frags + > + skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS) > + return false; > + > + if (!from->head_frag || skb_cloned(from)) > return false; > + > + delta = from->truesize - sizeof(struct sk_buff); > + > page = virt_to_head_page(from->head); > offset = from->data - (unsigned char *)page_address(page); > + > skb_fill_page_desc(to, skb_shinfo(to)->nr_frags, > page, offset, skb_headlen(from)); > *fragstolen = true; > - delta = len; /* we dont know real truesize... */ > - goto copyfrags; > + } else { > + if (skb_shinfo(to)->nr_frags + > + skb_shinfo(from)->nr_frags > MAX_SKB_FRAGS) > + return false; > + > + delta = from->truesize - > + SKB_TRUESIZE(skb_end_pointer(from) - from->head); No... SKB_TRUESIZE() doesnt account of power-of-two roundings in kmalloc() > } > - return false; > + > + memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, > + skb_shinfo(from)->frags, > + skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); > + skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; > + > + if (!skb_cloned(from)) > + skb_shinfo(from)->nr_frags = 0; > + You break the code here... we had an else. > + for (i = 0; i < skb_shinfo(from)->nr_frags; i++) > + skb_frag_ref(from, i); > + > + to->truesize += delta; > + atomic_add(delta, &sk->sk_rmem_alloc); > + sk_mem_charge(sk, delta); > + to->len += len; > + to->data_len += len; > + > +merge: > + NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); > + TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; > + TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; > + return true; > } > > static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen) > Really this patch is too hard to review. Thanks ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 4:06 ` Eric Dumazet @ 2012-05-03 4:58 ` Alexander Duyck 2012-05-03 5:19 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 4:58 UTC (permalink / raw) To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher On 5/2/2012 9:06 PM, Eric Dumazet wrote: > On Wed, 2012-05-02 at 20:39 -0700, Alexander Duyck wrote: >> This change is mostly meant to help improve the readability of >> tcp_try_coalesce. I had a few issues since there were several points when >> the code would test for a conditional, fail, then succeed on another >> conditional take some action, and then follow a goto back into the previous >> conditional. I just tore all of that apart and made the whole thing one >> linear flow with a single goto. >> >> Also there were multiple ways of computing the delta, the one for head_frag >> made the least amount of sense to me since we were only dropping the >> sk_buff so I have updated the logic for the stolen head case so that delta >> is only truesize - sizeof(skb_buff), and for the case where we are dropping >> the head as well it is truesize - SKB_TRUESIZE(skb_end_pointer - head). >> This way we can also account for the head_frag with headlen == 0. >> >> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com> >> Cc: Eric Dumazet<edumazet@google.com> >> Cc: Jeff Kirsher<jeffrey.t.kirsher@intel.com> >> --- >> > Sorry I prefer you dont touch this code like this. > > The truesize bits must stay as is, since it'll track drivers that lies > about truesize. I can understand that. But from what I can tell the only way they can lie about truesize is to actually change the value itself. The code I modified was just tightening things up so we didn't do things like use the length to track the truesize like we were in the original code. From what I can tell the new code should have been more accurate. >> net/ipv4/tcp_input.c | 80 +++++++++++++++++++++++++++----------------------- >> 1 files changed, 43 insertions(+), 37 deletions(-) >> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> index c6f78e2..23bc3ff 100644 >> --- a/net/ipv4/tcp_input.c >> +++ b/net/ipv4/tcp_input.c >> @@ -4548,62 +4548,68 @@ static bool tcp_try_coalesce(struct sock *sk, >> int i, delta, len = from->len; >> >> *fragstolen = false; >> + >> if (tcp_hdr(from)->fin || skb_cloned(to)) >> return false; >> + >> if (len<= skb_tailroom(to)) { >> BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); >> -merge: >> - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); >> - TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; >> - TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; >> - return true; >> + goto merge; >> } >> >> if (skb_has_frag_list(to) || skb_has_frag_list(from)) >> return false; >> >> - if (skb_headlen(from) == 0&& >> - (skb_shinfo(to)->nr_frags + >> - skb_shinfo(from)->nr_frags<= MAX_SKB_FRAGS)) { >> - WARN_ON_ONCE(from->head_frag); >> - delta = from->truesize - ksize(from->head) - >> - SKB_DATA_ALIGN(sizeof(struct sk_buff)); > > This delta was done on purpose. We want the ksize() The question I have is how can you get into a case where the ksize is different from the end offset plus the aligned size of skb_shared_info? >From what I can tell it looks like the only place we can lie is if we use build_skb with the frag_size option, and in that case we are using a page, not kmalloc memory. Otherwise in all other cases __alloc_skb or build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct skb_shared_info) to set the end pointer, so reversing that should give us the same value as ksize(skb->head). >> - >> - WARN_ON_ONCE(delta< len); >> -copyfrags: >> - memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, >> - skb_shinfo(from)->frags, >> - skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); >> - skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; >> - >> - if (skb_cloned(from)) >> - for (i = 0; i< skb_shinfo(from)->nr_frags; i++) >> - skb_frag_ref(from, i); >> - else >> - skb_shinfo(from)->nr_frags = 0; >> - >> - to->truesize += delta; >> - atomic_add(delta,&sk->sk_rmem_alloc); >> - sk_mem_charge(sk, delta); >> - to->len += len; >> - to->data_len += len; >> - goto merge; >> - } >> - if (from->head_frag&& !skb_cloned(from)) { >> + if (skb_headlen(from) != 0) { >> struct page *page; >> unsigned int offset; >> >> - if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS) >> + if (skb_shinfo(to)->nr_frags + >> + skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS) >> + return false; >> + >> + if (!from->head_frag || skb_cloned(from)) >> return false; >> + >> + delta = from->truesize - sizeof(struct sk_buff); >> It looks like you were thinking of something here since the quote lines were broken. This was the piece I saw in the original code that caught my eye as probably being wrong. We are letting packets using head_frag just report length as the truesize. This is why I favored using the end offset. In the case of us using the head_frag we should only be deducting SKB_DATA_ALIGN(sizeof(struct sk_buff)) (I just relized the line above is incorrect), and in the case of us dropping the head_frag as well we should be able to also deduct the size of the shared_info and everything up to the offset of end. >> + >> page = virt_to_head_page(from->head); >> offset = from->data - (unsigned char *)page_address(page); >> + >> skb_fill_page_desc(to, skb_shinfo(to)->nr_frags, >> page, offset, skb_headlen(from)); >> *fragstolen = true; >> - delta = len; /* we dont know real truesize... */ >> - goto copyfrags; >> + } else { >> + if (skb_shinfo(to)->nr_frags + >> + skb_shinfo(from)->nr_frags> MAX_SKB_FRAGS) >> + return false; >> + >> + delta = from->truesize - >> + SKB_TRUESIZE(skb_end_pointer(from) - from->head); > No... SKB_TRUESIZE() doesnt account of power-of-two roundings in > kmalloc() You used ksize in alloc_skb or build_skb to set the end pointer. All I am doing by using the end pointer is getting the value you had already extracted. The end pointer should be unmovable once it is set so I wouldn't think it would be an issue to use it to compute the truesize for the section including the sk_buff and skb->head. >> } >> - return false; >> + >> + memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, >> + skb_shinfo(from)->frags, >> + skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); >> + skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; >> + >> + if (!skb_cloned(from)) >> + skb_shinfo(from)->nr_frags = 0; >> + > You break the code here... we had an else. Yes and no. I just realized that I didn't need the else because the for loop below does nothing if nr_frags is 0. I suspect gcc is probably smart enough to just skip the loop if the skb is cloned. I should probably have added a one line comment explaining that above the loop. >> + for (i = 0; i< skb_shinfo(from)->nr_frags; i++) >> + skb_frag_ref(from, i); >> + >> + to->truesize += delta; >> + atomic_add(delta,&sk->sk_rmem_alloc); >> + sk_mem_charge(sk, delta); >> + to->len += len; >> + to->data_len += len; >> + >> +merge: >> + NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); >> + TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; >> + TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; >> + return true; >> } >> >> static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen) >> > Really this patch is too hard to review. Yeah, it is a bit difficult. After Dave pulled your patches it ended up pushing most of the changes into this one. I'll see if I can break this back up into smaller bits. I just needed to flush the patches I had so I could get some feedback and stop talking in circles about the other patch. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 4:58 ` Alexander Duyck @ 2012-05-03 5:19 ` Eric Dumazet 2012-05-03 5:25 ` David Miller 2012-05-03 5:41 ` Alexander Duyck 0 siblings, 2 replies; 15+ messages in thread From: Eric Dumazet @ 2012-05-03 5:19 UTC (permalink / raw) To: Alexander Duyck Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher On Wed, 2012-05-02 at 21:58 -0700, Alexander Duyck wrote: > The question I have is how can you get into a case where the ksize is > different from the end offset plus the aligned size of skb_shared_info? > From what I can tell it looks like the only place we can lie is if we > use build_skb with the frag_size option, and in that case we are using a > page, not kmalloc memory. Otherwise in all other cases __alloc_skb or > build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct > skb_shared_info) to set the end pointer, so reversing that should give > us the same value as ksize(skb->head). Right after skb is allocated (build_skb() or other skb_alloc... variants), truesize is correct by construction. Then drivers add fragments and can make truesize smaller than reality. And Intel drivers are known to abuse truesize. My last patch against iwlwifi is still waiting to make its way into official tree. http://www.spinics.net/lists/netdev/msg192629.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 5:19 ` Eric Dumazet @ 2012-05-03 5:25 ` David Miller [not found] ` <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2012-05-03 5:41 ` Alexander Duyck 1 sibling, 1 reply; 15+ messages in thread From: David Miller @ 2012-05-03 5:25 UTC (permalink / raw) To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w Cc: alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, linville-2XuSBdqkA4R54TAoqtyWWQ, linux-wireless-u79uwXL29TY76Z2rM5mHXA From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Date: Thu, 03 May 2012 07:19:33 +0200 > My last patch against iwlwifi is still waiting to make its way into > official tree. > > http://www.spinics.net/lists/netdev/msg192629.html John, please rectify this situation. The Intel Wireless folks said they would test it, but that was more than a month ago. It's not acceptable to let bug fixes rot for that long, I don't care what their special internal testing procedure is. If they give you further pushback, please just ignore them and apply Eric's fix directly. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce [not found] ` <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> @ 2012-05-03 15:14 ` John W. Linville 2012-05-03 15:24 ` Guy, Wey-Yi 0 siblings, 1 reply; 15+ messages in thread From: John W. Linville @ 2012-05-03 15:14 UTC (permalink / raw) To: David Miller Cc: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, linux-wireless-u79uwXL29TY76Z2rM5mHXA, wey-yi.w.guy-ral2JQCrhuEAvxtiuMwx3w On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Date: Thu, 03 May 2012 07:19:33 +0200 > > > My last patch against iwlwifi is still waiting to make its way into > > official tree. > > > > http://www.spinics.net/lists/netdev/msg192629.html > > John, please rectify this situation. > > The Intel Wireless folks said they would test it, but that was more > than a month ago. > > It's not acceptable to let bug fixes rot for that long, I don't care > what their special internal testing procedure is. > > If they give you further pushback, please just ignore them and apply > Eric's fix directly. > > Thank you. I imagine that this somehow got lost in the shuffle during the merge window. That doesn't excuse it, of course. It has waited long enough already, so I'll just go ahead and take it. John -- John W. Linville Someday the world will need a hero, and you linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org might be all we have. Be ready. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 15:14 ` John W. Linville @ 2012-05-03 15:24 ` Guy, Wey-Yi 2012-05-03 17:07 ` John W. Linville 0 siblings, 1 reply; 15+ messages in thread From: Guy, Wey-Yi @ 2012-05-03 15:24 UTC (permalink / raw) To: John W. Linville Cc: David Miller, eric.dumazet, alexander.duyck, alexander.h.duyck, netdev, edumazet, jeffrey.t.kirsher, linux-wireless Hi John, On Thu, 2012-05-03 at 11:14 -0400, John W. Linville wrote: > On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Thu, 03 May 2012 07:19:33 +0200 > > > > > My last patch against iwlwifi is still waiting to make its way into > > > official tree. > > > > > > http://www.spinics.net/lists/netdev/msg192629.html > > > > John, please rectify this situation. > > > > The Intel Wireless folks said they would test it, but that was more > > than a month ago. > > > > It's not acceptable to let bug fixes rot for that long, I don't care > > what their special internal testing procedure is. > > > > If they give you further pushback, please just ignore them and apply > > Eric's fix directly. > > > > Thank you. > > I imagine that this somehow got lost in the shuffle during the > merge window. That doesn't excuse it, of course. > > It has waited long enough already, so I'll just go ahead and take it. > it is my mistake to lost this patch during the iwlwifi re-factor work, the patch is no longer apply and I ask Eric to rebase the patch. Sorry again for the mistake Thanks Wey ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 15:24 ` Guy, Wey-Yi @ 2012-05-03 17:07 ` John W. Linville [not found] ` <20120503170727.GM9285-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: John W. Linville @ 2012-05-03 17:07 UTC (permalink / raw) To: Guy, Wey-Yi Cc: David Miller, eric.dumazet, alexander.duyck, alexander.h.duyck, netdev, edumazet, jeffrey.t.kirsher, linux-wireless, Stephen Rothwell On Thu, May 03, 2012 at 08:24:19AM -0700, Guy, Wey-Yi wrote: > Hi John, > > On Thu, 2012-05-03 at 11:14 -0400, John W. Linville wrote: > > On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote: > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > > Date: Thu, 03 May 2012 07:19:33 +0200 > > > > > > > My last patch against iwlwifi is still waiting to make its way into > > > > official tree. > > > > > > > > http://www.spinics.net/lists/netdev/msg192629.html > > > > > > John, please rectify this situation. > > > > > > The Intel Wireless folks said they would test it, but that was more > > > than a month ago. > > > > > > It's not acceptable to let bug fixes rot for that long, I don't care > > > what their special internal testing procedure is. > > > > > > If they give you further pushback, please just ignore them and apply > > > Eric's fix directly. > > > > > > Thank you. > > > > I imagine that this somehow got lost in the shuffle during the > > merge window. That doesn't excuse it, of course. > > > > It has waited long enough already, so I'll just go ahead and take it. > > > it is my mistake to lost this patch during the iwlwifi re-factor work, > the patch is no longer apply and I ask Eric to rebase the patch. > > Sorry again for the mistake Well, it seems like a fix needed for 3.4. And, the patch applies there. It does cause some merge breakage in wireless-testing (and presumably in linux-next). I'll attach the commit diff for the wireless-testing merge fixup I did, for review and/or as a peace offering to sfr... :-) Please take a look at the result in wireless-testing and let me know if it is broken...thanks! John --- commit 22a101d22ad3296b55d87e92c4a94548aaf6f595 Merge: 20ef730 ed90542 Author: John W. Linville <linville@tuxdriver.com> Date: Thu May 3 12:58:38 2012 -0400 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h diff --cc drivers/net/wireless/iwlwifi/iwl-agn-rx.c index f941223,2247460..ff758fe --- a/drivers/net/wireless/iwlwifi/iwl-agn-rx.c +++ b/drivers/net/wireless/iwlwifi/iwl-agn-rx.c @@@ -752,15 -787,25 +751,25 @@@ static void iwlagn_pass_packet_to_mac80 iwlagn_set_decrypted_flag(priv, hdr, ampdu_status, stats)) return; - skb = dev_alloc_skb(128); + /* Dont use dev_alloc_skb(), we'll have enough headroom once + * ieee80211_hdr pulled. + */ + skb = alloc_skb(128, GFP_ATOMIC); if (!skb) { - IWL_ERR(priv, "dev_alloc_skb failed\n"); + IWL_ERR(priv, "alloc_skb failed\n"); return; } + hdrlen = min_t(unsigned int, len, skb_tailroom(skb)); + memcpy(skb_put(skb, hdrlen), hdr, hdrlen); + fraglen = len - hdrlen; + + if (fraglen) { - int offset = (void *)hdr + hdrlen - rxb_addr(rxb); ++ int offset = (void *)hdr + hdrlen - rxb_addr(rxb) ++ + rxb_offset(rxb); - offset = (void *)hdr - rxb_addr(rxb) + rxb_offset(rxb); - p = rxb_steal_page(rxb); - skb_add_rx_frag(skb, 0, p, offset, len, len); + skb_add_rx_frag(skb, 0, rxb_steal_page(rxb), offset, + fraglen, rxb->truesize); + } - iwl_update_stats(priv, false, fc, len); /* * Wake any queues that were stopped due to a passive channel tx diff --cc drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c index d2239aa,aa7aea1..08517d3 --- a/drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c +++ b/drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c @@@ -373,89 -374,72 +373,90 @@@ static void iwl_rx_handle_rxbuf(struct if (WARN_ON(!rxb)) return; - rxcb.truesize = PAGE_SIZE << hw_params(trans).rx_page_order; - dma_unmap_page(trans->dev, rxb->page_dma, - rxcb.truesize, - DMA_FROM_DEVICE); - - rxcb._page = rxb->page; - pkt = rxb_addr(&rxcb); - - IWL_DEBUG_RX(trans, "%s, 0x%02x\n", - get_cmd_string(pkt->hdr.cmd), pkt->hdr.cmd); + dma_unmap_page(trans->dev, rxb->page_dma, max_len, DMA_FROM_DEVICE); + while (offset + sizeof(u32) + sizeof(struct iwl_cmd_header) < max_len) { + struct iwl_rx_packet *pkt; + struct iwl_device_cmd *cmd; + u16 sequence; + bool reclaim; + int index, cmd_index, err, len; + struct iwl_rx_cmd_buffer rxcb = { + ._offset = offset, + ._page = rxb->page, + ._page_stolen = false, ++ .truesize = max_len, + }; - len = le32_to_cpu(pkt->len_n_flags) & FH_RSCSR_FRAME_SIZE_MSK; - len += sizeof(u32); /* account for status word */ - trace_iwlwifi_dev_rx(trans->dev, pkt, len); + pkt = rxb_addr(&rxcb); - /* Reclaim a command buffer only if this packet is a response - * to a (driver-originated) command. - * If the packet (e.g. Rx frame) originated from uCode, - * there is no command buffer to reclaim. - * Ucode should set SEQ_RX_FRAME bit if ucode-originated, - * but apparently a few don't get set; catch them here. */ - reclaim = !(pkt->hdr.sequence & SEQ_RX_FRAME); - if (reclaim) { - int i; + if (pkt->len_n_flags == cpu_to_le32(FH_RSCSR_FRAME_INVALID)) + break; - for (i = 0; i < trans_pcie->n_no_reclaim_cmds; i++) { - if (trans_pcie->no_reclaim_cmds[i] == pkt->hdr.cmd) { - reclaim = false; - break; + IWL_DEBUG_RX(trans, "cmd at offset %d: %s (0x%.2x)\n", + rxcb._offset, + trans_pcie_get_cmd_string(trans_pcie, pkt->hdr.cmd), + pkt->hdr.cmd); + + len = le32_to_cpu(pkt->len_n_flags) & FH_RSCSR_FRAME_SIZE_MSK; + len += sizeof(u32); /* account for status word */ + trace_iwlwifi_dev_rx(trans->dev, pkt, len); + + /* Reclaim a command buffer only if this packet is a response + * to a (driver-originated) command. + * If the packet (e.g. Rx frame) originated from uCode, + * there is no command buffer to reclaim. + * Ucode should set SEQ_RX_FRAME bit if ucode-originated, + * but apparently a few don't get set; catch them here. */ + reclaim = !(pkt->hdr.sequence & SEQ_RX_FRAME); + if (reclaim) { + int i; + + for (i = 0; i < trans_pcie->n_no_reclaim_cmds; i++) { + if (trans_pcie->no_reclaim_cmds[i] == + pkt->hdr.cmd) { + reclaim = false; + break; + } } } - } - sequence = le16_to_cpu(pkt->hdr.sequence); - index = SEQ_TO_INDEX(sequence); - cmd_index = get_cmd_index(&txq->q, index); + sequence = le16_to_cpu(pkt->hdr.sequence); + index = SEQ_TO_INDEX(sequence); + cmd_index = get_cmd_index(&txq->q, index); - if (reclaim) - cmd = txq->cmd[cmd_index]; - else - cmd = NULL; + if (reclaim) + cmd = txq->entries[cmd_index].cmd; + else + cmd = NULL; - err = iwl_op_mode_rx(trans->op_mode, &rxcb, cmd); + err = iwl_op_mode_rx(trans->op_mode, &rxcb, cmd); - /* - * XXX: After here, we should always check rxcb._page - * against NULL before touching it or its virtual - * memory (pkt). Because some rx_handler might have - * already taken or freed the pages. - */ + /* + * After here, we should always check rxcb._page_stolen, + * if it is true then one of the handlers took the page. + */ - if (reclaim) { - /* Invoke any callbacks, transfer the buffer to caller, - * and fire off the (possibly) blocking - * iwl_trans_send_cmd() - * as we reclaim the driver command queue */ - if (rxcb._page) - iwl_tx_cmd_complete(trans, &rxcb, err); - else - IWL_WARN(trans, "Claim null rxb?\n"); + if (reclaim) { + /* Invoke any callbacks, transfer the buffer to caller, + * and fire off the (possibly) blocking + * iwl_trans_send_cmd() + * as we reclaim the driver command queue */ + if (!rxcb._page_stolen) + iwl_tx_cmd_complete(trans, &rxcb, err); + else + IWL_WARN(trans, "Claim null rxb?\n"); + } + + page_stolen |= rxcb._page_stolen; + offset += ALIGN(len, FH_RSCSR_FRAME_ALIGN); } - /* page was stolen from us */ - if (rxcb._page == NULL) + /* page was stolen from us -- free our reference */ + if (page_stolen) { + __free_pages(rxb->page, trans_pcie->rx_page_order); rxb->page = NULL; + } /* Reuse the page if possible. For notification packets and * SKBs that fail to Rx correctly, add them back into the diff --cc drivers/net/wireless/iwlwifi/iwl-trans.h index 7018d31,fdf9788..79a1e7a --- a/drivers/net/wireless/iwlwifi/iwl-trans.h +++ b/drivers/net/wireless/iwlwifi/iwl-trans.h @@@ -256,8 -260,7 +256,9 @@@ static inline void iwl_free_resp(struc struct iwl_rx_cmd_buffer { struct page *_page; + int _offset; + bool _page_stolen; + unsigned int truesize; }; static inline void *rxb_addr(struct iwl_rx_cmd_buffer *r) -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20120503170727.GM9285-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>]
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce [not found] ` <20120503170727.GM9285-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> @ 2012-05-03 20:21 ` Guy, Wey-Yi 0 siblings, 0 replies; 15+ messages in thread From: Guy, Wey-Yi @ 2012-05-03 20:21 UTC (permalink / raw) To: John W. Linville Cc: David Miller, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w, alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, linux-wireless-u79uwXL29TY76Z2rM5mHXA, Stephen Rothwell Hi John, On Thu, 2012-05-03 at 13:07 -0400, John W. Linville wrote: > On Thu, May 03, 2012 at 08:24:19AM -0700, Guy, Wey-Yi wrote: > > Hi John, > > > > On Thu, 2012-05-03 at 11:14 -0400, John W. Linville wrote: > > > On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote: > > > > From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > > > > Date: Thu, 03 May 2012 07:19:33 +0200 > > > > > > > > > My last patch against iwlwifi is still waiting to make its way into > > > > > official tree. > > > > > > > > > > http://www.spinics.net/lists/netdev/msg192629.html > > > > > > > > John, please rectify this situation. > > > > > > > > The Intel Wireless folks said they would test it, but that was more > > > > than a month ago. > > > > > > > > It's not acceptable to let bug fixes rot for that long, I don't care > > > > what their special internal testing procedure is. > > > > > > > > If they give you further pushback, please just ignore them and apply > > > > Eric's fix directly. > > > > > > > > Thank you. > > > > > > I imagine that this somehow got lost in the shuffle during the > > > merge window. That doesn't excuse it, of course. > > > > > > It has waited long enough already, so I'll just go ahead and take it. > > > > > it is my mistake to lost this patch during the iwlwifi re-factor work, > > the patch is no longer apply and I ask Eric to rebase the patch. > > > > Sorry again for the mistake > > Well, it seems like a fix needed for 3.4. And, the patch applies there. > > It does cause some merge breakage in wireless-testing (and presumably > in linux-next). I'll attach the commit diff for the wireless-testing > merge fixup I did, for review and/or as a peace offering to sfr... :-) > > Please take a look at the result in wireless-testing and let me know > if it is broken...thanks! > Looks good to me, thanks very much for helping this. Wey > -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 5:19 ` Eric Dumazet 2012-05-03 5:25 ` David Miller @ 2012-05-03 5:41 ` Alexander Duyck 2012-05-03 5:50 ` David Miller 1 sibling, 1 reply; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 5:41 UTC (permalink / raw) To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, Eric Dumazet, Jeff Kirsher On 5/2/2012 10:19 PM, Eric Dumazet wrote: > On Wed, 2012-05-02 at 21:58 -0700, Alexander Duyck wrote: >> The question I have is how can you get into a case where the ksize is >> different from the end offset plus the aligned size of skb_shared_info? >> From what I can tell it looks like the only place we can lie is if we >> use build_skb with the frag_size option, and in that case we are using a >> page, not kmalloc memory. Otherwise in all other cases __alloc_skb or >> build_skb is using ksize(skb->head) - SKB_DATA_ALIGN(struct >> skb_shared_info) to set the end pointer, so reversing that should give >> us the same value as ksize(skb->head). > > Right after skb is allocated (build_skb() or other skb_alloc... > variants), truesize is correct by construction. > > Then drivers add fragments and can make truesize smaller than reality. > > And Intel drivers are known to abuse truesize. > > My last patch against iwlwifi is still waiting to make its way into > official tree. > > http://www.spinics.net/lists/netdev/msg192629.html I think the part that has me confused is how being more precise about removing from truesize gets in the way of detecting abuses of truesize. It seems like it should be more as good as or better then the original approach of just using skb->len. Then again we might just be talking in circles again. I have things broken out into 3 patches now that are much more readable. I will email them out in an hour or so once I do some quick tests to verify they are building and don't break anything. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 5:41 ` Alexander Duyck @ 2012-05-03 5:50 ` David Miller 2012-05-03 7:08 ` Alexander Duyck 0 siblings, 1 reply; 15+ messages in thread From: David Miller @ 2012-05-03 5:50 UTC (permalink / raw) To: alexander.duyck Cc: eric.dumazet, alexander.h.duyck, netdev, edumazet, jeffrey.t.kirsher From: Alexander Duyck <alexander.duyck@gmail.com> Date: Wed, 02 May 2012 22:41:36 -0700 > I think the part that has me confused is how being more precise about > removing from truesize gets in the way of detecting abuses of > truesize. It seems like it should be more as good as or better then > the original approach of just using skb->len. You can only trim from truesize if you can be absolutely certain that you have removed any reference in the fraglist, or the page'd head, to the entire "block" of data. If the skb still refers to even just one byte in a particular block, we must still charge the entire block in the truesize. For example, NIU has three pools of power-of-2 blocks of data it maintainers and the device pulls from to build incoming packets. So if the chip used one of the 2048 byte buffers, we charge the entire 2048 bytes even of the packet is much smaller. Conversely this means we cannot trim the 2048 part of the truesize of that SKB unless we had some mechanism to know for certain 1) what the block size of the underlying data is and 2) that we've removed all references to that. Currently this is not really possible, so we therefore defer truesize adjustments. Behaving otherwise is dangerous, because then we'd potentially end up with a lot of memory used, but not actually accounted for by anyone. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce 2012-05-03 5:50 ` David Miller @ 2012-05-03 7:08 ` Alexander Duyck 0 siblings, 0 replies; 15+ messages in thread From: Alexander Duyck @ 2012-05-03 7:08 UTC (permalink / raw) To: David Miller Cc: eric.dumazet, alexander.h.duyck, netdev, edumazet, jeffrey.t.kirsher On 5/2/2012 10:50 PM, David Miller wrote: > From: Alexander Duyck<alexander.duyck@gmail.com> > Date: Wed, 02 May 2012 22:41:36 -0700 > >> I think the part that has me confused is how being more precise about >> removing from truesize gets in the way of detecting abuses of >> truesize. It seems like it should be more as good as or better then >> the original approach of just using skb->len. > You can only trim from truesize if you can be absolutely certain that > you have removed any reference in the fraglist, or the page'd head, to > the entire "block" of data. > > If the skb still refers to even just one byte in a particular block, > we must still charge the entire block in the truesize. I get that, and that is what my code was doing that the old code wasn't doing. That is why I am confused. I should have another patch ready in an hour or so that should help to make my position clear. > For example, NIU has three pools of power-of-2 blocks of data it > maintainers and the device pulls from to build incoming packets. > > So if the chip used one of the 2048 byte buffers, we charge the entire > 2048 bytes even of the packet is much smaller. > > Conversely this means we cannot trim the 2048 part of the truesize of > that SKB unless we had some mechanism to know for certain 1) what the > block size of the underlying data is and 2) that we've removed all > references to that. > > Currently this is not really possible, so we therefore defer truesize > adjustments. This is true for the frags, but not for the head buffer allocated with __alloc_skb or build_skb. For either of these we set both truesize and the end pointer offset based on the ksize(data). The truesize is the value plus SKB_DATA_ALIGNED(sizeof(struct sk_buff)), and the end pointer offset is the value minus SKB_DATA_ALIGNED(sizeof(struct skb_shared_info)). So in the case where we are removing the sk_buff and skb->head, I am subtracting the end pointer offset plus the aligned shared info and sk_buff values from skb->truesize to get the size of just the paged region. > Behaving otherwise is dangerous, because then we'd potentially end up > with a lot of memory used, but not actually accounted for by anyone. I fully agree that is why I was surprised when I was told not to use the values for skb->truesize that were computed back when we allocated the skb to determine the truesize to remove from the delta when we were removing it. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-05-03 20:21 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-03 3:38 [PATCH 0/2] Cleanups for head_frag usage and tcp_try_coalese Alexander Duyck 2012-05-03 3:38 ` [PATCH 1/2] net: Stop decapitating clones that have a head_frag Alexander Duyck 2012-05-03 3:56 ` Eric Dumazet 2012-05-03 3:39 ` [PATCH 2/2] tcp: cleanup tcp_try_coalesce Alexander Duyck 2012-05-03 4:06 ` Eric Dumazet 2012-05-03 4:58 ` Alexander Duyck 2012-05-03 5:19 ` Eric Dumazet 2012-05-03 5:25 ` David Miller [not found] ` <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2012-05-03 15:14 ` John W. Linville 2012-05-03 15:24 ` Guy, Wey-Yi 2012-05-03 17:07 ` John W. Linville [not found] ` <20120503170727.GM9285-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> 2012-05-03 20:21 ` Guy, Wey-Yi 2012-05-03 5:41 ` Alexander Duyck 2012-05-03 5:50 ` David Miller 2012-05-03 7:08 ` Alexander Duyck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).