From: Alexander Duyck <alexander.h.duyck@intel.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Alexander Duyck" <alexander.duyck@gmail.com>,
"David Miller" <davem@davemloft.net>,
netdev <netdev@vger.kernel.org>,
"Neal Cardwell" <ncardwell@google.com>,
"Tom Herbert" <therbert@google.com>,
"Jeff Kirsher" <jeffrey.t.kirsher@intel.com>,
"Michael Chan" <mchan@broadcom.com>,
"Matt Carlson" <mcarlson@broadcom.com>,
"Herbert Xu" <herbert@gondor.apana.org.au>,
"Ben Hutchings" <bhutchings@solarflare.com>,
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>,
"Maciej Żenczykowski" <maze@google.com>
Subject: Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
Date: Wed, 02 May 2012 09:27:22 -0700 [thread overview]
Message-ID: <4FA1606A.6040607@intel.com> (raw)
In-Reply-To: <1335975168.22133.578.camel@edumazet-glaptop>
On 05/02/2012 09:12 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 08:52 -0700, Alexander Duyck wrote:
>> On 05/02/2012 01:13 AM, Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Before stealing fragments or skb head, we must make sure skb is not
>>> cloned.
>>>
>>> If skb is cloned, we must take references on pages instead.
>>>
>>> Bug happened using tcpdump (if not using mmap())
>>>
>>> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>> net/ipv4/tcp_input.c | 17 ++++++++++++++---
>>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 96a631d..7686d7f 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -4467,7 +4467,7 @@ static bool tcp_try_coalesce(struct sock *sk,
>>> struct sk_buff *from,
>>> bool *fragstolen)
>>> {
>>> - int delta, len = from->len;
>>> + int i, delta, len = from->len;
>>>
>>> *fragstolen = false;
>>> if (tcp_hdr(from)->fin)
>>> @@ -4497,7 +4497,13 @@ copyfrags:
>>> skb_shinfo(from)->frags,
>>> skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
>>> skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
>>> - skb_shinfo(from)->nr_frags = 0;
>>> +
>>> + if (skb_cloned(from))
>>> + for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
>>> + skb_frag_ref(from, i);
>>> + else
>>> + skb_shinfo(from)->nr_frags = 0;
>>> +
>>> to->truesize += delta;
>>> atomic_add(delta, &sk->sk_rmem_alloc);
>>> sk_mem_charge(sk, delta);
>> I am fairly certain the bug I saw is only masked over by this change.
>> The underlying problem is that we shouldn't be messing with nr_frags on
>> the from or the to if either one is clone. You now have a check in
>> place for the from, but what about the to? This function should
>> probably be calling a pskb_expand_head on the to skb in order to
>> guarantee that the skb->head isn't shared. Otherwise this is going to
>> cause other issues for any functions that are sharing these skbs that
>> just walk through frags without checking skb->len or skb->data_len first.
> Its safe to increase to->len and increase nr_frags in this context,
> because we hold a reference to dataref : It cannot disappear under us.
>
> clones will still have their skb->len at skb_clone() time and wont care
> we expanded the frags.
Are you sure about that? I think this may blow up if a bridge is
brought into play. In that case you will have clones that will be going
through the xmit path of network drivers and I know in the case of the
older e1000 driver it didn't stop to look at the length but would
instead just go through and start mapping all frags to the device. I am
fairly certain you are risking a data corruption any time you modify
nr_frags and dataref is != 1.
I really think what we should be doing is either not merge period, or we
have to go through slow paths if either the to or the from is cloned.
>>> @@ -4515,7 +4521,12 @@ copyfrags:
>>> offset = from->data - (unsigned char *)page_address(page);
>>> skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>>> page, offset, skb_headlen(from));
>>> - *fragstolen = true;
>>> +
>>> + if (skb_cloned(from))
>>> + get_page(page);
>>> + else
>>> + *fragstolen = true;
>>> +
>>> delta = len; /* we dont know real truesize... */
>>> goto copyfrags;
>>> }
>>>
>>>
>> I don't see where we are now addressing the put_page call to release the
>> page afterwards. By calling get_page you are incrementing the page
>> count by one, but where are you decrementing dataref in the shared
>> info? Without that we are looking at a memory leak because __kfree_skb
>> will decrement the dataref but it will never reach 0 so it will never
>> call put_page on the head frag.
> really the dataref was already incremented at skb_clone() time
>
> It will be properly decremented since we call __kfree_skb()
>
> Only the last decrement will perform the put_page()
>
> Think about splice() is doing, its the same get_page() game.
I think you are missing the point. So skb_clone will bump the dataref
to 2, calling get_page will bump the page count to 2. After this
function you don't call __kfree_skb(skb) instead you call
kmem_cache_free(skbuff_head_cache, skb). This will free the sk_buff,
but not decrement dataref leaving it at 2. Eventually the raw socket
will call kfree_skb(skb) on the clone dropping the dataref to 1 and you
will call put_page dropping the page count to 1, but I don't see where
the last __kfree_skb call will come from that will drop dataref and the
page count to 0.
Thanks,
Alex
next prev parent reply other threads:[~2012-05-02 16:27 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-27 10:37 [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag Eric Dumazet
2012-04-30 17:54 ` Eric Dumazet
2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
2012-04-30 23:36 ` Alexander Duyck
2012-05-01 1:27 ` Eric Dumazet
2012-05-01 5:33 ` Alexander Duyck
2012-05-01 6:39 ` Eric Dumazet
2012-05-01 16:17 ` Alexander Duyck
2012-05-01 17:04 ` Eric Dumazet
2012-05-01 19:45 ` Alexander Duyck
2012-05-02 2:45 ` Eric Dumazet
2012-05-02 8:24 ` Eric Dumazet
2012-05-02 16:16 ` Alexander Duyck
2012-05-02 16:19 ` Eric Dumazet
2012-05-02 16:27 ` Eric Dumazet
2012-05-02 17:04 ` Alexander Duyck
2012-05-02 17:02 ` Alexander Duyck
2012-05-02 17:16 ` Rick Jones
2012-05-01 22:58 ` Alexander Duyck
2012-05-01 23:10 ` Alexander Duyck
2012-05-02 2:47 ` Eric Dumazet
2012-05-02 3:54 ` Eric Dumazet
2012-05-02 8:13 ` [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce() Eric Dumazet
2012-05-02 15:52 ` Alexander Duyck
2012-05-02 16:12 ` Eric Dumazet
2012-05-02 16:27 ` Alexander Duyck [this message]
2012-05-02 16:46 ` Eric Dumazet
2012-05-02 17:55 ` [PATCH v2 " Eric Dumazet
2012-05-02 19:58 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
2012-05-02 20:11 ` Joe Perches
2012-05-02 20:23 ` Eric Dumazet
2012-05-02 20:34 ` Joe Perches
2012-05-03 0:32 ` David Miller
2012-05-03 1:11 ` David Miller
2012-05-03 2:14 ` Eric Dumazet
2012-05-03 2:21 ` David Miller
2012-05-03 1:11 ` [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce() David Miller
2012-05-02 18:05 ` [PATCH " Alexander Duyck
2012-05-02 18:15 ` Eric Dumazet
2012-05-02 20:55 ` Alexander Duyck
2012-05-03 1:52 ` Eric Dumazet
2012-05-03 3:00 ` Alexander Duyck
2012-05-03 3:14 ` Eric Dumazet
2012-05-03 3:28 ` Alexander Duyck
2012-05-01 1:48 ` [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FA1606A.6040607@intel.com \
--to=alexander.h.duyck@intel.com \
--cc=alexander.duyck@gmail.com \
--cc=bhutchings@solarflare.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=herbert@gondor.apana.org.au \
--cc=ilpo.jarvinen@helsinki.fi \
--cc=jeffrey.t.kirsher@intel.com \
--cc=maze@google.com \
--cc=mcarlson@broadcom.com \
--cc=mchan@broadcom.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=therbert@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).