Bug in computing data_len in tcp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug in computing data_len in tcp_sendmsg?
@ 2011-12-01  1:48 Tom Herbert
  2011-12-01  3:42 ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2011-12-01  1:48 UTC (permalink / raw)
  To: Linux Netdev List; +Cc: Eric Dumazet, David Miller

I believe that skb->data_len might no be computed correctly in
tcp_sendmsg.  Specifically, when skb_add_data_nocache (or
skb_add_data) is called skb->data_len is not updated (skb_put only
updates skb->len).  This results in the datalen in the head skbuf
being zero so any subsequent uses of the value lead to incorrect
results.  For instance, skb_headlen returns the length of the head
skbu data and not just that of the headers.  If I'm reading this
correctly, it's a pretty fundamental bug.

I don't have a fix for this yet.

Tom

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  1:48 Bug in computing data_len in tcp_sendmsg? Tom Herbert
@ 2011-12-01  3:42 ` Eric Dumazet
  2011-12-01  4:06   ` Tom Herbert
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-01  3:42 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List, David Miller

Le mercredi 30 novembre 2011 à 17:48 -0800, Tom Herbert a écrit :
> I believe that skb->data_len might no be computed correctly in
> tcp_sendmsg.  Specifically, when skb_add_data_nocache (or
> skb_add_data) is called skb->data_len is not updated (skb_put only
> updates skb->len).  This results in the datalen in the head skbuf
> being zero so any subsequent uses of the value lead to incorrect
> results.  For instance, skb_headlen returns the length of the head
> skbu data and not just that of the headers.  If I'm reading this
> correctly, it's a pretty fundamental bug.
> 
> I don't have a fix for this yet.
> 
> Tom

On which tree do you see a problem ?

For example net-next seems fine to me :


static inline int skb_copy_to_page_nocache(struct sock *sk, char __user *from,
                                           struct sk_buff *skb,
                                           struct page *page,
                                           int off, int copy)
{
        int err;

        err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) + off,
                                       copy, skb->len);
        if (err)
                return err;

        skb->len             += copy;
        skb->data_len        += copy;
        skb->truesize        += copy;
        sk->sk_wmem_queued   += copy;
        sk_mem_charge(sk, copy);
        return 0;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  3:42 ` Eric Dumazet
@ 2011-12-01  4:06   ` Tom Herbert
  2011-12-01  4:16     ` David Miller
  2011-12-01  4:17     ` Eric Dumazet
  0 siblings, 2 replies; 29+ messages in thread
From: Tom Herbert @ 2011-12-01  4:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller

On Wed, Nov 30, 2011 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 30 novembre 2011 à 17:48 -0800, Tom Herbert a écrit :
>> I believe that skb->data_len might no be computed correctly in
>> tcp_sendmsg.  Specifically, when skb_add_data_nocache (or
>> skb_add_data) is called skb->data_len is not updated (skb_put only
>> updates skb->len).  This results in the datalen in the head skbuf
>> being zero so any subsequent uses of the value lead to incorrect
>> results.  For instance, skb_headlen returns the length of the head
>> skbu data and not just that of the headers.  If I'm reading this
>> correctly, it's a pretty fundamental bug.
>>
>> I don't have a fix for this yet.
>>
>> Tom
>
> On which tree do you see a problem ?
>
> For example net-next seems fine to me :
>
>

On net-next.  Look at skb_add_data_nocache.  This doesn't touch
data_len, and really can't since skb_put wants skb linearized.

Actually, I think I am misinterpreting the meaning of data_len, looks
like it is intended to return the number of bytes in the page buffer
area (comment on the function would be have been nice ;-) ).  It would
seem that e1000e driver might be misinterpreting it also:

        segs = skb_shinfo(skb)->gso_segs ? : 1;
        /* multiply data chunks by size of headers */
        bytecount = ((segs - 1) * skb_headlen(skb)) + skb->len;

So maybe the bug is  that e1000e is misusing the function (and it
looks like some other drivers would be too)?.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  4:06   ` Tom Herbert
@ 2011-12-01  4:16     ` David Miller
  2011-12-01  4:17     ` Eric Dumazet
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2011-12-01  4:16 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev

From: Tom Herbert <therbert@google.com>
Date: Wed, 30 Nov 2011 20:06:09 -0800

> Actually, I think I am misinterpreting the meaning of data_len, looks
> like it is intended to return the number of bytes in the page buffer
> area

skb->len is all bytes, skb->data_len covers only segmented portion.

> (comment on the function would be have been nice ;-) ).

Patches welcome, otherwise:

http://vger.kernel.org/~davem/skb.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  4:06   ` Tom Herbert
  2011-12-01  4:16     ` David Miller
@ 2011-12-01  4:17     ` Eric Dumazet
  2011-12-01  5:09       ` Tom Herbert
  2011-12-01 22:18       ` Vijay Subramanian
  1 sibling, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2011-12-01  4:17 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List, David Miller

Le mercredi 30 novembre 2011 à 20:06 -0800, Tom Herbert a écrit :
> On Wed, Nov 30, 2011 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mercredi 30 novembre 2011 à 17:48 -0800, Tom Herbert a écrit :
> >> I believe that skb->data_len might no be computed correctly in
> >> tcp_sendmsg.  Specifically, when skb_add_data_nocache (or
> >> skb_add_data) is called skb->data_len is not updated (skb_put only
> >> updates skb->len).  This results in the datalen in the head skbuf
> >> being zero so any subsequent uses of the value lead to incorrect
> >> results.  For instance, skb_headlen returns the length of the head
> >> skbu data and not just that of the headers.  If I'm reading this
> >> correctly, it's a pretty fundamental bug.
> >>
> >> I don't have a fix for this yet.
> >>
> >> Tom
> >
> > On which tree do you see a problem ?
> >
> > For example net-next seems fine to me :
> >
> >
> 
> On net-next.  Look at skb_add_data_nocache.  This doesn't touch
> data_len, and really can't since skb_put wants skb linearized.
> 
> Actually, I think I am misinterpreting the meaning of data_len, looks
> like it is intended to return the number of bytes in the page buffer
> area (comment on the function would be have been nice ;-) ).  It would
> seem that e1000e driver might be misinterpreting it also:
> 
>         segs = skb_shinfo(skb)->gso_segs ? : 1;
>         /* multiply data chunks by size of headers */
>         bytecount = ((segs - 1) * skb_headlen(skb)) + skb->len;
> 
> So maybe the bug is  that e1000e is misusing the function (and it
> looks like some other drivers would be too)?.

Or the "bug" was to assume that skb was headless.
It was true until recently.

We recently added commit f07d960df33c5aef
(tcp: avoid frag allocation for small frames)

to avoid page allocation for small frames.

So now, skb can contain in head part of tcp data.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  4:17     ` Eric Dumazet
@ 2011-12-01  5:09       ` Tom Herbert
  2011-12-01  9:15         ` Eric Dumazet
  2011-12-01 22:18       ` Vijay Subramanian
  1 sibling, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2011-12-01  5:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller

> Or the "bug" was to assume that skb was headless.
> It was true until recently.
>
> We recently added commit f07d960df33c5aef
> (tcp: avoid frag allocation for small frames)
>
> to avoid page allocation for small frames.
>
> So now, skb can contain in head part of tcp data.

Yes, reverting that patch seems to fix the problem.  Eric, do you have
an idea what should be used to determine length of headers in an skb
now (mac through transport).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  5:09       ` Tom Herbert
@ 2011-12-01  9:15         ` Eric Dumazet
  2011-12-01 17:29           ` Jesse Brandeburg
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-01  9:15 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List, David Miller

Le mercredi 30 novembre 2011 à 21:09 -0800, Tom Herbert a écrit :
> > Or the "bug" was to assume that skb was headless.
> > It was true until recently.
> >
> > We recently added commit f07d960df33c5aef
> > (tcp: avoid frag allocation for small frames)
> >
> > to avoid page allocation for small frames.
> >
> > So now, skb can contain in head part of tcp data.
> 
> Yes, reverting that patch seems to fix the problem.  Eric, do you have
> an idea what should be used to determine length of headers in an skb
> now (mac through transport).

I dont know why its even necessary :

TSO enabled NIC all provide hardware counters, so why even bother
computing tx_bytes ourself ?

skb->len is appropriate for BQL, as long as producers/consumer use the
same skb->len. 1 or 2% error is not a problem if not cumulative ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  9:15         ` Eric Dumazet
@ 2011-12-01 17:29           ` Jesse Brandeburg
  2011-12-01 18:04             ` David Miller
  2011-12-01 20:37             ` Eric Dumazet
  0 siblings, 2 replies; 29+ messages in thread
From: Jesse Brandeburg @ 2011-12-01 17:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Linux Netdev List, David Miller, alexander.h.duyck

On Thu, 1 Dec 2011 01:15:37 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 30 novembre 2011 à 21:09 -0800, Tom Herbert a écrit :
> > > We recently added commit f07d960df33c5aef
> > > (tcp: avoid frag allocation for small frames)
> > 
> > Yes, reverting that patch seems to fix the problem.  Eric, do you have
> > an idea what should be used to determine length of headers in an skb
> > now (mac through transport).

Tom, thanks very much for finding this subtle bug!  I bet that this
commit (which I had missed) broke a lot of other drivers in
very subtle ways due to changing a long standing behavior.  Auditing
every driver's tx path is going to be a lot of work unless a way can be
discovered to automate finding incorrect assumptions in drivers.

> I dont know why its even necessary :
> 
> TSO enabled NIC all provide hardware counters, so why even bother
> computing tx_bytes ourself ?

because we have runtime logic for adjusting interrupt rate that depends
on knowing how many bytes and packets were cleaned up in an interrupt
and reading MMIO to get latest stats causes CPU stall in hot path.
Counting the bytes sent on the wire via a segmented SKB was hard to get
right to begin with, I think we took a different approach in ixgbe
(recent versions) and compute the math in hard_start_xmit instead of tx
cleanup.

> skb->len is appropriate for BQL, as long as producers/consumer use the
> same skb->len. 1 or 2% error is not a problem if not cumulative ?

if skb->len access doesn't cause a cache miss in hot path (or at
least doesn't increase misses) then I say sure.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01 17:29           ` Jesse Brandeburg
@ 2011-12-01 18:04             ` David Miller
  2011-12-01 20:37             ` Eric Dumazet
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2011-12-01 18:04 UTC (permalink / raw)
  To: jesse.brandeburg; +Cc: eric.dumazet, therbert, netdev, alexander.h.duyck

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Thu, 1 Dec 2011 09:29:50 -0800

> Tom, thanks very much for finding this subtle bug!  I bet that this
> commit (which I had missed) broke a lot of other drivers in
> very subtle ways due to changing a long standing behavior.

I call bullshit on this.

It's always been possible, any entity between the TCP stack and your
driver can pull data from the segmented pages into the linear data
area.

F.e. netfilter, packet scheduler actions, you name it.

This code sequence has always been buggy.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01 17:29           ` Jesse Brandeburg
  2011-12-01 18:04             ` David Miller
@ 2011-12-01 20:37             ` Eric Dumazet
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2011-12-01 20:37 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Tom Herbert, Linux Netdev List, David Miller, alexander.h.duyck

Le jeudi 01 décembre 2011 à 09:29 -0800, Jesse Brandeburg a écrit :
> On Thu, 1 Dec 2011 01:15:37 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Tom, thanks very much for finding this subtle bug!  I bet that this
> commit (which I had missed) broke a lot of other drivers in
> very subtle ways due to changing a long standing behavior.  Auditing
> every driver's tx path is going to be a lot of work unless a way can be
> discovered to automate finding incorrect assumptions in drivers.
> 

I did a quick check an counted 3 intel drivers.

Others are fine. I'll redo a check.

> > I dont know why its even necessary :
> > 
> > TSO enabled NIC all provide hardware counters, so why even bother
> > computing tx_bytes ourself ?
> 
> because we have runtime logic for adjusting interrupt rate that depends
> on knowing how many bytes and packets were cleaned up in an interrupt
> and reading MMIO to get latest stats causes CPU stall in hot path.
> Counting the bytes sent on the wire via a segmented SKB was hard to get
> right to begin with, I think we took a different approach in ixgbe
> (recent versions) and compute the math in hard_start_xmit instead of tx
> cleanup.


> 
> > skb->len is appropriate for BQL, as long as producers/consumer use the
> > same skb->len. 1 or 2% error is not a problem if not cumulative ?
> 
> if skb->len access doesn't cause a cache miss in hot path (or at
> least doesn't increase misses) then I say sure.

You still can cache it like now in your bytecount field in start_xmit(),
but since you call skb_free(), I doubt this matters a lot.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01  4:17     ` Eric Dumazet
  2011-12-01  5:09       ` Tom Herbert
@ 2011-12-01 22:18       ` Vijay Subramanian
  2011-12-01 22:30         ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: Vijay Subramanian @ 2011-12-01 22:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Linux Netdev List, David Miller

>
> Or the "bug" was to assume that skb was headless.
> It was true until recently.
>
> We recently added commit f07d960df33c5aef
> (tcp: avoid frag allocation for small frames)
>
> to avoid page allocation for small frames.
>
> So now, skb can contain in head part of tcp data.
>
>

Hi,
I am looking at tcp_mtu_probe() and was wondering if this commit also
impacts this function. Once the data are copied from skbs in the write
queue to the probe skb, copied data are cleared from the original skbs
in the write queue.

It looks like the code assumes that the original skb will have data
either in linear part or in paged part.  The call to
__pskb_trim_head(skb, copy) for example does not clear linear part.

Can someone more familiar with the code take a look? Apologies if I
have read this wrong.

Regards
Vijay Subramanian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01 22:18       ` Vijay Subramanian
@ 2011-12-01 22:30         ` Eric Dumazet
  2011-12-02  6:18           ` Vijay Subramanian
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-01 22:30 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Tom Herbert, Linux Netdev List, David Miller

Le jeudi 01 décembre 2011 à 14:18 -0800, Vijay Subramanian a écrit :
> >
> > Or the "bug" was to assume that skb was headless.
> > It was true until recently.
> >
> > We recently added commit f07d960df33c5aef
> > (tcp: avoid frag allocation for small frames)
> >
> > to avoid page allocation for small frames.
> >
> > So now, skb can contain in head part of tcp data.
> >
> >
> 
> Hi,
> I am looking at tcp_mtu_probe() and was wondering if this commit also
> impacts this function. Once the data are copied from skbs in the write
> queue to the probe skb, copied data are cleared from the original skbs
> in the write queue.
> 
> It looks like the code assumes that the original skb will have data
> either in linear part or in paged part.  The call to
> __pskb_trim_head(skb, copy) for example does not clear linear part.
> 
> Can someone more familiar with the code take a look? Apologies if I
> have read this wrong.
> 

tcp_mtu_probe() builds a linear skb, and populate it using
skb_copy_bits() [ this is frag aware, and aware of payload in header as
well ]

I see no problem in it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-01 22:30         ` Eric Dumazet
@ 2011-12-02  6:18           ` Vijay Subramanian
  2011-12-02 11:59             ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Vijay Subramanian @ 2011-12-02  6:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Linux Netdev List, David Miller

>> I am looking at tcp_mtu_probe() and was wondering if this commit also
>> impacts this function. Once the data are copied from skbs in the write
>> queue to the probe skb, copied data are cleared from the original skbs
>> in the write queue.
>>
>> It looks like the code assumes that the original skb will have data
>> either in linear part or in paged part.  The call to
>> __pskb_trim_head(skb, copy) for example does not clear linear part.
>>
>> Can someone more familiar with the code take a look? Apologies if I
>> have read this wrong.
>>
>
> tcp_mtu_probe() builds a linear skb, and populate it using
> skb_copy_bits() [ this is frag aware, and aware of payload in header as
> well ]
>
> I see no problem in it.
>

Eric,
I think you may have misunderstood me (I think my post was not very
clear). Let me try again.

The MTU probe is built  correctly as a linear skb. As you point out,
skb_copy_bits() is frag aware and copies data from both the header and
pages.
The issue is with the way the data is cleared from the write queue
later in the function tcp_mtu_probe().

For example, if the MTU probe size was N bytes, then the probe is
inserted at the front of the write_queue and N bytes are copied from
the original write-queue skbs.
As these N bytes are copied, if an skb is completely consumed, it is
unlinked from the write_queue and freed. If an skb is only partially
consumed, then the pointers are adjusted
accordingly to erase the data. For a paged skb, fully consumed pages
are unreferenced.

This is done as follows in tcp_mtu_probe()

           if (!skb_shinfo(skb)->nr_frags) {
               skb_pull(skb, copy);
               if (skb->ip_summed != CHECKSUM_PARTIAL)
                   skb->csum = csum_partial(skb->data,
                                skb->len, 0);
           } else {
               __pskb_trim_head(skb, copy);
               tcp_set_skb_tso_segs(sk, skb, mss_now);
           }

It appears that the code assumes the data will either be in the linear
part (the if condition) or in the paged part (else condition) but not
both. Is this a correct assumption after the
recent commit f07d960df33c5aef (tcp: avoid frag allocation for small frames)?

Since __pskb_trim_head() only only removes data from the non-linear
part, the data in the linear part is never removed. Maybe for paged
skbs, we need something like
headlen = skb_headlen(skb);
skb_pull(skb, headlen);
__pskb_trim_head(skb, copy - headlen);

Thanks for your patience and hope this makes more sense than my previous post.

Regards,
Vijay Subramanian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02  6:18           ` Vijay Subramanian
@ 2011-12-02 11:59             ` Eric Dumazet
  2011-12-02 15:44               ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 11:59 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Tom Herbert, Linux Netdev List, David Miller

Le jeudi 01 décembre 2011 à 22:18 -0800, Vijay Subramanian a écrit :
> >> I am looking at tcp_mtu_probe() and was wondering if this commit also
> >> impacts this function. Once the data are copied from skbs in the write
> >> queue to the probe skb, copied data are cleared from the original skbs
> >> in the write queue.
> >>
> >> It looks like the code assumes that the original skb will have data
> >> either in linear part or in paged part.  The call to
> >> __pskb_trim_head(skb, copy) for example does not clear linear part.
> >>
> >> Can someone more familiar with the code take a look? Apologies if I
> >> have read this wrong.
> >>
> >
> > tcp_mtu_probe() builds a linear skb, and populate it using
> > skb_copy_bits() [ this is frag aware, and aware of payload in header as
> > well ]
> >
> > I see no problem in it.
> >
> 
> Eric,
> I think you may have misunderstood me (I think my post was not very
> clear). Let me try again.
> 
> The MTU probe is built  correctly as a linear skb. As you point out,
> skb_copy_bits() is frag aware and copies data from both the header and
> pages.
> The issue is with the way the data is cleared from the write queue
> later in the function tcp_mtu_probe().
> 
> For example, if the MTU probe size was N bytes, then the probe is
> inserted at the front of the write_queue and N bytes are copied from
> the original write-queue skbs.
> As these N bytes are copied, if an skb is completely consumed, it is
> unlinked from the write_queue and freed. If an skb is only partially
> consumed, then the pointers are adjusted
> accordingly to erase the data. For a paged skb, fully consumed pages
> are unreferenced.
> 
> This is done as follows in tcp_mtu_probe()
> 
>            if (!skb_shinfo(skb)->nr_frags) {
>                skb_pull(skb, copy);
>                if (skb->ip_summed != CHECKSUM_PARTIAL)
>                    skb->csum = csum_partial(skb->data,
>                                 skb->len, 0);
>            } else {
>                __pskb_trim_head(skb, copy);
>                tcp_set_skb_tso_segs(sk, skb, mss_now);
>            }
> 
> It appears that the code assumes the data will either be in the linear
> part (the if condition) or in the paged part (else condition) but not
> both. Is this a correct assumption after the
> recent commit f07d960df33c5aef (tcp: avoid frag allocation for small frames)?
> 
> Since __pskb_trim_head() only only removes data from the non-linear
> part, the data in the linear part is never removed. Maybe for paged
> skbs, we need something like
> headlen = skb_headlen(skb);
> skb_pull(skb, headlen);
> __pskb_trim_head(skb, copy - headlen);
> 
> Thanks for your patience and hope this makes more sense than my previous post.
> 

Thanks for this detailed explanation !

And yes, you're probably right.

Are you willing to submit a patch to fix this ?

(If not, I can do it myself of course)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 11:59             ` Eric Dumazet
@ 2011-12-02 15:44               ` Eric Dumazet
  2011-12-02 16:05                 ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 15:44 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Tom Herbert, Linux Netdev List, David Miller

Le vendredi 02 décembre 2011 à 12:59 +0100, Eric Dumazet a écrit :

> Thanks for this detailed explanation !
> 
> And yes, you're probably right.
> 
> Are you willing to submit a patch to fix this ?
> 
> (If not, I can do it myself of course)
> 

[PATCH net-next] tcp: fix tcp_trim_head()

commit f07d960df3 (tcp: avoid frag allocation for small frames)
breaked assumption in tcp stack that skb is either linear (data_len ==
0), either fully fragged (data_len == len)

Thanks to Vijay for providing a very detailed explanation.

Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/tcp_output.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 58f69ac..4a0f54a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1093,6 +1093,13 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 {
 	int i, k, eat;
 
+	eat = min_t(int, len, skb_headlen(skb));
+	if (eat) {
+		__skb_pull(skb, eat);
+		len -= eat;
+		if (!len)
+			return;
+	}
 	eat = len;
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
@@ -1124,11 +1131,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
-	/* If len == headlen, we avoid __skb_pull to preserve alignment. */
-	if (unlikely(len < skb_headlen(skb)))
-		__skb_pull(skb, len);
-	else
-		__pskb_trim_head(skb, len - skb_headlen(skb));
+	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_PARTIAL;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 15:44               ` Eric Dumazet
@ 2011-12-02 16:05                 ` Eric Dumazet
  2011-12-02 18:13                   ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 16:05 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Tom Herbert, Linux Netdev List, David Miller

Le vendredi 02 décembre 2011 à 16:44 +0100, Eric Dumazet a écrit :

> [PATCH net-next] tcp: fix tcp_trim_head()
> 
> commit f07d960df3 (tcp: avoid frag allocation for small frames)
> breaked assumption in tcp stack that skb is either linear (data_len ==
> 0), either fully fragged (data_len == len)
> 
> Thanks to Vijay for providing a very detailed explanation.
> 
> Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---

Another problem is the possible misalignement of skb->data if/when we
receive an ACK of an odd (not a 4 multiple) tcp sequence.

So when/if packet is restransmited, tcp header (and IP header as well)
could be misaligned.

Hmm, I probably miss something obvious, since this problem could happen
for linear skbs before my patch ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 16:05                 ` Eric Dumazet
@ 2011-12-02 18:13                   ` David Miller
  2011-12-02 18:36                     ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2011-12-02 18:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Dec 2011 17:05:35 +0100

> Le vendredi 02 décembre 2011 à 16:44 +0100, Eric Dumazet a écrit :
> 
>> [PATCH net-next] tcp: fix tcp_trim_head()
>> 
>> commit f07d960df3 (tcp: avoid frag allocation for small frames)
>> breaked assumption in tcp stack that skb is either linear (data_len ==
>> 0), either fully fragged (data_len == len)
>> 
>> Thanks to Vijay for providing a very detailed explanation.
>> 
>> Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
> 
> Another problem is the possible misalignement of skb->data if/when we
> receive an ACK of an odd (not a 4 multiple) tcp sequence.
> 
> So when/if packet is restransmited, tcp header (and IP header as well)
> could be misaligned.
> 
> Hmm, I probably miss something obvious, since this problem could happen
> for linear skbs before my patch ?

Unfortunately, even if netfilter or the packet scheduler pulled data, this
misalignment wouldn't happen before your change.

Maybe we should revert your frag allocation avoidance change until we can
sort this out.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 18:13                   ` David Miller
@ 2011-12-02 18:36                     ` Eric Dumazet
  2011-12-02 18:40                       ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 18:36 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le vendredi 02 décembre 2011 à 13:13 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 02 Dec 2011 17:05:35 +0100
> 
> > Le vendredi 02 décembre 2011 à 16:44 +0100, Eric Dumazet a écrit :
> > 
> >> [PATCH net-next] tcp: fix tcp_trim_head()
> >> 
> >> commit f07d960df3 (tcp: avoid frag allocation for small frames)
> >> breaked assumption in tcp stack that skb is either linear (data_len ==
> >> 0), either fully fragged (data_len == len)
> >> 
> >> Thanks to Vijay for providing a very detailed explanation.
> >> 
> >> Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >> ---
> > 
> > Another problem is the possible misalignement of skb->data if/when we
> > receive an ACK of an odd (not a 4 multiple) tcp sequence.
> > 
> > So when/if packet is restransmited, tcp header (and IP header as well)
> > could be misaligned.
> > 
> > Hmm, I probably miss something obvious, since this problem could happen
> > for linear skbs before my patch ?
> 
> Unfortunately, even if netfilter or the packet scheduler pulled data, this
> misalignment wouldn't happen before your change.
> 
> Maybe we should revert your frag allocation avoidance change until we can
> sort this out.

The patch I posted should solve the problem Vijay spotted.
(It really does, I tested it, allowing mtu probing)

What I ask now is following problem (even prior to my frag allocation
patch) :

1) We allocate a linear skb (SG being off) to cook a tcp frame of length
XXX bytes.

2) We send it.

3) We receive an ACK for first 31 bytes.

4) We trim 31 bytes from the head of skb.  (skb_pull(skb, 31))
    skb->data is now not anymore aligned to a 4 bytes boundary.

5) Later, we need to retransmit skb (XXX minus 31 bytes already ACKed)
   We push TCP header, and skb->data is not aligned.
   TCP header is not aligned anymore. x86 doesnt care, but what about
other arches with misalign traps ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 18:36                     ` Eric Dumazet
@ 2011-12-02 18:40                       ` David Miller
  2011-12-02 20:22                         ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2011-12-02 18:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Dec 2011 19:36:29 +0100

> What I ask now is following problem (even prior to my frag allocation
> patch) :
> 
> 1) We allocate a linear skb (SG being off) to cook a tcp frame of length
> XXX bytes.
> 
> 2) We send it.
> 
> 3) We receive an ACK for first 31 bytes.
> 
> 4) We trim 31 bytes from the head of skb.  (skb_pull(skb, 31))
>     skb->data is now not anymore aligned to a 4 bytes boundary.
> 
> 5) Later, we need to retransmit skb (XXX minus 31 bytes already ACKed)
>    We push TCP header, and skb->data is not aligned.
>    TCP header is not aligned anymore. x86 doesnt care, but what about
> other arches with misalign traps ?

Yes, for non-SG this always was technically possible.

But now you've made it possible for SG too.

Frankly, I think we should:

1) Revert your patch, so it's not possible for SG once more.

2) Add code to reallocate the SKB linear data when this happens in
   the non-SG case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 18:40                       ` David Miller
@ 2011-12-02 20:22                         ` Eric Dumazet
  2011-12-02 20:24                           ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le vendredi 02 décembre 2011 à 13:40 -0500, David Miller a écrit :

> Yes, for non-SG this always was technically possible.
> 

Yes this can, I reproduced it very easily.

I find this hard to believe...

[ 2179.697991] ------------[ cut here ]------------
[ 2179.698000] WARNING: at net/ipv4/tcp_output.c:850 tcp_transmit_skb+0x846/0x8f0()
[ 2179.698001] Hardware name: ProLiant BL460c G6
[ 2179.698011] Pid: 0, comm: swapper Tainted: G        W    3.2.0-rc2+ #187
[ 2179.698012] Call Trace:
[ 2179.698014]  <IRQ>  [<ffffffff8104715f>] warn_slowpath_common+0x7f/0xc0
[ 2179.698020]  [<ffffffff810471ba>] warn_slowpath_null+0x1a/0x20
[ 2179.698022]  [<ffffffff815d8386>] tcp_transmit_skb+0x846/0x8f0
[ 2179.698025]  [<ffffffff815d9621>] tcp_retransmit_skb+0x1a1/0x5f0
[ 2179.698027]  [<ffffffff815daf5d>] tcp_retransmit_timer+0x1bd/0x640
[ 2179.698030]  [<ffffffff815db578>] tcp_write_timer+0x198/0x200
[ 2179.698034]  [<ffffffff810579bf>] run_timer_softirq+0x12f/0x3e0
[ 2179.698039]  [<ffffffff81295924>] ? timerqueue_add+0x74/0xc0
[ 2179.698041]  [<ffffffff815db3e0>] ? tcp_retransmit_timer+0x640/0x640
[ 2179.698045]  [<ffffffff810753b5>] ? ktime_get+0x65/0xe0
[ 2179.698047]  [<ffffffff8104e929>] __do_softirq+0xa9/0x240
[ 2179.698050]  [<ffffffff81075b76>] ? do_timer+0x2b6/0x480
[ 2179.698053]  [<ffffffff8106fb90>] ? hrtimer_interrupt+0x130/0x220
[ 2179.698057]  [<ffffffff816d526c>] call_softirq+0x1c/0x30
[ 2179.698061]  [<ffffffff81004115>] do_softirq+0x55/0x90
[ 2179.698063]  [<ffffffff8104edde>] irq_exit+0x9e/0xc0
[ 2179.698066]  [<ffffffff816d5a3e>] smp_apic_timer_interrupt+0x6e/0x99
[ 2179.698068]  [<ffffffff816d3d8b>] apic_timer_interrupt+0x6b/0x70



diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 58f69ac..92ce7c3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -847,6 +847,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	/* Build TCP header and checksum it. */
 	th = tcp_hdr(skb);
+	WARN_ON((((unsigned long)th) & 3) != 0);
 	th->source		= inet->inet_sport;
 	th->dest		= inet->inet_dport;
 	th->seq			= htonl(tcb->seq);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 20:22                         ` Eric Dumazet
@ 2011-12-02 20:24                           ` David Miller
  2011-12-02 20:45                             ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2011-12-02 20:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Dec 2011 21:22:49 +0100

> Le vendredi 02 décembre 2011 à 13:40 -0500, David Miller a écrit :
> 
>> Yes, for non-SG this always was technically possible.
>> 
> 
> Yes this can, I reproduced it very easily.
> 
> I find this hard to believe...

Grrr :-)

Ok, I'll think about this some more.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 20:24                           ` David Miller
@ 2011-12-02 20:45                             ` Eric Dumazet
  2011-12-02 20:48                               ` Eric Dumazet
  2011-12-02 21:30                               ` David Miller
  0 siblings, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 20:45 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le vendredi 02 décembre 2011 à 15:24 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 02 Dec 2011 21:22:49 +0100
> 
> > Le vendredi 02 décembre 2011 à 13:40 -0500, David Miller a écrit :
> > 
> >> Yes, for non-SG this always was technically possible.
> >> 
> > 
> > Yes this can, I reproduced it very easily.
> > 
> > I find this hard to believe...
> 
> Grrr :-)
> 
> Ok, I'll think about this some more.

Maybe a quick fix would be to trim not len bytes but (len & ~3) bytes ?

This avoid reallocations and complex code for a 'should never happen in
normal circumstances'...

Retransmits could transmits 3 bytes already ACKed, is it a big deal ?

patch on top of linux/net tree :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 63170e2..4e108c5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1126,7 +1126,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 
 	/* If len == headlen, we avoid __skb_pull to preserve alignment. */
 	if (unlikely(len < skb_headlen(skb)))
-		__skb_pull(skb, len);
+		__skb_pull(skb, len & ~3); /* preserve alignement of tcp/ip headers */
 	else
 		__pskb_trim_head(skb, len - skb_headlen(skb));
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 20:45                             ` Eric Dumazet
@ 2011-12-02 20:48                               ` Eric Dumazet
  2011-12-02 21:30                               ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2011-12-02 20:48 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le vendredi 02 décembre 2011 à 21:45 +0100, Eric Dumazet a écrit :
> Le vendredi 02 décembre 2011 à 15:24 -0500, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Fri, 02 Dec 2011 21:22:49 +0100
> > 
> > > Le vendredi 02 décembre 2011 à 13:40 -0500, David Miller a écrit :
> > > 
> > >> Yes, for non-SG this always was technically possible.
> > >> 
> > > 
> > > Yes this can, I reproduced it very easily.
> > > 
> > > I find this hard to believe...
> > 
> > Grrr :-)
> > 
> > Ok, I'll think about this some more.
> 
> Maybe a quick fix would be to trim not len bytes but (len & ~3) bytes ?
> 
> This avoid reallocations and complex code for a 'should never happen in
> normal circumstances'...
> 
> Retransmits could transmits 3 bytes already ACKed, is it a big deal ?
> 
> patch on top of linux/net tree :
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 63170e2..4e108c5 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1126,7 +1126,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
>  
>  	/* If len == headlen, we avoid __skb_pull to preserve alignment. */
>  	if (unlikely(len < skb_headlen(skb)))
> -		__skb_pull(skb, len);
> +		__skb_pull(skb, len & ~3); /* preserve alignement of tcp/ip headers */
>  	else
>  		__pskb_trim_head(skb, len - skb_headlen(skb));
>  
> 


oops, thats the wrong version, here is it again :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 63170e2..492b917 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1125,11 +1125,12 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 		return -ENOMEM;
 
 	/* If len == headlen, we avoid __skb_pull to preserve alignment. */
-	if (unlikely(len < skb_headlen(skb)))
+	if (unlikely(len < skb_headlen(skb))) {
+		len &= ~3;
 		__skb_pull(skb, len);
-	else
+	} else {
 		__pskb_trim_head(skb, len - skb_headlen(skb));
-
+	}
 	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 20:45                             ` Eric Dumazet
  2011-12-02 20:48                               ` Eric Dumazet
@ 2011-12-02 21:30                               ` David Miller
  2011-12-03 15:09                                 ` Eric Dumazet
  1 sibling, 1 reply; 29+ messages in thread
From: David Miller @ 2011-12-02 21:30 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Dec 2011 21:45:56 +0100

> Retransmits could transmits 3 bytes already ACKed, is it a big deal ?

Unfortunately this kind of adjustment doesn't work.

When we trim the head in response to ACK'd data, the stack assumes that
the first byte sitting at the front of the retransmit queue is ->snd_una.

So if you just back align the pull, and don't make amends for the setting
of ->snd_una, we'll retransmit the wrong bytes.  The send queue will be
out of sync with the sequence number state of the socket.

This has implications for SACK tagging state bit in the transmit queue
as well.

In fact, this is a real dangerous road to go down, I think :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Bug in computing data_len in tcp_sendmsg?
  2011-12-02 21:30                               ` David Miller
@ 2011-12-03 15:09                                 ` Eric Dumazet
  2011-12-04  7:39                                   ` [PATCH] tcp: take care of misalignments Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-03 15:09 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le vendredi 02 décembre 2011 à 16:30 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 02 Dec 2011 21:45:56 +0100
> 
> > Retransmits could transmits 3 bytes already ACKed, is it a big deal ?
> 
> Unfortunately this kind of adjustment doesn't work.
> 
> When we trim the head in response to ACK'd data, the stack assumes that
> the first byte sitting at the front of the retransmit queue is ->snd_una.
> 
> So if you just back align the pull, and don't make amends for the setting
> of ->snd_una, we'll retransmit the wrong bytes.  The send queue will be
> out of sync with the sequence number state of the socket.
> 
> This has implications for SACK tagging state bit in the transmit queue
> as well.
> 
> In fact, this is a real dangerous road to go down, I think :-)


Yeah, it was not a good idea :)

My plan is to add a third parameter to pskb_copy(struct sk_buff *skb,
gfp_t gfp_mask, int reserve)

and use pskb_copy() in tcp_retransmit_skb() if it appears we need
between 1 and 3 bytes to re-align skb head before calling
tcp_transmit_skb(), (if NET_IP_ALIGN is not null)

[ In the unlikely case we had to allocate a new skb with pskb_copy(), we
will pass clone_it=0 to tcp_transmit_skb() ]

I'll send a fully tested patch before the end of week end.

Thanks !

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] tcp: take care of misalignments
  2011-12-03 15:09                                 ` Eric Dumazet
@ 2011-12-04  7:39                                   ` Eric Dumazet
  2011-12-04 18:21                                     ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-04  7:39 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Hi David

Here is the patch I tested this morning on my machine.

To simulate misalignement without a malicious peer, I did a quick hack
in tcp_ack() to ack N-1 bytes instead of N bytes every 10 acks.

Here is the patch suited for net tree.

Thanks

[PATCH] tcp: take care of misalignments

We discovered that TCP stack could retransmit misaligned skbs if a
malicious peer acknowledged sub MSS frame. This currently can happen
only if output interface is non SG enabled : If SG is enabled, tcp
builds headless skbs (all payload is included in fragments), so the tcp
trimming process only removes parts of skb fragments, header stay
aligned.

Some arches cant handle misalignments, so force a head reallocation and
shrink headroom to MAX_TCP_HEADER.

Dont care about misaligments on x86 and PPC (or other arches setting
NET_IP_ALIGN to 0)

This patch introduces __pskb_copy() which can specify the headroom of
new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   11 +++++++++--
 net/core/skbuff.c      |   11 ++++++-----
 net/ipv4/tcp_output.c  |   10 +++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..dd9a0cd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -561,8 +561,9 @@ extern struct sk_buff *skb_clone(struct sk_buff *skb,
 				 gfp_t priority);
 extern struct sk_buff *skb_copy(const struct sk_buff *skb,
 				gfp_t priority);
-extern struct sk_buff *pskb_copy(struct sk_buff *skb,
-				 gfp_t gfp_mask);
+extern struct sk_buff *__pskb_copy(struct sk_buff *skb,
+				 int headroom, gfp_t gfp_mask);
+
 extern int	       pskb_expand_head(struct sk_buff *skb,
 					int nhead, int ntail,
 					gfp_t gfp_mask);
@@ -1824,6 +1825,12 @@ static inline dma_addr_t skb_frag_dma_map(struct device *dev,
 			    frag->page_offset + offset, size, dir);
 }
 
+static inline struct sk_buff *pskb_copy(struct sk_buff *skb,
+					gfp_t gfp_mask)
+{
+	return __pskb_copy(skb, skb_headroom(skb), gfp_mask);
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c30ee4..21c7361 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -791,8 +791,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 EXPORT_SYMBOL(skb_copy);
 
 /**
- *	pskb_copy	-	create copy of an sk_buff with private head.
+ *	__pskb_copy	-	create copy of an sk_buff with private head.
  *	@skb: buffer to copy
+ *	@headroom: headroom of new skb
  *	@gfp_mask: allocation priority
  *
  *	Make a copy of both an &sk_buff and part of its data, located
@@ -803,16 +804,16 @@ EXPORT_SYMBOL(skb_copy);
  *	The returned buffer has a reference count of 1.
  */
 
-struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
+struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
 {
-	unsigned int size = skb_end_pointer(skb) - skb->head;
+	unsigned int size = skb_headlen(skb) + headroom;
 	struct sk_buff *n = alloc_skb(size, gfp_mask);
 
 	if (!n)
 		goto out;
 
 	/* Set the data pointer */
-	skb_reserve(n, skb_headroom(skb));
+	skb_reserve(n, headroom);
 	/* Set the tail pointer and length */
 	skb_put(n, skb_headlen(skb));
 	/* Copy the bytes */
@@ -848,7 +849,7 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
 out:
 	return n;
 }
-EXPORT_SYMBOL(pskb_copy);
+EXPORT_SYMBOL(__pskb_copy);
 
 /**
  *	pskb_expand_head - reallocate header of &sk_buff
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 63170e2..0973631 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2147,7 +2147,15 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
 
-	err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
+	/* make sure skb->data is aligned on arches that require it */
+	if (unlikely(NET_IP_ALIGN && ((unsigned long)skb->data & 3))) {
+		struct sk_buff *nskb = __pskb_copy(skb, MAX_TCP_HEADER,
+						   GFP_ATOMIC);
+		err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
+			     -ENOBUFS;
+	} else {
+		err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
+	}
 
 	if (err == 0) {
 		/* Update global TCP statistics. */

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] tcp: take care of misalignments
  2011-12-04  7:39                                   ` [PATCH] tcp: take care of misalignments Eric Dumazet
@ 2011-12-04 18:21                                     ` David Miller
  2011-12-04 18:51                                       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2011-12-04 18:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 04 Dec 2011 08:39:53 +0100

> [PATCH] tcp: take care of misalignments
> 
> We discovered that TCP stack could retransmit misaligned skbs if a
> malicious peer acknowledged sub MSS frame. This currently can happen
> only if output interface is non SG enabled : If SG is enabled, tcp
> builds headless skbs (all payload is included in fragments), so the tcp
> trimming process only removes parts of skb fragments, header stay
> aligned.
> 
> Some arches cant handle misalignments, so force a head reallocation and
> shrink headroom to MAX_TCP_HEADER.
> 
> Dont care about misaligments on x86 and PPC (or other arches setting
> NET_IP_ALIGN to 0)
> 
> This patch introduces __pskb_copy() which can specify the headroom of
> new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, applied.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] tcp: take care of misalignments
  2011-12-04 18:21                                     ` David Miller
@ 2011-12-04 18:51                                       ` Eric Dumazet
  2011-12-05 23:45                                         ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2011-12-04 18:51 UTC (permalink / raw)
  To: David Miller; +Cc: subramanian.vijay, therbert, netdev

Le dimanche 04 décembre 2011 à 13:21 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 04 Dec 2011 08:39:53 +0100
> 
> > [PATCH] tcp: take care of misalignments
> > 
> > We discovered that TCP stack could retransmit misaligned skbs if a
> > malicious peer acknowledged sub MSS frame. This currently can happen
> > only if output interface is non SG enabled : If SG is enabled, tcp
> > builds headless skbs (all payload is included in fragments), so the tcp
> > trimming process only removes parts of skb fragments, header stay
> > aligned.
> > 
> > Some arches cant handle misalignments, so force a head reallocation and
> > shrink headroom to MAX_TCP_HEADER.
> > 
> > Dont care about misaligments on x86 and PPC (or other arches setting
> > NET_IP_ALIGN to 0)
> > 
> > This patch introduces __pskb_copy() which can specify the headroom of
> > new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good, applied.

Thanks !

Here a respin of previous patch then ?

[PATCH net-next] tcp: fix tcp_trim_head()

commit f07d960df3 (tcp: avoid frag allocation for small frames)
breaked assumption in tcp stack that skb is either linear (skb->data_len
== 0), or fully fragged (skb->data_len == skb->len)

tcp_trim_head() made this assumption, we must fix it.

Thanks to Vijay for providing a very detailed explanation.

Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/tcp_output.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 50788d6..cf30680 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1093,6 +1093,13 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 {
 	int i, k, eat;
 
+	eat = min_t(int, len, skb_headlen(skb));
+	if (eat) {
+		__skb_pull(skb, eat);
+		len -= eat;
+		if (!len)
+			return;
+	}
 	eat = len;
 	k = 0;
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
@@ -1124,11 +1131,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
-	/* If len == headlen, we avoid __skb_pull to preserve alignment. */
-	if (unlikely(len < skb_headlen(skb)))
-		__skb_pull(skb, len);
-	else
-		__pskb_trim_head(skb, len - skb_headlen(skb));
+	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_PARTIAL;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] tcp: take care of misalignments
  2011-12-04 18:51                                       ` Eric Dumazet
@ 2011-12-05 23:45                                         ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2011-12-05 23:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: subramanian.vijay, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 04 Dec 2011 19:51:08 +0100

> [PATCH net-next] tcp: fix tcp_trim_head()
> 
> commit f07d960df3 (tcp: avoid frag allocation for small frames)
> breaked assumption in tcp stack that skb is either linear (skb->data_len
> == 0), or fully fragged (skb->data_len == skb->len)
> 
> tcp_trim_head() made this assumption, we must fix it.
> 
> Thanks to Vijay for providing a very detailed explanation.
> 
> Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-12-05 23:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-01  1:48 Bug in computing data_len in tcp_sendmsg? Tom Herbert
2011-12-01  3:42 ` Eric Dumazet
2011-12-01  4:06   ` Tom Herbert
2011-12-01  4:16     ` David Miller
2011-12-01  4:17     ` Eric Dumazet
2011-12-01  5:09       ` Tom Herbert
2011-12-01  9:15         ` Eric Dumazet
2011-12-01 17:29           ` Jesse Brandeburg
2011-12-01 18:04             ` David Miller
2011-12-01 20:37             ` Eric Dumazet
2011-12-01 22:18       ` Vijay Subramanian
2011-12-01 22:30         ` Eric Dumazet
2011-12-02  6:18           ` Vijay Subramanian
2011-12-02 11:59             ` Eric Dumazet
2011-12-02 15:44               ` Eric Dumazet
2011-12-02 16:05                 ` Eric Dumazet
2011-12-02 18:13                   ` David Miller
2011-12-02 18:36                     ` Eric Dumazet
2011-12-02 18:40                       ` David Miller
2011-12-02 20:22                         ` Eric Dumazet
2011-12-02 20:24                           ` David Miller
2011-12-02 20:45                             ` Eric Dumazet
2011-12-02 20:48                               ` Eric Dumazet
2011-12-02 21:30                               ` David Miller
2011-12-03 15:09                                 ` Eric Dumazet
2011-12-04  7:39                                   ` [PATCH] tcp: take care of misalignments Eric Dumazet
2011-12-04 18:21                                     ` David Miller
2011-12-04 18:51                                       ` Eric Dumazet
2011-12-05 23:45                                         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).