netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* skb_try_coalesce bug?
@ 2014-04-22 12:01 Erik Hugne
  2014-04-22 13:11 ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Erik Hugne @ 2014-04-22 12:01 UTC (permalink / raw)
  To: netdev

It seems that if the head skb of a reassembly chain have enough tailroom
to hold the data of a received fragment, skb_try_coalesce() will append this
directly to the head, even if preceding fragments have been put on a frag list.
This will cause a corrupted buffer to be passed to userland when 
skb_copy_datagram_iovec() later copies the contents of head, and then each frag
one by one to the target iovec.
 
Is skb_try_coalesce() broken, or are we using it wrongly in tipc?

//E

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 12:01 skb_try_coalesce bug? Erik Hugne
@ 2014-04-22 13:11 ` Eric Dumazet
  2014-04-22 19:38   ` Jon Maloy
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2014-04-22 13:11 UTC (permalink / raw)
  To: Erik Hugne; +Cc: netdev

On Tue, 2014-04-22 at 14:01 +0200, Erik Hugne wrote:
> It seems that if the head skb of a reassembly chain have enough tailroom
> to hold the data of a received fragment, skb_try_coalesce() will append this
> directly to the head, even if preceding fragments have been put on a frag list.
> This will cause a corrupted buffer to be passed to userland when 
> skb_copy_datagram_iovec() later copies the contents of head, and then each frag
> one by one to the target iovec.
>  
> Is skb_try_coalesce() broken, or are we using it wrongly in tipc?

I am not sure how it could happen with the current implementation ?

static inline bool skb_is_nonlinear(const struct sk_buff *skb)
{
	return skb->data_len;
}

static inline int skb_tailroom(const struct sk_buff *skb)
{
	return skb_is_nonlinear(skb) ? 0 : skb->end - skb->tail;
}

/**
 * skb_try_coalesce - try to merge skb to prior one
 * @to: prior buffer
 * @from: buffer to add
 * @fragstolen: pointer to boolean
 * @delta_truesize: how much more was allocated than was requested
 */
bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
                      bool *fragstolen, int *delta_truesize)
{
        int i, delta, len = from->len;

        *fragstolen = false;

        if (skb_cloned(to))
                return false;

        if (len <= skb_tailroom(to)) {
                BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 13:11 ` Eric Dumazet
@ 2014-04-22 19:38   ` Jon Maloy
  2014-04-22 20:05     ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Jon Maloy @ 2014-04-22 19:38 UTC (permalink / raw)
  To: Eric Dumazet, Erik Hugne; +Cc: netdev

On 04/22/2014 09:11 AM, Eric Dumazet wrote:
> On Tue, 2014-04-22 at 14:01 +0200, Erik Hugne wrote:
>> It seems that if the head skb of a reassembly chain have enough tailroom
>> to hold the data of a received fragment, skb_try_coalesce() will append this
>> directly to the head, even if preceding fragments have been put on a frag list.
>> This will cause a corrupted buffer to be passed to userland when 
>> skb_copy_datagram_iovec() later copies the contents of head, and then each frag
>> one by one to the target iovec.
>>  
>> Is skb_try_coalesce() broken, or are we using it wrongly in tipc?
> 
> I am not sure how it could happen with the current implementation ?
> 
> static inline bool skb_is_nonlinear(const struct sk_buff *skb)
> {
> 	return skb->data_len;
> }
> 
> static inline int skb_tailroom(const struct sk_buff *skb)
> {
> 	return skb_is_nonlinear(skb) ? 0 : skb->end - skb->tail;
> }
> 
> /**
>  * skb_try_coalesce - try to merge skb to prior one
>  * @to: prior buffer
>  * @from: buffer to add
>  * @fragstolen: pointer to boolean
>  * @delta_truesize: how much more was allocated than was requested
>  */
> bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>                       bool *fragstolen, int *delta_truesize)
> {
>         int i, delta, len = from->len;
> 
>         *fragstolen = false;
> 
>         if (skb_cloned(to))
>                 return false;
> 
>         if (len <= skb_tailroom(to)) {
>                 BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));

In the case I encountered, our head buffer is linear (skb->data_len == 0),
so it is the real tailroom value that is returned. An alas, that one is big
enough to contain the last (small) fragment of the message.

///jon


> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 19:38   ` Jon Maloy
@ 2014-04-22 20:05     ` Eric Dumazet
  2014-04-22 20:35       ` Jon Maloy
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2014-04-22 20:05 UTC (permalink / raw)
  To: Jon Maloy; +Cc: Erik Hugne, netdev

On Tue, 2014-04-22 at 15:38 -0400, Jon Maloy wrote:

> 
> In the case I encountered, our head buffer is linear (skb->data_len == 0),
> so it is the real tailroom value that is returned. An alas, that one is big
> enough to contain the last (small) fragment of the message.


Whole point of skb_try_coalesce() is to coalesce as much as possible,
without guarantee of keeping some sort of 'segments'

skb_try_coalesce - try to merge skb to prior one

If you do not want this to happen, (you seem to want nothing else in
your head buffer skb->head), you need to add some logic.

A helper temporarily setting head->tail = head->end would do it I guess.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 20:05     ` Eric Dumazet
@ 2014-04-22 20:35       ` Jon Maloy
  2014-04-22 21:28         ` Jon Maloy
  2014-04-22 21:29         ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Jon Maloy @ 2014-04-22 20:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Erik Hugne, netdev

On 04/22/2014 04:05 PM, Eric Dumazet wrote:
> On Tue, 2014-04-22 at 15:38 -0400, Jon Maloy wrote:
> 
>>
>> In the case I encountered, our head buffer is linear (skb->data_len == 0),
>> so it is the real tailroom value that is returned. An alas, that one is big
>> enough to contain the last (small) fragment of the message.
> 
> 
> Whole point of skb_try_coalesce() is to coalesce as much as possible,
> without guarantee of keeping some sort of 'segments'
> 
> skb_try_coalesce - try to merge skb to prior one
> 
> If you do not want this to happen, (you seem to want nothing else in
> your head buffer skb->head), you need to add some logic.

Ok. I should have given a little background.

1: We send a message of 3041 bytes, inclusive TIPC header, via loopback interface.

2: This one gets chopped up in three fragments: 1420, 1420,and 201 bytes.
   (The mtu was of course wrong, but this is how I discovered the problem).

3: First fragment is received, uncloned, and serves as head.

4;  Second fragment (a clone) is received. skb_try_coalesce() fails at
    the skb_head_is_locked() test, because the buffer is a clone.
    Because of this, we add the buffer to skb_shinfo(head)->frag_list
    instead.

5: Third fragment (also a clone) is received. Now, since we check for
   space in tailroom of header before we do anything else, it slips
   in there, and bypasses the already chained-up second segment.

Regards
///jon


> 
> A helper temporarily setting head->tail = head->end would do it I guess.
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 20:35       ` Jon Maloy
@ 2014-04-22 21:28         ` Jon Maloy
  2014-04-22 21:29         ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2014-04-22 21:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Erik Hugne, netdev

On 04/22/2014 04:35 PM, Jon Maloy wrote:
> On 04/22/2014 04:05 PM, Eric Dumazet wrote:
>> On Tue, 2014-04-22 at 15:38 -0400, Jon Maloy wrote:
>>
>>>
>>> In the case I encountered, our head buffer is linear (skb->data_len == 0),
>>> so it is the real tailroom value that is returned. An alas, that one is big
>>> enough to contain the last (small) fragment of the message.
>>
>>
>> Whole point of skb_try_coalesce() is to coalesce as much as possible,
>> without guarantee of keeping some sort of 'segments'
>>
>> skb_try_coalesce - try to merge skb to prior one
>>
>> If you do not want this to happen, (you seem to want nothing else in
>> your head buffer skb->head), you need to add some logic.
> 
> Ok. I should have given a little background.
> 
> 1: We send a message of 3041 bytes, inclusive TIPC header, via loopback interface.
> 
> 2: This one gets chopped up in three fragments: 1420, 1420,and 201 bytes.
>    (The mtu was of course wrong, but this is how I discovered the problem).
> 
> 3: First fragment is received, uncloned, and serves as head.
> 
> 4;  Second fragment (a clone) is received. skb_try_coalesce() fails at
>     the skb_head_is_locked() test, because the buffer is a clone.
>     Because of this, we add the buffer to skb_shinfo(head)->frag_list
>     instead.
> 
> 5: Third fragment (also a clone) is received. Now, since we 

i.e., skb_try_coalesce(head, frag)

check for
>    space in tailroom of header before we do anything else, it slips
>    in there, and bypasses the already chained-up second segment.

More background: our reassembly code is based on the one found in 
ip_fragment.c::ip_frag_reasm(), which always first try to coalecse
a buffer with head. That is a bad idea, I guess, but I wonder why
they don't see this problem in ipv4.

> 
> Regards
> ///jon
> 
> 
>>
>> A helper temporarily setting head->tail = head->end would do it I guess.

That would work. Or just check skb_has_frag_list(head) first, and make
the call to skb_try_coalesce() conditional to the the result.
It just feels a little unnecessary, since that test is done inside
skb_try_coalesce() anyway.

///jon



>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 20:35       ` Jon Maloy
  2014-04-22 21:28         ` Jon Maloy
@ 2014-04-22 21:29         ` Eric Dumazet
  2014-04-22 21:31           ` Jon Maloy
  2014-04-22 21:37           ` Eric Dumazet
  1 sibling, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2014-04-22 21:29 UTC (permalink / raw)
  To: Jon Maloy; +Cc: Erik Hugne, netdev

On Tue, 2014-04-22 at 16:35 -0400, Jon Maloy wrote:
> On 04/22/2014 04:05 PM, Eric Dumazet wrote:
> > On Tue, 2014-04-22 at 15:38 -0400, Jon Maloy wrote:
> > 
> >>
> >> In the case I encountered, our head buffer is linear (skb->data_len == 0),
> >> so it is the real tailroom value that is returned. An alas, that one is big
> >> enough to contain the last (small) fragment of the message.
> > 
> > 
> > Whole point of skb_try_coalesce() is to coalesce as much as possible,
> > without guarantee of keeping some sort of 'segments'
> > 
> > skb_try_coalesce - try to merge skb to prior one
> > 
> > If you do not want this to happen, (you seem to want nothing else in
> > your head buffer skb->head), you need to add some logic.
> 
> Ok. I should have given a little background.
> 
> 1: We send a message of 3041 bytes, inclusive TIPC header, via loopback interface.
> 
> 2: This one gets chopped up in three fragments: 1420, 1420,and 201 bytes.
>    (The mtu was of course wrong, but this is how I discovered the problem).
> 
> 3: First fragment is received, uncloned, and serves as head.
> 
> 4;  Second fragment (a clone) is received. skb_try_coalesce() fails at
>     the skb_head_is_locked() test, because the buffer is a clone.
>     Because of this, we add the buffer to skb_shinfo(head)->frag_list
>     instead.

Then if you do that, you also need to change head->data_len !

> 
> 5: Third fragment (also a clone) is received. Now, since we check for
>    space in tailroom of header before we do anything else, it slips
>    in there, and bypasses the already chained-up second segment.

Only because you failed to tell the world @head was no longer linear.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 21:29         ` Eric Dumazet
@ 2014-04-22 21:31           ` Jon Maloy
  2014-04-22 21:37           ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2014-04-22 21:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Erik Hugne, netdev

On 04/22/2014 05:29 PM, Eric Dumazet wrote:
> On Tue, 2014-04-22 at 16:35 -0400, Jon Maloy wrote:
>> On 04/22/2014 04:05 PM, Eric Dumazet wrote:
>>> On Tue, 2014-04-22 at 15:38 -0400, Jon Maloy wrote:
>>>
>>>>
>>>> In the case I encountered, our head buffer is linear (skb->data_len == 0),
>>>> so it is the real tailroom value that is returned. An alas, that one is big
>>>> enough to contain the last (small) fragment of the message.
>>>
>>>
>>> Whole point of skb_try_coalesce() is to coalesce as much as possible,
>>> without guarantee of keeping some sort of 'segments'
>>>
>>> skb_try_coalesce - try to merge skb to prior one
>>>
>>> If you do not want this to happen, (you seem to want nothing else in
>>> your head buffer skb->head), you need to add some logic.
>>
>> Ok. I should have given a little background.
>>
>> 1: We send a message of 3041 bytes, inclusive TIPC header, via loopback interface.
>>
>> 2: This one gets chopped up in three fragments: 1420, 1420,and 201 bytes.
>>    (The mtu was of course wrong, but this is how I discovered the problem).
>>
>> 3: First fragment is received, uncloned, and serves as head.
>>
>> 4;  Second fragment (a clone) is received. skb_try_coalesce() fails at
>>     the skb_head_is_locked() test, because the buffer is a clone.
>>     Because of this, we add the buffer to skb_shinfo(head)->frag_list
>>     instead.
> 
> Then if you do that, you also need to change head->data_len !
> 
>>
>> 5: Third fragment (also a clone) is received. Now, since we check for
>>    space in tailroom of header before we do anything else, it slips
>>    in there, and bypasses the already chained-up second segment.
> 
> Only because you failed to tell the world @head was no longer linear.

Ok. Thank you. Lesson learned.

///jon

> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 21:29         ` Eric Dumazet
  2014-04-22 21:31           ` Jon Maloy
@ 2014-04-22 21:37           ` Eric Dumazet
  2014-04-23 16:56             ` Jon Maloy
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2014-04-22 21:37 UTC (permalink / raw)
  To: Jon Maloy; +Cc: Erik Hugne, netdev

On Tue, 2014-04-22 at 14:29 -0700, Eric Dumazet wrote:

> Then if you do that, you also need to change head->data_len !

Untested patch would be :

diff --git a/net/tipc/link.c b/net/tipc/link.c
index c5190ab75290..85077dd7c63e 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -2349,6 +2349,7 @@ int tipc_link_frag_rcv(struct sk_buff **head, struct sk_buff **tail,
 			(*tail)->next = frag;
 		*tail = frag;
 		(*head)->truesize += frag->truesize;
+		(*head)->data_len += frag->len;
 	}
 	if (fragid == LAST_FRAGMENT) {
 		*fbuf = *head;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-22 21:37           ` Eric Dumazet
@ 2014-04-23 16:56             ` Jon Maloy
  2014-04-23 17:33               ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Jon Maloy @ 2014-04-23 16:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Erik Hugne, netdev

On 04/22/2014 05:37 PM, Eric Dumazet wrote:
> On Tue, 2014-04-22 at 14:29 -0700, Eric Dumazet wrote:
> 
>> Then if you do that, you also need to change head->data_len !
> 
> Untested patch would be :
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index c5190ab75290..85077dd7c63e 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -2349,6 +2349,7 @@ int tipc_link_frag_rcv(struct sk_buff **head, struct sk_buff **tail,
>  			(*tail)->next = frag;
>  		*tail = frag;
>  		(*head)->truesize += frag->truesize;
> +		(*head)->data_len += frag->len;

Just to confirm, does this mean that head's own (linear) data is not
included in data_len? 

///jon

>  	}
>  	if (fragid == LAST_FRAGMENT) {
>  		*fbuf = *head;
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-23 16:56             ` Jon Maloy
@ 2014-04-23 17:33               ` David Miller
  2014-04-23 17:54                 ` Jon Maloy
  0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2014-04-23 17:33 UTC (permalink / raw)
  To: jon.maloy; +Cc: eric.dumazet, erik.hugne, netdev

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Wed, 23 Apr 2014 12:56:20 -0400

> On 04/22/2014 05:37 PM, Eric Dumazet wrote:
>> On Tue, 2014-04-22 at 14:29 -0700, Eric Dumazet wrote:
>> 
>>> Then if you do that, you also need to change head->data_len !
>> 
>> Untested patch would be :
>> 
>> diff --git a/net/tipc/link.c b/net/tipc/link.c
>> index c5190ab75290..85077dd7c63e 100644
>> --- a/net/tipc/link.c
>> +++ b/net/tipc/link.c
>> @@ -2349,6 +2349,7 @@ int tipc_link_frag_rcv(struct sk_buff **head, struct sk_buff **tail,
>>  			(*tail)->next = frag;
>>  		*tail = frag;
>>  		(*head)->truesize += frag->truesize;
>> +		(*head)->data_len += frag->len;
> 
> Just to confirm, does this mean that head's own (linear) data is not
> included in data_len? 

For a given SKB, skb->len is the entire length of the packet, fragments and
all.

skb->data_len counts the sum of all of the page and SKB based fragments, ie.
all bytes which are not in the top-level SKBs linear area.

So the linear length is always "skb->len - skb->data_len", and this is exactly
what skb_headlen() does.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: skb_try_coalesce bug?
  2014-04-23 17:33               ` David Miller
@ 2014-04-23 17:54                 ` Jon Maloy
  0 siblings, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2014-04-23 17:54 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, erik.hugne, netdev

On 04/23/2014 01:33 PM, David Miller wrote:
> From: Jon Maloy <jon.maloy@ericsson.com>
> Date: Wed, 23 Apr 2014 12:56:20 -0400
> 
>> On 04/22/2014 05:37 PM, Eric Dumazet wrote:
>>> On Tue, 2014-04-22 at 14:29 -0700, Eric Dumazet wrote:
>>>
>>>> Then if you do that, you also need to change head->data_len !
>>>
>>> Untested patch would be :
>>>
>>> diff --git a/net/tipc/link.c b/net/tipc/link.c
>>> index c5190ab75290..85077dd7c63e 100644
>>> --- a/net/tipc/link.c
>>> +++ b/net/tipc/link.c
>>> @@ -2349,6 +2349,7 @@ int tipc_link_frag_rcv(struct sk_buff **head, struct sk_buff **tail,
>>>  			(*tail)->next = frag;
>>>  		*tail = frag;
>>>  		(*head)->truesize += frag->truesize;
>>> +		(*head)->data_len += frag->len;
>>
>> Just to confirm, does this mean that head's own (linear) data is not
>> included in data_len? 
> 
> For a given SKB, skb->len is the entire length of the packet, fragments and
> all.
> 
> skb->data_len counts the sum of all of the page and SKB based fragments, ie.
> all bytes which are not in the top-level SKBs linear area.
> 
> So the linear length is always "skb->len - skb->data_len", and this is exactly
> what skb_headlen() does.
> 

Thank you for the clarification. We'll have to fix this. I am just puzzled that
our defragmentation algorithm has worked as well as it has until now.

///jon

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-04-23 17:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-22 12:01 skb_try_coalesce bug? Erik Hugne
2014-04-22 13:11 ` Eric Dumazet
2014-04-22 19:38   ` Jon Maloy
2014-04-22 20:05     ` Eric Dumazet
2014-04-22 20:35       ` Jon Maloy
2014-04-22 21:28         ` Jon Maloy
2014-04-22 21:29         ` Eric Dumazet
2014-04-22 21:31           ` Jon Maloy
2014-04-22 21:37           ` Eric Dumazet
2014-04-23 16:56             ` Jon Maloy
2014-04-23 17:33               ` David Miller
2014-04-23 17:54                 ` Jon Maloy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).