From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jon Maloy <jon.maloy@ericsson.com>
Subject: Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for
 connectionless sockets
Date: Mon, 10 Dec 2012 03:46:08 -0500
Message-ID: <50C5A150.3090102@ericsson.com>
References: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com> <1354890498-6448-4-git-send-email-paul.gortmaker@windriver.com> <20121207192030.GA30339@hmsreliant.think-freely.org> <50C26DF3.90409@ericsson.com> <20121209165020.GA4362@neilslaptop.think-freely.org> <50C580E6.7030905@windriver.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: Neil Horman <nhorman@tuxdriver.com>,
	Paul Gortmaker <paul.gortmaker@windriver.com>,
	David Miller <davem@davemloft.net>, <netdev@vger.kernel.org>
To: Ying Xue <ying.xue@windriver.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from imr3.ericy.com ([198.24.6.13]:40114 "EHLO imr3.ericy.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751064Ab2LJIqb (ORCPT <rfc822;netdev@vger.kernel.org>);
	Mon, 10 Dec 2012 03:46:31 -0500
In-Reply-To: <50C580E6.7030905@windriver.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 12/10/2012 01:27 AM, Ying Xue wrote:
> Neil Horman wrote:
>> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>>   
>>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>>     
>>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>>       
>>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>>
>>>>> The sk_receive_queue limit control is currently performed for
>>>>> all arriving messages, disregarding socket and message type.
>>>>> But for connected sockets this check is redundant, since the protocol
>>>>> flow control already makes queue overflow impossible.
>>>>>
>>>>>         
>>>> Can you explain where that occurs?  
>>>>       
>>> It happens in the functions port_dispatcher_sigh() and tipc_send(), 
>>> among other places. Both are to be found in the file port.c, which 
>>> was supposed to contain the 'generic' (i.e., API independent) part 
>>> of the send/receive code.
>>> Now that we have only one API left, the socket API, we are 
>>> planning to merge the code in socket.c and port.c, and get rid of 
>>> some code overhead.
>>>
>>> The flow control in TIPC is message based, where the sender
>>> requires to receive an explicit acknowledge message for each 
>>> 512 message the receiver reads to user space.
>>> If the sender has more than 1024 messages outstanding without having
>>> received an acknowledge he will be suspended or receive EAGAIN until 
>>> he does.
>>> The plan going forward is to replace this mechanism with a more 
>>> standard looking byte based flow control, while maintaining 
>>> backwards compatibility.
>>>
>>>     
>> Ok, That makes more sense, thank you.  Although I still don't think this is
>> safe (but the problem may not be solely introduced by this patch).  Using a
>> global limit that assumes the sender will block when the congestion window is
>> reached just doesn't seem sane to me.  It clearly works with the Linux
>> implementation, as it conforms to your expectations, but an alternate
>> implementation could create a DOS situation by simply ignoring the window limit,
>> and continuing to send.  I see that we drop frames over the global limit in
>> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
>> value of msg_importance(msg), but that threshold is ignored if the value of
>> msg_importance is invalid.  All a sender needs to do is flood a receiver with
>> frames containing an invalid set of message importance bits, and you will queue
>> frames indefinately.  In fact that will also happen if you send message of
>> CRITICAL importance as well, so you don't even need to supply an invalid value
>> here.
>>
>>   
> 
> You are absolutely right. I will correct these drawbacks in next version.

I think we should rather just drop this patch. We introduce a major vulnerability,
as Neil correctly points out. We will anyway have to do a rework of this code.

> 
>>>> I see where the tipc dispatch function calls
>>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>>> the receiving socket is connection oriented or connectionless), but if the
>>>> receiver doesn't call receive very often, This just adds a check against your
>>>> global limit, doing nothing for your per-socket limits. 
>>>>       
>>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>>> our per-socket limit. In fact, TIPC connectionless overflow control currently 
>>> is a kind of a hybrid, based on a message counter when the socket is not locked, 
>>> and based on sk_rcv_queue's byte limit when a message has to be added to the 
>>> backlog.
>>> We are planning to fix this inconsistency too.
>>>     
>> Good, thank you,  that was seeming quite wrong to me.
>>
>>   
>>>  In fact it seems to
>>>     
>>>> repeat the same check twice, as in the worst case of the incomming message being
>>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>>> OVERLOAD_LIMIT_BASE/2 again.
>>>>       
>>> Yes, you are right. The intention is that only the first test, 
>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>>> will be run for the vast majority of messages, since we must assume
>>> that there is no overload most of the time.
>>> An inelegant optimization, perhaps, but not logically wrong.
>>>     
>> No, not logically wrong, but not an optimization either.  With this change,
>> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
>> rx_queue_full, and then you do some multiplication based on that.  

It is still in the "unlikely" (in fact, very unlikely) branch. And the multiplication
is by two, i.e. just a left-shift operation. Our approach was rather to let the 
compiler decide about inlining, which in this case might be a sub-optimization.

If you really
>> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
>> doubling it like this patch series does), mark rx_queue_full as inline, and just
>> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
>> the conditional branch and a call instruction.  If you add a multiplication
>> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>>
>>   
> 
> Good suggestion with a factor table. Maybe it's unnecessary to 
> explicitly mark rx_queue_full as inline. Currently it sounds like we let 
> complier decide whether a function is defined as inline or not.

One approach I had in mind was to just left-shift OVERLOAD_LIMIT_BASE with
message priority, and compare that to the per-socket counter. This way,
we obtain the limit set [10000,20000,30000,40000] without having to read
data memory. The limits will not be the same as now, but probably good
enough. We don't even need a separate function for this check.
Something we should look into when we move on to make this mechanism 
byte-based.

> 
> Regards,
> Ying
> 
>> Neil
>>
>>   
>>> ///jon
>>>
>>>     
>>>> Neil
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>       
>>>     
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>   
>