From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jon Maloy Subject: Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets Date: Mon, 10 Dec 2012 03:46:08 -0500 Message-ID: <50C5A150.3090102@ericsson.com> References: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com> <1354890498-6448-4-git-send-email-paul.gortmaker@windriver.com> <20121207192030.GA30339@hmsreliant.think-freely.org> <50C26DF3.90409@ericsson.com> <20121209165020.GA4362@neilslaptop.think-freely.org> <50C580E6.7030905@windriver.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Cc: Neil Horman , Paul Gortmaker , David Miller , To: Ying Xue Return-path: Received: from imr3.ericy.com ([198.24.6.13]:40114 "EHLO imr3.ericy.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751064Ab2LJIqb (ORCPT ); Mon, 10 Dec 2012 03:46:31 -0500 In-Reply-To: <50C580E6.7030905@windriver.com> Sender: netdev-owner@vger.kernel.org List-ID: On 12/10/2012 01:27 AM, Ying Xue wrote: > Neil Horman wrote: >> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote: >> >>> On 12/07/2012 02:20 PM, Neil Horman wrote: >>> >>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote: >>>> >>>>> From: Ying Xue >>>>> >>>>> The sk_receive_queue limit control is currently performed for >>>>> all arriving messages, disregarding socket and message type. >>>>> But for connected sockets this check is redundant, since the protocol >>>>> flow control already makes queue overflow impossible. >>>>> >>>>> >>>> Can you explain where that occurs? >>>> >>> It happens in the functions port_dispatcher_sigh() and tipc_send(), >>> among other places. Both are to be found in the file port.c, which >>> was supposed to contain the 'generic' (i.e., API independent) part >>> of the send/receive code. >>> Now that we have only one API left, the socket API, we are >>> planning to merge the code in socket.c and port.c, and get rid of >>> some code overhead. >>> >>> The flow control in TIPC is message based, where the sender >>> requires to receive an explicit acknowledge message for each >>> 512 message the receiver reads to user space. >>> If the sender has more than 1024 messages outstanding without having >>> received an acknowledge he will be suspended or receive EAGAIN until >>> he does. >>> The plan going forward is to replace this mechanism with a more >>> standard looking byte based flow control, while maintaining >>> backwards compatibility. >>> >>> >> Ok, That makes more sense, thank you. Although I still don't think this is >> safe (but the problem may not be solely introduced by this patch). Using a >> global limit that assumes the sender will block when the congestion window is >> reached just doesn't seem sane to me. It clearly works with the Linux >> implementation, as it conforms to your expectations, but an alternate >> implementation could create a DOS situation by simply ignoring the window limit, >> and continuing to send. I see that we drop frames over the global limit in >> filter_rcv, but the check in rx_queue_full bumps up that limit based on the >> value of msg_importance(msg), but that threshold is ignored if the value of >> msg_importance is invalid. All a sender needs to do is flood a receiver with >> frames containing an invalid set of message importance bits, and you will queue >> frames indefinately. In fact that will also happen if you send message of >> CRITICAL importance as well, so you don't even need to supply an invalid value >> here. >> >> > > You are absolutely right. I will correct these drawbacks in next version. I think we should rather just drop this patch. We introduce a major vulnerability, as Neil correctly points out. We will anyway have to do a rework of this code. > >>>> I see where the tipc dispatch function calls >>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather >>>> the receiving socket is connection oriented or connectionless), but if the >>>> receiver doesn't call receive very often, This just adds a check against your >>>> global limit, doing nothing for your per-socket limits. >>>> >>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_ >>> our per-socket limit. In fact, TIPC connectionless overflow control currently >>> is a kind of a hybrid, based on a message counter when the socket is not locked, >>> and based on sk_rcv_queue's byte limit when a message has to be added to the >>> backlog. >>> We are planning to fix this inconsistency too. >>> >> Good, thank you, that was seeming quite wrong to me. >> >> >>> In fact it seems to >>> >>>> repeat the same check twice, as in the worst case of the incomming message being >>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly >>>> OVERLOAD_LIMIT_BASE/2 again. >>>> >>> Yes, you are right. The intention is that only the first test, >>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..} >>> will be run for the vast majority of messages, since we must assume >>> that there is no overload most of the time. >>> An inelegant optimization, perhaps, but not logically wrong. >>> >> No, not logically wrong, but not an optimization either. With this change, >> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to >> rx_queue_full, and then you do some multiplication based on that. It is still in the "unlikely" (in fact, very unlikely) branch. And the multiplication is by two, i.e. just a left-shift operation. Our approach was rather to let the compiler decide about inlining, which in this case might be a sub-optimization. If you really >> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than >> doubling it like this patch series does), mark rx_queue_full as inline, and just >> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration, >> the conditional branch and a call instruction. If you add a multiplication >> factor table, you can eliminate the if/else clauses in rx_queue_full as well. >> >> > > Good suggestion with a factor table. Maybe it's unnecessary to > explicitly mark rx_queue_full as inline. Currently it sounds like we let > complier decide whether a function is defined as inline or not. One approach I had in mind was to just left-shift OVERLOAD_LIMIT_BASE with message priority, and compare that to the per-socket counter. This way, we obtain the limit set [10000,20000,30000,40000] without having to read data memory. The limits will not be the same as now, but probably good enough. We don't even need a separate function for this check. Something we should look into when we move on to make this mechanism byte-based. > > Regards, > Ying > >> Neil >> >> >>> ///jon >>> >>> >>>> Neil >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >