From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Wang Subject: Re: TUN problems (regression?) Date: Fri, 04 Jan 2013 13:04:21 +0800 Message-ID: <50E662D5.8010007@redhat.com> References: <4151394.nMo40zlg68@sifl> <1356046697.21834.3606.camel@edumazet-glaptop> <20121220155001.538bbdb0@nehalam.linuxnetplumber.net> <50D3D85B.1070605@redhat.com> <1356061179.21834.4515.camel@edumazet-glaptop> <50D3E510.6020008@redhat.com> <20121227164106.078604a8@nehalam.linuxnetplumber.net> <50DD319A.5000708@redhat.com> <20121227222513.394d8234@nehalam.linuxnetplumber.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Eric Dumazet , Paul Moore , netdev@vger.kernel.org To: Stephen Hemminger Return-path: Received: from mx1.redhat.com ([209.132.183.28]:48742 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750837Ab3ADFE1 (ORCPT ); Fri, 4 Jan 2013 00:04:27 -0500 In-Reply-To: <20121227222513.394d8234@nehalam.linuxnetplumber.net> Sender: netdev-owner@vger.kernel.org List-ID: On 12/28/2012 02:25 PM, Stephen Hemminger wrote: > On Fri, 28 Dec 2012 13:43:54 +0800 > Jason Wang wrote: > >> On 12/28/2012 08:41 AM, Stephen Hemminger wrote: >>> On Fri, 21 Dec 2012 12:26:56 +0800 >>> Jason Wang wrote: >>> >>>> On 12/21/2012 11:39 AM, Eric Dumazet wrote: >>>>> On Fri, 2012-12-21 at 11:32 +0800, Jason Wang wrote: >>>>>> On 12/21/2012 07:50 AM, Stephen Hemminger wrote: >>>>>>> On Thu, 20 Dec 2012 15:38:17 -0800 >>>>>>> Eric Dumazet wrote: >>>>>>> >>>>>>>> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote: >>>>>>>>> [CC'ing netdev in case this is a known problem I just missed ...] >>>>>>>>> >>>>>>>>> Hi Jason, >>>>>>>>> >>>>>>>>> I started doing some more testing with the multiqueue TUN changes and I ran >>>>>>>>> into a problem when running tunctl: running it once w/o arguments works as >>>>>>>>> expected, but running it a second time results in failure and a >>>>>>>>> kmem_cache_sanity_check() failure. The problem appears to be very repeatable >>>>>>>>> on my test VM and happens independent of the LSM/SELinux fixup patches. >>>>>>>>> >>>>>>>>> Have you seen this before? >>>>>>>>> >>>>>>>> Obviously code in tun_flow_init() is wrong... >>>>>>>> >>>>>>>> static int tun_flow_init(struct tun_struct *tun) >>>>>>>> { >>>>>>>> int i; >>>>>>>> >>>>>>>> tun->flow_cache = kmem_cache_create("tun_flow_cache", >>>>>>>> sizeof(struct tun_flow_entry), 0, 0, >>>>>>>> NULL); >>>>>>>> if (!tun->flow_cache) >>>>>>>> return -ENOMEM; >>>>>>>> ... >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> I have no idea why we would need a kmem_cache per tun_struct, >>>>>>>> and why we even need a kmem_cache. >>>>>>> Normally flow malloc/free should be good enough. >>>>>>> It might make sense to use private kmem_cache if doing hlist_nulls. >>>>>>> >>>>>>> >>>>>>> Acked-by: Stephen Hemminger >>>>>> Should be at least a global cache, I thought I can get some speed-up by >>>>>> using kmem_cache. >>>>>> >>>>>> Acked-by: Jason Wang >>>>> Was it with SLUB or SLAB ? >>>>> >>>>> Using generic kmalloc-64 is better than a dedicated kmem_cache of 48 >>>>> bytes per object, as we guarantee each object is on a single cache line. >>>>> >>>>> >>>> Right, thanks for the explanation. >>>> >>> I wonder if TUN would be better if it used a array to translate >>> receive hash to receive queue. This is how real hardware works with the >>> indirection table, and it would allow RFS acceleration. The current flow >>> cache stuff is prone to DoS attack and scaling problems with lots of >>> short lived flows. >> The problem of indirection table is hash collision which may even happen >> when few flows existed. > Hash collision is fine, as long as the the statistical average of > hash across queue's is approximately equal it will be faster. A simple > array indirection is much faster than walking a hash table. True, but hash collision may cause some negative effects such as losing the flow affinity and packet re-ordering in guest which does not exist in a perfect filter. Maybe we can implement them both and let user to choose. > >> For the RFS, we can open a API/ioctl for userspace to add or remove a >> flow cache. > RFS acceleration relies on programming the table. It is easier if > TUN looks more like hardware. > >> For the DoS/scaling issue, I have an idea of: >> - limit the total number of flow entries in tun/tap >> - only update the flow entry every N (say 20 like ixgbe) packets or the >> the tcp packet has sync flag >> - I'm not sure skb_get_rxhash() is lightweight enough, or change to more >> lightweight one? > Ideally the hash should be programmable L2 vs L3, but that is splitting > hairs at this point. > > Flow tables are scaling problem, especially on highly loaded servers where > they are most needed. > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html