From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754360Ab3FGMGe (ORCPT ); Fri, 7 Jun 2013 08:06:34 -0400 Received: from endeavour.telenet.dn.ua ([195.39.211.45]:48795 "EHLO endeavour.telenet.dn.ua" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752311Ab3FGMGd (ORCPT ); Fri, 7 Jun 2013 08:06:33 -0400 X-Greylist: delayed 631 seconds by postgrey-1.27 at vger.kernel.org; Fri, 07 Jun 2013 08:06:33 EDT Message-ID: <51B1CA50.30702@telenet.dn.ua> Date: Fri, 07 Jun 2013 14:56:00 +0300 From: "Vitaly V. Bursov" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org Subject: Scaling problem with a lot of AF_PACKET sockets on different interfaces Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, I have a Linux router with a lot of interfaces (hundreds or thousands of VLANs) and an application that creates AF_PACKET socket per interface and bind()s sockets to interfaces. Each socket has attached BPF filter too. The problem is observed on linux-3.8.13, but as far I can see from the source the latest version has alike behavior. I noticed that box has strange performance problems with most of the CPU time spent in __netif_receive_skb: 86.15% [k] __netif_receive_skb 1.41% [k] _raw_spin_lock 1.09% [k] fib_table_lookup 0.99% [k] local_bh_enable_ip and this the assembly with the "hot spot": │ shr $0x8,%r15w │ and $0xf,%r15d 0.00 │ shl $0x4,%r15 │ add $0xffffffff8165ec80,%r15 │ mov (%r15),%rax 0.09 │ mov %rax,0x28(%rsp) │ mov 0x28(%rsp),%rbp 0.01 │ sub $0x28,%rbp │ jmp 5c7 1.72 │5b0: mov 0x28(%rbp),%rax 0.05 │ mov 0x18(%rsp),%rbx 0.00 │ mov %rax,0x28(%rsp) 0.03 │ mov 0x28(%rsp),%rbp 5.67 │ sub $0x28,%rbp 1.71 │5c7: lea 0x28(%rbp),%rax 1.73 │ cmp %r15,%rax │ je 640 1.74 │ cmp %r14w,0x0(%rbp) │ jne 5b0 81.36 │ mov 0x8(%rbp),%rax 2.74 │ cmp %rax,%r8 │ je 5eb 1.37 │ cmp 0x20(%rbx),%rax │ je 5eb 1.39 │ cmp %r13,%rax │ jne 5b0 0.04 │5eb: test %r12,%r12 0.04 │ je 6f4 │ mov 0xc0(%rbx),%eax │ mov 0xc8(%rbx),%rdx │ testb $0x8,0x1(%rdx,%rax,1) │ jne 6d5 This corresponds to: net/core/dev.c: type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } Which works perfectly OK until there are a lot of AF_PACKET sockets, since the socket adds a protocol to ptype list: # cat /proc/net/ptype Type Device Function 0800 eth2.1989 packet_rcv+0x0/0x400 0800 eth2.1987 packet_rcv+0x0/0x400 0800 eth2.1986 packet_rcv+0x0/0x400 0800 eth2.1990 packet_rcv+0x0/0x400 0800 eth2.1995 packet_rcv+0x0/0x400 0800 eth2.1997 packet_rcv+0x0/0x400 ....... 0800 eth2.1004 packet_rcv+0x0/0x400 0800 ip_rcv+0x0/0x310 0011 llc_rcv+0x0/0x3a0 0004 llc_rcv+0x0/0x3a0 0806 arp_rcv+0x0/0x150 And this obviously results in a huge performance penalty. ptype_all, by the looks, should be the same. Probably one way to fix this it to perform interface name matching in af_packet handler, but there could be other cases, other protocols. Ideas are welcome :) -- Thanks Vitaly