Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces
       [not found] <51B1CA50.30702@telenet.dn.ua>
@ 2013-06-07 12:41 ` Mike Galbraith
  2013-06-07 13:05   ` Daniel Borkmann
  2013-06-07 13:30   ` David Laight
  0 siblings, 2 replies; 9+ messages in thread
From: Mike Galbraith @ 2013-06-07 12:41 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: linux-kernel, netdev

(CC's net-fu dojo) 

On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote: 
> Hello,
> 
> I have a Linux router with a lot of interfaces (hundreds or
> thousands of VLANs) and an application that creates AF_PACKET
> socket per interface and bind()s sockets to interfaces.
> 
> Each socket has attached BPF filter too.
> 
> The problem is observed on linux-3.8.13, but as far I can see
> from the source the latest version has alike behavior.
> 
> I noticed that box has strange performance problems with
> most of the CPU time spent in __netif_receive_skb:
>   86.15%  [k] __netif_receive_skb
>    1.41%  [k] _raw_spin_lock
>    1.09%  [k] fib_table_lookup
>    0.99%  [k] local_bh_enable_ip
> 
> and this the assembly with the "hot spot":
>         │       shr    $0x8,%r15w
>         │       and    $0xf,%r15d
>    0.00 │       shl    $0x4,%r15
>         │       add    $0xffffffff8165ec80,%r15
>         │       mov    (%r15),%rax
>    0.09 │       mov    %rax,0x28(%rsp)
>         │       mov    0x28(%rsp),%rbp
>    0.01 │       sub    $0x28,%rbp
>         │       jmp    5c7
>    1.72 │5b0:   mov    0x28(%rbp),%rax
>    0.05 │       mov    0x18(%rsp),%rbx
>    0.00 │       mov    %rax,0x28(%rsp)
>    0.03 │       mov    0x28(%rsp),%rbp
>    5.67 │       sub    $0x28,%rbp
>    1.71 │5c7:   lea    0x28(%rbp),%rax
>    1.73 │       cmp    %r15,%rax
>         │       je     640
>    1.74 │       cmp    %r14w,0x0(%rbp)
>         │       jne    5b0
>   81.36 │       mov    0x8(%rbp),%rax
>    2.74 │       cmp    %rax,%r8
>         │       je     5eb
>    1.37 │       cmp    0x20(%rbx),%rax
>         │       je     5eb
>    1.39 │       cmp    %r13,%rax
>         │       jne    5b0
>    0.04 │5eb:   test   %r12,%r12
>    0.04 │       je     6f4
>         │       mov    0xc0(%rbx),%eax
>         │       mov    0xc8(%rbx),%rdx
>         │       testb  $0x8,0x1(%rdx,%rax,1)
>         │       jne    6d5
> 
> This corresponds to:
> 
> net/core/dev.c:
>          type = skb->protocol;
>          list_for_each_entry_rcu(ptype,
>                          &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>                  if (ptype->type == type &&
>                      (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>                       ptype->dev == orig_dev)) {
>                          if (pt_prev)
>                                  ret = deliver_skb(skb, pt_prev, orig_dev);
>                          pt_prev = ptype;
>                  }
>          }
> 
> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
> the socket adds a protocol to ptype list:
> 
> # cat /proc/net/ptype
> Type Device      Function
> 0800 eth2.1989 packet_rcv+0x0/0x400
> 0800 eth2.1987 packet_rcv+0x0/0x400
> 0800 eth2.1986 packet_rcv+0x0/0x400
> 0800 eth2.1990 packet_rcv+0x0/0x400
> 0800 eth2.1995 packet_rcv+0x0/0x400
> 0800 eth2.1997 packet_rcv+0x0/0x400
> .......
> 0800 eth2.1004 packet_rcv+0x0/0x400
> 0800          ip_rcv+0x0/0x310
> 0011          llc_rcv+0x0/0x3a0
> 0004          llc_rcv+0x0/0x3a0
> 0806          arp_rcv+0x0/0x150
> 
> And this obviously results in a huge performance penalty.
> 
> ptype_all, by the looks, should be the same.
> 
> Probably one way to fix this it to perform interface name matching in
> af_packet handler, but there could be other cases, other protocols.
> 
> Ideas are welcome :)
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 12:41 ` Scaling problem with a lot of AF_PACKET sockets on different interfaces Mike Galbraith
@ 2013-06-07 13:05   ` Daniel Borkmann
  2013-06-07 14:17     ` Vitaly V. Bursov
  2013-06-07 13:30   ` David Laight
  1 sibling, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2013-06-07 13:05 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Vitaly V. Bursov, linux-kernel, netdev

On 06/07/2013 02:41 PM, Mike Galbraith wrote:
> (CC's net-fu dojo)
>
> On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
>> Hello,
>>
>> I have a Linux router with a lot of interfaces (hundreds or
>> thousands of VLANs) and an application that creates AF_PACKET
>> socket per interface and bind()s sockets to interfaces.
>>
>> Each socket has attached BPF filter too.
>>
>> The problem is observed on linux-3.8.13, but as far I can see
>> from the source the latest version has alike behavior.
>>
>> I noticed that box has strange performance problems with
>> most of the CPU time spent in __netif_receive_skb:
>>    86.15%  [k] __netif_receive_skb
>>     1.41%  [k] _raw_spin_lock
>>     1.09%  [k] fib_table_lookup
>>     0.99%  [k] local_bh_enable_ip
>>
>> and this the assembly with the "hot spot":
>>          │       shr    $0x8,%r15w
>>          │       and    $0xf,%r15d
>>     0.00 │       shl    $0x4,%r15
>>          │       add    $0xffffffff8165ec80,%r15
>>          │       mov    (%r15),%rax
>>     0.09 │       mov    %rax,0x28(%rsp)
>>          │       mov    0x28(%rsp),%rbp
>>     0.01 │       sub    $0x28,%rbp
>>          │       jmp    5c7
>>     1.72 │5b0:   mov    0x28(%rbp),%rax
>>     0.05 │       mov    0x18(%rsp),%rbx
>>     0.00 │       mov    %rax,0x28(%rsp)
>>     0.03 │       mov    0x28(%rsp),%rbp
>>     5.67 │       sub    $0x28,%rbp
>>     1.71 │5c7:   lea    0x28(%rbp),%rax
>>     1.73 │       cmp    %r15,%rax
>>          │       je     640
>>     1.74 │       cmp    %r14w,0x0(%rbp)
>>          │       jne    5b0
>>    81.36 │       mov    0x8(%rbp),%rax
>>     2.74 │       cmp    %rax,%r8
>>          │       je     5eb
>>     1.37 │       cmp    0x20(%rbx),%rax
>>          │       je     5eb
>>     1.39 │       cmp    %r13,%rax
>>          │       jne    5b0
>>     0.04 │5eb:   test   %r12,%r12
>>     0.04 │       je     6f4
>>          │       mov    0xc0(%rbx),%eax
>>          │       mov    0xc8(%rbx),%rdx
>>          │       testb  $0x8,0x1(%rdx,%rax,1)
>>          │       jne    6d5
>>
>> This corresponds to:
>>
>> net/core/dev.c:
>>           type = skb->protocol;
>>           list_for_each_entry_rcu(ptype,
>>                           &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>                   if (ptype->type == type &&
>>                       (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>>                        ptype->dev == orig_dev)) {
>>                           if (pt_prev)
>>                                   ret = deliver_skb(skb, pt_prev, orig_dev);
>>                           pt_prev = ptype;
>>                   }
>>           }
>>
>> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
>> the socket adds a protocol to ptype list:
>>
>> # cat /proc/net/ptype
>> Type Device      Function
>> 0800 eth2.1989 packet_rcv+0x0/0x400
>> 0800 eth2.1987 packet_rcv+0x0/0x400
>> 0800 eth2.1986 packet_rcv+0x0/0x400
>> 0800 eth2.1990 packet_rcv+0x0/0x400
>> 0800 eth2.1995 packet_rcv+0x0/0x400
>> 0800 eth2.1997 packet_rcv+0x0/0x400
>> .......
>> 0800 eth2.1004 packet_rcv+0x0/0x400
>> 0800          ip_rcv+0x0/0x310
>> 0011          llc_rcv+0x0/0x3a0
>> 0004          llc_rcv+0x0/0x3a0
>> 0806          arp_rcv+0x0/0x150
>>
>> And this obviously results in a huge performance penalty.
>>
>> ptype_all, by the looks, should be the same.
>>
>> Probably one way to fix this it to perform interface name matching in
>> af_packet handler, but there could be other cases, other protocols.
>>
>> Ideas are welcome :)

Probably, that depends on _your scenario_ and/or BPF filter, but would it be
an alternative if you have only a few packet sockets (maybe one pinned to each
cpu) and cluster/load-balance them together via packet fanout? (Where you
bind the socket to ifindex 0, so that you get traffic from all devs...) That
would at least avoid that "hot spot", and you could post-process the interface
via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
observed. ;-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 12:41 ` Scaling problem with a lot of AF_PACKET sockets on different interfaces Mike Galbraith
  2013-06-07 13:05   ` Daniel Borkmann
@ 2013-06-07 13:30   ` David Laight
  2013-06-07 13:54     ` Eric Dumazet
  1 sibling, 1 reply; 9+ messages in thread
From: David Laight @ 2013-06-07 13:30 UTC (permalink / raw)
  To: Mike Galbraith, Vitaly V. Bursov; +Cc: linux-kernel, netdev

> > I have a Linux router with a lot of interfaces (hundreds or
> > thousands of VLANs) and an application that creates AF_PACKET
> > socket per interface and bind()s sockets to interfaces.
...
> > I noticed that box has strange performance problems with
> > most of the CPU time spent in __netif_receive_skb:
> >   86.15%  [k] __netif_receive_skb
> >    1.41%  [k] _raw_spin_lock
> >    1.09%  [k] fib_table_lookup
> >    0.99%  [k] local_bh_enable_ip
...
> > This corresponds to:
> >
> > net/core/dev.c:
> >          type = skb->protocol;
> >          list_for_each_entry_rcu(ptype,
> >                          &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> >                  if (ptype->type == type &&
> >                      (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
> >                       ptype->dev == orig_dev)) {
> >                          if (pt_prev)
> >                                  ret = deliver_skb(skb, pt_prev, orig_dev);
> >                          pt_prev = ptype;
> >                  }
> >          }
> >
> > Which works perfectly OK until there are a lot of AF_PACKET sockets, since
> > the socket adds a protocol to ptype list:

Presumably the 'ethertype' is the same for all the sockets?
(And probably the '& PTYPE_HASH_MASH' doesn't separate it from 0800
or 0806 (IIRC IP and ICMP))

How often is that deliver_skb() inside the loop called?
If the code could be arranged so that the scan loop didn't contain
a function call then the loop code would be a lot faster since
the compiler can cache values in registers.
While that woukd speed the code up somewhat, there would still be a
significant cost to iterate 1000+ times.

Looks like the ptype_base[] should be per 'dev'?
Or just put entries where ptype->dev != null_or_dev on a per-interface
list and do two searches?

	David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 13:30   ` David Laight
@ 2013-06-07 13:54     ` Eric Dumazet
  2013-06-07 14:09       ` David Laight
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2013-06-07 13:54 UTC (permalink / raw)
  To: David Laight; +Cc: Mike Galbraith, Vitaly V. Bursov, linux-kernel, netdev

On Fri, 2013-06-07 at 14:30 +0100, David Laight wrote:

> Looks like the ptype_base[] should be per 'dev'?
> Or just put entries where ptype->dev != null_or_dev on a per-interface
> list and do two searches?

Yes, but then we would have two searches instead of one in fast path.

ptype_base[] is currently 16 slots, 256 bytes on x86_64.
Presumably the per device list could be a single list, instead of a hash
table, but still...

If the application creating hundred or thousand of AF_PACKET sockets is
a single process, I really question why using a single AF_PACKET was not
chosen.

We now have a FANOUT capability on AF_PACKET, so that its scalable to
million of packets per second.

I would rather try this way before adding yet another section in
__netif_receive_skb()

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 13:54     ` Eric Dumazet
@ 2013-06-07 14:09       ` David Laight
  2013-06-07 14:30         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: David Laight @ 2013-06-07 14:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mike Galbraith, Vitaly V. Bursov, linux-kernel, netdev

> > Looks like the ptype_base[] should be per 'dev'?
> > Or just put entries where ptype->dev != null_or_dev on a per-interface
> > list and do two searches?
> 
> Yes, but then we would have two searches instead of one in fast path.

Usually it would be empty - so the search would be very quick!

	David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 13:05   ` Daniel Borkmann
@ 2013-06-07 14:17     ` Vitaly V. Bursov
  2013-06-07 14:33       ` Daniel Borkmann
  0 siblings, 1 reply; 9+ messages in thread
From: Vitaly V. Bursov @ 2013-06-07 14:17 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Mike Galbraith, linux-kernel, netdev

07.06.2013 16:05, Daniel Borkmann пишет:
> On 06/07/2013 02:41 PM, Mike Galbraith wrote:
>> (CC's net-fu dojo)
>>
>> On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
>>> Hello,
>>>
>>> I have a Linux router with a lot of interfaces (hundreds or
>>> thousands of VLANs) and an application that creates AF_PACKET
>>> socket per interface and bind()s sockets to interfaces.
>>>
>>> Each socket has attached BPF filter too.
>>>
>>> The problem is observed on linux-3.8.13, but as far I can see
>>> from the source the latest version has alike behavior.
>>>
>>> I noticed that box has strange performance problems with
>>> most of the CPU time spent in __netif_receive_skb:
>>>    86.15%  [k] __netif_receive_skb
>>>     1.41%  [k] _raw_spin_lock
>>>     1.09%  [k] fib_table_lookup
>>>     0.99%  [k] local_bh_enable_ip
>>>
>>> and this the assembly with the "hot spot":
>>>          │       shr    $0x8,%r15w
>>>          │       and    $0xf,%r15d
>>>     0.00 │       shl    $0x4,%r15
>>>          │       add    $0xffffffff8165ec80,%r15
>>>          │       mov    (%r15),%rax
>>>     0.09 │       mov    %rax,0x28(%rsp)
>>>          │       mov    0x28(%rsp),%rbp
>>>     0.01 │       sub    $0x28,%rbp
>>>          │       jmp    5c7
>>>     1.72 │5b0:   mov    0x28(%rbp),%rax
>>>     0.05 │       mov    0x18(%rsp),%rbx
>>>     0.00 │       mov    %rax,0x28(%rsp)
>>>     0.03 │       mov    0x28(%rsp),%rbp
>>>     5.67 │       sub    $0x28,%rbp
>>>     1.71 │5c7:   lea    0x28(%rbp),%rax
>>>     1.73 │       cmp    %r15,%rax
>>>          │       je     640
>>>     1.74 │       cmp    %r14w,0x0(%rbp)
>>>          │       jne    5b0
>>>    81.36 │       mov    0x8(%rbp),%rax
>>>     2.74 │       cmp    %rax,%r8
>>>          │       je     5eb
>>>     1.37 │       cmp    0x20(%rbx),%rax
>>>          │       je     5eb
>>>     1.39 │       cmp    %r13,%rax
>>>          │       jne    5b0
>>>     0.04 │5eb:   test   %r12,%r12
>>>     0.04 │       je     6f4
>>>          │       mov    0xc0(%rbx),%eax
>>>          │       mov    0xc8(%rbx),%rdx
>>>          │       testb  $0x8,0x1(%rdx,%rax,1)
>>>          │       jne    6d5
>>>
>>> This corresponds to:
>>>
>>> net/core/dev.c:
>>>           type = skb->protocol;
>>>           list_for_each_entry_rcu(ptype,
>>>                           &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>>                   if (ptype->type == type &&
>>>                       (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>>>                        ptype->dev == orig_dev)) {
>>>                           if (pt_prev)
>>>                                   ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                           pt_prev = ptype;
>>>                   }
>>>           }
>>>
>>> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
>>> the socket adds a protocol to ptype list:
>>>
>>> # cat /proc/net/ptype
>>> Type Device      Function
>>> 0800 eth2.1989 packet_rcv+0x0/0x400
>>> 0800 eth2.1987 packet_rcv+0x0/0x400
>>> 0800 eth2.1986 packet_rcv+0x0/0x400
>>> 0800 eth2.1990 packet_rcv+0x0/0x400
>>> 0800 eth2.1995 packet_rcv+0x0/0x400
>>> 0800 eth2.1997 packet_rcv+0x0/0x400
>>> .......
>>> 0800 eth2.1004 packet_rcv+0x0/0x400
>>> 0800          ip_rcv+0x0/0x310
>>> 0011          llc_rcv+0x0/0x3a0
>>> 0004          llc_rcv+0x0/0x3a0
>>> 0806          arp_rcv+0x0/0x150
>>>
>>> And this obviously results in a huge performance penalty.
>>>
>>> ptype_all, by the looks, should be the same.
>>>
>>> Probably one way to fix this it to perform interface name matching in
>>> af_packet handler, but there could be other cases, other protocols.
>>>
>>> Ideas are welcome :)
>
> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
> an alternative if you have only a few packet sockets (maybe one pinned to each
> cpu) and cluster/load-balance them together via packet fanout? (Where you
> bind the socket to ifindex 0, so that you get traffic from all devs...) That
> would at least avoid that "hot spot", and you could post-process the interface
> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
> observed. ;-)

I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
work for me (applications is a custom DHCP server) it'll surely
increase the overhead of BPF (I don't need to tap the traffic from all
interfaces), there are vlans, bridges and bonds - likely the server will receive
same packets multiple times and replies must be sent too...
but it still should be faster.

I just checked isc-dhcpd-V3.1.3 running on multiple interfaces
(another system with 2.6.32):
$ cat /proc/net/ptype
Type Device      Function
ALL  eth0     packet_rcv_spkt+0x0/0x190
ALL  eth0.10  packet_rcv_spkt+0x0/0x190
ALL  eth0.11  packet_rcv_spkt+0x0/0x190
....

As I understand, it'll hit this code:
         list_for_each_entry_rcu(ptype, &ptype_all, list) {
                 if (!ptype->dev || ptype->dev == skb->dev) {
                         if (pt_prev)
                                 ret = deliver_skb(skb, pt_prev, orig_dev);
                         pt_prev = ptype;
                 }
         }
which scales the same.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 14:09       ` David Laight
@ 2013-06-07 14:30         ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2013-06-07 14:30 UTC (permalink / raw)
  To: David Laight; +Cc: Mike Galbraith, Vitaly V. Bursov, linux-kernel, netdev

On Fri, 2013-06-07 at 15:09 +0100, David Laight wrote:
> > > Looks like the ptype_base[] should be per 'dev'?
> > > Or just put entries where ptype->dev != null_or_dev on a per-interface
> > > list and do two searches?
> > 
> > Yes, but then we would have two searches instead of one in fast path.
> 
> Usually it would be empty - so the search would be very quick!

quick + quick + quick + quick + quick == not so quick ;)

Plus adding another code for /proc/net/packet

Plus adding the xmit side (AF_PACKET captures receive and xmit)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 14:17     ` Vitaly V. Bursov
@ 2013-06-07 14:33       ` Daniel Borkmann
  2013-06-10  6:34         ` Vitaly V. Bursov
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2013-06-07 14:33 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Mike Galbraith, linux-kernel, netdev

On 06/07/2013 04:17 PM, Vitaly V. Bursov wrote:
> 07.06.2013 16:05, Daniel Borkmann пишет:
[...]
>>>> Ideas are welcome :)
>>
>> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
>> an alternative if you have only a few packet sockets (maybe one pinned to each
>> cpu) and cluster/load-balance them together via packet fanout? (Where you
>> bind the socket to ifindex 0, so that you get traffic from all devs...) That
>> would at least avoid that "hot spot", and you could post-process the interface
>> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
>> observed. ;-)
>
> I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
> work for me (applications is a custom DHCP server) it'll surely
> increase the overhead of BPF (I don't need to tap the traffic from all
> interfaces), there are vlans, bridges and bonds - likely the server will receive
> same packets multiple times and replies must be sent too...
> but it still should be faster.

Well, as already said, if you use a fanout socket group, then you won't receive the
_exact_ same packet twice. Rather, packets are balanced by different policies among
your packet sockets in that group. What you could do is to have a (e.g.) single BPF
filter (jitted) for all those sockets that'll let needed packets pass and you can then
access the interface they came from via sockaddr_ll, which then is further processed
in your fast path (or dropped depending on the iface). There's also a BPF extension
(BPF_S_ANC_IFINDEX) that lets you load the ifindex of the skb into the BPF accumulator,
so you could also filter early from there for a range of ifindexes (in combination to
bind the sockets to index 0). Probably that could work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces
  2013-06-07 14:33       ` Daniel Borkmann
@ 2013-06-10  6:34         ` Vitaly V. Bursov
  0 siblings, 0 replies; 9+ messages in thread
From: Vitaly V. Bursov @ 2013-06-10  6:34 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Mike Galbraith, linux-kernel, netdev

07.06.2013 17:33, Daniel Borkmann пишет:
> On 06/07/2013 04:17 PM, Vitaly V. Bursov wrote:
>> 07.06.2013 16:05, Daniel Borkmann пишет:
> [...]
>>>>> Ideas are welcome :)
>>>
>>> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
>>> an alternative if you have only a few packet sockets (maybe one pinned to each
>>> cpu) and cluster/load-balance them together via packet fanout? (Where you
>>> bind the socket to ifindex 0, so that you get traffic from all devs...) That
>>> would at least avoid that "hot spot", and you could post-process the interface
>>> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
>>> observed. ;-)
>>
>> I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
>> work for me (applications is a custom DHCP server) it'll surely
>> increase the overhead of BPF (I don't need to tap the traffic from all
>> interfaces), there are vlans, bridges and bonds - likely the server will receive
>> same packets multiple times and replies must be sent too...
>> but it still should be faster.
>
> Well, as already said, if you use a fanout socket group, then you won't receive the
> _exact_ same packet twice. Rather, packets are balanced by different policies among
> your packet sockets in that group. What you could do is to have a (e.g.) single BPF
> filter (jitted) for all those sockets that'll let needed packets pass and you can then
> access the interface they came from via sockaddr_ll, which then is further processed
> in your fast path (or dropped depending on the iface). There's also a BPF extension
> (BPF_S_ANC_IFINDEX) that lets you load the ifindex of the skb into the BPF accumulator,
> so you could also filter early from there for a range of ifindexes (in combination to
> bind the sockets to index 0). Probably that could work.

Thanks everybody, this should help a lot.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-06-10  6:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <51B1CA50.30702@telenet.dn.ua>
2013-06-07 12:41 ` Scaling problem with a lot of AF_PACKET sockets on different interfaces Mike Galbraith
2013-06-07 13:05   ` Daniel Borkmann
2013-06-07 14:17     ` Vitaly V. Bursov
2013-06-07 14:33       ` Daniel Borkmann
2013-06-10  6:34         ` Vitaly V. Bursov
2013-06-07 13:30   ` David Laight
2013-06-07 13:54     ` Eric Dumazet
2013-06-07 14:09       ` David Laight
2013-06-07 14:30         ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox