Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
@ 2004-03-30 14:23 Yusuf Goolamabbas
  2004-04-03 23:02 ` jamal
  0 siblings, 1 reply; 14+ messages in thread
From: Yusuf Goolamabbas @ 2004-03-30 14:23 UTC (permalink / raw)
  To: netdev

Maybe this might be of interest to netdev hackers

Luca Deri's paper
  Improving Passive Packet Capture: Beyond Device Polling

http://www.net-security.org/dl/articles/Ring.pdf

Regards, Yusuf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-03-30 14:23 Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling Yusuf Goolamabbas
@ 2004-04-03 23:02 ` jamal
  2004-04-05 16:03   ` Jason Lunz
  0 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2004-04-03 23:02 UTC (permalink / raw)
  To: Yusuf Goolamabbas; +Cc: netdev

On Tue, 2004-03-30 at 09:23, Yusuf Goolamabbas wrote:
> Maybe this might be of interest to netdev hackers
> 
> Luca Deri's paper
>   Improving Passive Packet Capture: Beyond Device Polling
> 
> http://www.net-security.org/dl/articles/Ring.pdf
> 

Thanks for posting this.
I am a little suprised at some of the results especially with
NAPI + packet mmap. 
I have had no problem at all handling wire rate 100mbps with 
tulip-NAPI in 2.4.x using packet mmap on a p3 or maybe it was a p2; This
is zero drops at 148.8Kpps.
Jason Lunz actually seemed to have been doing more work on this and
e1000 - he could provide better perfomance numbers.
It should also be noted that infact packet mmap already uses rings.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-03 23:02 ` jamal
@ 2004-04-05 16:03   ` Jason Lunz
  2004-04-06 10:30     ` P
  2004-04-06 14:18     ` jamal
  0 siblings, 2 replies; 14+ messages in thread
From: Jason Lunz @ 2004-04-05 16:03 UTC (permalink / raw)
  To: netdev

hadi@cyberus.ca said:
> Jason Lunz actually seemed to have been doing more work on this and
> e1000 - he could provide better perfomance numbers.

Well, not really. What I have is still available at:

http://gtf.org/lunz/linux/net/perf/

...but those are mainly measurements of very outdated versions of the
e1000 napi driver backported to 2.4, running on 1.8Ghz Xeon systems.
That work hasn't really been kept up to date, I'm afraid.

> It should also be noted that infact packet mmap already uses rings.

Yes, I read the paper (but not his code). What stood out to me is that
the description of his custom socket implementation matches exactly what
packet-mmap already is.

I noticed he only mentioned testing of libpcap-mmap, but did not use
mmap packet sockets directly -- maybe there's something about libpcap
that limits performance? I haven't looked.

What I can say for sure is that the napi + packet-mmap performance with
many small packets is almost surely limited by problems with irq/softirq
load. There was an excellent thread last week about this with Andrea
Arcangeli, Robert Olsson and others about the balancing of softirq and
userspace load; they eventually were beginning to agree that running
softirqs on return from hardirq and bh was a bigger load than expected
when there was lots of napi work to do. So despite NAPI, too much kernel
time is spent handling (soft)irq load with many small packets.

It appears this problem became worse in 2.6 with HZ=1000, because now
the napi rx softirq work is being done 10X as much on return from the
timer interrupt.  I'm not sure if a solution was reached.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-05 16:03   ` Jason Lunz
@ 2004-04-06 10:30     ` P
  2004-04-06 12:25       ` Luca Deri
  2004-04-06 14:18     ` jamal
  1 sibling, 1 reply; 14+ messages in thread
From: P @ 2004-04-06 10:30 UTC (permalink / raw)
  To: Jason Lunz; +Cc: netdev, cpw, luca.deri

Jason Lunz wrote:
> hadi@cyberus.ca said:
> 
>>Jason Lunz actually seemed to have been doing more work on this and
>>e1000 - he could provide better perfomance numbers.
> 
> 
> Well, not really. What I have is still available at:
> 
> http://gtf.org/lunz/linux/net/perf/
> 
> ...but those are mainly measurements of very outdated versions of the
> e1000 napi driver backported to 2.4, running on 1.8Ghz Xeon systems.
> That work hasn't really been kept up to date, I'm afraid.
> 
> 
>>It should also be noted that infact packet mmap already uses rings.
> 
> 
> Yes, I read the paper (but not his code). What stood out to me is that
> the description of his custom socket implementation matches exactly what
> packet-mmap already is.
> 
> I noticed he only mentioned testing of libpcap-mmap, but did not use
> mmap packet sockets directly -- maybe there's something about libpcap
> that limits performance? I haven't looked.

That's my experience. I'm thinking of redoing libpcap-mmap completely
as it has huge amounts of statistics messing in the fast path.
Also the ring gets corrupted if packets are being received while
the ring buffer is being setup.

I've a patch for http://public.lanl.gov/cpw/libpcap-0.8.030808.tar.gz
here: http://www.pixelbeat.org/patches/libpcap-0.8.030808-pb.diff
(you need to compile with PB defined)
Note this only addresses the speed issue.
Also there are newer versions of libpcap-mmap available which I
haven't looked at yet.

> What I can say for sure is that the napi + packet-mmap performance with
> many small packets is almost surely limited by problems with irq/softirq
> load. There was an excellent thread last week about this with Andrea
> Arcangeli, Robert Olsson and others about the balancing of softirq and
> userspace load; they eventually were beginning to agree that running
> softirqs on return from hardirq and bh was a bigger load than expected
> when there was lots of napi work to do. So despite NAPI, too much kernel
> time is spent handling (soft)irq load with many small packets.

agreed.

> It appears this problem became worse in 2.6 with HZ=1000, because now
> the napi rx softirq work is being done 10X as much on return from the
> timer interrupt.  I'm not sure if a solution was reached.

Pádraig.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-06 10:30     ` P
@ 2004-04-06 12:25       ` Luca Deri
  2004-04-06 16:01         ` Jason Lunz
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Luca Deri @ 2004-04-06 12:25 UTC (permalink / raw)
  To: P; +Cc: Jason Lunz, netdev, cpw, ntop-misc

Hi all,
the problem with libpcap-mmap is that:
- it does not reduce the journey of the packet from NIC to userland 
beside the last part of it (syscall replaced with mmap). This has a 
negative impact on the overall performance.
- it does not feature things like kernel packet sampling that pushes 
people to fetch all the packets out of a NIC then discard most of them 
(i.e. CPU cycles not very well spent). Somehow this is a limitation of 
pcap that does not feature a pcap_sample call.

In addition if you do care about performance, I believe you're willing 
to turn off packet transmission and only do packet receive. 
Unfortunately I have no access to a "real" traffic generator (I use a PC 
as traffic generator). However if you read my paper you can see that 
with a Pentium IV 1.7 you can capture over 500'000 pkt/sec, so in your 
setup (Xeon + Spirent) you can have better figures.

IRQ: Linux has far too much latency, in particular at high speeds. I'm 
not the right person who can say "this is the way to go", however I 
believe that we need some sort of interrupt prioritization like RTIRQ does.

FYI, I've just polished the code and added kernel packet filtering to 
PF_RING. As soon as I have completed my tests I will release a new version.

Finally It would be nice to have in the standard Linux core some packet 
capture improvements. It could either be based on my work or on somebody 
else's work. It doesn't really matter as long as Linux gets faster.

Cheers, Luca


P@draigBrady.com wrote:

> Jason Lunz wrote:
>
>> hadi@cyberus.ca said:
>>
>>> Jason Lunz actually seemed to have been doing more work on this and
>>> e1000 - he could provide better perfomance numbers.
>>
>>
>>
>> Well, not really. What I have is still available at:
>>
>> http://gtf.org/lunz/linux/net/perf/
>>
>> ...but those are mainly measurements of very outdated versions of the
>> e1000 napi driver backported to 2.4, running on 1.8Ghz Xeon systems.
>> That work hasn't really been kept up to date, I'm afraid.
>>
>>
>>> It should also be noted that infact packet mmap already uses rings.
>>
>>
>>
>> Yes, I read the paper (but not his code). What stood out to me is that
>> the description of his custom socket implementation matches exactly what
>> packet-mmap already is.
>>
>> I noticed he only mentioned testing of libpcap-mmap, but did not use
>> mmap packet sockets directly -- maybe there's something about libpcap
>> that limits performance? I haven't looked.
>
>
> That's my experience. I'm thinking of redoing libpcap-mmap completely
> as it has huge amounts of statistics messing in the fast path.
> Also the ring gets corrupted if packets are being received while
> the ring buffer is being setup.
>
> I've a patch for http://public.lanl.gov/cpw/libpcap-0.8.030808.tar.gz
> here: http://www.pixelbeat.org/patches/libpcap-0.8.030808-pb.diff
> (you need to compile with PB defined)
> Note this only addresses the speed issue.
> Also there are newer versions of libpcap-mmap available which I
> haven't looked at yet.
>
>> What I can say for sure is that the napi + packet-mmap performance with
>> many small packets is almost surely limited by problems with irq/softirq
>> load. There was an excellent thread last week about this with Andrea
>> Arcangeli, Robert Olsson and others about the balancing of softirq and
>> userspace load; they eventually were beginning to agree that running
>> softirqs on return from hardirq and bh was a bigger load than expected
>> when there was lots of napi work to do. So despite NAPI, too much kernel
>> time is spent handling (soft)irq load with many small packets.
>
>
> agreed.
>
>> It appears this problem became worse in 2.6 with HZ=1000, because now
>> the napi rx softirq work is being done 10X as much on return from the
>> timer interrupt.  I'm not sure if a solution was reached.
>
>
> Pádraig
> .



-- 
Luca Deri <deri@ntop.org>	http://luca.ntop.org/
Hacker: someone who loves to program and enjoys being
clever about it - Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-05 16:03   ` Jason Lunz
  2004-04-06 10:30     ` P
@ 2004-04-06 14:18     ` jamal
  2004-04-06 15:31       ` Robert Olsson
  1 sibling, 1 reply; 14+ messages in thread
From: jamal @ 2004-04-06 14:18 UTC (permalink / raw)
  To: Jason Lunz; +Cc: netdev, Luca Deri, Robert Olsson

On Mon, 2004-04-05 at 12:03, Jason Lunz wrote:

> Yes, I read the paper (but not his code). What stood out to me is that
> the description of his custom socket implementation matches exactly what
> packet-mmap already is.

The problem is when you first glance at the paper you think its 
something new. So that piece is a little misleading.

> I noticed he only mentioned testing of libpcap-mmap, but did not use
> mmap packet sockets directly -- maybe there's something about libpcap
> that limits performance? I haven't looked.

More than likely turbo packet was used from some of Alexeys old patches-
now obsoleted

> What I can say for sure is that the napi + packet-mmap performance with
> many small packets is almost surely limited by problems with irq/softirq
> load. There was an excellent thread last week about this with Andrea
> Arcangeli, Robert Olsson and others about the balancing of softirq and
> userspace load; they eventually were beginning to agree that running
> softirqs on return from hardirq and bh was a bigger load than expected
> when there was lots of napi work to do. So despite NAPI, too much kernel
> time is spent handling (soft)irq load with many small packets.

I didnt follow that discussion; archived for later entertaining reading.
My take on it was it is 2.6.x related and in particular the misbehavior
observed has to do with use of rcu in the route cache.

> It appears this problem became worse in 2.6 with HZ=1000, because now
> the napi rx softirq work is being done 10X as much on return from the
> timer interrupt.  I'm not sure if a solution was reached.

Robert?

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-06 14:18     ` jamal
@ 2004-04-06 15:31       ` Robert Olsson
  2004-04-07  7:03         ` Luca Deri
  0 siblings, 1 reply; 14+ messages in thread
From: Robert Olsson @ 2004-04-06 15:31 UTC (permalink / raw)
  To: hadi; +Cc: Jason Lunz, netdev, Luca Deri, Robert Olsson

jamal writes:

 > I didnt follow that discussion; archived for later entertaining reading.
 > My take on it was it is 2.6.x related and in particular the misbehavior
 > observed has to do with use of rcu in the route cache.
 > 
 > > It appears this problem became worse in 2.6 with HZ=1000, because now
 > > the napi rx softirq work is being done 10X as much on return from the
 > > timer interrupt.  I'm not sure if a solution was reached.
 > 
 > Robert?

 Well it's a general problem controlling softirq/user and the RCU locking
 put this on our agenda as the dst hash was among the first applications 
 to use the RCU locking. Which in turn had problem doing progress in hard 
 softirq environment which happens during route cache DoS.

 NAPI is a part of RX_SOFTIRQ which is well-behaved. NAPI addresses only 
 irq/sofirq problem and is totally innocent for do_sofirq() run from other 
 parts of kernel causing userland starvation.

 Under normal hi-load conditions RX_SOFTIRQ schedules itself when the
 netdev_max_backlog is done. do_softirq sees this and defers execution
 to ksoftirqd and things get under (scheduler) control.

 During route DoS, code that does a lot do_softirq() is run for hash and 
 fib-lookup, GC etc. The effect is that ksoftirqd is more or less bypassed.
 Again it's a general problem... We are just the unlucky guys getting 
 into this.

 I don't know if packet capture tests done by Luca ran into this problems.
 A profile could have helped...

 Cheers.
						--ro 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-06 12:25       ` Luca Deri
@ 2004-04-06 16:01         ` Jason Lunz
  2004-04-06 18:40           ` Ben Greear
       [not found]         ` <1081262228.1046.25.camel@jzny.localdomain>
       [not found]         ` <E1BAt0s-0003V8-00@crown.reflexsecurity.com>
  2 siblings, 1 reply; 14+ messages in thread
From: Jason Lunz @ 2004-04-06 16:01 UTC (permalink / raw)
  To: netdev

deri@ntop.org said:
> In addition if you do care about performance, I believe you're willing
> to turn off packet transmission and only do packet receive. 

I don't understand what you mean by this. packet-mmap works perfectly
well on an UP|PROMISC interface with no addresses bound to it. As long
as no packets are injected through a packet socket, the tx path never
gets involved.

> IRQ: Linux has far too much latency, in particular at high speeds. I'm
> not the right person who can say "this is the way to go", however I
> believe that we need some sort of interrupt prioritization like RTIRQ
> does.

I don't think this is the problem, since small-packet performance is bad
even with a fully-polling e1000 in NAPI mode. As Robert Olsson has
demonstrated, a highly-loaded napi e1000 only generates a few hundred
interrupts per second. So the vast majority of packets recieved are
coming in without a hardware interrupt occurring at all.

Could it be that each time an hw irq _is_ generated, it causes many
packets to be lost? That's a possibility. Can you explain in more detail
how you measured the effect of interrupt latency on recieve efficiency?

> Finally It would be nice to have in the standard Linux core some
> packet capture improvements. It could either be based on my work or on
> somebody else's work. It doesn't really matter as long as Linux gets
> faster.

I agree. I think a good place to start would be reading and
understanding this thread:

http://thread.gmane.org/gmane.linux.kernel/193758

There's some disagreement for a while about where all this softirq load
is coming from. It looks like an interaction of softirqs and RCU, but
the first patch doesn't help.  Finally Olsson pointed out:

http://article.gmane.org/gmane.linux.kernel/194412

that the majority of softirq's are being run from hardirq exit. Even
with NAPI. At this point, I think, it's clear that the problem exists
regardless of rcu, and indeed, Linux is bad at doing packet-mmap RX of a
small-packet gigabit flood on both 2.4 and 2.6 (my old 2.4 measurements
earlier in this thread show this).

I'm particularly interested in trying Andrea's suggestion from
http://article.gmane.org/gmane.linux.kernel/194486 , but I won't have
the time anytime soon.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-06 16:01         ` Jason Lunz
@ 2004-04-06 18:40           ` Ben Greear
  0 siblings, 0 replies; 14+ messages in thread
From: Ben Greear @ 2004-04-06 18:40 UTC (permalink / raw)
  To: Jason Lunz; +Cc: netdev

Jason Lunz wrote:
> deri@ntop.org said:
> 
>>In addition if you do care about performance, I believe you're willing
>>to turn off packet transmission and only do packet receive. 
> 
> 
> I don't understand what you mean by this. packet-mmap works perfectly
> well on an UP|PROMISC interface with no addresses bound to it. As long
> as no packets are injected through a packet socket, the tx path never
> gets involved.
> 
> 
>>IRQ: Linux has far too much latency, in particular at high speeds. I'm
>>not the right person who can say "this is the way to go", however I
>>believe that we need some sort of interrupt prioritization like RTIRQ
>>does.
> 
> 
> I don't think this is the problem, since small-packet performance is bad
> even with a fully-polling e1000 in NAPI mode. As Robert Olsson has
> demonstrated, a highly-loaded napi e1000 only generates a few hundred
> interrupts per second. So the vast majority of packets recieved are
> coming in without a hardware interrupt occurring at all.

If the polling is delayed, then you get plenty of latency.

With something like e1000, you can have up to 4096 rx buffers too,
which can also increase latency, but you do drop fewer packets.

Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
       [not found]         ` <1081262228.1046.25.camel@jzny.localdomain>
@ 2004-04-07  6:59           ` Luca Deri
  2004-04-07 12:20             ` jamal
  0 siblings, 1 reply; 14+ messages in thread
From: Luca Deri @ 2004-04-07  6:59 UTC (permalink / raw)
  To: hadi; +Cc: P, Jason Lunz, netdev, cpw, ntop-misc, Robert.Olsson

Hi Jamal,
from what I read below it seems that you read my first version of the 
paper/code. The current paper is available here 
http://luca.ntop.org/Ring.pdf and the code here 
http://sourceforge.net/project/showfiles.php?group_id=17233&package_id=110128 
(as I have said before I plan to have a new release soon).

Briefly:
- with the new release I don't have to patch the NIC driver anymore
- the principle is simple. At the beginning of netif_rx/netif_receive_sk 
I have added some code that does this: if there's an incoming packet for 
a device where a PF_RING socket was bound, the packet is processed by 
the socket and the functions return NET_RX_SUCCESS with no further 
processing.

This means that:
- Linux does not have to do anything else with the packet and it's ready 
to do something else
- the PF_RING is mapped to userland via mmap (like libpcap-mmap) but 
down the stack (for instance I'm below netfilter) so for each incoming 
packet there's no extra overhead like queuing into data structures, 
netfilter processing etc.

This work has been done to improve passive packet capture in order to 
speedup apps based on pcap like ntop, snort, ethereal...

jamal wrote:

>On Tue, 2004-04-06 at 08:25, Luca Deri wrote:
>  
>
>>Hi all,
>>the problem with libpcap-mmap is that:
>>- it does not reduce the journey of the packet from NIC to userland 
>>beside the last part of it (syscall replaced with mmap). This has a 
>>negative impact on the overall performance.
>>    
>>
>
>By how much does it add to the overall cost? I would say not by much if
>your other approach is also to cross user space.
>Can you post the userland program you used?
>Can you also capture profiles and post them?
>  
>
The code is available at the URL I have specified before.

>  
>
>>- it does not feature things like kernel packet sampling that pushes 
>>people to fetch all the packets out of a NIC then discard most of them 
>>(i.e. CPU cycles not very well spent). Somehow this is a limitation of 
>>pcap that does not feature a pcap_sample call.
>>    
>>
>
>Sampling can be done easily with tc extensions if you are willing to
>accept patches - 2.4.x only at the moment.
>Infact if all you want is to account and drop at the kernel,
>this would be the easiest way to do it (the kernel will gather stats for
>you which you can collect in user space).
>  
>

What I did is not for simply accounting. In fact as you pointed out 
accounting can be done with the kernel. What i did is for apps that need 
to access the raw packet and do something with it. Moreover, do not 
forget that at high speeds (or even at 100 Mbit under attack) the 
standard Linux kernel is not always able to receive all the traffic. 
This means that even using kernel apps like tc you will not account 
traffic properly

> 
>  
>
>>In addition if you do care about performance, I believe you're willing 
>>to turn off packet transmission and only do packet receive.
>>    
>>
>
>You are talking very speacilized machine here. If all you want is to 
>recieve and make sure the transmission never happens - consider looking
>at the patches i suggested.
>  
>
I probably missed your patches. Chan you please send them again?

>I still think you want to manage this device, so some packets should be
>transmitted.
>I have a small issue with your approach btw:
>Any solution which hacks things so that they run at the driver level
>would always get a good performance at the expense of flexibility.
>You might as well stop using Linux - what is the point? Write a bare
>bone OS constituting of said driver.
>  
>
I agree, that's why in release 2. I have decide not to hack the driver 
as this is not too smart.

>What we should do instead is improve Linux so it can be at the same
>level performance wise as your bare bone driver. I have never seen you
>post on this subject and then you show up with a a paper showing an
>ancient OS like BSD beating us at performance (or worse Win2k).
>
>  
>
>> 
>>Unfortunately I have no access to a "real" traffic generator (I use a PC 
>>as traffic generator). However if you read my paper you can see that 
>>with a Pentium IV 1.7 you can capture over 500'000 pkt/sec, so in your 
>>setup (Xeon + Spirent) you can have better figures.
>>    
>>
>
>Do you have a new version of the paper? In the version i have you
>only talk about 100Mbps rates.
>  
>

See above.

>  
>
>>IRQ: Linux has far too much latency, in particular at high speeds. I'm 
>>not the right person who can say "this is the way to go", however I 
>>believe that we need some sort of interrupt prioritization like RTIRQ does.
>>    
>>
>
>Is this still relevant with NAPI?
>  
>
Not really. I have written a simple kernel module with a dummy poll() 
implementation what returns immediately. Well under high system load the 
time it takes to process this poll call is much much greater (and 
totally unpredictable). You should read this:
http://home.t-online.de/home/Bernhard_Kuhn/rtirq/20040304/rtirq.html

>  
>
>>FYI, I've just polished the code and added kernel packet filtering to 
>>PF_RING. As soon as I have completed my tests I will release a new version.
>>
>>Finally It would be nice to have in the standard Linux core some packet 
>>capture improvements. It could either be based on my work or on somebody 
>>else's work. It doesn't really matter as long as Linux gets faster.
>>    
>>
>
>You should be involved since you have the energy. You also have to be
>willing to provide data when things dont work well - I am willing to
>help as i am sure many netdev people are if you are going to be positive
>in your approach. Provide results when needed and be willing to invest
>time and experiment.
>  
>

So tell me what to do in order to integrate my work into Linux and I'll 
do my best to serve the community.

>cheers,
>jamal
>
>  
>
Cheers, Luca

-- 
Luca Deri <deri@ntop.org>	http://luca.ntop.org/
Hacker: someone who loves to program and enjoys being
clever about it - Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-06 15:31       ` Robert Olsson
@ 2004-04-07  7:03         ` Luca Deri
  2004-04-07 15:11           ` [Ntop-misc] " Robert Olsson
  0 siblings, 1 reply; 14+ messages in thread
From: Luca Deri @ 2004-04-07  7:03 UTC (permalink / raw)
  To: Robert Olsson; +Cc: hadi, Jason Lunz, netdev, ntop-misc

Robert Olsson wrote:

>jamal writes:
>
> > I didnt follow that discussion; archived for later entertaining reading.
> > My take on it was it is 2.6.x related and in particular the misbehavior
> > observed has to do with use of rcu in the route cache.
> > 
> > > It appears this problem became worse in 2.6 with HZ=1000, because now
> > > the napi rx softirq work is being done 10X as much on return from the
> > > timer interrupt.  I'm not sure if a solution was reached.
> > 
> > Robert?
>
> Well it's a general problem controlling softirq/user and the RCU locking
> put this on our agenda as the dst hash was among the first applications 
> to use the RCU locking. Which in turn had problem doing progress in hard 
> softirq environment which happens during route cache DoS.
>
> NAPI is a part of RX_SOFTIRQ which is well-behaved. NAPI addresses only 
> irq/sofirq problem and is totally innocent for do_sofirq() run from other 
> parts of kernel causing userland starvation.
>
> Under normal hi-load conditions RX_SOFTIRQ schedules itself when the
> netdev_max_backlog is done. do_softirq sees this and defers execution
> to ksoftirqd and things get under (scheduler) control.
>
> During route DoS, code that does a lot do_softirq() is run for hash and 
> fib-lookup, GC etc. The effect is that ksoftirqd is more or less bypassed.
> Again it's a general problem... We are just the unlucky guys getting 
> into this.
>
> I don't know if packet capture tests done by Luca ran into this problems.
> A profile could have helped...
>  
>

Robert,
yes I run into this problems and I have solved using the RTIRQ kernel patch.

Cheers, Luca

> Cheers.
>						--ro 
>  
>


-- 
Luca Deri <deri@ntop.org>	http://luca.ntop.org/
Hacker: someone who loves to program and enjoys being
clever about it - Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
       [not found]         ` <E1BAt0s-0003V8-00@crown.reflexsecurity.com>
@ 2004-04-07  7:11           ` Luca Deri
  0 siblings, 0 replies; 14+ messages in thread
From: Luca Deri @ 2004-04-07  7:11 UTC (permalink / raw)
  To: Jason Lunz; +Cc: ntop-misc, netdev, Robert.Olsson, hadi

Jason,
Jason Lunz wrote:

>[This message has also been posted to gmane.linux.network.]
>deri@ntop.org said:
>  
>
>>In addition if you do care about performance, I believe you're willing
>>to turn off packet transmission and only do packet receive. 
>>    
>>
>
>I don't understand what you mean by this. packet-mmap works perfectly
>well on an UP|PROMISC interface with no addresses bound to it. As long
>as no packets are injected through a packet socket, the tx path never
>gets involved.
>
>  
>
my PF_RING does not allow you to send data, but just to receive it. I 
have not implemented transmission as this work is mainly for receiving 
data and not for sending (although it should be fairly easy to do add 
this feature). So 1) everything is optimized for receiving packets and 
2) as I have explained before the trip of a packet from the NIC to the 
userland is much shorter that with pcap-mmap (for instance you don't 
cross netfilter at all).

>>IRQ: Linux has far too much latency, in particular at high speeds. I'm
>>not the right person who can say "this is the way to go", however I
>>believe that we need some sort of interrupt prioritization like RTIRQ
>>does.
>>    
>>
>
>I don't think this is the problem, since small-packet performance is bad
>even with a fully-polling e1000 in NAPI mode. As Robert Olsson has
>demonstrated, a highly-loaded napi e1000 only generates a few hundred
>interrupts per second. So the vast majority of packets recieved are
>coming in without a hardware interrupt occurring at all.
>
>Could it be that each time an hw irq _is_ generated, it causes many
>packets to be lost? That's a possibility. Can you explain in more detail
>how you measured the effect of interrupt latency on recieve efficiency?
>  
>
I'm not an expert here. All i can tell you is that measuring performance 
with rtdisc I have realized that even at high load, even if there are 
few incoming interrupts (as Robert demonstrated) the kernel latency is 
not acceptable. That's why I used RTIRQ.


>  
>
>>Finally It would be nice to have in the standard Linux core some
>>packet capture improvements. It could either be based on my work or on
>>somebody else's work. It doesn't really matter as long as Linux gets
>>faster.
>>    
>>
>
>I agree. I think a good place to start would be reading and
>understanding this thread:
>
>http://thread.gmane.org/gmane.linux.kernel/193758
>
>There's some disagreement for a while about where all this softirq load
>is coming from. It looks like an interaction of softirqs and RCU, but
>the first patch doesn't help.  Finally Olsson pointed out:
>
>http://article.gmane.org/gmane.linux.kernel/194412
>
>that the majority of softirq's are being run from hardirq exit. Even
>with NAPI. At this point, I think, it's clear that the problem exists
>regardless of rcu, and indeed, Linux is bad at doing packet-mmap RX of a
>small-packet gigabit flood on both 2.4 and 2.6 (my old 2.4 measurements
>earlier in this thread show this).
>
>I'm particularly interested in trying Andrea's suggestion from
>http://article.gmane.org/gmane.linux.kernel/194486 , but I won't have
>the time anytime soon.
>
>Jason
>  
>
I'll read them.

Thanks, Luca

-- 
Luca Deri <deri@ntop.org>	http://luca.ntop.org/
Hacker: someone who loves to program and enjoys being
clever about it - Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-07  6:59           ` Luca Deri
@ 2004-04-07 12:20             ` jamal
  0 siblings, 0 replies; 14+ messages in thread
From: jamal @ 2004-04-07 12:20 UTC (permalink / raw)
  To: Luca Deri; +Cc: P, Jason Lunz, netdev, cpw, ntop-misc, Robert.Olsson

On Wed, 2004-04-07 at 02:59, Luca Deri wrote:
> Hi Jamal,
> from what I read below it seems that you read my first version of the 
> paper/code. The current paper is available here 
> http://luca.ntop.org/Ring.pdf and the code here 
> http://sourceforge.net/project/showfiles.php?group_id=17233&package_id=110128
> (as I have said before I plan to have a new release soon).

Thanks. I will take a look at the above. The paper i looked at was
posted on netdev by someone else.

> Briefly:
> - with the new release I don't have to patch the NIC driver anymore
> - the principle is simple. At the beginning of netif_rx/netif_receive_sk 
> I have added some code that does this: if there's an incoming packet for 
> a device where a PF_RING socket was bound, the packet is processed by 
> the socket and the functions return NET_RX_SUCCESS with no further 
> processing.

I think theres a good connection with what i am working on since the
patches i have are at the same level. On my TODO list was  "fast packet
diverting to userspace" - but this meant either stealing or sharing
unlike your case where it is always stealing the packet. My intention
was to just mmaped PF_PACKET at the level you are refering to. So maybe
i could use your work instead if its clean. 
I will send you the patches privately.

> This means that:
> - Linux does not have to do anything else with the packet and it's ready 
> to do something else

This should be policy driven. In some cases you may want that packet
to be shared/copied (i.e this is a more generic solution).
i.e you add a policy which says to divert all packets from
10.0.0.1 arriving on eth0 to user space with a tag x.
User space binds to tag x or * to receive all. Filtering out at that
low level provides early discard opportunities.
Of course what above means is you may need to have several rings
even within a device.
There is also nothing that should stop packet capture to happen at the
egress side (what you refered to as transmit).

> - the PF_RING is mapped to userland via mmap (like libpcap-mmap) but 
> down the stack (for instance I'm below netfilter) so for each incoming 
> packet there's no extra overhead like queuing into data structures, 
> netfilter processing etc.

Netfilter is definetely not something to be proud of perfomance wise but
i think you may be overstating the impact of the other pieces.

> This work has been done to improve passive packet capture in order to 
> speedup apps based on pcap like ntop, snort, ethereal...

Again note that we want to get as close as possible to performance you
get from speacilized work while still maintaining Linux as a general OS.
For example creating a new socket family like you have MUST have a good
reason; could you not have reused PF_PACKET?[1]

> jamal wrote:
> 
> >On Tue, 2004-04-06 at 08:25, Luca Deri wrote:
> >  
> >

> >By how much does it add to the overall cost? I would say not by much if
> >your other approach is also to cross user space.
> >Can you post the userland program you used?
> >Can you also capture profiles and post them?
> >  
> >
> The code is available at the URL I have specified before.

But i asked for your profiles since you did the work ;->
Dont expect me to get very enthusiastic and collect profiles for you.
For example, i didnt know why you couldnot get packet mmap to work.
I certainly could do about 200Kpps with it on what i remember to be an
average machine. 

> 
> What I did is not for simply accounting. In fact as you pointed out 
> accounting can be done with the kernel. What i did is for apps that need 
> to access the raw packet and do something with it. Moreover, do not 
> forget that at high speeds (or even at 100 Mbit under attack) the 
> standard Linux kernel is not always able to receive all the traffic. 
> This means that even using kernel apps like tc you will not account 
> traffic properly

I think s/ware not receiving all packets will always be an issue
regardless - actually i should say even well designed NICs will have
problems. So whatever sampling methodology you use should factor that in
to account for properly. 

> >  
> >
> >>IRQ: Linux has far too much latency, in particular at high speeds. I'm 
> >>not the right person who can say "this is the way to go", however I 
> >>believe that we need some sort of interrupt prioritization like RTIRQ does.
> >>    
> >>
> >
> >Is this still relevant with NAPI?
> >  
> >
> Not really. I have written a simple kernel module with a dummy poll() 
> implementation what returns immediately. Well under high system load the 
> time it takes to process this poll call is much much greater (and 
> totally unpredictable). You should read this:
> http://home.t-online.de/home/Bernhard_Kuhn/rtirq/20040304/rtirq.html
> 

This maybe related to what Robert and co. are chasing; How did 2.4.x
treat you?

I did a quick glance at the above work and i am curious how they address
shared interupts. Lets say you have a PC with a soundcard and videocard
sharing an IRQ and the video card is considered high prio - how do you
control priorities then?

> 
> So tell me what to do in order to integrate my work into Linux and I'll 
> do my best to serve the community.

For one, provide results when people ask. I asked you for profiles above
and you point me to the code ;-> You should do the work ;->

cheers,
jamal

[1]A good reason not to use PF_PACKET maybe because it would have
required too many changes and may break backward compatibility. But
these are the kind of things you need to show. I would also suggest you
look at other work like relayfs which i have not heard time to look at
myself.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Ntop-misc] Re: Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling
  2004-04-07  7:03         ` Luca Deri
@ 2004-04-07 15:11           ` Robert Olsson
  0 siblings, 0 replies; 14+ messages in thread
From: Robert Olsson @ 2004-04-07 15:11 UTC (permalink / raw)
  To: ntop-misc; +Cc: Robert Olsson, hadi, Jason Lunz, netdev

Luca Deri writes:

 > Robert,
 > yes I run into this problems and I have solved using the RTIRQ kernel patch.

 Hello!

 I found a reference in your mail... After a very quick look I don't see how RTIRQ 
 can solve any of softirq/userland balance problems. It's seems to deal with
 getting realtime-like interrupt responsiveness during load.

 Cheers.
							--ro

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-04-07 15:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-30 14:23 Luca Deri's paper: Improving Passive Packet Capture: Beyond Device Polling Yusuf Goolamabbas
2004-04-03 23:02 ` jamal
2004-04-05 16:03   ` Jason Lunz
2004-04-06 10:30     ` P
2004-04-06 12:25       ` Luca Deri
2004-04-06 16:01         ` Jason Lunz
2004-04-06 18:40           ` Ben Greear
     [not found]         ` <1081262228.1046.25.camel@jzny.localdomain>
2004-04-07  6:59           ` Luca Deri
2004-04-07 12:20             ` jamal
     [not found]         ` <E1BAt0s-0003V8-00@crown.reflexsecurity.com>
2004-04-07  7:11           ` Luca Deri
2004-04-06 14:18     ` jamal
2004-04-06 15:31       ` Robert Olsson
2004-04-07  7:03         ` Luca Deri
2004-04-07 15:11           ` [Ntop-misc] " Robert Olsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).