netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Using ethernet device as efficient small packet generator
@ 2010-12-21  9:56 juice
  2010-12-21 18:22 ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: juice @ 2010-12-21  9:56 UTC (permalink / raw)
  To: netdev


Hi net-devvers.

I am involved in telecom equipment R&D, and I need to do some network
performance benchmarking. We need to generate streams of Ethernet/IP/UDP
traffic that consists of different sized payloads ranging from smallest
AMR payload to ethernet MTU.

We have various tools including for example Spirent traffic generators
as well as in-house made software generating 3GPP specified protocol
streams. Now, the problem with the off-the-shelf generators is the
inflexibility in our needs and the unavailability to R&D personnel to
have the generator available at any given time.

For larger packet sizes our linux-based generator is quite sufficent,
as I can use it to fully saturate GE link with packet sizes around 1kB.
However, as packet sizes get smalles ethernet performance suffers.

I did some benchmarking using pktgen with 64B packets against AX4000 and
confirmed that the maximun throughput is only around 25% of GE capacity.
I managed to get to about same speeds using own custom module that writes
skbuffs directly to kernel *xmit of the netdev.

Now, it is evident that something is not optimized to the maximum here
as PCI bus allows for way higher transfer speeds. If large packets can
fully saturate the ethernet link same should apply for minimum sized
packets too, unless there is some overhead I am unaware of.

I have couple of questions here:

1.) Is it possible to enhance the "normal" behaving network driver so
    that the device would still work as an ethernet device (ethxx)?

    Currently the test stream is generated in userland process that
    writes to RAW_SOCK, but it is OK for me if I need to write the
    packet generating part as a kernel module that is configured
    from the userland part to send the prepared stream out.

2.) If it is not possible to get the needed performance from normal
    network architecture, is it possible to make a "generate only"
    ethernet device that I can use to replace the network card driver?

    For example, RX is not really needed at all by my application, so
    just optimizing the driver to send out packets from memory as fast
    as possible is enough.

    Are there notable differences between ethernet chipsets/cards
    regarding to the raw output speed they are capable?
    I have benchmarked e1000, r8169 ang tg3 based cards and with all
    of those I get about same throughput of 64byte ethernet frames.

    For my purpose, it would be OK, for example, to remove the normal
    r8169 driver and replace it with a custom TX-only driver, and use
    some other normal driver tied to another card to access the box.

I appreciate your comments and any pointers to existing projects that
have similar implementation that I require.

Yours, Jussi Ohenoja




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
  2010-12-21  9:56 juice
@ 2010-12-21 18:22 ` Stephen Hemminger
  0 siblings, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2010-12-21 18:22 UTC (permalink / raw)
  To: juice; +Cc: netdev

On Tue, 21 Dec 2010 11:56:42 +0200
"juice" <juice@swagman.org> wrote:

> 
> Hi net-devvers.
> 
> I am involved in telecom equipment R&D, and I need to do some network
> performance benchmarking. We need to generate streams of Ethernet/IP/UDP
> traffic that consists of different sized payloads ranging from smallest
> AMR payload to ethernet MTU.
> 
> We have various tools including for example Spirent traffic generators
> as well as in-house made software generating 3GPP specified protocol
> streams. Now, the problem with the off-the-shelf generators is the
> inflexibility in our needs and the unavailability to R&D personnel to
> have the generator available at any given time.
> 
> For larger packet sizes our linux-based generator is quite sufficent,
> as I can use it to fully saturate GE link with packet sizes around 1kB.
> However, as packet sizes get smalles ethernet performance suffers.
> 
> I did some benchmarking using pktgen with 64B packets against AX4000 and
> confirmed that the maximun throughput is only around 25% of GE capacity.
> I managed to get to about same speeds using own custom module that writes
> skbuffs directly to kernel *xmit of the netdev.
> 
> Now, it is evident that something is not optimized to the maximum here
> as PCI bus allows for way higher transfer speeds. If large packets can
> fully saturate the ethernet link same should apply for minimum sized
> packets too, unless there is some overhead I am unaware of.
> 
> I have couple of questions here:
> 
> 1.) Is it possible to enhance the "normal" behaving network driver so
>     that the device would still work as an ethernet device (ethxx)?
> 
>     Currently the test stream is generated in userland process that
>     writes to RAW_SOCK, but it is OK for me if I need to write the
>     packet generating part as a kernel module that is configured
>     from the userland part to send the prepared stream out.
> 
> 2.) If it is not possible to get the needed performance from normal
>     network architecture, is it possible to make a "generate only"
>     ethernet device that I can use to replace the network card driver?
> 
>     For example, RX is not really needed at all by my application, so
>     just optimizing the driver to send out packets from memory as fast
>     as possible is enough.
> 
>     Are there notable differences between ethernet chipsets/cards
>     regarding to the raw output speed they are capable?
>     I have benchmarked e1000, r8169 ang tg3 based cards and with all
>     of those I get about same throughput of 64byte ethernet frames.
> 
>     For my purpose, it would be OK, for example, to remove the normal
>     r8169 driver and replace it with a custom TX-only driver, and use
>     some other normal driver tied to another card to access the box.
> 
> I appreciate your comments and any pointers to existing projects that
> have similar implementation that I require.
> 
> Yours, Jussi Ohenoja

I regularly get full 1G line rate of 64 byte packets using old Opteron
box and pktgen.  It does require some tuning of IRQ's and interrupt mitigation but
no patches. Did you remember to do the basic stuff like setting IRQ affinity
and not enabling debugging or tracing in the kernel? This is on sky2, but
also using e1000 and tg3. Others have reported 7M packets per second over 10G cards.
The r8169 hardware is low end consumer hardware and doesn't work as well.

It is possible to get close to 1G line rate forwarding with a single core with current
generation processors. Actual rate depends on hardware and configuration (size of route
table, firewalling, etc).  Much better performance with multi-queue hardware to spread load
over multiple cores.


-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
@ 2010-12-22  7:30 juice
  2010-12-22  8:08 ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: juice @ 2010-12-22  7:30 UTC (permalink / raw)
  To: Stephen Hemminger, netdev

> On Tue, 21 Dec 2010 11:56:42 +0200 shemminger wrote:
> I regularly get full 1G line rate of 64 byte packets using old Opteron
box and pktgen.  It does require some tuning of IRQ's and interrupt
mitigation but
> no patches. Did you remember to do the basic stuff like setting IRQ
affinity
> and not enabling debugging or tracing in the kernel? This is on sky2,
but
> also using e1000 and tg3. Others have reported 7M packets per second
over
> 10G cards.
> The r8169 hardware is low end consumer hardware and doesn't work as
well.
> It is possible to get close to 1G line rate forwarding with a single
core
> with current
> generation processors. Actual rate depends on hardware and configuration
(size of route
> table, firewalling, etc).  Much better performance with multi-queue
hardware to spread load
> over multiple cores.

I did my testing on two kinds of boxes we use in our lab, an older Pomi
Supermicro with e1000 and a newer Dell T3500 with tg3 and r8169.
Both computers have dual-core 2.4G Xeon Cpus, but with somewhat different
model and stepping.
Both boxes are running the same OS, Ubuntu 2.6.32-26-generic #48.

Could you share some information on the required interrupt tuning? It
would certainly be easiest if the full line rate can be achieved without
any patching of drivers or hindering normal eth/ip interface operation.

Yours, Jussi Ohenoja





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
  2010-12-22  7:30 juice
@ 2010-12-22  8:08 ` Eric Dumazet
  2010-12-22 11:11   ` juice
  2010-12-22 15:48   ` Jon Zhou
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-12-22  8:08 UTC (permalink / raw)
  To: juice; +Cc: Stephen Hemminger, netdev

Le mercredi 22 décembre 2010 à 09:30 +0200, juice a écrit :
> > On Tue, 21 Dec 2010 11:56:42 +0200 shemminger wrote:
> > I regularly get full 1G line rate of 64 byte packets using old Opteron
> box and pktgen.  It does require some tuning of IRQ's and interrupt
> mitigation but
> > no patches. Did you remember to do the basic stuff like setting IRQ
> affinity
> > and not enabling debugging or tracing in the kernel? This is on sky2,
> but
> > also using e1000 and tg3. Others have reported 7M packets per second
> over
> > 10G cards.
> > The r8169 hardware is low end consumer hardware and doesn't work as
> well.
> > It is possible to get close to 1G line rate forwarding with a single
> core
> > with current
> > generation processors. Actual rate depends on hardware and configuration
> (size of route
> > table, firewalling, etc).  Much better performance with multi-queue
> hardware to spread load
> > over multiple cores.
> 
> I did my testing on two kinds of boxes we use in our lab, an older Pomi
> Supermicro with e1000 and a newer Dell T3500 with tg3 and r8169.
> Both computers have dual-core 2.4G Xeon Cpus, but with somewhat different
> model and stepping.
> Both boxes are running the same OS, Ubuntu 2.6.32-26-generic #48.
> 

Hmm, might be better with 10.10 ubuntu, with 2.6.35 kernels

> Could you share some information on the required interrupt tuning? It
> would certainly be easiest if the full line rate can be achieved without
> any patching of drivers or hindering normal eth/ip interface operation.
> 

Thats pretty easy.

Say your card has 8 queues, do :

echo 01 >/proc/irq/*/eth1-fp-0/../smp_affinity
echo 02 >/proc/irq/*/eth1-fp-1/../smp_affinity
echo 04 >/proc/irq/*/eth1-fp-2/../smp_affinity
echo 08 >/proc/irq/*/eth1-fp-3/../smp_affinity
echo 10 >/proc/irq/*/eth1-fp-4/../smp_affinity
echo 20 >/proc/irq/*/eth1-fp-5/../smp_affinity
echo 40 >/proc/irq/*/eth1-fp-6/../smp_affinity
echo 80 >/proc/irq/*/eth1-fp-7/../smp_affinity

Then, start your pktgen threads on each queue, so that TX completion IRQ
are run on same CPU.

I confirm getting 6Mpps (or more) out of the box is OK.

I did it one year ago on ixgbe, no patches needed.

With recent kernels, it should even be faster.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
  2010-12-22  8:08 ` Eric Dumazet
@ 2010-12-22 11:11   ` juice
  2010-12-22 11:28     ` Eric Dumazet
  2010-12-22 15:48   ` Jon Zhou
  1 sibling, 1 reply; 28+ messages in thread
From: juice @ 2010-12-22 11:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice, Stephen Hemminger, netdev


>> Could you share some information on the required interrupt tuning? It
>> would certainly be easiest if the full line rate can be achieved without
>> any patching of drivers or hindering normal eth/ip interface operation.
>>
>
> Thats pretty easy.
>
> Say your card has 8 queues, do :
>
> echo 01 >/proc/irq/*/eth1-fp-0/../smp_affinity
> echo 02 >/proc/irq/*/eth1-fp-1/../smp_affinity
> echo 04 >/proc/irq/*/eth1-fp-2/../smp_affinity
> echo 08 >/proc/irq/*/eth1-fp-3/../smp_affinity
> echo 10 >/proc/irq/*/eth1-fp-4/../smp_affinity
> echo 20 >/proc/irq/*/eth1-fp-5/../smp_affinity
> echo 40 >/proc/irq/*/eth1-fp-6/../smp_affinity
> echo 80 >/proc/irq/*/eth1-fp-7/../smp_affinity
>
> Then, start your pktgen threads on each queue, so that TX completion IRQ
> are run on same CPU.
>
> I confirm getting 6Mpps (or more) out of the box is OK.
>
> I did it one year ago on ixgbe, no patches needed.
>
> With recent kernels, it should even be faster.
>

I guess the irq structures are different on 2.6.31, as there are no such
files there. However, this is what it looks like:

root@a2labralinux:/home/juice#
root@a2labralinux:/home/juice# cat /proc/interrupts
           CPU0       CPU1
  0:         46          0   IO-APIC-edge      timer
  1:       1917          0   IO-APIC-edge      i8042
  3:          2          0   IO-APIC-edge
  4:          2          0   IO-APIC-edge
  6:          5          0   IO-APIC-edge      floppy
  7:          0          0   IO-APIC-edge      parport0
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:      41310          0   IO-APIC-edge      i8042
 14:     132126          0   IO-APIC-edge      ata_piix
 15:    3747771          0   IO-APIC-edge      ata_piix
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb1
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 28:   11678379          0   IO-APIC-fasteoi   eth0
 29:    1659580     305890   IO-APIC-fasteoi   eth1
 72:    1667572          0   IO-APIC-fasteoi   eth2
NMI:          0          0   Non-maskable interrupts
LOC:   42109031   78473986   Local timer interrupts
SPU:          0          0   Spurious interrupts
CNT:          0          0   Performance counter interrupts
PND:          0          0   Performance pending work
RES:     654819     680053   Rescheduling interrupts
CAL:        137       1534   Function call interrupts
TLB:     102720     606381   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:       1724       1724   Machine check polls
ERR:          0
MIS:          0
root@a2labralinux:/home/juice# ls -la /proc/irq/28/
total 0
dr-xr-xr-x  3 root root 0 2010-12-22 15:23 .
dr-xr-xr-x 24 root root 0 2010-12-22 15:23 ..
dr-xr-xr-x  2 root root 0 2010-12-22 15:23 eth0
-rw-------  1 root root 0 2010-12-22 15:23 smp_affinity
-r--r--r--  1 root root 0 2010-12-22 15:23 spurious
root@a2labralinux:/home/juice#
root@a2labralinux:/home/juice# cat /proc/irq/28/smp_affinity
1
root@a2labralinux:/home/juice#

The smp_affinity was previously 3, so I guess both CPU's handled the
interrupts.

Now, with affinity set to CPU0, I get a bit better results but still
nothing near full GE saturation:

root@a2labralinux:/home/juice# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10000000  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
     src_min:   src_max:
     src_mac: 00:30:48:2a:2a:61 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1293021547122748us  stopped: 1293021562952096us idle: 2118707us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0xb090914  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 15829348(c13710641+d2118707) usec, 10000000 (60byte,0frags)
  631737pps 303Mb/sec (303233760bps) errors: 0
root@a2labralinux:/home/juice#

This result is from the Pomi micro using e1000 network interface.
Previously the small packet throghput was about 180Mb/s, now 303Mb/s.

>From the Dell machine using tg3 interface, there was really no difference
when I set the interrupt affinity to single CPU, the results are about
same as before:

root@d8labralinux:/home/juice# cat /proc/net/pktgen/eth2
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10000000  ifname: eth2
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: b8:ac:6f:95:d5:f7 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 169829200145us  stopped: 169856889850us idle: 1296us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x4030201  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 27689705(c27688408+d1296) nsec, 10000000 (60byte,0frags)
  361145pps 173Mb/sec (173349600bps) errors: 0
root@d8labralinux:/home/juice#


>
> Hmm, might be better with 10.10 ubuntu, with 2.6.35 kernels
>

So, is the interrupt handling different in newer kernels?
Should I try to update the linux version before doing any more optimizing?

As the boxes are also running other software I would like to keep
them in Ubuntu-LTS.

Yours, Jussi Ohenoja



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
  2010-12-22 11:11   ` juice
@ 2010-12-22 11:28     ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-12-22 11:28 UTC (permalink / raw)
  To: juice; +Cc: Stephen Hemminger, netdev

Le mercredi 22 décembre 2010 à 13:11 +0200, juice a écrit :
> >> Could you share some information on the required interrupt tuning? It
> >> would certainly be easiest if the full line rate can be achieved without
> >> any patching of drivers or hindering normal eth/ip interface operation.
> >>
> >
> > Thats pretty easy.
> >
> > Say your card has 8 queues, do :
> >
> > echo 01 >/proc/irq/*/eth1-fp-0/../smp_affinity
> > echo 02 >/proc/irq/*/eth1-fp-1/../smp_affinity
> > echo 04 >/proc/irq/*/eth1-fp-2/../smp_affinity
> > echo 08 >/proc/irq/*/eth1-fp-3/../smp_affinity
> > echo 10 >/proc/irq/*/eth1-fp-4/../smp_affinity
> > echo 20 >/proc/irq/*/eth1-fp-5/../smp_affinity
> > echo 40 >/proc/irq/*/eth1-fp-6/../smp_affinity
> > echo 80 >/proc/irq/*/eth1-fp-7/../smp_affinity
> >
> > Then, start your pktgen threads on each queue, so that TX completion IRQ
> > are run on same CPU.
> >
> > I confirm getting 6Mpps (or more) out of the box is OK.
> >
> > I did it one year ago on ixgbe, no patches needed.
> >
> > With recent kernels, it should even be faster.
> >
> 
> I guess the irq structures are different on 2.6.31, as there are no such
> files there. However, this is what it looks like:
> 
> root@a2labralinux:/home/juice#
> root@a2labralinux:/home/juice# cat /proc/interrupts
>            CPU0       CPU1
>   0:         46          0   IO-APIC-edge      timer
>   1:       1917          0   IO-APIC-edge      i8042
>   3:          2          0   IO-APIC-edge
>   4:          2          0   IO-APIC-edge
>   6:          5          0   IO-APIC-edge      floppy
>   7:          0          0   IO-APIC-edge      parport0
>   8:          0          0   IO-APIC-edge      rtc0
>   9:          0          0   IO-APIC-fasteoi   acpi
>  12:      41310          0   IO-APIC-edge      i8042
>  14:     132126          0   IO-APIC-edge      ata_piix
>  15:    3747771          0   IO-APIC-edge      ata_piix
>  16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb1
>  18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
>  19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
>  28:   11678379          0   IO-APIC-fasteoi   eth0
>  29:    1659580     305890   IO-APIC-fasteoi   eth1
>  72:    1667572          0   IO-APIC-fasteoi   eth2
> NMI:          0          0   Non-maskable interrupts
> LOC:   42109031   78473986   Local timer interrupts
> SPU:          0          0   Spurious interrupts
> CNT:          0          0   Performance counter interrupts
> PND:          0          0   Performance pending work
> RES:     654819     680053   Rescheduling interrupts
> CAL:        137       1534   Function call interrupts
> TLB:     102720     606381   TLB shootdowns
> TRM:          0          0   Thermal event interrupts
> THR:          0          0   Threshold APIC interrupts
> MCE:          0          0   Machine check exceptions
> MCP:       1724       1724   Machine check polls
> ERR:          0
> MIS:          0
> root@a2labralinux:/home/juice# ls -la /proc/irq/28/
> total 0
> dr-xr-xr-x  3 root root 0 2010-12-22 15:23 .
> dr-xr-xr-x 24 root root 0 2010-12-22 15:23 ..
> dr-xr-xr-x  2 root root 0 2010-12-22 15:23 eth0
> -rw-------  1 root root 0 2010-12-22 15:23 smp_affinity
> -r--r--r--  1 root root 0 2010-12-22 15:23 spurious
> root@a2labralinux:/home/juice#
> root@a2labralinux:/home/juice# cat /proc/irq/28/smp_affinity
> 1
> root@a2labralinux:/home/juice#
> 
> The smp_affinity was previously 3, so I guess both CPU's handled the
> interrupts.
> 
> Now, with affinity set to CPU0, I get a bit better results but still
> nothing near full GE saturation:
> 
> root@a2labralinux:/home/juice# cat /proc/net/pktgen/eth1
> Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 10000000  ifname: eth1
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 10.10.11.2  dst_max:
>      src_min:   src_max:
>      src_mac: 00:30:48:2a:2a:61 dst_mac: 00:04:23:08:91:dc
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 10000000  errors: 0
>      started: 1293021547122748us  stopped: 1293021562952096us idle: 2118707us
>      seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0xb090914  cur_daddr: 0x20b0a0a
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 15829348(c13710641+d2118707) usec, 10000000 (60byte,0frags)
>   631737pps 303Mb/sec (303233760bps) errors: 0
> root@a2labralinux:/home/juice#
> 
> This result is from the Pomi micro using e1000 network interface.
> Previously the small packet throghput was about 180Mb/s, now 303Mb/s.
> 
> From the Dell machine using tg3 interface, there was really no difference
> when I set the interrupt affinity to single CPU, the results are about
> same as before:
> 
> root@d8labralinux:/home/juice# cat /proc/net/pktgen/eth2
> Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 10000000  ifname: eth2
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 10.10.11.2  dst_max:
>         src_min:   src_max:
>      src_mac: b8:ac:6f:95:d5:f7 dst_mac: 00:04:23:08:91:dc
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 10000000  errors: 0
>      started: 169829200145us  stopped: 169856889850us idle: 1296us
>      seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x4030201  cur_daddr: 0x20b0a0a
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 27689705(c27688408+d1296) nsec, 10000000 (60byte,0frags)
>   361145pps 173Mb/sec (173349600bps) errors: 0
> root@d8labralinux:/home/juice#
> 
> 
> >
> > Hmm, might be better with 10.10 ubuntu, with 2.6.35 kernels
> >
> 
> So, is the interrupt handling different in newer kernels?
> Should I try to update the linux version before doing any more optimizing?
> 

I dont know if distro kernel dont have too much debugging stuff for this
kind of use.

> As the boxes are also running other software I would like to keep
> them in Ubuntu-LTS.
> 
> Yours, Jussi Ohenoja
> 
> 

Reaching 1Gbs should not be a problem (I was speaking about 10Gbps)

I reach link speed with my tg3 card and one single cpu :)

(Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3))

Please provide :

ethtool -S eth0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22  8:08 ` Eric Dumazet
  2010-12-22 11:11   ` juice
@ 2010-12-22 15:48   ` Jon Zhou
  2010-12-22 15:59     ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Jon Zhou @ 2010-12-22 15:48 UTC (permalink / raw)
  To: Eric Dumazet, juice@swagman.org; +Cc: Stephen Hemminger, netdev@vger.kernel.org



-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Eric Dumazet
Sent: Wednesday, December 22, 2010 4:08 PM
To: juice@swagman.org
Cc: Stephen Hemminger; netdev@vger.kernel.org
Subject: Re: Using ethernet device as efficient small packet generator

Le mercredi 22 décembre 2010 à 09:30 +0200, juice a écrit :
> > On Tue, 21 Dec 2010 11:56:42 +0200 shemminger wrote:
> > I regularly get full 1G line rate of 64 byte packets using old Opteron
> box and pktgen.  It does require some tuning of IRQ's and interrupt
> mitigation but
> > no patches. Did you remember to do the basic stuff like setting IRQ
> affinity
> > and not enabling debugging or tracing in the kernel? This is on sky2,
> but
> > also using e1000 and tg3. Others have reported 7M packets per second
> over
> > 10G cards.
> > The r8169 hardware is low end consumer hardware and doesn't work as
> well.
> > It is possible to get close to 1G line rate forwarding with a single
> core
> > with current
> > generation processors. Actual rate depends on hardware and configuration
> (size of route
> > table, firewalling, etc).  Much better performance with multi-queue
> hardware to spread load
> > over multiple cores.
> 
> I did my testing on two kinds of boxes we use in our lab, an older Pomi
> Supermicro with e1000 and a newer Dell T3500 with tg3 and r8169.
> Both computers have dual-core 2.4G Xeon Cpus, but with somewhat different
> model and stepping.
> Both boxes are running the same OS, Ubuntu 2.6.32-26-generic #48.
> 

Hmm, might be better with 10.10 ubuntu, with 2.6.35 kernels

> Could you share some information on the required interrupt tuning? It
> would certainly be easiest if the full line rate can be achieved without
> any patching of drivers or hindering normal eth/ip interface operation.
> 

Thats pretty easy.

Say your card has 8 queues, do :

echo 01 >/proc/irq/*/eth1-fp-0/../smp_affinity
echo 02 >/proc/irq/*/eth1-fp-1/../smp_affinity
echo 04 >/proc/irq/*/eth1-fp-2/../smp_affinity
echo 08 >/proc/irq/*/eth1-fp-3/../smp_affinity
echo 10 >/proc/irq/*/eth1-fp-4/../smp_affinity
echo 20 >/proc/irq/*/eth1-fp-5/../smp_affinity
echo 40 >/proc/irq/*/eth1-fp-6/../smp_affinity
echo 80 >/proc/irq/*/eth1-fp-7/../smp_affinity

Then, start your pktgen threads on each queue, so that TX completion IRQ
are run on same CPU.

Hi eric, any special setting in pktgen.conf?

PGDEV=/proc/net/pktgen/kpktgend_0
  echo "Removing all devices"
 pgset "rem_device_all"
  echo "Adding eth1-fp-0" //or eth1?
 pgset "add_device eth1"
  echo "Setting max_before_softirq 10000"
 pgset "max_before_softirq 10000"

All things I need to do is set cpu affinity and start 8 pktgen threads? (PGDEV=/proc/net/pktgen/kpktgend_0~7 with "eth1")


I confirm getting 6Mpps (or more) out of the box is OK.

I did it one year ago on ixgbe, no patches needed.

With recent kernels, it should even be faster.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 15:48   ` Jon Zhou
@ 2010-12-22 15:59     ` Eric Dumazet
  2010-12-22 16:52       ` Jon Zhou
  2010-12-22 17:15       ` Jon Zhou
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-12-22 15:59 UTC (permalink / raw)
  To: Jon Zhou; +Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org

Le mercredi 22 décembre 2010 à 07:48 -0800, Jon Zhou a écrit :
> 

> Hi eric, any special setting in pktgen.conf?
> 
> PGDEV=/proc/net/pktgen/kpktgend_0
>   echo "Removing all devices"
>  pgset "rem_device_all"
>   echo "Adding eth1-fp-0" //or eth1?

eth1

>  pgset "add_device eth1"
>   echo "Setting max_before_softirq 10000"
>  pgset "max_before_softirq 10000"

Not sure you need to tweak max_before_softirq (I never did)

> 
> All things I need to do is set cpu affinity and start 8 pktgen threads? (PGDEV=/proc/net/pktgen/kpktgend_0~7 with "eth1")

Yes, but you must also use queue_map_min and queue_map_max pktgen
parameters so that each cpu manipulates its own 'queue'

CPU 0 : 

   pgset "queue_map_min 0"
   pgset "queue_map_max 0"

...

CPU 3 : 

   pgset "queue_map_min 3"
   pgset "queue_map_max 3"





^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 15:59     ` Eric Dumazet
@ 2010-12-22 16:52       ` Jon Zhou
  2010-12-22 17:18         ` Eric Dumazet
  2010-12-22 17:15       ` Jon Zhou
  1 sibling, 1 reply; 28+ messages in thread
From: Jon Zhou @ 2010-12-22 16:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org



-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Wednesday, December 22, 2010 11:59 PM
To: Jon Zhou
Cc: juice@swagman.org; Stephen Hemminger; netdev@vger.kernel.org
Subject: RE: Using ethernet device as efficient small packet generator

Le mercredi 22 décembre 2010 à 07:48 -0800, Jon Zhou a écrit :
> 

> Hi eric, any special setting in pktgen.conf?
> 
> PGDEV=/proc/net/pktgen/kpktgend_0
>   echo "Removing all devices"
>  pgset "rem_device_all"
>   echo "Adding eth1-fp-0" //or eth1?

eth1

>  pgset "add_device eth1"
>   echo "Setting max_before_softirq 10000"
>  pgset "max_before_softirq 10000"

Not sure you need to tweak max_before_softirq (I never did)

> 
> All things I need to do is set cpu affinity and start 8 pktgen threads? (PGDEV=/proc/net/pktgen/kpktgend_0~7 with "eth1")

Yes, but you must also use queue_map_min and queue_map_max pktgen
parameters so that each cpu manipulates its own 'queue'

CPU 0 : 

   pgset "queue_map_min 0"
   pgset "queue_map_max 0"

...

CPU 3 : 

   pgset "queue_map_min 3"
   pgset "queue_map_max 3"


PGDEV=/proc/net/pktgen/kpktgend_0
  echo "Removing all devices"
 pgset "rem_device_all" 
  echo "Adding eth4"
 pgset "add_device eth4" 
  echo "Setting max_before_softirq 10000"
 pgset "queue_map_min 0"
 pgset "queue_map_max 0"

-->
It said: 
queue_map_min 0
./pktgen.conf-8-1: line 10: echo: write error: Invalid argument
queue_map_max 0
./pktgen.conf-8-1: line 10: echo: write error: Invalid argument


PGDEV=/proc/net/pktgen/eth4
  echo "Configuring $PGDEV"
 pgset "$COUNT"
 pgset "$CLONE_SKB"
 pgset "$PKT_SIZE"
 pgset "$DELAY"
 pgset "dst 10.10.11.2" 
 pgset "queue_map_min 0"
 pgset "queue_map_max 7"
 pgset "dst_mac  00:04:23:08:91:dc"

->it is ok


Here is the top result, why only kpktgend_0 is running?
Eric,can you share the pktgen script? Thank you

top - 00:43:59 up  7:00,  6 users,  load average: 0.95, 0.66, 0.51
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.2%sy,  0.0%ni, 86.1%id,  0.1%wa,  0.0%hi, 10.6%si,  0.0%st
Mem:     32228M total,      933M used,    31295M free,       97M buffers
Swap:     2055M total,        0M used,     2055M free,      138M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8806 root      20   0     0    0    0 R  100  0.0   5:27.12 kpktgend_0
 8807 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_1
 8808 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_2
 8810 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_3
 8811 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_4
 8812 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_5
 8813 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_6
 8814 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_7

Already set affinity:
cat /proc/interrupts |grep eth4
  78:   10625257          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-0
  79:      10451     581007          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-1
  80:      10447          0     535185          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-2
  81:      10441          0          0     575911          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-3
  82:      10444          0          0          0     521068          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-4
  83:      10448          0          0          0          0     564710          0          0  IR-PCI-MSI-edge      eth4-TxRx-5
  84:      10429          0          0          0          0          0     516087          0  IR-PCI-MSI-edge      eth4-TxRx-6
  85:      10444          0          0          0          0          0          0     558530  IR-PCI-MSI-edge      eth4-TxRx-7
  86:          2          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4:lsc



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 15:59     ` Eric Dumazet
  2010-12-22 16:52       ` Jon Zhou
@ 2010-12-22 17:15       ` Jon Zhou
  1 sibling, 0 replies; 28+ messages in thread
From: Jon Zhou @ 2010-12-22 17:15 UTC (permalink / raw)
  To: Jon Zhou, Eric Dumazet
  Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org



-----Original Message-----
From: Jon Zhou 
Sent: Thursday, December 23, 2010 12:53 AM
To: 'Eric Dumazet'
Cc: juice@swagman.org; Stephen Hemminger; netdev@vger.kernel.org
Subject: RE: Using ethernet device as efficient small packet generator



-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Wednesday, December 22, 2010 11:59 PM
To: Jon Zhou
Cc: juice@swagman.org; Stephen Hemminger; netdev@vger.kernel.org
Subject: RE: Using ethernet device as efficient small packet generator

Le mercredi 22 décembre 2010 à 07:48 -0800, Jon Zhou a écrit :
> 

> Hi eric, any special setting in pktgen.conf?
> 
> PGDEV=/proc/net/pktgen/kpktgend_0
>   echo "Removing all devices"
>  pgset "rem_device_all"
>   echo "Adding eth1-fp-0" //or eth1?

eth1

>  pgset "add_device eth1"
>   echo "Setting max_before_softirq 10000"
>  pgset "max_before_softirq 10000"

Not sure you need to tweak max_before_softirq (I never did)

> 
> All things I need to do is set cpu affinity and start 8 pktgen threads? (PGDEV=/proc/net/pktgen/kpktgend_0~7 with "eth1")

Yes, but you must also use queue_map_min and queue_map_max pktgen
parameters so that each cpu manipulates its own 'queue'

CPU 0 : 

   pgset "queue_map_min 0"
   pgset "queue_map_max 0"

...

CPU 3 : 

   pgset "queue_map_min 3"
   pgset "queue_map_max 3"


PGDEV=/proc/net/pktgen/kpktgend_0
  echo "Removing all devices"
 pgset "rem_device_all" 
  echo "Adding eth4"
 pgset "add_device eth4" 
  echo "Setting max_before_softirq 10000"
 pgset "queue_map_min 0"
 pgset "queue_map_max 0"

-->
It said: 
queue_map_min 0
./pktgen.conf-8-1: line 10: echo: write error: Invalid argument
queue_map_max 0
./pktgen.conf-8-1: line 10: echo: write error: Invalid argument

Forgot to tell the kernel is 2.6.32

PGDEV=/proc/net/pktgen/eth4
  echo "Configuring $PGDEV"
 pgset "$COUNT"
 pgset "$CLONE_SKB"
 pgset "$PKT_SIZE"
 pgset "$DELAY"
 pgset "dst 10.10.11.2" 
 pgset "queue_map_min 0"
 pgset "queue_map_max 7"
 pgset "dst_mac  00:04:23:08:91:dc"

->it is ok


Here is the top result, why only kpktgend_0 is running?
Eric,can you share the pktgen script? Thank you

top - 00:43:59 up  7:00,  6 users,  load average: 0.95, 0.66, 0.51
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  3.2%sy,  0.0%ni, 86.1%id,  0.1%wa,  0.0%hi, 10.6%si,  0.0%st
Mem:     32228M total,      933M used,    31295M free,       97M buffers
Swap:     2055M total,        0M used,     2055M free,      138M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8806 root      20   0     0    0    0 R  100  0.0   5:27.12 kpktgend_0
 8807 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_1
 8808 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_2
 8810 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_3
 8811 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_4
 8812 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_5
 8813 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_6
 8814 root      20   0     0    0    0 S    0  0.0   0:00.00 kpktgend_7

Already set affinity:
cat /proc/interrupts |grep eth4
  78:   10625257          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-0
  79:      10451     581007          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-1
  80:      10447          0     535185          0          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-2
  81:      10441          0          0     575911          0          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-3
  82:      10444          0          0          0     521068          0          0          0  IR-PCI-MSI-edge      eth4-TxRx-4
  83:      10448          0          0          0          0     564710          0          0  IR-PCI-MSI-edge      eth4-TxRx-5
  84:      10429          0          0          0          0          0     516087          0  IR-PCI-MSI-edge      eth4-TxRx-6
  85:      10444          0          0          0          0          0          0     558530  IR-PCI-MSI-edge      eth4-TxRx-7
  86:          2          0          0          0          0          0          0          0  IR-PCI-MSI-edge      eth4:lsc



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 16:52       ` Jon Zhou
@ 2010-12-22 17:18         ` Eric Dumazet
  2010-12-22 17:40           ` Jon Zhou
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-12-22 17:18 UTC (permalink / raw)
  To: Jon Zhou; +Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org


> PGDEV=/proc/net/pktgen/eth4

You meant :  PGDEV=/proc/net/pktgen/kpktgend_0   ???

>   echo "Configuring $PGDEV"
>  pgset "$COUNT"
>  pgset "$CLONE_SKB"
>  pgset "$PKT_SIZE"
>  pgset "$DELAY"
>  pgset "dst 10.10.11.2" 
>  pgset "queue_map_min 0"
>  pgset "queue_map_max 7"
>  pgset "dst_mac  00:04:23:08:91:dc"
> 
> ->it is ok
> 
> 
> Here is the top result, why only kpktgend_0 is running?
> Eric,can you share the pktgen script? Thank you

If you want to control several kpktgend you need to send each one of
them a full pktgen script.

PGDEV=/proc/net/pktgen/kpktgend_0
script0
PGDEV=/proc/net/pktgen/kpktgend_1
script1
PGDEV=/proc/net/pktgen/kpktgend_2
script2
PGDEV=/proc/net/pktgen/kpktgend_3
script3




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 17:18         ` Eric Dumazet
@ 2010-12-22 17:40           ` Jon Zhou
  2010-12-22 17:51             ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Jon Zhou @ 2010-12-22 17:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org



-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Thursday, December 23, 2010 1:18 AM
To: Jon Zhou
Cc: juice@swagman.org; Stephen Hemminger; netdev@vger.kernel.org
Subject: RE: Using ethernet device as efficient small packet generator


> PGDEV=/proc/net/pktgen/eth4

You meant :  PGDEV=/proc/net/pktgen/kpktgend_0   ???


With "PGDEV=/proc/net/pktgen/kpktgend_0", I can't pgset "queue_map_min 0"
Only able to pgset "queue_map_min 0" with "PGDEV=/proc/net/pktgen/eth4"

>   echo "Configuring $PGDEV"
>  pgset "$COUNT"
>  pgset "$CLONE_SKB"
>  pgset "$PKT_SIZE"
>  pgset "$DELAY"
>  pgset "dst 10.10.11.2" 
>  pgset "queue_map_min 0"
>  pgset "queue_map_max 7"
>  pgset "dst_mac  00:04:23:08:91:dc"
> 
> ->it is ok
> 
> 
> Here is the top result, why only kpktgend_0 is running?
> Eric,can you share the pktgen script? Thank you

If you want to control several kpktgend you need to send each one of
them a full pktgen script.

PGDEV=/proc/net/pktgen/kpktgend_0
script0
PGDEV=/proc/net/pktgen/kpktgend_1
script1
PGDEV=/proc/net/pktgen/kpktgend_2
script2
PGDEV=/proc/net/pktgen/kpktgend_3
script3

but these kpktgend can not share the same device?
i.e.
I can't add the device again?

PGDEV=/proc/net/pktgen/kpktgend_1
 #echo "Removing all devices"
 #pgset "rem_device_all" 
 #echo "Adding eth4"
 pgset "add_device eth4"   
 pgset "queue_map_min 1"
 pgset "queue_map_max 1"

as I saw when pps reach 1M, throughput reach 900Mbps,the kpktgend_0 also reach 100% cpu (with intel 10G X520 nic,ixgbe)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-22 17:40           ` Jon Zhou
@ 2010-12-22 17:51             ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-12-22 17:51 UTC (permalink / raw)
  To: Jon Zhou; +Cc: juice@swagman.org, Stephen Hemminger, netdev@vger.kernel.org

Le mercredi 22 décembre 2010 à 09:40 -0800, Jon Zhou a écrit :
> 
> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
> Sent: Thursday, December 23, 2010 1:18 AM
> To: Jon Zhou
> Cc: juice@swagman.org; Stephen Hemminger; netdev@vger.kernel.org
> Subject: RE: Using ethernet device as efficient small packet generator
> 
> 
> > PGDEV=/proc/net/pktgen/eth4
> 
> You meant :  PGDEV=/proc/net/pktgen/kpktgend_0   ???
> 
> 
> With "PGDEV=/proc/net/pktgen/kpktgend_0", I can't pgset "queue_map_min 0"
> Only able to pgset "queue_map_min 0" with "PGDEV=/proc/net/pktgen/eth4"
> 
> >   echo "Configuring $PGDEV"
> >  pgset "$COUNT"
> >  pgset "$CLONE_SKB"
> >  pgset "$PKT_SIZE"
> >  pgset "$DELAY"
> >  pgset "dst 10.10.11.2" 
> >  pgset "queue_map_min 0"
> >  pgset "queue_map_max 7"
> >  pgset "dst_mac  00:04:23:08:91:dc"
> > 
> > ->it is ok
> > 
> > 
> > Here is the top result, why only kpktgend_0 is running?
> > Eric,can you share the pktgen script? Thank you
> 
> If you want to control several kpktgend you need to send each one of
> them a full pktgen script.
> 
> PGDEV=/proc/net/pktgen/kpktgend_0
> script0
> PGDEV=/proc/net/pktgen/kpktgend_1
> script1
> PGDEV=/proc/net/pktgen/kpktgend_2
> script2
> PGDEV=/proc/net/pktgen/kpktgend_3
> script3
> 
> but these kpktgend can not share the same device?
> i.e.
> I can't add the device again?
> 
> PGDEV=/proc/net/pktgen/kpktgend_1
>  #echo "Removing all devices"
>  #pgset "rem_device_all" 
>  #echo "Adding eth4"
>  pgset "add_device eth4"   
>  pgset "queue_map_min 1"
>  pgset "queue_map_max 1"
> 
> as I saw when pps reach 1M, throughput reach 900Mbps,the kpktgend_0 also reach 100% cpu (with intel 10G X520 nic,ixgbe)
> 
> 

Please quote messages normally

You mix several things, please read :

http://www.spinics.net/lists/netdev/msg71164.html




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
@ 2010-12-23  5:15 juice
  2010-12-23  8:57 ` Jon Zhou
  0 siblings, 1 reply; 28+ messages in thread
From: juice @ 2010-12-23  5:15 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger, netdev

> Reaching 1Gbs should not be a problem (I was speaking about 10Gbps)
> I reach link speed with my tg3 card and one single cpu :)
> (Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3))
>
> Please provide : ethtool -S eth0
>

This is from the e1000 interface:
03:02.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet
Controller (Copper) (rev 01)

root@a2labralinux:/home/juice# ethtool -S eth1
NIC statistics:
     rx_packets: 192069
     tx_packets: 60000313
     rx_bytes: 33850492
     tx_bytes: 3840026215
     rx_broadcast: 192069
     tx_broadcast: 3
     rx_multicast: 0
     tx_multicast: 310
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 1806437
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 33850492
     rx_csum_offload_good: 8978
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0


This is from the tg3 interface:
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761
Gigabit Ethernet PCIe (rev 10)

root@d8labralinux:/home/juice# ethtool -S eth2
NIC statistics:
     rx_octets: 10814
     rx_fragments: 0
     rx_ucast_packets: 20
     rx_mcast_packets: 0
     rx_bcast_packets: 26
     rx_fcs_errors: 0
     rx_align_errors: 0
     rx_xon_pause_rcvd: 0
     rx_xoff_pause_rcvd: 0
     rx_mac_ctrl_rcvd: 0
     rx_xoff_entered: 0
     rx_frame_too_long_errors: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_in_length_errors: 0
     rx_out_length_errors: 0
     rx_64_or_less_octet_packets: 0
     rx_65_to_127_octet_packets: 0
     rx_128_to_255_octet_packets: 0
     rx_256_to_511_octet_packets: 0
     rx_512_to_1023_octet_packets: 0
     rx_1024_to_1522_octet_packets: 0
     rx_1523_to_2047_octet_packets: 0
     rx_2048_to_4095_octet_packets: 0
     rx_4096_to_8191_octet_packets: 0
     rx_8192_to_9022_octet_packets: 0
     tx_octets: 5120013863
     tx_collisions: 0
     tx_xon_sent: 0
     tx_xoff_sent: 0
     tx_flow_control: 0
     tx_mac_errors: 0
     tx_single_collisions: 0
     tx_mult_collisions: 0
     tx_deferred: 0
     tx_excessive_collisions: 0
     tx_late_collisions: 0
     tx_collide_2times: 0
     tx_collide_3times: 0
     tx_collide_4times: 0
     tx_collide_5times: 0
     tx_collide_6times: 0
     tx_collide_7times: 0
     tx_collide_8times: 0
     tx_collide_9times: 0
     tx_collide_10times: 0
     tx_collide_11times: 0
     tx_collide_12times: 0
     tx_collide_13times: 0
     tx_collide_14times: 0
     tx_collide_15times: 0
     tx_ucast_packets: 80000034
     tx_mcast_packets: 42
     tx_bcast_packets: 40
     tx_carrier_sense_errors: 0
     tx_discards: 0
     tx_errors: 0
     dma_writeq_full: 0
     dma_write_prioq_full: 0
     rxbds_empty: 0
     rx_discards: 0
     rx_errors: 0
     rx_threshold_hit: 0
     dma_readq_full: 0
     dma_read_prioq_full: 0
     tx_comp_queue_full: 0
     ring_set_send_prod_index: 0
     ring_status_update: 0
     nic_irqs: 0
     nic_avoided_irqs: 0
     nic_tx_threshold_hit: 0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-23  5:15 juice
@ 2010-12-23  8:57 ` Jon Zhou
  2010-12-23 10:50   ` juice
  0 siblings, 1 reply; 28+ messages in thread
From: Jon Zhou @ 2010-12-23  8:57 UTC (permalink / raw)
  To: juice@swagman.org, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of juice
> Sent: Thursday, December 23, 2010 1:16 PM
> To: Eric Dumazet; Stephen Hemminger; netdev@vger.kernel.org
> Subject: Re: Using ethernet device as efficient small packet generator
> 
> > Reaching 1Gbs should not be a problem (I was speaking about 10Gbps)
> > I reach link speed with my tg3 card and one single cpu :)
> > (Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3))
> >
> > Please provide : ethtool -S eth0
> >
> 
> This is from the e1000 interface:
> 03:02.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet
> Controller (Copper) (rev 01)
> 
> root@a2labralinux:/home/juice# ethtool -S eth1
> NIC statistics:
>      rx_packets: 192069
>      tx_packets: 60000313
>      rx_bytes: 33850492
>      tx_bytes: 3840026215
>      rx_broadcast: 192069
>      tx_broadcast: 3
>      rx_multicast: 0
>      tx_multicast: 310
>      rx_errors: 0
>      tx_errors: 0
>      tx_dropped: 0
>      multicast: 0
>      collisions: 0
>      rx_length_errors: 0
>      rx_over_errors: 0
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_no_buffer_count: 0
>      rx_missed_errors: 0
>      tx_aborted_errors: 0
>      tx_carrier_errors: 0
>      tx_fifo_errors: 0
>      tx_heartbeat_errors: 0
>      tx_window_errors: 0
>      tx_abort_late_coll: 0
>      tx_deferred_ok: 0
>      tx_single_coll_ok: 0
>      tx_multi_coll_ok: 0
>      tx_timeout_count: 0
>      tx_restart_queue: 1806437
>      rx_long_length_errors: 0
>      rx_short_length_errors: 0
>      rx_align_errors: 0
>      tx_tcp_seg_good: 0
>      tx_tcp_seg_failed: 0
>      rx_flow_control_xon: 0
>      rx_flow_control_xoff: 0
>      tx_flow_control_xon: 0
>      tx_flow_control_xoff: 0
>      rx_long_byte_count: 33850492
>      rx_csum_offload_good: 8978
>      rx_csum_offload_errors: 0
>      rx_header_split: 0
>      alloc_rx_buff_failed: 0
>      tx_smbus: 0
>      rx_smbus: 0
>      dropped_smbus: 0
> 
> 
> This is from the tg3 interface:
> 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761
> Gigabit Ethernet PCIe (rev 10)
> 
> root@d8labralinux:/home/juice# ethtool -S eth2
> NIC statistics:
>      rx_octets: 10814
>      rx_fragments: 0
>      rx_ucast_packets: 20
>      rx_mcast_packets: 0
>      rx_bcast_packets: 26
>      rx_fcs_errors: 0
>      rx_align_errors: 0
>      rx_xon_pause_rcvd: 0
>      rx_xoff_pause_rcvd: 0
>      rx_mac_ctrl_rcvd: 0
>      rx_xoff_entered: 0
>      rx_frame_too_long_errors: 0
>      rx_jabbers: 0
>      rx_undersize_packets: 0
>      rx_in_length_errors: 0
>      rx_out_length_errors: 0
>      rx_64_or_less_octet_packets: 0
>      rx_65_to_127_octet_packets: 0
>      rx_128_to_255_octet_packets: 0
>      rx_256_to_511_octet_packets: 0
>      rx_512_to_1023_octet_packets: 0
>      rx_1024_to_1522_octet_packets: 0
>      rx_1523_to_2047_octet_packets: 0
>      rx_2048_to_4095_octet_packets: 0
>      rx_4096_to_8191_octet_packets: 0
>      rx_8192_to_9022_octet_packets: 0
>      tx_octets: 5120013863
>      tx_collisions: 0
>      tx_xon_sent: 0
>      tx_xoff_sent: 0
>      tx_flow_control: 0
>      tx_mac_errors: 0
>      tx_single_collisions: 0
>      tx_mult_collisions: 0
>      tx_deferred: 0
>      tx_excessive_collisions: 0
>      tx_late_collisions: 0
>      tx_collide_2times: 0
>      tx_collide_3times: 0
>      tx_collide_4times: 0
>      tx_collide_5times: 0
>      tx_collide_6times: 0
>      tx_collide_7times: 0
>      tx_collide_8times: 0
>      tx_collide_9times: 0
>      tx_collide_10times: 0
>      tx_collide_11times: 0
>      tx_collide_12times: 0
>      tx_collide_13times: 0
>      tx_collide_14times: 0
>      tx_collide_15times: 0
>      tx_ucast_packets: 80000034
>      tx_mcast_packets: 42
>      tx_bcast_packets: 40
>      tx_carrier_sense_errors: 0
>      tx_discards: 0
>      tx_errors: 0
>      dma_writeq_full: 0
>      dma_write_prioq_full: 0
>      rxbds_empty: 0
>      rx_discards: 0
>      rx_errors: 0
>      rx_threshold_hit: 0
>      dma_readq_full: 0
>      dma_read_prioq_full: 0
>      tx_comp_queue_full: 0
>      ring_set_send_prod_index: 0
>      ring_status_update: 0
>      nic_irqs: 0
>      nic_avoided_irqs: 0
>      nic_tx_threshold_hit: 0

Ethtool -S "My intel x520 10G nic" will show there are 8 rx/tx queues

I just made 5M pps with 64 bytes packet according to link given by eric Dumazet.
(connect the 2 ports with each other of the NIC, XEON E5540,kernel 2.6.32,set irq affinity, Noted that I have an abnormal ksoftirqd/2 which occupy 30%cpu even at idle state, so the result still has space to improve)

At another old kernel(2.6.16) with tg3 and bnx2 1G NIC,XEON E5450, I only got 490K pps(it is about 300Mbps,30% GE), I think the reason is multiqueue unsupported in this kernel.

I will do a test with 1Gb nic on the new kernel later.

> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-23  8:57 ` Jon Zhou
@ 2010-12-23 10:50   ` juice
  0 siblings, 0 replies; 28+ messages in thread
From: juice @ 2010-12-23 10:50 UTC (permalink / raw)
  To: Jon Zhou, Eric Dumazet, Stephen Hemminger, netdev@vger.kernel.org

>
> Ethtool -S "My intel x520 10G nic" will show there are 8 rx/tx queues
>
> I just made 5M pps with 64 bytes packet according to link given by eric
> Dumazet.
> (connect the 2 ports with each other of the NIC, XEON E5540,kernel
> 2.6.32,set irq affinity, Noted that I have an abnormal ksoftirqd/2 which
> occupy 30%cpu even at idle state, so the result still has space to
> improve)
>
> At another old kernel(2.6.16) with tg3 and bnx2 1G NIC,XEON E5450, I only
> got 490K pps(it is about 300Mbps,30% GE), I think the reason is multiqueue
> unsupported in this kernel.
>
> I will do a test with 1Gb nic on the new kernel later.
>

Which do you suppose is the reason for poor performance on my setup,
is it lack of multiqueue HW in the GE NIC's I am using or is it lack
of multiqueue support in the kernel (2.6.32) that I am using?

Is multiqueue really necessary to achieve the full 1GE saturation, or
is it only needed on 10GE NIC's?

As I understand multiqueue is useful only if there are lots of CPU cores
to run, each handling one queue.

The application I am thinking of, preloading a packet sequence into
kernel from userland application and then starting to send from buffer
propably does not benefit so much from many cores, it would be enough
that one CPU would handle the sending and other core(s) would handle
other tasks.

Yours, Jussi Ohenoja



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
@ 2010-12-30  1:11 Loke, Chetan
  2011-01-21 11:44 ` juice
  0 siblings, 1 reply; 28+ messages in thread
From: Loke, Chetan @ 2010-12-30  1:11 UTC (permalink / raw)
  To: Jon Zhou, juice, Eric Dumazet, Stephen Hemminger, netdev

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Jon Zhou
> Sent: December 23, 2010 3:58 AM
> To: juice@swagman.org; Eric Dumazet; Stephen Hemminger;
> netdev@vger.kernel.org
> Subject: RE: Using ethernet device as efficient small packet generator
> 
> 
> At another old kernel(2.6.16) with tg3 and bnx2 1G NIC,XEON E5450, I
> only got 490K pps(it is about 300Mbps,30% GE), I think the reason is
> multiqueue unsupported in this kernel.
> 
> I will do a test with 1Gb nic on the new kernel later.
> 


I can hit close to 1M pps(first time every time) w/ a 64-byte payload on
my VirtualMachine(running 2.6.33) via vmxnet3 vNIC - 


[root@localhost ~]# cat /proc/net/pktgen/eth2
Params: count 0  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 0  ifname: eth2
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 192.168.222.2  dst_max:
        src_min:   src_max:
     src_mac: 00:50:56:b1:00:19 dst_mac: 00:50:56:c0:00:3e
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 59241012  errors: 0
     started: 1898437021us  stopped: 1957709510us idle: 9168us
     seq_num: 59241013  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x2dea8c0
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 59272488(c59263320+d9168) nsec, 59241012 (60byte,0frags)
  999468pps 479Mb/sec (479744640bps) errors: 0



Chetan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2010-12-30  1:11 Using ethernet device as efficient small packet generator Loke, Chetan
@ 2011-01-21 11:44 ` juice
  2011-01-21 11:51   ` Eric Dumazet
  2011-01-21 22:09   ` Brandeburg, Jesse
  0 siblings, 2 replies; 28+ messages in thread
From: juice @ 2011-01-21 11:44 UTC (permalink / raw)
  To: Loke, Chetan, Jon Zhou, Eric Dumazet, Stephen Hemminger, netdev

>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Jon Zhou
>> Sent: December 23, 2010 3:58 AM
>> To: juice@swagman.org; Eric Dumazet; Stephen Hemminger;
>> netdev@vger.kernel.org
>> Subject: RE: Using ethernet device as efficient small packet generator
>>
>>
>> At another old kernel(2.6.16) with tg3 and bnx2 1G NIC,XEON E5450, I
>> only got 490K pps(it is about 300Mbps,30% GE), I think the reason is
>> multiqueue unsupported in this kernel.
>>
>> I will do a test with 1Gb nic on the new kernel later.
>>
>
>
> I can hit close to 1M pps(first time every time) w/ a 64-byte payload on
> my VirtualMachine(running 2.6.33) via vmxnet3 vNIC -
>
>
> [root@localhost ~]# cat /proc/net/pktgen/eth2
> Params: count 0  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 0  ifname: eth2
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 192.168.222.2  dst_max:
>         src_min:   src_max:
>      src_mac: 00:50:56:b1:00:19 dst_mac: 00:50:56:c0:00:3e
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 59241012  errors: 0
>      started: 1898437021us  stopped: 1957709510us idle: 9168us
>      seq_num: 59241013  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x0  cur_daddr: 0x2dea8c0
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 59272488(c59263320+d9168) nsec, 59241012 (60byte,0frags)
>   999468pps 479Mb/sec (479744640bps) errors: 0
>
>
>
> Chetan
>


Hi again.

It has been a while since last time I got to be able to test this
again, as there have been some other matters at hand.
However, now I managed to rerun my tests in several different kernels.

I am using now a PCIe Intel e1000e card, that should be able to handle
the needed traffic amount.

The statistics that I get are as follows:

kernel 2.6.32-27 (ubuntu 10.10 default)
    pktgen:           750064pps 360Mb/sec (360030720bps)
    AX4000 analyser:  Total bitrate:             383.879 MBits/s
                      Bandwidth:                 38.39% GE
                      Average packet intereval:  1.33 us

kernel 2.6.37 (latest stable from kernel.org)
    pktgen:           786848pps 377Mb/sec (377687040bps)
    AX4000 analyser:  Total bitrate:             402.904 MBits/s
                      Bandwidth:                 40.29% GE
                      Average packet intereval:  1.27 us

kernel 2.6.38-rc1 (latest from kernel.org)
    pktgen:           795297pps 381Mb/sec (381742560bps)
    AX4000 analyser:  Total bitrate:             407.117 MBits/s
                      Bandwidth:                 40.72% GE
                      Average packet intereval:  1.26 us


In every case I have set the IRQ affinity of eth1 to CPU0 and started
the test running in kpktgend_0.

The complete data of my measurements follows in the end of this post.

It looks like the small packet sending effiency of the ethernet driver
is improving all the time, albeit quite slowly.

Now, I would be intrested in knowing whether it is indeed possible to
increase the sending rate near full 1GE capacity with the current
ethernet card I am using or do I have here a hardware limitation here?

I recall hearing that there are some enhanced versions of the e1000
network card, such that have been geared towards higher performance
at the expense of some functionality or general system effiency.
Can anybody point me how to do that?

As I stated before, quoting myself:

> Which do you suppose is the reason for poor performance on my setup,
> is it lack of multiqueue HW in the GE NIC's I am using or is it lack
> of multiqueue support in the kernel (2.6.32) that I am using?
>
> Is multiqueue really necessary to achieve the full 1GE saturation, or
> is it only needed on 10GE NIC's?
>
> As I understand multiqueue is useful only if there are lots of CPU cores
> to run, each handling one queue.
>
> The application I am thinking of, preloading a packet sequence into
> kernel from userland application and then starting to send from buffer
> propably does not benefit so much from many cores, it would be enough
> that one CPU would handle the sending and other core(s) would handle
> other tasks.

Yours, Jussi Ohenoja


*** Measurement details follows ***


root@d8labralinux:/var/home/juice# lspci -vvv -s 04:00.0
04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet
Controller (Copper) (rev 06)
	Subsystem: Intel Corporation Device 1082
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at f3cc0000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at f3ce0000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at cce0 [size=32]
	Expansion ROM at f3d00000 [disabled] [size=128K]
	Capabilities: [c8] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [e0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1
<64us
			ClockPM- Suprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
BWMgmt- ABWMgmt-
	Capabilities: [100] Advanced Error Reporting <?>
	Capabilities: [140] Device Serial Number b1-e5-7c-ff-ff-21-1b-00
	Kernel modules: e1000e

root@d8labralinux:/var/home/juice# ethtool eth1
Settings for eth1:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: No
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: on
	Supports Wake-on: pumbag
	Wake-on: d
	Current message level: 0x00000001 (1)
	Link detected: yes





2.6.38-rc1
----------

dmesg:

[  195.685655] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.20-k2
[  195.685658] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[  195.685677] e1000e 0000:04:00.0: Disabling ASPM  L1
[  195.685690] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[  195.685707] e1000e 0000:04:00.0: setting latency timer to 64
[  195.685852] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  195.869917] e1000e 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[  195.869921] e1000e 0000:04:00.0: eth1: Intel(R) PRO/1000 Network
Connection
[  195.870006] e1000e 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: D50861-006
[  196.017285] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  196.073144] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  196.073630] ADDRCONF(NETDEV_UP): eth1: link is not ready
[  198.746000] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[  198.746162] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[  209.564433] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 77203892067us  stopped: 77216465982us idle: 1325us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 12573914(c12572589+d1325) nsec, 10000000 (60byte,0frags)
  795297pps 381Mb/sec (381742560bps) errors: 0


AX4000 analyser:

   Total bitrate:             407.117 MBits/s
   Bandwidth:                 40.72% GE
   Average packet intereval:  1.26 us






2.6.37
------


dmesg:

[ 1810.959907] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.7-k2
[ 1810.959909] e1000e: Copyright (c) 1999 - 2010 Intel Corporation.
[ 1810.959928] e1000e 0000:04:00.0: Disabling ASPM  L1
[ 1810.959942] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[ 1810.959961] e1000e 0000:04:00.0: setting latency timer to 64
[ 1810.960103] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.137269] e1000e 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[ 1811.137272] e1000e 0000:04:00.0: eth1: Intel(R) PRO/1000 Network
Connection
[ 1811.137358] e1000e 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50861-006
[ 1811.286173] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.342065] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.342575] ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 1814.010736] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[ 1814.010949] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 1824.082148] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 265936151us  stopped: 278645077us idle: 1651us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 12708925(c12707274+d1651) nsec, 10000000 (60byte,0frags)
  786848pps 377Mb/sec (377687040bps) errors: 0


AX4000 analyser:

   Total bitrate:             402.904 MBits/s
   Bandwidth:                 40.29% GE
   Average packet intereval:  1.27 us






2.6.32-27
---------


dmesg:

[    2.178800] e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k2
[    2.178802] e1000e: Copyright (c) 1999-2008 Intel Corporation.
[    2.178854] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[    2.178887] e1000e 0000:04:00.0: setting latency timer to 64
[    2.179039] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    2.360700] 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[    2.360702] 0000:04:00.0: eth1: Intel(R) PRO/1000 Network Connection
[    2.360787] 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50861-006
[    9.551486] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    9.607309] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    9.607876] ADDRCONF(NETDEV_UP): eth1: link is not ready
[   12.448302] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[   12.448544] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   23.068498] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 799760010us  stopped: 813092189us idle: 1314us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 13332178(c13330864+d1314) nsec, 10000000 (60byte,0frags)
  750064pps 360Mb/sec (360030720bps) errors: 0


AX4000 analyser:

   Total bitrate:             383.879 MBits/s
   Bandwidth:                 38.39% GE
   Average packet intereval:  1.33 us




root@d8labralinux:/var/home/juice/pkt_test# cat ./pktgen_conf
#!/bin/bash

#modprobe pktgen

function pgset() {
  local result
  echo $1 > $PGDEV
  result=`cat $PGDEV | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
    cat $PGDEV | fgrep Result:
  fi
}

function pg() {
  echo inject > $PGDEV
  cat $PGDEV
}

# Config Start Here
-----------------------------------------------------------

# thread config
# Each CPU has own thread. Two CPU exammple. We add eth1, eth2 respectivly.
PGDEV=/proc/net/pktgen/kpktgend_0
echo "Removing all devices"
pgset "rem_device_all"
PGDEV=/proc/net/pktgen/kpktgend_1
pgset "rem_device_all"

PGDEV=/proc/net/pktgen/kpktgend_0
echo "Adding eth1"
pgset "add_device eth1"
#echo "Setting max_before_softirq 10000"
#pgset "max_before_softirq 10000"

# device config
# ipg is inter packet gap. 0 means maximum speed.
CLONE_SKB="clone_skb 1"
# NIC adds 4 bytes CRC
PKT_SIZE="pkt_size 60"
# COUNT 0 means forever
#COUNT="count 0"
COUNT="count 10000000"
IPG="delay 0"
PGDEV=/proc/net/pktgen/eth1
echo "Configuring $PGDEV"
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$IPG"
pgset "dst 10.10.11.2"
pgset "dst_mac 00:04:23:08:91:dc"
pgset "queue_map_min 0"

# Time to run
PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"

# Result can be vieved in /proc/net/pktgen/eth1





^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-21 11:44 ` juice
@ 2011-01-21 11:51   ` Eric Dumazet
  2011-01-21 12:12     ` juice
  2011-01-21 22:09   ` Brandeburg, Jesse
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2011-01-21 11:51 UTC (permalink / raw)
  To: juice; +Cc: Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev

Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :

> Hi again.
> 
> It has been a while since last time I got to be able to test this
> again, as there have been some other matters at hand.
> However, now I managed to rerun my tests in several different kernels.
> 
> I am using now a PCIe Intel e1000e card, that should be able to handle
> the needed traffic amount.
> 
> The statistics that I get are as follows:
> 
> kernel 2.6.32-27 (ubuntu 10.10 default)
>     pktgen:           750064pps 360Mb/sec (360030720bps)
>     AX4000 analyser:  Total bitrate:             383.879 MBits/s
>                       Bandwidth:                 38.39% GE
>                       Average packet intereval:  1.33 us
> 
> kernel 2.6.37 (latest stable from kernel.org)
>     pktgen:           786848pps 377Mb/sec (377687040bps)
>     AX4000 analyser:  Total bitrate:             402.904 MBits/s
>                       Bandwidth:                 40.29% GE
>                       Average packet intereval:  1.27 us
> 
> kernel 2.6.38-rc1 (latest from kernel.org)
>     pktgen:           795297pps 381Mb/sec (381742560bps)
>     AX4000 analyser:  Total bitrate:             407.117 MBits/s
>                       Bandwidth:                 40.72% GE
>                       Average packet intereval:  1.26 us
> 
> 

...

> pktgen:
> 
> Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 1  ifname: eth1
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 10.10.11.2  dst_max:
>         src_min:   src_max:
>      src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 10000000  errors: 0
>      started: 77203892067us  stopped: 77216465982us idle: 1325us
>      seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x0  cur_daddr: 0x20b0a0a
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 12573914(c12572589+d1325) nsec, 10000000 (60byte,0frags)
>   795297pps 381Mb/sec (381742560bps) errors: 0
> 
> 
> AX4000 analyser:
> 
>    Total bitrate:             407.117 MBits/s
>    Bandwidth:                 40.72% GE
>    Average packet intereval:  1.26 us
> 
> 

You should try

CLONE_SKB="clone_skb 10"
...
pgset "$CLONE_SKB"


Because I suspect you hit a performance problem on skb
allocation/filling/use/freeing

You can use perf tool to get some performance profile while your pktgen
session is running

# cd tools/perf
# make
...
# ./perf top




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-21 11:51   ` Eric Dumazet
@ 2011-01-21 12:12     ` juice
  2011-01-21 13:38       ` Ben Greear
  0 siblings, 1 reply; 28+ messages in thread
From: juice @ 2011-01-21 12:12 UTC (permalink / raw)
  To: Eric Dumazet, Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev

> Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :
>
>
> You should try
>
> CLONE_SKB="clone_skb 10"
> ...
> pgset "$CLONE_SKB"
>
>
> Because I suspect you hit a performance problem on skb
> allocation/filling/use/freeing

Actually, that makes the performance worse:
(Now I tried it with kernel 2.6.37, which is currently running)

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 2555660074us  stopped: 2569239323us idle: 3484us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 13579248(c13575763+d3484) nsec, 10000000 (60byte,0frags)
  736417pps 353Mb/sec (353480160bps) errors: 0


> You can use perf tool to get some performance profile while your pktgen
> session is running
>
> # cd tools/perf
> # make
> ...
> # ./perf top
>

I can try that.
Where do I get the performance profiler tool?


Yours, Jussi Ohenoja



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using ethernet device as efficient small packet generator
  2011-01-21 12:12     ` juice
@ 2011-01-21 13:38       ` Ben Greear
  0 siblings, 0 replies; 28+ messages in thread
From: Ben Greear @ 2011-01-21 13:38 UTC (permalink / raw)
  To: juice; +Cc: Eric Dumazet, Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev

On 01/21/2011 04:12 AM, juice wrote:
>> Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :
>>
>>
>> You should try
>>
>> CLONE_SKB="clone_skb 10"
>> ...
>> pgset "$CLONE_SKB"
>>
>>
>> Because I suspect you hit a performance problem on skb
>> allocation/filling/use/freeing
>
> Actually, that makes the performance worse:
> (Now I tried it with kernel 2.6.37, which is currently running)

Maybe try clone-skb of 1000 or so.  It zero's it's memory when
allocating a packet which can be quite expensive.

Also note that the Ethernet inter-frame gap isn't accounted for in
the BPS, but it is a significant amount of the total bandwidth
when using 64-byte packets.
You are pushing a bit more than half of the theoretical limit of
around 1,400,000 64-byte packets per second for 1Gbps ethernet.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-21 11:44 ` juice
  2011-01-21 11:51   ` Eric Dumazet
@ 2011-01-21 22:09   ` Brandeburg, Jesse
  2011-01-23 21:48     ` juice
  1 sibling, 1 reply; 28+ messages in thread
From: Brandeburg, Jesse @ 2011-01-21 22:09 UTC (permalink / raw)
  To: juice
  Cc: Loke, Chetan, Jon Zhou, Eric Dumazet, Stephen Hemminger,
	netdev@vger.kernel.org



On Fri, 21 Jan 2011, juice wrote:
> I am using now a PCIe Intel e1000e card, that should be able to handle
> the needed traffic amount.
> 
> The statistics that I get are as follows:
> 
> kernel 2.6.32-27 (ubuntu 10.10 default)
>     pktgen:           750064pps 360Mb/sec (360030720bps)
>     AX4000 analyser:  Total bitrate:             383.879 MBits/s
>                       Bandwidth:                 38.39% GE
>                       Average packet intereval:  1.33 us
> 
> kernel 2.6.37 (latest stable from kernel.org)
>     pktgen:           786848pps 377Mb/sec (377687040bps)
>     AX4000 analyser:  Total bitrate:             402.904 MBits/s
>                       Bandwidth:                 40.29% GE
>                       Average packet intereval:  1.27 us
> 
> kernel 2.6.38-rc1 (latest from kernel.org)
>     pktgen:           795297pps 381Mb/sec (381742560bps)
>     AX4000 analyser:  Total bitrate:             407.117 MBits/s
>                       Bandwidth:                 40.72% GE
>                       Average packet intereval:  1.26 us
> 

your computation of Bandwidth (as Ben Greear said) is not accounting for 
the interframe gaps.  Maybe more useful is to note that wire speed 64 byte 
packets is 1.44 Million packets per second.

> In every case I have set the IRQ affinity of eth1 to CPU0 and started
> the test running in kpktgend_0.
> 
> The complete data of my measurements follows in the end of this post.
> 
> It looks like the small packet sending effiency of the ethernet driver
> is improving all the time, albeit quite slowly.
> 
> Now, I would be intrested in knowing whether it is indeed possible to
> increase the sending rate near full 1GE capacity with the current
> ethernet card I am using or do I have here a hardware limitation here?
> 
> I recall hearing that there are some enhanced versions of the e1000
> network card, such that have been geared towards higher performance
> at the expense of some functionality or general system effiency.
> Can anybody point me how to do that?

I think you need different hardware (again) as you have saddled yourself 
with a x1 PCIe connected adapter.  This adapter is not well suited to 
small packet traffic because the sheer amount of transactions is effected 
by the added latency due to the x1 connector (vs our dual port 1GbE 
adapters with a x4 connector)


> As I stated before, quoting myself:
> 
> > Which do you suppose is the reason for poor performance on my setup,
> > is it lack of multiqueue HW in the GE NIC's I am using or is it lack
> > of multiqueue support in the kernel (2.6.32) that I am using?
> >
> > Is multiqueue really necessary to achieve the full 1GE saturation, or
> > is it only needed on 10GE NIC's?

with Core i3/5/7 or newer cpus you should be able to saturate a 1Gb link 
with a single core/queue.  With Core2 era processors you may have some 
difficulty, with anything older than that you won't make it. :-)

> > As I understand multiqueue is useful only if there are lots of CPU cores
> > to run, each handling one queue.
> >
> > The application I am thinking of, preloading a packet sequence into
> > kernel from userland application and then starting to send from buffer
> > propably does not benefit so much from many cores, it would be enough
> > that one CPU would handle the sending and other core(s) would handle
> > other tasks.
> 
> Yours, Jussi Ohenoja
> 
> 
> *** Measurement details follows ***
> 
> 
> root@d8labralinux:/var/home/juice# lspci -vvv -s 04:00.0
> 04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet
> Controller (Copper) (rev 06)

My suggestion is to get one of the igb based adapters, 82576, or 82580 
based that run the igb driver.

If you can't get a hold of those you should be able to easily get 1.1M pps 
from an 82571 adapter.

you may also want to try reducing the tx descriptor ring count to 128 
using ethtool, and change the ethtool -C rx-usecs 20 setting, try 
20,30,40,50,60

Jesse

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-21 22:09   ` Brandeburg, Jesse
@ 2011-01-23 21:48     ` juice
  2011-01-24  8:10       ` juice
  2011-02-02  8:13       ` juice
  0 siblings, 2 replies; 28+ messages in thread
From: juice @ 2011-01-23 21:48 UTC (permalink / raw)
  To: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Eric Dumazet,
	"Stephen Hemming


> your computation of Bandwidth (as Ben Greear said) is not accounting for
> the interframe gaps.  Maybe more useful is to note that wire speed 64 byte
> packets is 1.44 Million packets per second.

I am aware of the fact that interframe gap eats away some of the bandwidth
from actual data bytes, and I am taking that into consideration.
My benchmark here is the Spirent AX4000 network analyzer, which can send
and receive full utilization of GE line.

The measurement when sending full line rate from AX4000 are:
  Total bitrate:             761.903 MBits/s
  Packet rate:               1488090 packets/s
  Bandwidth:                 76.19% GE
  Average packet intereval:  0.67 us


> I think you need different hardware (again) as you have saddled yourself
> with a x1 PCIe connected adapter.  This adapter is not well suited to
> small packet traffic because the sheer amount of transactions is effected
> by the added latency due to the x1 connector (vs our dual port 1GbE
> adapters with a x4 connector)
>
> with Core i3/5/7 or newer cpus you should be able to saturate a 1Gb link
> with a single core/queue.  With Core2 era processors you may have some
> difficulty, with anything older than that you won't make it. :-)

The CPU I have on the machine driving the card is a dual-core Xeon:
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           W3503  @ 2.40GHz
stepping	: 5
cpu MHz		: 2399.926
cache size	: 4096 KB

I do hope this is enough to go, as it is easier for me to get a better
network adapter than order a new faster machine as I jut got this one
last december :)


> My suggestion is to get one of the igb based adapters, 82576, or 82580
> based that run the igb driver.
>
> If you can't get a hold of those you should be able to easily get 1.1M pps
> from an 82571 adapter.

I will order the 82576 card and try my tests with that.


> you may also want to try reducing the tx descriptor ring count to 128
> using ethtool, and change the ethtool -C rx-usecs 20 setting, try
> 20,30,40,50,60

So this could up my current network card to a little faster?
If I can reach 1.1Mpackets/s, thats about 560Mbits/s. At least it would
get me a little closet to what I am trying to achieve.

Yours, Jussi Ohenoja





^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-23 21:48     ` juice
@ 2011-01-24  8:10       ` juice
  2011-01-24  9:18         ` Eric Dumazet
  2011-01-24 16:34         ` Eric Dumazet
  2011-02-02  8:13       ` juice
  1 sibling, 2 replies; 28+ messages in thread
From: juice @ 2011-01-24  8:10 UTC (permalink / raw)
  To: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Eric Dumazet,
	"Stephen Hemming


>> you may also want to try reducing the tx descriptor ring count to 128
>> using ethtool, and change the ethtool -C rx-usecs 20 setting, try
>> 20,30,40,50,60
>
> So this could up my current network card to a little faster?
> If I can reach 1.1Mpackets/s, thats about 560Mbits/s. At least it would
> get me a little closet to what I am trying to achieve.
>

I tried these tunings, and it turns out that I am able to get the best
performance with pktgen when I set the options "ethtool -G eth1 tx 128"
and "ethtool -C eth1 rx-usecs 10". Anything different will lower the TX
performance.

Now I can get these rates:

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1205660106us  stopped: 1218005650us idle: 804us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 12345544(c12344739+d804) nsec, 10000000 (60byte,0frags)
  810008pps 388Mb/sec (388803840bps) errors: 0

AX4000:
  Total bitrate:             414.629 MBits/s
  Packet rate:               809824 packets/s
  Bandwidth:                 41.46% GE
  Average packet intereval:  1.23 us

This is a bit better than the previous maxim of 750064pps / 360Mb/sec
that I was able to achieve without tuning parameters with ethtool, but
still not near the 1.1Mpacks/s that shoud be doable with my card?

Are there other tunings or alternate driver that I could use to get the
best performance out of the card? Basically what puzzles me is the fact
that I can get a lot better performance using larger packets, so that
suggests to me that the bottleneck cannot be the PCIe interface, as I can
push enough data through it. Is there any way of doing larger transfers
on the bus, like grouping many smaller packets together to avoid the
problems caused by so many TX interrupts?

Yours, Jussi Ohenoja




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-24  8:10       ` juice
@ 2011-01-24  9:18         ` Eric Dumazet
  2011-01-24 16:34         ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2011-01-24  9:18 UTC (permalink / raw)
  To: juice
  Cc: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Stephen Hemminger,
	netdev@vger.kernel.org

Le lundi 24 janvier 2011 à 10:10 +0200, juice a écrit :
> >> you may also want to try reducing the tx descriptor ring count to 128
> >> using ethtool, and change the ethtool -C rx-usecs 20 setting, try
> >> 20,30,40,50,60
> >
> > So this could up my current network card to a little faster?
> > If I can reach 1.1Mpackets/s, thats about 560Mbits/s. At least it would
> > get me a little closet to what I am trying to achieve.
> >
> 
> I tried these tunings, and it turns out that I am able to get the best
> performance with pktgen when I set the options "ethtool -G eth1 tx 128"
> and "ethtool -C eth1 rx-usecs 10". Anything different will lower the TX
> performance.
> 

That (rx-usecs 10) makes no sense.
pktgen sends packets.
You should not receive packets ?

> Now I can get these rates:
> 
> root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
> Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 1  ifname: eth1
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 10.10.11.2  dst_max:
>         src_min:   src_max:
>      src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 10000000  errors: 0
>      started: 1205660106us  stopped: 1218005650us idle: 804us
>      seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x0  cur_daddr: 0x20b0a0a
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 12345544(c12344739+d804) nsec, 10000000 (60byte,0frags)
>   810008pps 388Mb/sec (388803840bps) errors: 0
> 
> AX4000:
>   Total bitrate:             414.629 MBits/s
>   Packet rate:               809824 packets/s
>   Bandwidth:                 41.46% GE
>   Average packet intereval:  1.23 us
> 
> This is a bit better than the previous maxim of 750064pps / 360Mb/sec
> that I was able to achieve without tuning parameters with ethtool, but
> still not near the 1.1Mpacks/s that shoud be doable with my card?
> 
> Are there other tunings or alternate driver that I could use to get the
> best performance out of the card? Basically what puzzles me is the fact
> that I can get a lot better performance using larger packets, so that
> suggests to me that the bottleneck cannot be the PCIe interface, as I can
> push enough data through it. Is there any way of doing larger transfers
> on the bus, like grouping many smaller packets together to avoid the
> problems caused by so many TX interrupts?
> 

What matters is not size of packets, but number of transactions (packets
per second)

You need a x4 or x8 connector to get more transactions per second, by an
order of magnitude, not 10% or 15% ;)

TX interrupts are already 'grouped', one for ~50 packets

ethtool -c :

tx-usecs: 72
tx-frames: 53
tx-usecs-irq: 0
tx-frames-irq: 53




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-24  8:10       ` juice
  2011-01-24  9:18         ` Eric Dumazet
@ 2011-01-24 16:34         ` Eric Dumazet
  2011-01-24 20:51           ` juice
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2011-01-24 16:34 UTC (permalink / raw)
  To: juice
  Cc: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Stephen Hemminger,
	netdev@vger.kernel.org

Le lundi 24 janvier 2011 à 10:10 +0200, juice a écrit :

> Result: OK: 12345544(c12344739+d804) nsec, 10000000 (60byte,0frags)
>   810008pps 388Mb/sec (388803840bps) errors: 0
> 

> 
> This is a bit better than the previous maxim of 750064pps / 360Mb/sec
> that I was able to achieve without tuning parameters with ethtool, but
> still not near the 1.1Mpacks/s that shoud be doable with my card?

Please check what numbers you can get using dummy0 device instead of
real ethernet driver.

Here : (E5540  @ 2.53GHz) clone = 1

Result: OK: 34775941(c34775225+d716) nsec, 100000000 (60byte,0frags)
  2875551pps 1380Mb/sec (1380264480bps) errors: 0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-24 16:34         ` Eric Dumazet
@ 2011-01-24 20:51           ` juice
  0 siblings, 0 replies; 28+ messages in thread
From: juice @ 2011-01-24 20:51 UTC (permalink / raw)
  To: Eric Dumazet, Brandeburg, Jesse, Loke, Chetan, Jon Zhou,
	"Stephen Hemming



> Please check what numbers you can get using dummy0 device instead of
> real ethernet driver.
>
> Here : (E5540  @ 2.53GHz) clone = 1
>
> Result: OK: 34775941(c34775225+d716) nsec, 100000000 (60byte,0frags)
>   2875551pps 1380Mb/sec (1380264480bps) errors: 0
>

My result on the machine (W3503 @ 2.40GHz):

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 0  ifname: dummy0
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
     src_min:   src_max:
     src_mac: b6:b2:a2:f4:8e:dc dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1295902048722173us  stopped: 1295902052312514us idle: 3664us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 3590341(c3586677+d3664) usec, 10000000 (60byte,0frags)
  2785250pps 1336Mb/sec (1336920000bps) errors: 0


Yours, Jussi Ohenoja



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Using ethernet device as efficient small packet generator
  2011-01-23 21:48     ` juice
  2011-01-24  8:10       ` juice
@ 2011-02-02  8:13       ` juice
  1 sibling, 0 replies; 28+ messages in thread
From: juice @ 2011-02-02  8:13 UTC (permalink / raw)
  To: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Eric Dumazet,
	"Stephen Hemming

>
>> your computation of Bandwidth (as Ben Greear said) is not accounting for
>> the interframe gaps.  Maybe more useful is to note that wire speed 64
>> byte packets is 1.44 Million packets per second.
>
> I am aware of the fact that interframe gap eats away some of the bandwidth
> from actual data bytes, and I am taking that into consideration.
> My benchmark here is the Spirent AX4000 network analyzer, which can send
> and receive full utilization of GE line.
>
> The measurement when sending full line rate from AX4000 are:
>   Total bitrate:             761.903 MBits/s
>   Packet rate:               1488090 packets/s
>   Bandwidth:                 76.19% GE
>   Average packet intereval:  0.67 us
>
>
>> I think you need different hardware (again) as you have saddled yourself
>> with a x1 PCIe connected adapter.  This adapter is not well suited to
>> small packet traffic because the sheer amount of transactions is
>> effected
>> by the added latency due to the x1 connector (vs our dual port 1GbE
>> adapters with a x4 connector)
>>
>> with Core i3/5/7 or newer cpus you should be able to saturate a 1Gb link
>> with a single core/queue.  With Core2 era processors you may have some
>> difficulty, with anything older than that you won't make it. :-)
>>
>> My suggestion is to get one of the igb based adapters, 82576, or 82580
>> based that run the igb driver.
>>
>> If you can't get a hold of those you should be able to easily get 1.1M
>> pps from an 82571 adapter.
>
> I will order the 82576 card and try my tests with that.
>

Okay, now I just installed the new hot 82576 DualGE adapter and compiled
the igb module for 2.6.38-rc2 kernel I am running on.

The results with this adapter look very promising, now I am able to reach
the full GE bandwidth with 64 byte packets with only interrupt cpu affinity
tuning, no other tweaks needed:

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 0  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:97:21:76 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1941436194us  stopped: 1948155853us idle: 179us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 6719658(c6719479+d179) nsec, 10000000 (60byte,0frags)
  1488170pps 714Mb/sec (714321600bps) errors: 0

AX4000 measurements:
   Total bitrate:             761.910 MBits/s
   Packet rate:               1488106 packets/s
   Bandwidth:                 76.19% GE
   Average packet intereval:  0.67 us

Now, I need to check if I can send similar rates from userspace socket
interface. If that is possible then it may be so that I do not even need
to create a kernel driver for my application.

Yours, Jussi Ohenoja



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-02-02  8:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-30  1:11 Using ethernet device as efficient small packet generator Loke, Chetan
2011-01-21 11:44 ` juice
2011-01-21 11:51   ` Eric Dumazet
2011-01-21 12:12     ` juice
2011-01-21 13:38       ` Ben Greear
2011-01-21 22:09   ` Brandeburg, Jesse
2011-01-23 21:48     ` juice
2011-01-24  8:10       ` juice
2011-01-24  9:18         ` Eric Dumazet
2011-01-24 16:34         ` Eric Dumazet
2011-01-24 20:51           ` juice
2011-02-02  8:13       ` juice
  -- strict thread matches above, loose matches on Subject: below --
2010-12-23  5:15 juice
2010-12-23  8:57 ` Jon Zhou
2010-12-23 10:50   ` juice
2010-12-22  7:30 juice
2010-12-22  8:08 ` Eric Dumazet
2010-12-22 11:11   ` juice
2010-12-22 11:28     ` Eric Dumazet
2010-12-22 15:48   ` Jon Zhou
2010-12-22 15:59     ` Eric Dumazet
2010-12-22 16:52       ` Jon Zhou
2010-12-22 17:18         ` Eric Dumazet
2010-12-22 17:40           ` Jon Zhou
2010-12-22 17:51             ` Eric Dumazet
2010-12-22 17:15       ` Jon Zhou
2010-12-21  9:56 juice
2010-12-21 18:22 ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).