Re: [PATCH] mlx4_en: map entire pages to increase throughput

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
       [not found] ` <50044F1D.6000703@hp.com>
@ 2012-07-16 19:06   ` Thadeu Lima de Souza Cascardo
  2012-07-16 19:42     ` Rick Jones
  0 siblings, 1 reply; 8+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 19:06 UTC (permalink / raw)
  To: Rick Jones
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton, brking@linux.vnet.ibm.com,
	ogerlitz@mellanox.com, linuxppc-dev, davem@davemloft.net

On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> >In its receive path, mlx4_en driver maps each page chunk that it pushes
> >to the hardware and unmaps it when pushing it up the stack. This limits
> >throughput to about 3Gbps on a Power7 8-core machine.
> 
> That seems rather extraordinarily low - Power7 is supposed to be a
> rather high performance CPU.  The last time I noticed O(3Gbit/s) on
> 10G for bulk transfer was before the advent of LRO/GRO - that was in
> the x86 space though.  Is mapping really that expensive with Power7?
> 

Copying linuxppc-dev and Anton here. But I can tell you that we have
lock contention when doing the mapping on the same adapter (map table
per device). Anton has sent some patches that improves that *a lot*.

However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per
packet. The problem is not the CPU power needed to do the mappings, but
that we find the lock contention and end up with the CPUs more than 30%
of the time spent on spin locking.

> 
> >One solution is to map the entire allocated page at once. However, this
> >requires that we keep track of every page fragment we give to a
> >descriptor. We also need to work with the discipline that all fragments will
> >be released (in the sense that it will not be reused by the driver
> >anymore) in the order they are allocated to the driver.
> >
> >This requires that we don't reuse any fragments, every single one of
> >them must be reallocated. We do that by releasing all the fragments that
> >are processed and only after finished processing the descriptors, we
> >start the refill.
> >
> >We also must somehow guarantee that we either refill all fragments in a
> >descriptor or none at all, without resorting to giving up a page
> >fragment that we would have already given. Otherwise, we would break the
> >discipline of only releasing the fragments in the order they were
> >allocated.
> >
> >This has passed page allocation fault injections (restricted to the
> >driver by using required-start and required-end) and device hotplug
> >while 16 TCP streams were able to deliver more than 9Gbps.
> 
> What is the effect on packet-per-second performance?  (eg aggregate,
> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> 

I used uperf with TCP_NODELAY and 16 threads sending from another
machine 64000-sized writes for 60 seconds.

I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
(9.23Gb/s) with the patch.

Best regards.
Cascardo.


> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 19:06   ` [PATCH] mlx4_en: map entire pages to increase throughput Thadeu Lima de Souza Cascardo
@ 2012-07-16 19:42     ` Rick Jones
  2012-07-16 20:36       ` Or Gerlitz
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Rick Jones @ 2012-07-16 19:42 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net

On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
>
>> What is the effect on packet-per-second performance?  (eg aggregate,
>> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
>>
> I used uperf with TCP_NODELAY and 16 threads sending from another
> machine 64000-sized writes for 60 seconds.
>
> I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> (9.23Gb/s) with the patch.

I was thinking more along the lines of an additional comparison, 
explicitly using netperf TCP_RR or something like it, not just the 
packets per second from a bulk transfer test.

rick

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 19:42     ` Rick Jones
@ 2012-07-16 20:36       ` Or Gerlitz
  2012-07-16 20:43       ` Or Gerlitz
  2012-07-16 20:47       ` Thadeu Lima de Souza Cascardo
  2 siblings, 0 replies; 8+ messages in thread
From: Or Gerlitz @ 2012-07-16 20:36 UTC (permalink / raw)
  To: Rick Jones
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, Thadeu Lima de Souza Cascardo,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net,
	anton@samba.org

[-- Attachment #1: Type: text/plain, Size: 312 bytes --]

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.
>

TCP_STREAM would be good to know here as well

Or.

[-- Attachment #2: Type: text/html, Size: 636 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 19:42     ` Rick Jones
  2012-07-16 20:36       ` Or Gerlitz
@ 2012-07-16 20:43       ` Or Gerlitz
  2012-07-16 20:57         ` Thadeu Lima de Souza Cascardo
  2012-07-16 20:47       ` Thadeu Lima de Souza Cascardo
  2 siblings, 1 reply; 8+ messages in thread
From: Or Gerlitz @ 2012-07-16 20:43 UTC (permalink / raw)
  To: Rick Jones, Thadeu Lima de Souza Cascardo
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.


TCP_STREAM from this setup before the patch would be good to know as well

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 19:42     ` Rick Jones
  2012-07-16 20:36       ` Or Gerlitz
  2012-07-16 20:43       ` Or Gerlitz
@ 2012-07-16 20:47       ` Thadeu Lima de Souza Cascardo
  2012-07-16 21:08         ` Rick Jones
  2 siblings, 1 reply; 8+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:47 UTC (permalink / raw)
  To: Rick Jones
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net

On Mon, Jul 16, 2012 at 12:42:41PM -0700, Rick Jones wrote:
> On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> >On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> >
> >>What is the effect on packet-per-second performance?  (eg aggregate,
> >>burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> >>
> >I used uperf with TCP_NODELAY and 16 threads sending from another
> >machine 64000-sized writes for 60 seconds.
> >
> >I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> >(9.23Gb/s) with the patch.
> 
> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the
> packets per second from a bulk transfer test.
> 
> rick
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I used a uperf profile that is similar to TCP_RR. It writes, then reads
some bytes. I kept the TCP_NODELAY flag.

Without the patch, I saw the following:

packet size	ops/s		Gb/s
1		337024		0.0027
90		276620		0.199
900		190455		1.37
4000		68863		2.20
9000		45638		3.29
60000		9409		4.52

With the patch:

packet size	ops/s		Gb/s
1		451738		0.0036
90		345682		0.248
900		272258		1.96
4000		127055		4.07
9000		106614		7.68
60000		30671		14.72

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 20:43       ` Or Gerlitz
@ 2012-07-16 20:57         ` Thadeu Lima de Souza Cascardo
  2012-07-18 14:59           ` Or Gerlitz
  0 siblings, 1 reply; 8+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com, Rick Jones,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net

On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
> On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:
> 
> > I was thinking more along the lines of an additional comparison,
> > explicitly using netperf TCP_RR or something like it, not just the packets
> > per second from a bulk transfer test.
> 
> 
> TCP_STREAM from this setup before the patch would be good to know as well
> 

Hi, Or.

Does the stream test that I did with uperf using messages of 64000 bytes
fit?

TCP_NODELAY does not make a difference in this case. I get something
around 3Gbps before the patch and something around 9Gbps after the
patch.

Before the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    20.81GB /59.26(s) =     3.02Gb/s        5914op/s
Txn3          0 /0.00(s) =            0      128295op/s
-------------------------------------------------------------------------------------------------------------------------------
Total   20.81GB /61.37(s) =     2.91Gb/s        5712op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       252459       31694   3.06Gb/s  16.74Mb/s
eth0            2          18   3.87Kb/s  14.28Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         61.47s     20.81GB     2.91Gb/s       350528
0.00
master           61.37s     20.81GB     2.91Gb/s       350528
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.16%      0.00%        0.16%        0.00%
0.00%


After the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    64.50GB /60.27(s) =     9.19Gb/s       17975op/s
Txn3          0 /0.00(s) =            0
-------------------------------------------------------------------------------------------------------------------------------
Total   64.50GB /62.27(s) =     8.90Gb/s       17397op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       769428       96018   9.31Gb/s  50.72Mb/s
eth0            1          15   2.48Kb/s  13.59Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         62.27s     64.36GB     8.88Gb/s      1081096
0.00
master           62.27s     64.50GB     8.90Gb/s      1083325
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.00%      0.21%        0.21%        0.21%
0.00%


Profile tcp.xml:

<?xml version="1.0"?>
<profile name="TCP_STREAM">
  <group nthreads="16">
        <transaction iterations="1">
            <flowop type="connect" options="remotehost=10.0.0.2 protocol=tcp tcp_nodelay"/>
        </transaction>
        <transaction duration="60">
            <flowop type="write" options="count=160 size=64000"/>
        </transaction>
        <transaction iterations="1">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 20:47       ` Thadeu Lima de Souza Cascardo
@ 2012-07-16 21:08         ` Rick Jones
  0 siblings, 0 replies; 8+ messages in thread
From: Rick Jones @ 2012-07-16 21:08 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net

I was thinking more along the lines of an additional comparison,
explicitly using netperf TCP_RR or something like it, not just the
packets per second from a bulk transfer test.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

> I used a uperf profile that is similar to TCP_RR. It writes, then reads
> some bytes. I kept the TCP_NODELAY flag.
>
> Without the patch, I saw the following:
>
> packet size	ops/s		Gb/s
> 1		337024		0.0027
> 90		276620		0.199
> 900		190455		1.37
> 4000		68863		2.20
> 9000		45638		3.29
> 60000		9409		4.52
>
> With the patch:
>
> packet size	ops/s		Gb/s
> 1		451738		0.0036
> 90		345682		0.248
> 900		272258		1.96
> 4000		127055		4.07
> 9000		106614		7.68
> 60000		30671		14.72
>

So, on the surface it looks like it did good things for PPS, though it 
would be nice to know what the CPU utilizations/service demands were as 
a sanity check - does uperf not have that sort of functionality?

I'm guessing there were several writes at a time - the 1 byte packet 
size (sic - that is payload, not packet, and without TCP_NODELAY not 
even payload necessarily) How many writes does it have outstanding 
before it does a read?  And does it take care to build-up to that number 
of writes to avoid batching during slowstart, even with TCP_NODELAY set?

rick jones

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
  2012-07-16 20:57         ` Thadeu Lima de Souza Cascardo
@ 2012-07-18 14:59           ` Or Gerlitz
  0 siblings, 0 replies; 8+ messages in thread
From: Or Gerlitz @ 2012-07-18 14:59 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
  Cc: Or Gerlitz, netdev@vger.kernel.org, Rick Jones,
	amirv@mellanox.com, leitao@linux.vnet.ibm.com,
	klebers@linux.vnet.ibm.com, anton@samba.org,
	brking@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	davem@davemloft.net

On 7/16/2012 11:57 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
>>
>>
>> TCP_STREAM from this setup before the patch would be good to know as well
>>
>
> Does the stream test that I did with uperf using messages of 64000 bytes fit?

netperf/TCP_STREAM is very common and it would help to better compare 
the numbers
you get on your systems before/after the patch which runs done here. As 
for review for
the patch itself and the related discussion, Yevgeny Petrilin should be 
looking on your
patch, he'll be in by early next week.

Or.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-07-18 15:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1342458113-10384-1-git-send-email-cascardo@linux.vnet.ibm.com>
     [not found] ` <50044F1D.6000703@hp.com>
2012-07-16 19:06   ` [PATCH] mlx4_en: map entire pages to increase throughput Thadeu Lima de Souza Cascardo
2012-07-16 19:42     ` Rick Jones
2012-07-16 20:36       ` Or Gerlitz
2012-07-16 20:43       ` Or Gerlitz
2012-07-16 20:57         ` Thadeu Lima de Souza Cascardo
2012-07-18 14:59           ` Or Gerlitz
2012-07-16 20:47       ` Thadeu Lima de Souza Cascardo
2012-07-16 21:08         ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).