From: jszhang@marvell.com (Jisheng Zhang)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
Date: Tue, 21 Feb 2017 12:37:40 +0800 [thread overview]
Message-ID: <20170221123726.2db7db19@xhacker> (raw)
In-Reply-To: <877f4laqog.fsf@free-electrons.com>
Hi Gregory,
On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:
> Hi Jisheng,
>
> On lun., f?vr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
>
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> Could you explain who you get this number?
Thanks for your review.
The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:
mvneta_rx_swbm()
{
static u64 total_time;
u64 t1, t2;
static u64 count;
t1 = sched_clock();
...
if (rcvd_pkts) {
...
t2 = sched_clock() - t1;
total_time += t2;
count += rcvd_bytes;;
if (count >= 0x40000000) {
printk("!!!! %lld %lld\n", total_time, count);
total_time = 0;
count = 0;
}
...
}
>
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.
hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?
>
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0- 9.1 sec 1.00 GBytes 942 Mbits/sec
>
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at
>From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?
Thanks,
Jisheng
> least I didn't see any regression with or without HWBM.
>
> Gregory
>
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> > - add Gregory's ack to patch1
> > - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> > - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> > - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> > - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> > net: mvneta: avoid getting status from rx_desc as much as possible
> > net: mvneta: avoid getting buf_phys_addr from rx_desc again
> > net: mvneta: avoid reading from tx_desc as much as possible
> > net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> > drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> > 1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > --
> > 2.11.0
> >
>
WARNING: multiple messages have this Message-ID (diff)
From: Jisheng Zhang <jszhang@marvell.com>
To: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: <thomas.petazzoni@free-electrons.com>, <davem@davemloft.net>,
<arnd@arndb.de>, <mw@semihalf.com>,
<linux-arm-kernel@lists.infradead.org>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
Date: Tue, 21 Feb 2017 12:37:40 +0800 [thread overview]
Message-ID: <20170221123726.2db7db19@xhacker> (raw)
In-Reply-To: <877f4laqog.fsf@free-electrons.com>
Hi Gregory,
On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:
> Hi Jisheng,
>
> On lun., févr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
>
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> Could you explain who you get this number?
Thanks for your review.
The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:
mvneta_rx_swbm()
{
static u64 total_time;
u64 t1, t2;
static u64 count;
t1 = sched_clock();
...
if (rcvd_pkts) {
...
t2 = sched_clock() - t1;
total_time += t2;
count += rcvd_bytes;;
if (count >= 0x40000000) {
printk("!!!! %lld %lld\n", total_time, count);
total_time = 0;
count = 0;
}
...
}
>
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.
hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?
>
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0- 9.1 sec 1.00 GBytes 942 Mbits/sec
>
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at
>From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?
Thanks,
Jisheng
> least I didn't see any regression with or without HWBM.
>
> Gregory
>
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> > - add Gregory's ack to patch1
> > - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> > - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> > - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> > - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> > net: mvneta: avoid getting status from rx_desc as much as possible
> > net: mvneta: avoid getting buf_phys_addr from rx_desc again
> > net: mvneta: avoid reading from tx_desc as much as possible
> > net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> > drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> > 1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > --
> > 2.11.0
> >
>
WARNING: multiple messages have this Message-ID (diff)
From: Jisheng Zhang <jszhang@marvell.com>
To: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: thomas.petazzoni@free-electrons.com, arnd@arndb.de,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
mw@semihalf.com, davem@davemloft.net,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
Date: Tue, 21 Feb 2017 12:37:40 +0800 [thread overview]
Message-ID: <20170221123726.2db7db19@xhacker> (raw)
In-Reply-To: <877f4laqog.fsf@free-electrons.com>
Hi Gregory,
On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:
> Hi Jisheng,
>
> On lun., févr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
>
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> Could you explain who you get this number?
Thanks for your review.
The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:
mvneta_rx_swbm()
{
static u64 total_time;
u64 t1, t2;
static u64 count;
t1 = sched_clock();
...
if (rcvd_pkts) {
...
t2 = sched_clock() - t1;
total_time += t2;
count += rcvd_bytes;;
if (count >= 0x40000000) {
printk("!!!! %lld %lld\n", total_time, count);
total_time = 0;
count = 0;
}
...
}
>
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.
hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?
>
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0- 9.1 sec 1.00 GBytes 942 Mbits/sec
>
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at
From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?
Thanks,
Jisheng
> least I didn't see any regression with or without HWBM.
>
> Gregory
>
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> > - add Gregory's ack to patch1
> > - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> > - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> > - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> > - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> > net: mvneta: avoid getting status from rx_desc as much as possible
> > net: mvneta: avoid getting buf_phys_addr from rx_desc again
> > net: mvneta: avoid reading from tx_desc as much as possible
> > net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> > drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> > 1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > --
> > 2.11.0
> >
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2017-02-21 4:37 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-20 12:53 [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 1/4] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 2/4] net: mvneta: avoid getting buf_phys_addr from rx_desc again Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 3/4] net: mvneta: avoid reading from tx_desc as much as possible Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 4/4] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 14:21 ` [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance Gregory CLEMENT
2017-02-20 14:21 ` Gregory CLEMENT
2017-02-21 4:37 ` Jisheng Zhang [this message]
2017-02-21 4:37 ` Jisheng Zhang
2017-02-21 4:37 ` Jisheng Zhang
2017-02-21 16:16 ` David Miller
2017-02-21 16:16 ` David Miller
2017-02-21 16:35 ` Marcin Wojtas
2017-02-21 16:35 ` Marcin Wojtas
2017-02-24 11:56 ` Jisheng Zhang
2017-02-24 11:56 ` Jisheng Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170221123726.2db7db19@xhacker \
--to=jszhang@marvell.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.