* [RFC] myri10ge: small rx_done refactoring
@ 2011-03-23 12:52 Stanislaw Gruszka
2011-03-23 15:21 ` Andrew Gallatin
2011-03-23 15:33 ` Stephen Hemminger
0 siblings, 2 replies; 8+ messages in thread
From: Stanislaw Gruszka @ 2011-03-23 12:52 UTC (permalink / raw)
To: netdev; +Cc: Andrew Gallatin, Brice Goglin
myri10ge: small rx_done refactoring
Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
call. This should fix theoretical race condition with
myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
can be changed.
On the way reduce myri10ge_rx_done() number of arguments and calls by
moving mgp->small_bytes check into that function. That reduce code size
from:
text data bss dec hex filename
36644 248 100 36992 9080 drivers/net/myri10ge/myri10ge.o
to:
text data bss dec hex filename
36037 247 100 36384 8e20 drivers/net/myri10ge/myri10ge.o
on my i686 system, what should also make myri10ge_clean_rx_done()
being faster.
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 24386a8..2e71240 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1312,17 +1312,26 @@ myri10ge_unmap_rx_page(struct pci_dev *pdev,
* page into an skb */
static inline int
-myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
- int bytes, int len, __wsum csum)
+myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum,
+ bool lro_enabled)
{
struct myri10ge_priv *mgp = ss->mgp;
struct sk_buff *skb;
struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
- int i, idx, hlen, remainder;
+ struct myri10ge_rx_buf *rx;
+ int i, idx, hlen, remainder, bytes;
struct pci_dev *pdev = mgp->pdev;
struct net_device *dev = mgp->dev;
u8 *va;
+ if (len <= mgp->small_bytes) {
+ rx = &ss->rx_small;
+ bytes = mgp->small_bytes;
+ } else {
+ rx = &ss->rx_big,
+ bytes = mgp->big_bytes;
+ }
+
len += MXGEFW_PAD;
idx = rx->cnt & rx->mask;
va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
@@ -1341,7 +1350,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
remainder -= MYRI10GE_ALLOC_SIZE;
}
- if (dev->features & NETIF_F_LRO) {
+ if (lro_enabled) {
rx_frags[0].page_offset += MXGEFW_PAD;
rx_frags[0].size -= MXGEFW_PAD;
len -= MXGEFW_PAD;
@@ -1464,6 +1473,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
struct myri10ge_rx_done *rx_done = &ss->rx_done;
struct myri10ge_priv *mgp = ss->mgp;
struct net_device *netdev = mgp->dev;
+ bool lro_enabled = (netdev->features & NETIF_F_LRO) ? true : false;
unsigned long rx_bytes = 0;
unsigned long rx_packets = 0;
unsigned long rx_ok;
@@ -1478,14 +1488,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
length = ntohs(rx_done->entry[idx].length);
rx_done->entry[idx].length = 0;
checksum = csum_unfold(rx_done->entry[idx].checksum);
- if (length <= mgp->small_bytes)
- rx_ok = myri10ge_rx_done(ss, &ss->rx_small,
- mgp->small_bytes,
- length, checksum);
- else
- rx_ok = myri10ge_rx_done(ss, &ss->rx_big,
- mgp->big_bytes,
- length, checksum);
+ rx_ok = myri10ge_rx_done(ss, length, checksum, lro_enabled);
rx_packets += rx_ok;
rx_bytes += rx_ok * (unsigned long)length;
cnt++;
@@ -1497,7 +1500,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
ss->stats.rx_packets += rx_packets;
ss->stats.rx_bytes += rx_bytes;
- if (netdev->features & NETIF_F_LRO)
+ if (lro_enabled)
lro_flush_all(&rx_done->lro_mgr);
/* restock receive rings if needed */
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-23 12:52 [RFC] myri10ge: small rx_done refactoring Stanislaw Gruszka
@ 2011-03-23 15:21 ` Andrew Gallatin
2011-03-23 15:33 ` Stephen Hemminger
1 sibling, 0 replies; 8+ messages in thread
From: Andrew Gallatin @ 2011-03-23 15:21 UTC (permalink / raw)
To: Stanislaw Gruszka; +Cc: netdev, Brice Goglin
On 03/23/11 08:52, Stanislaw Gruszka wrote:
> on my i686 system, what should also make myri10ge_clean_rx_done()
> being faster.
I tested this on my very old, very weak dual-core athlon64 systems.
These machines can barely achieve 10Gb/s using a 1500b MTU with LRO.
Running 35 60 second netperf tests into the machines with the stock
driver, and again with this patch applied, I see a tiny bandwidth
increase (1.4Mb/s on average) which is statistically significant
( p < 0.001). There is no statistically significant CPU load
reduction.
> + if (len<= mgp->small_bytes) {
> + rx =&ss->rx_small;
> + bytes = mgp->small_bytes;
> + } else {
> + rx =&ss->rx_big,
Small nit: the "," above should be a ";"
Between the small bandwidth increase, and the code size reduction,
I'm very appreciative of this patch.
Thank you,
Drew
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-23 12:52 [RFC] myri10ge: small rx_done refactoring Stanislaw Gruszka
2011-03-23 15:21 ` Andrew Gallatin
@ 2011-03-23 15:33 ` Stephen Hemminger
2011-03-24 8:16 ` Stanislaw Gruszka
1 sibling, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2011-03-23 15:33 UTC (permalink / raw)
To: Stanislaw Gruszka; +Cc: netdev, Andrew Gallatin, Brice Goglin
On Wed, 23 Mar 2011 13:52:04 +0100
Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> call. This should fix theoretical race condition with
> myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> can be changed.
You may need a barrier or the race may still be there.
The driver seems to use mb() where wmb() is intended, and never use rmb()?
--
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-23 15:33 ` Stephen Hemminger
@ 2011-03-24 8:16 ` Stanislaw Gruszka
2011-03-24 15:15 ` Stephen Hemminger
0 siblings, 1 reply; 8+ messages in thread
From: Stanislaw Gruszka @ 2011-03-24 8:16 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, Andrew Gallatin, Brice Goglin
On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> On Wed, 23 Mar 2011 13:52:04 +0100
> Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>
> > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > call. This should fix theoretical race condition with
> > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > can be changed.
>
> You may need a barrier or the race may still be there.
I don't understand why barrier in that case is need.
What I tried to avoid is.
myri10ge_clean_rx_done():
if (dev->features & NETIF_F_LRO)
setup lro
myri10ge_set_flags()
if (dev->features & NETIF_F_LRO)
flush lro
Now we read dev->features & NETIF_F_LRO only once to local
lro_enabled variable. So we can not flush without setup
or setup without flush. No idea why memory barries is still
needed.
> The driver seems to use mb() where wmb() is intended, and never use rmb()?
Yes, I think we can have some optimalization here.
Stanislaw
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-24 8:16 ` Stanislaw Gruszka
@ 2011-03-24 15:15 ` Stephen Hemminger
2011-03-24 15:59 ` Stanislaw Gruszka
2011-03-24 17:29 ` David Howells
0 siblings, 2 replies; 8+ messages in thread
From: Stephen Hemminger @ 2011-03-24 15:15 UTC (permalink / raw)
To: Stanislaw Gruszka; +Cc: netdev, Andrew Gallatin, Brice Goglin
On Thu, 24 Mar 2011 09:16:21 +0100
Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > On Wed, 23 Mar 2011 13:52:04 +0100
> > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> >
> > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > call. This should fix theoretical race condition with
> > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > can be changed.
> >
> > You may need a barrier or the race may still be there.
>
> I don't understand why barrier in that case is need.
>
> What I tried to avoid is.
>
> myri10ge_clean_rx_done():
>
> if (dev->features & NETIF_F_LRO)
> setup lro
> myri10ge_set_flags()
>
> if (dev->features & NETIF_F_LRO)
> flush lro
>
> Now we read dev->features & NETIF_F_LRO only once to local
> lro_enabled variable. So we can not flush without setup
> or setup without flush. No idea why memory barries is still
> needed.
>
> > The driver seems to use mb() where wmb() is intended, and never use rmb()?
>
> Yes, I think we can have some optimalization here.
>
Without barrier there is no guarantee that compiler read the flags
into a local variable. It is free to do the same thing as the original
code.
--
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-24 15:15 ` Stephen Hemminger
@ 2011-03-24 15:59 ` Stanislaw Gruszka
2011-03-24 16:42 ` Ben Hutchings
2011-03-24 17:29 ` David Howells
1 sibling, 1 reply; 8+ messages in thread
From: Stanislaw Gruszka @ 2011-03-24 15:59 UTC (permalink / raw)
To: Stephen Hemminger, David Howells; +Cc: netdev, Andrew Gallatin, Brice Goglin
On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote:
> > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > > On Wed, 23 Mar 2011 13:52:04 +0100
> > > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > >
> > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > > call. This should fix theoretical race condition with
> > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > > can be changed.
> > >
> > > You may need a barrier or the race may still be there.
> >
> > I don't understand why barrier in that case is need.
> >
> > What I tried to avoid is.
> >
> > myri10ge_clean_rx_done():
> >
> > if (dev->features & NETIF_F_LRO)
> > setup lro
> > myri10ge_set_flags()
> >
> > if (dev->features & NETIF_F_LRO)
> > flush lro
> >
> > Now we read dev->features & NETIF_F_LRO only once to local
> > lro_enabled variable. So we can not flush without setup
> > or setup without flush. No idea why memory barries is still
> > needed.
> >
> > > The driver seems to use mb() where wmb() is intended, and never use rmb()?
> >
> > Yes, I think we can have some optimalization here.
> >
>
> Without barrier there is no guarantee that compiler read the flags
> into a local variable. It is free to do the same thing as the original
> code.
Ok, so C code like:
code1
if (dev->features & NETIF_F_LRO)
branch1
code2;
if (dev->features & NETIF_F_LRO)
branch2
and
bool lro_enabled = dev->features & NETIF_F_LRO;
code1
if (lro_enabled)
branch1
code2
if (lro_enabled)
branch2
can give the same assembly output.
It's really hard for me to understand that. I could
understand, if we would get global variable directly
like:
bool lro_enabled = dev->lro_enabled;
instead of:
bool lro_enabled = dev->features & NETIF_F_LRO;
David, can you confirm that Staphen is correct?
Also where this barrier() should go. Before
"bool lro_enabled = dev->features & NETIF_F_LRO;"
or after?
Stanislaw
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-24 15:59 ` Stanislaw Gruszka
@ 2011-03-24 16:42 ` Ben Hutchings
0 siblings, 0 replies; 8+ messages in thread
From: Ben Hutchings @ 2011-03-24 16:42 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: Stephen Hemminger, David Howells, netdev, Andrew Gallatin,
Brice Goglin
On Thu, 2011-03-24 at 16:59 +0100, Stanislaw Gruszka wrote:
> On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote:
> > > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote:
> > > > On Wed, 23 Mar 2011 13:52:04 +0100
> > > > Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> > > >
> > > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll
> > > > > call. This should fix theoretical race condition with
> > > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO
> > > > > can be changed.
> > > >
> > > > You may need a barrier or the race may still be there.
> > >
> > > I don't understand why barrier in that case is need.
> > >
> > > What I tried to avoid is.
> > >
> > > myri10ge_clean_rx_done():
> > >
> > > if (dev->features & NETIF_F_LRO)
> > > setup lro
> > > myri10ge_set_flags()
> > >
> > > if (dev->features & NETIF_F_LRO)
> > > flush lro
> > >
> > > Now we read dev->features & NETIF_F_LRO only once to local
> > > lro_enabled variable. So we can not flush without setup
> > > or setup without flush. No idea why memory barries is still
> > > needed.
> > >
> > > > The driver seems to use mb() where wmb() is intended, and never use rmb()?
> > >
> > > Yes, I think we can have some optimalization here.
> > >
> >
> > Without barrier there is no guarantee that compiler read the flags
> > into a local variable. It is free to do the same thing as the original
> > code.
>
> Ok, so C code like:
>
> code1
> if (dev->features & NETIF_F_LRO)
> branch1
> code2;
> if (dev->features & NETIF_F_LRO)
> branch2
>
> and
>
> bool lro_enabled = dev->features & NETIF_F_LRO;
> code1
> if (lro_enabled)
> branch1
> code2
> if (lro_enabled)
> branch2
>
> can give the same assembly output.
[...]
Yes. A C compiler is allowed to assume that data are not shared between
multiple threads, and apply any transformations that would not affect
the behaviour of a single-threaded program.
We can make use of some gcc extensions (wrapped up in macros like
barrier()) to inhibit some such transformations. We also assume that
access to an int, long or pointer variable can be atomic. The
ACCESS_ONCE() macro adds volatile-qualification to such memory access,
which inhibits duplication of the access.
So, if you only care that this function has a consistent value for
lro_enabled, you can read dev->features with ACCESS_ONCE():
bool lro_enabled = ACCESS_ONCE(dev->features) & NETIF_F_LRO;
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] myri10ge: small rx_done refactoring
2011-03-24 15:15 ` Stephen Hemminger
2011-03-24 15:59 ` Stanislaw Gruszka
@ 2011-03-24 17:29 ` David Howells
1 sibling, 0 replies; 8+ messages in thread
From: David Howells @ 2011-03-24 17:29 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: dhowells, Stephen Hemminger, netdev, Andrew Gallatin,
Brice Goglin
Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> David, can you confirm that Staphen is correct?
Stephen is correct. The compiler is perfectly at liberty to merge the two
loads if the value being read is not marked volatile.
If you stick a barrier() in there between the reads, then I think the compiler
will be required to emit two load instructions. However, the _CPU_ is then
entitled to merge them.
If you don't want the CPU to merge them, you have to use smp_rmb() or smp_mb()
between.
However, the compiler is also allowed to re-read the variable (ie. emit two
load instructions) if it would otherwise have to save the value on the stack
to free up a register, unless the pointed to value is marked volatile.
If you want to ensure that the value is read once only, then you need to use
ACCESS_ONCE() or stick a read/full barrier of some degree after the read.
Note the use of a barrier implies a partial ordering between two memory
accesses in the same instruction stream. If you can't say which two memory
memory accesses you want to order, you probably shouldn't be using a barrier.
David
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-03-24 17:29 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-23 12:52 [RFC] myri10ge: small rx_done refactoring Stanislaw Gruszka
2011-03-23 15:21 ` Andrew Gallatin
2011-03-23 15:33 ` Stephen Hemminger
2011-03-24 8:16 ` Stanislaw Gruszka
2011-03-24 15:15 ` Stephen Hemminger
2011-03-24 15:59 ` Stanislaw Gruszka
2011-03-24 16:42 ` Ben Hutchings
2011-03-24 17:29 ` David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).