Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] cxgb4: Address various sparse warnings
From: David Miller @ 2012-10-08 18:48 UTC (permalink / raw)
  To: vipul; +Cc: netdev, divy, dm, leedom, felix, jay
In-Reply-To: <1349701183-25093-1-git-send-email-vipul@chelsio.com>

From: Vipul Pandya <vipul@chelsio.com>
Date: Mon,  8 Oct 2012 18:29:43 +0530

> This patch fixes type assignment issues, function definition and symbol
> shadowing which triggered sparse warnings.
> 
> Signed-off-by: Jay Hernandez <jay@chelsio.com>
> Signed-off-by: Vipul Pandya <vipul@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH] pch_gbe: Fix build error by selecting all the possible dependencies.
From: David Miller @ 2012-10-08 18:48 UTC (permalink / raw)
  To: haicheng.lee; +Cc: haicheng.li, linux-kernel, netdev, fengguang.wu
In-Reply-To: <1349695107-19524-1-git-send-email-haicheng.lee@gmail.com>


This didn't make it to any mailing list, so nobody can review it.

^ permalink raw reply

* Re: [PATCH 0/3] ipv4: pmtu fixes
From: David Miller @ 2012-10-08 18:47 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev
In-Reply-To: <20121008084642.GB15622@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Mon, 8 Oct 2012 10:46:42 +0200

> This patchset fixes some issues that came with the routing cache removal.
> 
> 1) IPsec and others (udp, ipvs) may cache output routes, these routes
> need to be invalidated on pmtu events in the same way e.g. tcp socket
> cached routes are invalidated. With this we always invalidate or update
> (if we already use a nh exeption route) the old route on pmtu events. 
> 
> This has the drawback that we may needlessly invalidate an uncached route,
> but this fixes all the users that cache routes and pmtu events are rare, so
> this should not be a real issue.
> 
> 2) We create nh exeptions if a user (e.g. tracepath) tries to do pmtu
> dicsovery with packets bigger than the output device mtu. The device mtu
> is not learned and does not expire, so don't create an exeption route.
> 
> 3) We report cached pmtu values to userspace even if they are expired.
> Fix this by checking for expiration before we report.

All applied and queued up for -stable, thanks!

^ permalink raw reply

* Re: [GIT PULL nf-next] IPVS for 3.7 #2
From: Jesper Dangaard Brouer @ 2012-10-08 18:44 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: David Miller, Simon Horman, lvs-devel, netdev, netfilter-devel,
	Wensong Zhang, Julian Anastasov, Hans Schillstrom,
	Hans Schillstrom
In-Reply-To: <1348800904-23902-1-git-send-email-horms@verge.net.au>

Hey Pablo,

These changes were intended for 3.7, but just checked you git tree...
and it looks like you didn't pull in Simon's changes, and thus they have
not hit DaveM's tree for the merge window :-(

--Jesper


On Fri, 2012-09-28 at 11:54 +0900, Simon Horman wrote:
> Hi Pablo,
> 
> please consider the following enhancements to IPVS for inclusion in 3.7.
> 
> ----------------------------------------------------------------
> The following changes since commit 82c93fcc2e1737fede2752520f1bf8f4de6304d8:
> 
>   x86: bpf_jit_comp: add XOR instruction for BPF JIT (2012-09-24 16:54:35 -0400)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git master
> 
> for you to fetch changes up to 92eec78d25aee6bbc9bd295f51c022ddfa80cdd9:
> 
>   ipvs: SIP fragment handling (2012-09-28 11:37:16 +0900)
> 
> ----------------------------------------------------------------
> Jesper Dangaard Brouer (7):
>       ipvs: Trivial changes, use compressed IPv6 address in output
>       ipvs: IPv6 extend ICMPv6 handling for future types
>       ipvs: Use config macro IS_ENABLED()
>       ipvs: Fix faulty IPv6 extension header handling in IPVS
>       ipvs: Complete IPv6 fragment handling for IPVS
>       ipvs: API change to avoid rescan of IPv6 exthdr
>       ipvs: SIP fragment handling
> 
>  include/net/ip_vs.h                     |  194 +++++++++++----
>  net/netfilter/ipvs/Kconfig              |    7 +-
>  net/netfilter/ipvs/ip_vs_conn.c         |   15 +-
>  net/netfilter/ipvs/ip_vs_core.c         |  404 +++++++++++++++++--------------
>  net/netfilter/ipvs/ip_vs_dh.c           |    2 +-
>  net/netfilter/ipvs/ip_vs_lblc.c         |    2 +-
>  net/netfilter/ipvs/ip_vs_lblcr.c        |    2 +-
>  net/netfilter/ipvs/ip_vs_pe_sip.c       |   18 +-
>  net/netfilter/ipvs/ip_vs_proto.c        |    6 +-
>  net/netfilter/ipvs/ip_vs_proto_ah_esp.c |    9 +-
>  net/netfilter/ipvs/ip_vs_proto_sctp.c   |   42 ++--
>  net/netfilter/ipvs/ip_vs_proto_tcp.c    |   40 ++-
>  net/netfilter/ipvs/ip_vs_proto_udp.c    |   41 ++--
>  net/netfilter/ipvs/ip_vs_sched.c        |    2 +-
>  net/netfilter/ipvs/ip_vs_sh.c           |    2 +-
>  net/netfilter/ipvs/ip_vs_xmit.c         |   73 +++---
>  net/netfilter/xt_ipvs.c                 |    4 +-
>  17 files changed, 501 insertions(+), 362 deletions(-)



^ permalink raw reply

* Re: [PATCH v2] vlan: don't deliver frames for unknown vlans to protocols
From: David Miller @ 2012-10-08 18:42 UTC (permalink / raw)
  To: florz; +Cc: kaber, eric.dumazet, netdev, linux-kernel, jpirko
In-Reply-To: <20121008015158.GE25895@florz.florz.dyndns.org>

From: Florian Zumbiehl <florz@florz.de>
Date: Mon, 8 Oct 2012 03:51:58 +0200

> This version completely avoids any new state that could need to be spilled
> to RAM, and instead re-checks existence and non-zeroness of the tag. What
> do you think?

At a high level it looks fine and doesn't have the problems mentioned
earlier.

But I wonder if it breaks things, since you do the assignment so late
we no longer handle the case where the VLAN device's MAC address
matches the packet MAC address and the top-level device's does not.

That's handled by logic in vlan_do_receive() which checks for
PACKET_OTHERHOST.

But you're going to unconditionally set PACKET_OTHERHOST, overriding
any decision that code makes.

This turns out to be a really non-trivial area and it's going to take
some time to get this right and audit the change appropriately.

^ permalink raw reply

* [PATCH v2] skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard
From: Jan Ceuleers @ 2012-10-08 18:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Graham Gower, Stephen Hemminger
In-Reply-To: <1349715275-12730-1-git-send-email-jan.ceuleers@computer.org>

From: Graham Gower <graham.gower@gmail.com>

Marvell 88E8001 on an ASUS P5NSLI motherboard is unable to send/receive
packets on a system with >4gb ram unless a 32bit DMA mask is used.

This issue has been around for years and a fix was sent 3.5 years ago, but
there was some debate as to whether it should instead be fixed as a PCI quirk.
http://www.spinics.net/lists/netdev/msg88670.html

However, 18 months later a similar workaround was introduced for another
chipset exhibiting the same problem.
http://www.spinics.net/lists/netdev/msg142287.html

Signed-off-by: Graham Gower <graham.gower@gmail.com>
Signed-off-by: Jan Ceuleers <jan.ceuleers@computer.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
---
I am resubmitting Graham's v4 patch hoping that doing so will avoid further
frustration while still getting this fix into the tree.
I have not tested this; I don't have the hardware. It compiles though.

v2: Fixed whitespace error

 drivers/net/ethernet/marvell/skge.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 5a30bf8..05b3c83 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -4153,6 +4153,13 @@ static struct dmi_system_id skge_32bit_dma_boards[] = {
 			DMI_MATCH(DMI_BOARD_NAME, "nForce"),
 		},
 	},
+	{
+		.ident = "ASUS P5NSLI",
+		.matches = {
+			DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
+			DMI_MATCH(DMI_BOARD_NAME, "P5NSLI")
+		},
+	},
 	{}
 };
 
-- 
1.7.9.5

^ permalink raw reply related

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-08 18:28 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <5073199F.6040408@hp.com>

On Mon, 2012-10-08 at 11:21 -0700, Rick Jones wrote:

> Did I then mis-interpret:
> 
> > 2) Use a LRU list to eventually be able to 'flush' too old packets,
> >    even if the napi never completes. Each time we process a new packet,
> >    being a GRO candidate or not, we increment a napi->sequence, and we
> >    flush the oldest packet in gro_lru_list if its own sequence is too
> >    old.
> 
> in your initial RFC email?  Because I took that as flushing a given 
> packet if N packets have come through since that packet was queued to 
> await coalescing.

Yes, this was refined in the patch I sent.

Currently using jiffies, but my plan is trying to use get_cycles() or
ktime_get(), after some experiments, allowing cycles/ns resolution.

(One ktime_get() per napi batch seems OK)

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Rick Jones @ 2012-10-08 18:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349718955.21172.3534.camel@edumazet-glaptop>

On 10/08/2012 10:55 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:
>
>> So, with my term shuffle better defined, let's circle back to your
>> proposal to try to GRO-service a very much larger group of flows, with a
>> "flush queued packets older than N packets" heuristic as part of the
>> latency minimization.  If N were 2 there - half the number of flows, the
>> "perfect" shuffle" doesn't get aggregated at all right? N would have to
>> be 4 or the number of concurrent flows.   What I'm trying to get at is
>> just to how many concurrent flows you are trying to get GRO to scale,
>> and whether at that level you have asymptotically approached having a
>> hash/retained state that is, basically, a duplicate of what is happening
>> in TCP.
>>
>
> I didnt said "flush queued packets older than N packets" but instead
> suggested to use a time limit, eventually a sysctl.

Did I then mis-interpret:

> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>    even if the napi never completes. Each time we process a new packet,
>    being a GRO candidate or not, we increment a napi->sequence, and we
>    flush the oldest packet in gro_lru_list if its own sequence is too
>    old.

in your initial RFC email?  Because I took that as flushing a given 
packet if N packets have come through since that packet was queued to 
await coalescing.

rick

^ permalink raw reply

* Re: [RFC:PATCH 3.6.0-] net/ipconfig: Extend ipconfig retries to device open
From: David Miller @ 2012-10-08 18:14 UTC (permalink / raw)
  To: srinivas.kandagatla; +Cc: netdev
In-Reply-To: <5072C961.3030807@st.com>

From: Srinivas KANDAGATLA <srinivas.kandagatla@st.com>
Date: Mon, 08 Oct 2012 13:38:57 +0100

> On 07/10/12 05:11, David Miller wrote:
>> From: Srinivas KANDAGATLA <srinivas.kandagatla@st.com>
>> Date: Thu,  4 Oct 2012 16:38:43 +0100
>>
>>> This patch adds retries to ipconfig at device open, the reason to do
>>> this is: Lets say If some mdio bus driver decide to use defered probe
>>> when it does not find any phys on the bus. The same mdio-bus driver is
>>> re-probed as part of lateinit calls. However ipconfig also fits into
>>> lateinit calls, so if ipconfig is called before the re-probe of mdio-bus
>>> driver, the mac driver will fail to find a valid PHY on the mdio-bus.
>> Real device drivers for real devices should not probe using late
>> initcalls.
> I agree,
> Let me summarize what I did try.
> 
> I wanted to use "Defered Probe Feature" which went in 3.4 kernel to
> solve a sequencing issue.
> So modified Mdio-driver accordingly, mdio-driver decided to defer its
> probe for the first-time when It could not detect any PHY's on the BUS.
> (second time) device probe is actually called in lateinit call sequence
> by "Defered Probe Code".

That's not how we probe "devices on a bus" which is the model you
should be following here to fix this bug.

Just delaying is going to run into corner cases where things do not
work.  You need cooperation between bus provider and bus consumer,
full stop.

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-08 17:56 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349718955.21172.3534.camel@edumazet-glaptop>

On Mon, 2012-10-08 at 19:55 +0200, Eric Dumazet wrote:

> If your hardware is capable of receiving one packet per us, this would
> allow 50 packets in GRO queue.

Sorry, I meant : 10 us per packet.

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-08 17:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <5073121B.2070300@hp.com>

On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:

> So, with my term shuffle better defined, let's circle back to your 
> proposal to try to GRO-service a very much larger group of flows, with a 
> "flush queued packets older than N packets" heuristic as part of the 
> latency minimization.  If N were 2 there - half the number of flows, the 
> "perfect" shuffle" doesn't get aggregated at all right? N would have to 
> be 4 or the number of concurrent flows.   What I'm trying to get at is 
> just to how many concurrent flows you are trying to get GRO to scale, 
> and whether at that level you have asymptotically approached having a 
> hash/retained state that is, basically, a duplicate of what is happening 
> in TCP.
> 

I didnt said "flush queued packets older than N packets" but instead
suggested to use a time limit, eventually a sysctl.

"flush packet if the oldest part of it is aged by xxx us"

Say the default would be 500 us

If your hardware is capable of receiving 2 packets per us, this would
allow 1000 packets in GRO queue. So using a hash table with 128 slots,
and keeping the current limit of 8 entries per bucket would allow this
kind of workload.

If your hardware is capable of receiving one packet per us, this would
allow 50 packets in GRO queue.

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Rick Jones @ 2012-10-08 17:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349715561.21172.3463.camel@edumazet-glaptop>

On 10/08/2012 09:59 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
>> Whe I say shuffle I mean something along the lines of interleave.  So,
>> if we have four flows, 1-4, a perfect shuffle of their segments would be
>> something like:
>>
>> 1 2 3 4 1 2 3 4 1 2 3 4
>>
>> but not well shuffled might look like
>>
>> 1 1 3 2 3 2 4 4 4 1 3 2
>>
>
> If all these packets are delivered in the same NAPI run, and correctly
> aggregated, their order doesnt matter.
>
> In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
> with 3 MSS)
>
> In second case we will deliver
>
> B1 B3 B2 B4

So, with my term shuffle better defined, let's circle back to your 
proposal to try to GRO-service a very much larger group of flows, with a 
"flush queued packets older than N packets" heuristic as part of the 
latency minimization.  If N were 2 there - half the number of flows, the 
"perfect" shuffle" doesn't get aggregated at all right? N would have to 
be 4 or the number of concurrent flows.   What I'm trying to get at is 
just to how many concurrent flows you are trying to get GRO to scale, 
and whether at that level you have asymptotically approached having a 
hash/retained state that is, basically, a duplicate of what is happening 
in TCP.

rick

^ permalink raw reply

* Re: [PATCH v3] iproute2: add support for tcp_metrics
From: Stephen Hemminger @ 2012-10-08 17:26 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev
In-Reply-To: <1349302059-3283-1-git-send-email-ja@ssi.bg>

On Thu,  4 Oct 2012 01:07:39 +0300
Julian Anastasov <ja@ssi.bg> wrote:

> 	ip tcp_metrics/tcpmetrics
> 
> 	We support get/del for single entry and dump for
> show/flush.
> 
> v3:
>  - fix rtt/rttvar shifts as suggested by Eric Dumazet
>  - show rtt/rttvar usecs as suggested by David Laight
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Applied. Thanks

^ permalink raw reply

* Re: [PATCH] net: gro: selective flush of packets
From: Eric Dumazet @ 2012-10-08 17:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: Herbert Xu, David Miller, netdev, Jesse Gross, Tom Herbert,
	Yuchung Cheng
In-Reply-To: <50730273.4000408@hp.com>

On Mon, 2012-10-08 at 09:42 -0700, Rick Jones wrote:

> > By the way, one of the beauty of GRO is it helps under load to aggregate
> > packets and reduce cpu load. People wanting very low latencies should
> > probably not use GRO, and if they use it or not, receiving a full 64
> > packets batch on a particular NIC makes latencies very unpredictable.
> >
> > So if we consumed all budget in a napi->poll() handler, its because we
> > are under load and we dont really want to cancel GRO aggregation.
> 
> Is that actually absolute, or does it depend on GRO aggregation actually 
> aggregating?  In your opening message you talked about how with though 
> flows GRO is defeated but its overhead remains.
> 

Sorry, I dont understand the question.

We consume all budget when 64 packets are fetched from NIC.

This has nothing to do with GRO, but NAPI behavior.

Sure, if these packets are UDP messages and cross GRO stack for nothing,
its pure overhead.

Current situation is :

You receive a burst of packets, with one (or few) TCP message(s), and
other frames are UDP only.

This TCP message is held in GRO queue, and stay here as long as we dont
receive another packet for the same flow, or the burst ends.

Note that I dont really care of these few TCP messages right now,
but when/if we use a hash table and allow XXX packets in GRO stack,
things are different ;)

^ permalink raw reply

* Re: [PATCH] iproute2: inform user when a neighbor is removed
From: Stephen Hemminger @ 2012-10-08 16:49 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1349424172-5602-1-git-send-email-nicolas.dichtel@6wind.com>

On Fri,  5 Oct 2012 10:02:52 +0200
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> When running 'ip monitor neigh', there is no hint to tell if a neighbor is
> updated or deleted.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---

Applied

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Eric Dumazet @ 2012-10-08 16:59 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <50730217.6020206@hp.com>

On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
> On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> > On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
> >
> >> Just how much code path is there between NAPI and the socket?? (And I
> >> guess just how much combining are you hoping for?)
> >>
> >
> > When GRO correctly works, you can save about 30% of cpu cycles, it
> > depends...
> >
> > Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> > gives an improvement as well...
> 
> OK, but how much of that 30% come from where?  Each coalesced segment is 
> saving the cycles between NAPI and the socket.  Each avoided ACK is 
> saving the cycles from TCP to the bottom of the driver and a (share of) 
> transmit completion.

It comes from the fact that you have less competition between Bottom
Half handler and application on socket lock, not counting all layers
that we have to cross (IP, netfilter ...)

Each time a TCP packet is delivered and socket owned by the user, packet
is placed on a special 'backlog queue', and application has to process
this packet right before releasing socket lock. It sucks because it adds
latencies, and other frames are queued to backlokg since application
processes the backlog (very expensive because of cache line misses)

So GRO really makes this kind of event less probable.

> 
> Whe I say shuffle I mean something along the lines of interleave.  So, 
> if we have four flows, 1-4, a perfect shuffle of their segments would be 
> something like:
> 
> 1 2 3 4 1 2 3 4 1 2 3 4
> 
> but not well shuffled might look like
> 
> 1 1 3 2 3 2 4 4 4 1 3 2
> 

If all these packets are delivered in the same NAPI run, and correctly
aggregated, their order doesnt matter.

In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
with 3 MSS)

In second case we will deliver

B1 B3 B2 B4

^ permalink raw reply

* Re: [PATCH] skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard
From: Stephen Hemminger @ 2012-10-08 16:55 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: David S. Miller, netdev, Graham Gower
In-Reply-To: <1349715275-12730-1-git-send-email-jan.ceuleers@computer.org>

On Mon,  8 Oct 2012 18:54:35 +0200
Jan Ceuleers <jan.ceuleers@computer.org> wrote:

> From: Graham Gower <graham.gower@gmail.com>
> 
> Marvell 88E8001 on an ASUS P5NSLI motherboard is unable to send/receive
> packets on a system with >4gb ram unless a 32bit DMA mask is used.
> 
> This issue has been around for years and a fix was sent 3.5 years ago, but
> there was some debate as to whether it should instead be fixed as a PCI quirk.
> http://www.spinics.net/lists/netdev/msg88670.html
> 
> However, 18 months later a similar workaround was introduced for another
> chipset exhibiting the same problem.
> http://www.spinics.net/lists/netdev/msg142287.html
> 
> Signed-off-by: Graham Gower <graham.gower@gmail.com>
> Signed-off-by: Jan Ceuleers <jan.ceuleers@computer.org>

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

^ permalink raw reply

* [PATCH] skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard
From: Jan Ceuleers @ 2012-10-08 16:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Graham Gower, Stephen Hemminger

From: Graham Gower <graham.gower@gmail.com>

Marvell 88E8001 on an ASUS P5NSLI motherboard is unable to send/receive
packets on a system with >4gb ram unless a 32bit DMA mask is used.

This issue has been around for years and a fix was sent 3.5 years ago, but
there was some debate as to whether it should instead be fixed as a PCI quirk.
http://www.spinics.net/lists/netdev/msg88670.html

However, 18 months later a similar workaround was introduced for another
chipset exhibiting the same problem.
http://www.spinics.net/lists/netdev/msg142287.html

Signed-off-by: Graham Gower <graham.gower@gmail.com>
Signed-off-by: Jan Ceuleers <jan.ceuleers@computer.org>
---
I am resubmitting Graham's v4 patch hoping that doing so will avoid further
frustration while still getting this fix into the tree.
I have not tested this; I don't have the hardware. It compiles though.

 drivers/net/ethernet/marvell/skge.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 5a30bf8..05b3c83 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -4153,6 +4153,13 @@ static struct dmi_system_id skge_32bit_dma_boards[] = {
 			DMI_MATCH(DMI_BOARD_NAME, "nForce"),
 		},
 	},
+	{
+		.ident = "ASUS P5NSLI",
+		.matches = {
+			DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
+			DMI_MATCH(DMI_BOARD_NAME, "P5NSLI")
+		},
+	}, 
 	{}
 };
 
-- 
1.7.9.5

^ permalink raw reply related

* RE: [PATCH net-next v2] netxen: write IP address to firmware when using bonding
From: Rajesh Borundia @ 2012-10-08 16:48 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Sony Chacko
  Cc: agospoda@redhat.com, David Miller, netdev
In-Reply-To: <506C0346.8010708@redhat.com>

>-----Original Message-----
>From: Nikolay Aleksandrov [mailto:nikolay@redhat.com]
>Sent: Wednesday, October 03, 2012 2:50 PM
>To: Sony Chacko
>Cc: agospoda@redhat.com; Rajesh Borundia; David Miller; netdev
>Subject: Re: [PATCH net-next v2] netxen: write IP address to firmware
>when using bonding
>
>On 10/02/2012 11:16 AM, Nikolay Aleksandrov wrote:
>> This patch allows LRO aggregation on bonded devices that contain an
>NX3031
>> device. It also adds a for_each_netdev_in_bond_rcu(bond, slave) macro
>> which executes for each slave that has bond as master.
>>
>> V2: Remove local ip caching, retrieve addresses dynamically and
>>      restore them if necessary.
>>
>> Note: Tested with NX3031 adapter.
>>
>> Signed-off-by: Andy Gospodarek <agospoda@redhat.com>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
>> ---
>>   drivers/net/ethernet/qlogic/netxen/netxen_nic.h    |   6 -
>>   .../net/ethernet/qlogic/netxen/netxen_nic_main.c   | 192
>+++++++++++----------
>>   include/linux/netdevice.h                          |   3 +
>>   3 files changed, 100 insertions(+), 101 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic.h
>b/drivers/net/ethernet/qlogic/netxen/netxen_nic.h
>> index eb3dfdb..cb4f2ce 100644
>> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic.h
>> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic.h
>> @@ -955,11 +955,6 @@ typedef struct nx_mac_list_s {
>>   	uint8_t mac_addr[ETH_ALEN+2];
>>   } nx_mac_list_t;
>>
>> -struct nx_vlan_ip_list {
>> -	struct list_head list;
>> -	__be32 ip_addr;
>> -};
>> -
>>   /*
>>    * Interrupt coalescing defaults. The defaults are for 1500 MTU. It
>is
>>    * adjusted based on configured MTU.
>> @@ -1605,7 +1600,6 @@ struct netxen_adapter {
>>   	struct net_device *netdev;
>>   	struct pci_dev *pdev;
>>   	struct list_head mac_list;
>> -	struct list_head vlan_ip_list;
>>
>>   	spinlock_t tx_clean_lock;
>>
>> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>> index e2a4858..8e3eb61 100644
>> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>> @@ -90,7 +90,6 @@ static irqreturn_t netxen_intr(int irq, void *data);
>>   static irqreturn_t netxen_msi_intr(int irq, void *data);
>>   static irqreturn_t netxen_msix_intr(int irq, void *data);
>>
>> -static void netxen_free_vlan_ip_list(struct netxen_adapter *);
>>   static void netxen_restore_indev_addr(struct net_device *dev,
>unsigned long);
>>   static struct rtnl_link_stats64 *netxen_nic_get_stats(struct
>net_device *dev,
>>   						      struct rtnl_link_stats64
>*stats);
>> @@ -1451,7 +1450,6 @@ netxen_nic_probe(struct pci_dev *pdev, const
>struct pci_device_id *ent)
>>
>>   	spin_lock_init(&adapter->tx_clean_lock);
>>   	INIT_LIST_HEAD(&adapter->mac_list);
>> -	INIT_LIST_HEAD(&adapter->vlan_ip_list);
>>
>>   	err = netxen_setup_pci_map(adapter);
>>   	if (err)
>> @@ -1586,7 +1584,7 @@ static void __devexit netxen_nic_remove(struct
>pci_dev *pdev)
>>
>>   	cancel_work_sync(&adapter->tx_timeout_task);
>>
>> -	netxen_free_vlan_ip_list(adapter);
>> +	netxen_restore_indev_addr(netdev, NETDEV_DOWN);
>>   	netxen_nic_detach(adapter);
>>
>>   	nx_decr_dev_ref_cnt(adapter);
>> @@ -3135,66 +3133,22 @@ netxen_destip_supported(struct netxen_adapter
>*adapter)
>>   	return 1;
>>   }
>>
>> -static void
>> -netxen_free_vlan_ip_list(struct netxen_adapter *adapter)
>> +static inline __be32
>> +netxen_get_addr(struct net_device *dev)
>>   {
>> -	struct nx_vlan_ip_list  *cur;
>> -	struct list_head *head = &adapter->vlan_ip_list;
>> -
>> -	while (!list_empty(head)) {
>> -		cur = list_entry(head->next, struct nx_vlan_ip_list, list);
>> -		netxen_config_ipaddr(adapter, cur->ip_addr, NX_IP_DOWN);
>> -		list_del(&cur->list);
>> -		kfree(cur);
>> -	}
>> -
>> -}
>> -static void
>> -netxen_list_config_vlan_ip(struct netxen_adapter *adapter,
>> -		struct in_ifaddr *ifa, unsigned long event)
>> -{
>> -	struct net_device *dev;
>> -	struct nx_vlan_ip_list *cur, *tmp_cur;
>> -	struct list_head *head;
>> -
>> -	dev = ifa->ifa_dev ? ifa->ifa_dev->dev : NULL;
>> -
>> -	if (dev == NULL)
>> -		return;
>> -
>> -	if (!is_vlan_dev(dev))
>> -		return;
>> +	struct in_device *in_dev;
>> +	__be32 addr = 0;
>>
>> -	switch (event) {
>> -	case NX_IP_UP:
>> -		list_for_each(head, &adapter->vlan_ip_list) {
>> -			cur = list_entry(head, struct nx_vlan_ip_list, list);
>> +	rcu_read_lock();
>> +	in_dev = __in_dev_get_rcu(dev);
>>
>> -			if (cur->ip_addr == ifa->ifa_address)
>> -				return;
>> -		}
>> +	if (in_dev)
>> +		addr = inet_confirm_addr(in_dev, 0, 0, RT_SCOPE_HOST);
>>
>> -		cur = kzalloc(sizeof(struct nx_vlan_ip_list), GFP_ATOMIC);
>> -		if (cur == NULL) {
>> -			printk(KERN_ERR "%s: failed to add vlan ip to list\n",
>> -					adapter->netdev->name);
>> -			return;
>> -		}
>> -
>> -		cur->ip_addr = ifa->ifa_address;
>> -		list_add_tail(&cur->list, &adapter->vlan_ip_list);
>> -		break;
>> -	case NX_IP_DOWN:
>> -		list_for_each_entry_safe(cur, tmp_cur,
>> -					&adapter->vlan_ip_list, list) {
>> -			if (cur->ip_addr == ifa->ifa_address) {
>> -				list_del(&cur->list);
>> -				kfree(cur);
>> -				break;
>> -			}
>> -		}
>> -	}
>> +	rcu_read_unlock();
>> +	return addr;
>>   }
>> +
>>   static void
>>   netxen_config_indev_addr(struct netxen_adapter *adapter,
>>   		struct net_device *dev, unsigned long event)
>> @@ -3213,12 +3167,10 @@ netxen_config_indev_addr(struct netxen_adapter
>*adapter,
>>   		case NETDEV_UP:
>>   			netxen_config_ipaddr(adapter,
>>   					ifa->ifa_address, NX_IP_UP);
>> -			netxen_list_config_vlan_ip(adapter, ifa, NX_IP_UP);
>>   			break;
>>   		case NETDEV_DOWN:
>>   			netxen_config_ipaddr(adapter,
>>   					ifa->ifa_address, NX_IP_DOWN);
>> -			netxen_list_config_vlan_ip(adapter, ifa, NX_IP_DOWN);
>>   			break;
>>   		default:
>>   			break;
>> @@ -3233,15 +3185,54 @@ netxen_restore_indev_addr(struct net_device
>*netdev, unsigned long event)
>>
>>   {
>>   	struct netxen_adapter *adapter = netdev_priv(netdev);
>> -	struct nx_vlan_ip_list *pos, *tmp_pos;
>> +	struct net_device *vlan_dev, *real_dev;
>>   	unsigned long ip_event;
>> +	__be32 addr = 0;
>>
>>   	ip_event = (event == NETDEV_UP) ? NX_IP_UP : NX_IP_DOWN;
>>   	netxen_config_indev_addr(adapter, netdev, event);
>>
>> -	list_for_each_entry_safe(pos, tmp_pos, &adapter->vlan_ip_list,
>list) {
>> -		netxen_config_ipaddr(adapter, pos->ip_addr, ip_event);
>> +	rcu_read_lock();
>> +	for_each_netdev_rcu(&init_net, vlan_dev) {
>> +		if (vlan_dev->priv_flags & IFF_802_1Q_VLAN) {
>> +			real_dev = vlan_dev_real_dev(vlan_dev);
>> +
>> +			if (real_dev == netdev) {
>> +				addr = netxen_get_addr(vlan_dev);
>> +
>> +				if (addr) {
>> +					netxen_config_ipaddr(adapter, addr,
>> +							     ip_event);
>> +				}
>> +			}
>> +		}
>>   	}
>> +
>> +	if (netdev->master) {
>> +		addr = netxen_get_addr(netdev->master);
>> +		if (addr)
>> +			netxen_config_ipaddr(adapter, addr, ip_event);
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>> +static inline int
>> +netxen_config_checkdev(struct net_device *dev)
>> +{
>> +	struct netxen_adapter *adapter;
>> +
>> +	if (!is_netxen_netdev(dev))
>> +		return -ENODEV;
>> +
>> +	adapter = netdev_priv(dev);
>> +
>> +	if (!adapter)
>> +		return -ENODEV;
>> +
>> +	if (adapter->is_up != NETXEN_ADAPTER_UP_MAGIC)
>> +		return -ENODEV;
>> +
>> +	return 0;
>>   }
>>
>>   static int netxen_netdev_event(struct notifier_block *this,
>> @@ -3260,18 +3251,26 @@ recheck:
>>   		goto recheck;
>>   	}
>>
>> -	if (!is_netxen_netdev(dev))
>> -		goto done;
>> +	/* If this is a bonding device, look for netxen-based slaves*/
>> +	if (dev->priv_flags & IFF_BONDING) {
>> +		struct net_device *slave;
>>
>> -	adapter = netdev_priv(dev);
>> +		rcu_read_lock();
>> +		for_each_netdev_in_bond_rcu(dev, slave) {
>> +			if (netxen_config_checkdev(slave) < 0)
>> +				continue;
>>
>> -	if (!adapter)
>> -		goto done;
>> -
>> -	if (adapter->is_up != NETXEN_ADAPTER_UP_MAGIC)
>> -		goto done;
>> +			adapter = netdev_priv(slave);
>> +			netxen_config_indev_addr(adapter, orig_dev, event);
>> +		}
>> +		rcu_read_unlock();
>> +	} else {
>> +		if (netxen_config_checkdev(dev) < 0)
>> +			goto done;
>>
>> -	netxen_config_indev_addr(adapter, orig_dev, event);
>> +		adapter = netdev_priv(dev);
>> +		netxen_config_indev_addr(adapter, orig_dev, event);
>> +	}
>>   done:
>>   	return NOTIFY_DONE;
>>   }
>> @@ -3282,10 +3281,11 @@ netxen_inetaddr_event(struct notifier_block
>*this,
>>   {
>>   	struct netxen_adapter *adapter;
>>   	struct net_device *dev;
>> -
>> +	unsigned long ip_event;
>>   	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
>>
>>   	dev = ifa->ifa_dev ? ifa->ifa_dev->dev : NULL;
>> +	ip_event = (event == NETDEV_UP) ? NX_IP_UP : NX_IP_DOWN;
>>
>>   recheck:
>>   	if (dev == NULL)
>> @@ -3296,30 +3296,35 @@ recheck:
>>   		goto recheck;
>>   	}
>>
>> -	if (!is_netxen_netdev(dev))
>> -		goto done;
>> +	/* If this is a bonding device, look for netxen-based slaves*/
>> +	if (dev->priv_flags & IFF_BONDING) {
>> +		struct net_device *slave;
>>
>> -	adapter = netdev_priv(dev);
>> +		rcu_read_lock();
>> +		for_each_netdev_in_bond_rcu(dev, slave) {
>> +			if (netxen_config_checkdev(slave) < 0)
>> +				continue;
>>
>> -	if (!adapter || !netxen_destip_supported(adapter))
>> -		goto done;
>> +			adapter = netdev_priv(slave);
>>
>> -	if (adapter->is_up != NETXEN_ADAPTER_UP_MAGIC)
>> -		goto done;
>> +			if (!netxen_destip_supported(adapter))
>> +				continue;
>>
>> -	switch (event) {
>> -	case NETDEV_UP:
>> -		netxen_config_ipaddr(adapter, ifa->ifa_address, NX_IP_UP);
>> -		netxen_list_config_vlan_ip(adapter, ifa, NX_IP_UP);
>> -		break;
>> -	case NETDEV_DOWN:
>> -		netxen_config_ipaddr(adapter, ifa->ifa_address, NX_IP_DOWN);
>> -		netxen_list_config_vlan_ip(adapter, ifa, NX_IP_DOWN);
>> -		break;
>> -	default:
>> -		break;
>> -	}
>> +			netxen_config_ipaddr(adapter, ifa->ifa_address,
>> +					     ip_event);
>> +		}
>> +		rcu_read_unlock();
>> +	} else {
>> +		if (netxen_config_checkdev(dev) < 0)
>> +			goto done;
>> +
>> +		adapter = netdev_priv(dev);
>>
>> +		if (!netxen_destip_supported(adapter))
>> +			goto done;
>> +
>> +		netxen_config_ipaddr(adapter, ifa->ifa_address, ip_event);
>> +	}
>>   done:
>>   	return NOTIFY_DONE;
>>   }
>> @@ -3335,9 +3340,6 @@ static struct notifier_block netxen_inetaddr_cb
>= {
>>   static void
>>   netxen_restore_indev_addr(struct net_device *dev, unsigned long
>event)
>>   { }
>> -static void
>> -netxen_free_vlan_ip_list(struct netxen_adapter *adapter)
>> -{ }
>>   #endif
>>
>>   static struct pci_error_handlers netxen_err_handler = {
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 59dc05f3..463bb40 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1578,6 +1578,9 @@ extern rwlock_t
>	dev_base_lock;		/* Device list lock */
>>   		list_for_each_entry_continue(d, &(net)->dev_base_head,
>dev_list)
>>   #define for_each_netdev_continue_rcu(net, d)		\
>>   	list_for_each_entry_continue_rcu(d, &(net)->dev_base_head,
>dev_list)
>> +#define for_each_netdev_in_bond_rcu(bond, slave)	\
>> +	for_each_netdev_rcu(&init_net, slave)		\
>> +		if (slave->master == bond)
>>   #define net_device_entry(lh)	list_entry(lh, struct net_device,
>dev_list)
>>
>>   static inline struct net_device *next_net_device(struct net_device
>*dev)
>Hello Dave,
>I just synced with upstream, I've missed a few patches and it seems
>that it doesn't apply cleanly because my previous patch was
>changed before it was applied. There is one character missing from
>a comment - "/* root bus? */", in upstream it was changed to
>/* root bus */.
>("netxen: check for root bus in netxen_mask_aer_correctable")
>
>About the rest, after QLogic test the functionality I'll clean-up the
>empty lines and re-send it.
>
>Thank you,
>  Nik
>
Hello Nikolay,

While testing the patch I found that mac address does not get added when
We perform following steps.
a. bond is created and assigned the ip address.
b. then the slave device is enslaved.

We add ip address in case of NETDEV_UP event. But in above case after enslave is done
we get NETDEV_FEAT_CHANGE event so ip address does not get added. 

Rajesh

^ permalink raw reply

* Re: [PATCH] RDS: Fix spinlock recursion for rds over tcp transmit
From: Venkat Venkatsubra @ 2012-10-08 16:47 UTC (permalink / raw)
  To: jeff.liu; +Cc: rds-devel, Dan Carpenter, davem, James Morris, netdev
In-Reply-To: <506FC4CD.7070509@oracle.com>

On 10/6/2012 12:42 AM, Jeff Liu wrote:
> Hello,
>
> RDS ping/pong over TCP feature has broke for years(2.6.39 to 3.6.0) since we have to set TCP cork and
> call kerenel_sendmsg() to reply a ping requirement which both need to lock "struct sock *sk".
> However, this lock has already been hold before our rda_tcp_data_ready() callback is triggerred.
> As a result, we always facing spinlock recursion which would resulting in system panic...
>
> Given that RDS ping is a special kind of message, we don't need to reply it as
> soon as possible, IMHO, we can schedule it to work queue as a delayed response to
> make TCP transport totally works.  Also, I think we can using the system default
> work queue to serve it to reduce the possible impact on general TCP transmit.
>
> With below patch, I have run rds-ping(run multiple ping between two hosts at the same time) and
> rds-stress(hostA listen, hostB send packets) for half day, it works to me.
>
> Thanks,
> -Jeff
>
>
> Reported-by: Dan Carpenter<dan.carpenter@oracle.com>
> CC: Venkat Venkatsubra<venkat.x.venkatsubra@oracle.com>
> CC: David S. Miller<davem@davemloft.net>
> CC: James Morris<james.l.morris@oracle.com>
> Signed-off-by: Jie Liu<jeff.liu@oracle.com>
>
> ---
>   net/rds/send.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   net/rds/tcp.h  |    7 +++++
>   2 files changed, 82 insertions(+), 5 deletions(-)
>
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 96531d4..011006e 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -38,8 +38,10 @@
>   #include<linux/list.h>
>   #include<linux/ratelimit.h>
>   #include<linux/export.h>
> +#include<linux/workqueue.h>
>
>   #include "rds.h"
> +#include "tcp.h"
>
>   /* When transmitting messages in rds_send_xmit, we need to emerge from
>    * time to time and briefly release the CPU. Otherwise the softlock watchdog
> @@ -55,6 +57,12 @@ static int send_batch_count = 64;
>   module_param(send_batch_count, int, 0444);
>   MODULE_PARM_DESC(send_batch_count, " batch factor when working the send queue");
>
> +/* RDS over TCP ping/pong */
> +static void rds_tcp_pong_worker(struct work_struct *work);
> +static DEFINE_SPINLOCK(rds_tcp_pong_lock);
> +static LIST_HEAD(rds_tcp_pong_list);
> +static DECLARE_WORK(rds_tcp_pong_work, rds_tcp_pong_worker);
> +
>   static void rds_send_remove_from_sock(struct list_head *messages, int status);
>
>   /*
> @@ -1082,11 +1090,7 @@ out:
>   	return ret;
>   }
>
> -/*
> - * Reply to a ping packet.
> - */
> -int
> -rds_send_pong(struct rds_connection *conn, __be16 dport)
> +static int rds_tcp_send_pong(struct rds_connection *conn, __be16 dport)
>   {
>   	struct rds_message *rm;
>   	unsigned long flags;
> @@ -1132,3 +1136,69 @@ out:
>   		rds_message_put(rm);
>   	return ret;
>   }
> +
> +static void rds_tcp_pong_worker(struct work_struct *work)
> +{
> +	struct rds_tcp_pong *tp;
> +	struct rds_connection *conn;
> +	__be16 dport;
> +
> +	spin_lock(&rds_tcp_pong_lock);
> +	if (list_empty(&rds_tcp_pong_list))
> +		goto out_unlock;
> +
> +	/*
> +	 * Process on tcp pong once one time to reduce the possbile impact
> +	 * on normal transmit.
> +	 */
> +	tp = list_entry(rds_tcp_pong_list.next, struct rds_tcp_pong, tp_node);
> +	conn = tp->tp_conn;
> +	dport = tp->tp_dport;
> +	list_del(&tp->tp_node);
> +	spin_unlock(&rds_tcp_pong_lock);
> +
> +	kfree(tp);
> +	rds_tcp_send_pong(conn, dport);
> +	goto out;
> +
> +out_unlock:
> +	spin_unlock(&rds_tcp_pong_lock);
> +out:
> +	return;
> +}
> +
> +/*
> + * RDS over TCP transport support ping/pong message.  However, it
> + * always resulting in sock spinlock recursion up to 3.7.0.  To solve
> + * this issue, we can shedule it to work queue as a delayed response.
> + * Here we using the system default work queue.
> + */
> +int rds_tcp_pong(struct rds_connection *conn, __be16 dport)
> +{
> +	struct rds_tcp_pong *tp;
> +	int ret = 0;
> +
> +	tp = kmalloc(sizeof(*tp), GFP_ATOMIC);
> +	if (!tp) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	tp->tp_conn = conn;
> +	tp->tp_dport = dport;
> +	spin_lock(&rds_tcp_pong_lock);
> +	list_add_tail(&tp->tp_node,&rds_tcp_pong_list);
> +	spin_unlock(&rds_tcp_pong_lock);
> +	schedule_work(&rds_tcp_pong_work);
> +
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Reply to a ping package, TCP only.
> + */
> +int rds_send_pong(struct rds_connection *conn, __be16 dport)
> +{
> +	return rds_tcp_pong(conn, dport);
> +}
> diff --git a/net/rds/tcp.h b/net/rds/tcp.h
> index 9cf2927..c4c7e01 100644
> --- a/net/rds/tcp.h
> +++ b/net/rds/tcp.h
> @@ -3,6 +3,13 @@
>
>   #define RDS_TCP_PORT	16385
>
> +/* RDS over TCP ping/pong message entry */
> +struct rds_tcp_pong {
> +	struct list_head	tp_node;
> +	struct rds_connection	*tp_conn;
> +	__be16			tp_dport;
> +};
> +
>   struct rds_tcp_incoming {
>   	struct rds_incoming	ti_inc;
>   	struct sk_buff_head	ti_skb_list;
Hi Jeff,

I was looking at the history of changes to rds_send_pong.
At one time rds_send_pong did this to transmit the pong message:
        queue_delayed_work(rds_wq, &conn->c_send_w, 0);
instead of the current
        ret = rds_send_xmit(conn);
i.e. the older versions did not have the deadlock problem and used to 
work once. ;-)

I have suggestions for fixing it in a couple of other ways which you may 
want to consider
to reduce the amount of code changes in a transport independent layer 
such as "send.c" for a specific underlying transport (tcp in this case).

1. One option is to move back to the old way for all transports (IB, 
tcp, loopback) since
queuing delay shouldn't be an issue for a diagnostic tool like rds-ping 
which is typically used just to test the connectivity
and not for serious performance measurements.

2. The underlying transport such as IB, loopback,TCP tells which method 
it wants to send the pong: queued way or send immediately.
      And the code change in rds_send_pong could then simply be:
    if (conn->c_flags & QUEUE_PONG)
       queue_delayed_work(rds_wq, &conn->c_send_w,0);
    else
       ret = rds_send_xmit(conn);
(The above example codes are not complete. You will need to propagate 
this new flag to "conn" from "rds_transport", etc. at connection setup time)

Venkat

^ permalink raw reply

* Re: [PATCH] net: gro: selective flush of packets
From: Rick Jones @ 2012-10-08 16:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, netdev, Jesse Gross, Tom Herbert,
	Yuchung Cheng
In-Reply-To: <1349681967.21172.2866.camel@edumazet-glaptop>

On 10/08/2012 12:39 AM, Eric Dumazet wrote:
> On Sun, 2012-10-07 at 07:29 +0200, Eric Dumazet wrote:
>> 	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
>
>>> Why don't we just always flush everything?
>>
>
>> This is what I tried first, but it lowered performance on several
>> typical workloads.
>>
>> Using this simple heuristic increases performance.
>>
>>
>
> By the way, one of the beauty of GRO is it helps under load to aggregate
> packets and reduce cpu load. People wanting very low latencies should
> probably not use GRO, and if they use it or not, receiving a full 64
> packets batch on a particular NIC makes latencies very unpredictable.
>
> So if we consumed all budget in a napi->poll() handler, its because we
> are under load and we dont really want to cancel GRO aggregation.

Is that actually absolute, or does it depend on GRO aggregation actually 
aggregating?  In your opening message you talked about how with though 
flows GRO is defeated but its overhead remains.

rick jones

^ permalink raw reply

* Re: [RFC] GRO scalability
From: Rick Jones @ 2012-10-08 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross
In-Reply-To: <1349467578.21172.178.camel@edumazet-glaptop>

On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
>
>> Just how much code path is there between NAPI and the socket?? (And I
>> guess just how much combining are you hoping for?)
>>
>
> When GRO correctly works, you can save about 30% of cpu cycles, it
> depends...
>
> Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> gives an improvement as well...

OK, but how much of that 30% come from where?  Each coalesced segment is 
saving the cycles between NAPI and the socket.  Each avoided ACK is 
saving the cycles from TCP to the bottom of the driver and a (share of) 
transmit completion.


> I took this 1ms delay, but I never said it was a fixed value ;)
>
> Also remember one thing, this is the _max_ delay in case your napi
> handler is flooded. This almost never happen (tm)

We can still ignore the FSI types and probably the HPC types because 
they will insist on never happens (tm) :)


>
> Not sure what you mean by shuffle. We use a hash table to locate a flow,
> but we also have a LRU list to get the packets ordered by their entry in
> the 'GRO unit'.

Whe I say shuffle I mean something along the lines of interleave.  So, 
if we have four flows, 1-4, a perfect shuffle of their segments would be 
something like:

1 2 3 4 1 2 3 4 1 2 3 4

but not well shuffled might look like

1 1 3 2 3 2 4 4 4 1 3 2

rick

^ permalink raw reply

* Fw: [Bug 48601] New: Pauses while downloading
From: Stephen Hemminger @ 2012-10-08 16:29 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Mon,  8 Oct 2012 09:12:17 +0000 (UTC)
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 48601] New: Pauses while downloading


https://bugzilla.kernel.org/show_bug.cgi?id=48601

           Summary: Pauses while downloading
           Product: Networking
           Version: 2.5
    Kernel Version: 3.6
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: low
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: colbec@start.ca
        Regression: No


Using OpenSUSE 12.2, Gnome 3.6 with Network Manager.
Recently upgraded from kernel 3.5 to 3.6, at which point the issue began.

Observed symptoms: while downloading, the stream will stop at random points for
5 to 120+ seconds, and then resume the download.
Example 1: while browsing, click on a link one time in 10 (approx) will fail to
get the page, and a second click is successful. Sometimes it will begin to get
the page, then stop, use back button to get previous page back, click link
again and the page is fetched successfully.
Example 2: try to test download speed using speedtest.net, one test in two
shows a normal start to the download side of the test, during the test the
process stops, then recommences after a varying period 5 to 120+ seconds.
Pauses happen one or multiple times.
Example 3: attempt to download a 20MB mp4 using Firefox. Process begins
correctly, but during the download the process stops, again for 5-120 sec, and
resumes. Downloads take a long time, but will eventually complete. If I kick it
(pause, resume in download manager) it will complete faster.

I have replaced cables, NIC; return to kernel 3.5 eliminates the issue.

.config file was copied from a 3.4.2 kernel version, worked well for 3.5, and
also seemed to be good for 3.6 with the exception of failed webcam which needed
V4L reinstated.

--- Comment #1 from Alan <alan@lxorguk.ukuu.org.uk>  2012-10-08 15:03:59 ---
Please provide detailed information on the hardware or network setup. Without
that it's hard to do anything.

--- Comment #2 from colbec@start.ca  2012-10-08 15:43:58 ---
OK, first to note is that two Windows machines on the same network using the
same wireless route to the WAN do not show this issue. Since kernel 3.5 is
clear and only 3.6 is the problem this seems to rule out the network
components, but anyway I am connecting to the WAN through two Linksys WRT4GL
running DD-WRT. Route continues to WAN through a wireless high speed connection
via Motorola Canopy technology.

I have replaced the cabling with identical results.
I have added a PCI NIC (Realtek) which behaves the same way as the onboard NIC
(Atheros AR8131).

Memory 3.8 GiB, Intel Core 2 Quad Q8300 @ 2.5 GHz x 4
Graphics Intel G41 64 bit OS, Disk 500 GB.
Correction to Gnome version 3.4.2. (not stated 3.6)

I have a ream of further hardware info from Yast, please specify what detail
would be helpful.

^ permalink raw reply

* Fw: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue()
From: Stephen Hemminger @ 2012-10-08 16:27 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Mon, 08 Oct 2012 17:15:56 +0800
From: Cong Wang <amwang@redhat.com>
To: stephen hemminger <shemminger@vyatta.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,        "David S. Miller" <davem@davemloft.net>, netdev@vger.kernel.org,        Thomas Graf <tgraf@redhat.com>
Subject: BUG: unable to handle kernel NULL pointer dereference in qfq_dequeue()


Hi, all,

We got the following kernel crash on RHEL6 and I confirmed upstream has
the same problem (I didn't save this kernel log though):

BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
IP: [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq]
PGD 1fbed067 PUD 1b103067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs
file: /sys/devices/pci0000:00/0000:00:08.0/virtio4/net/eth2/address
CPU 0 
Modules linked in: cls_u32 sch_qfq sch_cbq ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq
snd_seq_device
snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4
i2c_core
ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last
unloaded:
scsi_wait_scan]

Pid: 0, comm: swapper Not tainted 2.6.32-259.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa02c3dca>]  [<ffffffffa02c3dca>] qfq_dequeue
+0x30a/0x490
[sch_qfq]
RSP: 0018:ffff880002203da0  EFLAGS: 00010287
RAX: ffffffffffffffb0 RBX: ffff88001f45e0c0 RCX: 0000000000000029
RDX: fffffe0000000000 RSI: 0000000000000001 RDI: ffff88001f45f718
RBP: ffff880002203de0 R08: 0000000000000007 R09: 0000000225c602e3
R10: 00000000ffffffff R11: dead000000200200 R12: 0000000000000013
R13: ffff88001f124ea8 R14: ffff88001f45f6b8 R15: 0028940000000000
FS:  0000000000000000(0000) GS:ffff880002200000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000001b277000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task
ffffffff81a8d020)
Stack:
 ffff88001f45e000 0028900000000000 ffff880002203de0 ffff88001f4fcc00
<d> ffff88001f4fcc00 0000000000000000 0000000000000001 ffff88001ad640c0
<d> ffff880002203e60 ffffffffa02b9c85 ffff88001f4fcc00 ffff88001f4fcc00
Call Trace:
 <IRQ> 
 [<ffffffffa02b9c85>] cbq_dequeue+0x365/0x730 [sch_cbq]
 [<ffffffff81456c3f>] __qdisc_run+0x3f/0xe0
 [<ffffffff81436c00>] net_tx_action+0x130/0x1c0
 [<ffffffff8102b46d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff81073d81>] __do_softirq+0xc1/0x1e0
 [<ffffffff81096b10>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c24c>] call_softirq+0x1c/0x30
 [<ffffffff8100de85>] do_softirq+0x65/0xa0
 [<ffffffff81073b65>] irq_exit+0x85/0x90
 [<ffffffff81502bc0>] smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
 <EOI> 
 [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
 [<ffffffff810149cd>] default_idle+0x4d/0xb0
 [<ffffffff81009e06>] cpu_idle+0xb6/0x110
 [<ffffffff814e137a>] rest_init+0x7a/0x80
 [<ffffffff81c21f7b>] start_kernel+0x424/0x430
 [<ffffffff81c2133a>] x86_64_start_reservations+0x125/0x129
 [<ffffffff81c21438>] x86_64_start_kernel+0xfa/0x109
Code: 7c 03 50 4d 8b 7e 58 e8 b5 f6 ff ff 48 85 c0 0f 84 3c 01 00 00 41
8b 4e
60 be 01 00 00 00 49 8d 7e 60 48 89 f2 48 d3 e2 48 f7 da <48> 23 50 60
49 39 56
50 0f 84 d6 00 00 00 b8 02 00 00 00 49 89 
RIP  [<ffffffffa02c3dca>] qfq_dequeue+0x30a/0x490 [sch_qfq]
 RSP <ffff880002203da0>
CR2: 0000000000000010


This crash can be easily reproduced in KVM guests by the following
steps:

1. on virt-guest1 setup qdisc with qfq with this script:
http://pastebin.com/BRaSXLzq
2. on virt-guest2 start listening on ports 1234, 1235
# nc -l 1234 > /dev/null 2>&1
# nc -l 1235 > /dev/null 2>&1
3. on virt-guest1 send traffic to virt-guest2
# yes | nc $virt-guest2_ip_addr 1234
# yes | nc $virt-guest2_ip_addr 1235

I am not familiar with qfq qdisc. Any ideas?

Thanks!

^ permalink raw reply

* Re: skge: Add DMA mask quirk for Marvell 88E8001 on ASUS P5NSLI motherboard.
From: Stephen Hemminger @ 2012-10-08 16:16 UTC (permalink / raw)
  To: Graham Gower; +Cc: netdev
In-Reply-To: <50711961.1010101@gmail.com>

On Sun, 07 Oct 2012 16:25:45 +1030
Graham Gower <graham.gower@gmail.com> wrote:

> Marvell 88E8001 on an ASUS P5NSLI motherboard is unable to send/receive
> packets on a system with >4gb ram unless a 32bit DMA mask is used.
> 
> This issue has been around for years and a fix was sent 3.5 years ago, but
> there was some debate as to whether it should instead be fixed as a PCI quirk.
> http://www.spinics.net/lists/netdev/msg88670.html
> 
> However, 18 months later a similar workaround was introduced for another
> chipset exhibiting the same problem.
> http://www.spinics.net/lists/netdev/msg142287.html
> 
> Signed-off-by: Graham Gower <graham.gower@gmail.com>
> 
> --- skge.c.bak	2012-10-07 13:00:56.000000000 +1030
> +++ skge.c	2012-10-07 13:26:03.000000000 +1030
> @@ -4143,6 +4143,13 @@
>   			DMI_MATCH(DMI_BOARD_NAME, "nForce"),
>   		},
>   	},
> +	{
> +		.ident = "ASUS P5NSLI",
> +		.matches = {
> +			DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
> +			DMI_MATCH(DMI_BOARD_NAME, "P5NSLI")
> +		},
> +	},
>   	{}
>   };
>   

This fix is fine, but like Jan said it needs to be based at the
right place in the filesystem tree for inclusion.

I wonder if it would be a good idea to add a way to automatically
check for this quirk using the debug registers on the chip which allow
peeking. Using the test registers B3_RAM_DATA_LO/B3_RAM_DATA_HI.

The problem is that might introduce worse problems if it causes
DMA access to non-existent memory and trying to recover from the fault.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox