Netdev List
 help / color / mirror / Atom feed
* [PATCH] bnx2x: fix rx checksum validation for IPv6
From: Michal Schmidt @ 2012-09-13 22:59 UTC (permalink / raw)
  To: netdev
  Cc: Eilon Greenstein, Eric Dumazet, Yaniv Rosner, Yuval Mintz,
	Merav Sicron, Robert Evans, Tom Herbert, Willem de Bruijn,
	David Miller

Commit d6cb3e41 "bnx2x: fix checksum validation" caused a performance
regression for IPv6. Rx checksum offload does not work. IPv6 packets
are passed to the stack with CHECKSUM_NONE.

The hardware obviously cannot perform IP checksum validation for IPv6,
because there is no checksum in the IPv6 header. This should not prevent
us from setting CHECKSUM_UNNECESSARY.

Tested on BCM57711.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index af20c6e..e8e97a7 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -662,14 +662,16 @@ void bnx2x_csum_validate(struct sk_buff *skb, union eth_rx_cqe *cqe,
 				 struct bnx2x_fastpath *fp,
 				 struct bnx2x_eth_q_stats *qstats)
 {
-	/* Do nothing if no IP/L4 csum validation was done */
-
+	/* Do nothing if no L4 csum validation was done.
+	 * We do not check whether IP csum was validated. For IPv4 we assume
+	 * that if the card got as far as validating the L4 csum, it also
+	 * validated the IP csum. IPv6 has no IP csum.
+	 */
 	if (cqe->fast_path_cqe.status_flags &
-	    (ETH_FAST_PATH_RX_CQE_IP_XSUM_NO_VALIDATION_FLG |
-	     ETH_FAST_PATH_RX_CQE_L4_XSUM_NO_VALIDATION_FLG))
+	    ETH_FAST_PATH_RX_CQE_L4_XSUM_NO_VALIDATION_FLG)
 		return;
 
-	/* If both IP/L4 validation were done, check if an error was found. */
+	/* If L4 validation was done, check if an error was found. */
 
 	if (cqe->fast_path_cqe.type_error_flags &
 	    (ETH_FAST_PATH_RX_CQE_IP_BAD_XSUM_FLG |
-- 
1.7.11.4

^ permalink raw reply related

* [PATCH] ipconfig: Inform user if carrier is not ready
From: Erwan Velu @ 2012-09-13 21:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120913.164525.1171098883605242394.davem@davemloft.net>

From: Erwan Velu <erwanaliasr1@gmail.com>

While using the ip= option at the cmdline, the kernel can hold the boot
process for 2 minutes (CONF_CARRIER_TIMEOUT) if the carrier is not
present.

While waiting the carrier, user is not informed about this situation and
so could think the kernel is frozen.

This patch is just adding a simple message every second telling we are
waiting the carrier to come up.
---
  net/ipv4/ipconfig.c |    8 ++++++++
  1 file changed, 8 insertions(+)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 67e8a6b..d9f34b7 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -205,6 +205,7 @@ static int __init ic_open_devs(void)
      struct net_device *dev;
      unsigned short oflags;
      unsigned long start;
+    unsigned int loops=0;

      last = &ic_first_dev;
      rtnl_lock();
@@ -266,6 +267,13 @@ static int __init ic_open_devs(void)
              if (ic_is_init_dev(dev) && netif_carrier_ok(dev))
                  goto have_carrier;

+        loops++;
+        /* This loop is blocking the boot process until we get the 
carrier or reach the timeout.
+         * We have to inform the user about the situation as it could 
look like a kernel freeze.
+         * Every second, we display a short message indicating we wait 
the carrier */
+        if ((loops % 1000) == 0) {
+            pr_info("IP-Config: Waiting Carrier (%d/%d):\n",loops / 
1000, CONF_CARRIER_TIMEOUT / 1000);
+        }
          msleep(1);
      }
  have_carrier:
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH v2 net-next] ipv6: prevent useless neigh alloc on PTP or lo routes
From: Eric Dumazet @ 2012-09-13 21:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, lorenzo, maze, therbert, willemb
In-Reply-To: <20120913.171305.713716058425991240.davem@davemloft.net>

On Thu, 2012-09-13 at 17:13 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 13 Sep 2012 05:15:58 +0200
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > We have special handling of SIT devices in addrconf_prefix_route()
> > to avoid allocating a neighbour for each destination.
> > 
> > If routing entry is :
> > 
> > ip -6 route add 2001:db8::/64 dev sit1
> > 
> > Then the kernel will create a new route and neighbour for every new
> > address under 2001:db8::/64 that we send a packet to 
> > (potentially, 2^64 routes and neighbours).
> > 
> > Under load, we immediately get the infamous "Neighbour table overflow"
> > message and machine eventually crash.
> > 
> > This does not happen if we specify a next-hop explicitly, like so:
> > 
> > ip -6 route add 2001:db8::/64 via fe80:: dev sit1
> > 
> > Same problem happens if we use routes to loopback.
> > 
> > Idea of this patch is to move existing SIT related code from
> > addrconf_prefix_route() to a more generic one in ip6_route_add(). 
> > 
> > This permits ip6_pol_route() to clone route instead of calling
> > rt6_alloc_cow() and allocate a neighbour.
> > 
> > Many thanks to Lorenzo for his help and suggestions.
> > 
> > Reported-by: Lorenzo Colitti <lorenzo@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> This patch lacks the desired effect without your clone-caching-removal
> patch, which I will not apply.
> 
> Therefore it doesn't make any sense to apply this either, as it won't
> fix the stated problem.
> 
> Doing a proper conversion of ipv6 to ref-count-less neigh's will solve
> this problem and allow all of the clone/cow caching code to be elided
> for the majority of cases and is the correct approach to these problems.


This seems quite different patches to me. Addressing two problems.

This patch is about not allocating a neighbour for a given route, but
reusing one existing neighbour. What could be possibly wrong with this,
since all neighbours are exactly the sames ?

I understand we can make it better with big surgery later, but right now
it seems quite reasonable.

This is already done (in part) for SIT devices, which are a subclass of
PtP device.

For the other patch, it seems problem was introduced in 2.6.38 when
CLONE_OFFLINK_ROUTE was removed in commit d80bc0fd26.

^ permalink raw reply

* [PATCH] xfrm_user: return error pointer instead of NULL
From: Mathias Krause @ 2012-09-13 21:41 UTC (permalink / raw)
  To: David S. Miller
  Cc: Steffen Klassert, netdev, linux-kernel, Mathias Krause, stable

When dump_one_state() returns an error, e.g. because of a too small
buffer to dump the whole xfrm state, xfrm_state_netlink() returns NULL
instead of an error pointer. But its callers expect an error pointer
and therefore continue to operate on a NULL skbuff.

This could lead to a privilege escalation (execution of user code in
kernel context) if the attacker has CAP_NET_ADMIN and is able to map
address 0.

Cc: stable@vger.kernel.org
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---

A test case can be provided on request.

 net/xfrm/xfrm_user.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index e75d8e4..dac08e2 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -878,6 +878,7 @@ static struct sk_buff *xfrm_state_netlink(struct sk_buff *in_skb,
 {
 	struct xfrm_dump_info info;
 	struct sk_buff *skb;
+	int err;
 
 	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
 	if (!skb)
@@ -888,9 +889,10 @@ static struct sk_buff *xfrm_state_netlink(struct sk_buff *in_skb,
 	info.nlmsg_seq = seq;
 	info.nlmsg_flags = 0;
 
-	if (dump_one_state(x, 0, &info)) {
+	err = dump_one_state(x, 0, &info);
+	if (err) {
 		kfree_skb(skb);
-		return NULL;
+		return ERR_PTR(err);
 	}
 
 	return skb;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [net-next.git 3/8 (V2)] stmmac: add the initial tx coalesce schema
From: David Miller @ 2012-09-13 21:37 UTC (permalink / raw)
  To: bhutchings; +Cc: peppe.cavallaro, netdev
In-Reply-To: <1347570650.13258.40.camel@deadeye.wl.decadent.org.uk>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 13 Sep 2012 22:10:50 +0100

> On Thu, 2012-09-13 at 16:46 -0400, David Miller wrote:
>> From: Ben Hutchings <bhutchings@solarflare.com>
>> Date: Thu, 13 Sep 2012 21:42:51 +0100
>> 
>> Well written NAPI drivers never need to disable hardware interrupts
>> in their ->poll() method and it's callers, neither should you.
> 
> Perhaps you should get round to reviewing netpoll, because it does
> exactly this.

Then I don't understand the point you're trying to make.

Hardware interrupt disabling has absolutely no place in the
NAPI polling fast paths.

If NAPI drivers can't be implemented without hardware interrupt
toggling in ->poll(), we've failed.

^ permalink raw reply

* [PATCH] tcp: restore rcv_wscale in a repair mode
From: Andrew Vagin @ 2012-09-13 21:28 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Pavel Emelyanov, Andrew Vagin, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

rcv_wscale is a symetric parameter with snd_wscale.

Both this parameters are set on a connection handshake.

Without this value a remote window size can not be interpreted correctly,
because a value from a packet should be shifted on rcv_wscale.

And one more thing is that wscale_ok should be set too.

This patch doesn't break a backward compatibility.
If someone uses it in a old scheme, a rcv window
will be restored with the same bug (rcv_wscale = 0).

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 net/ipv4/tcp.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index df83d74..ed22cd7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2348,10 +2348,22 @@ static int tcp_repair_options_est(struct tcp_sock *tp,
 			tp->rx_opt.mss_clamp = opt.opt_val;
 			break;
 		case TCPOPT_WINDOW:
-			if (opt.opt_val > 14)
-				return -EFBIG;
-
-			tp->rx_opt.snd_wscale = opt.opt_val;
+			{
+				union {
+					struct {
+						u16 snd_wscale;
+						u16 rcv_wscale;
+					};
+					u32 raw;
+				} val = { .raw = opt.opt_val };
+
+				if (val.snd_wscale > 14 || val.rcv_wscale > 14)
+					return -EFBIG;
+
+				tp->rx_opt.snd_wscale = val.snd_wscale;
+				tp->rx_opt.rcv_wscale = val.rcv_wscale;
+				tp->rx_opt.wscale_ok = 1;
+			}
 			break;
 		case TCPOPT_SACK_PERM:
 			if (opt.opt_val != 0)
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH v2 net-next] ipv6: prevent useless neigh alloc on PTP or lo routes
From: David Miller @ 2012-09-13 21:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, lorenzo, maze, therbert, willemb
In-Reply-To: <1347506158.13103.1365.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Sep 2012 05:15:58 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We have special handling of SIT devices in addrconf_prefix_route()
> to avoid allocating a neighbour for each destination.
> 
> If routing entry is :
> 
> ip -6 route add 2001:db8::/64 dev sit1
> 
> Then the kernel will create a new route and neighbour for every new
> address under 2001:db8::/64 that we send a packet to 
> (potentially, 2^64 routes and neighbours).
> 
> Under load, we immediately get the infamous "Neighbour table overflow"
> message and machine eventually crash.
> 
> This does not happen if we specify a next-hop explicitly, like so:
> 
> ip -6 route add 2001:db8::/64 via fe80:: dev sit1
> 
> Same problem happens if we use routes to loopback.
> 
> Idea of this patch is to move existing SIT related code from
> addrconf_prefix_route() to a more generic one in ip6_route_add(). 
> 
> This permits ip6_pol_route() to clone route instead of calling
> rt6_alloc_cow() and allocate a neighbour.
> 
> Many thanks to Lorenzo for his help and suggestions.
> 
> Reported-by: Lorenzo Colitti <lorenzo@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This patch lacks the desired effect without your clone-caching-removal
patch, which I will not apply.

Therefore it doesn't make any sense to apply this either, as it won't
fix the stated problem.

Doing a proper conversion of ipv6 to ref-count-less neigh's will solve
this problem and allow all of the clone/cow caching code to be elided
for the majority of cases and is the correct approach to these problems.

^ permalink raw reply

* Re: [net-next.git 3/8 (V2)] stmmac: add the initial tx coalesce schema
From: Ben Hutchings @ 2012-09-13 21:10 UTC (permalink / raw)
  To: David Miller; +Cc: peppe.cavallaro, netdev
In-Reply-To: <20120913.164645.1268091771093604148.davem@davemloft.net>

On Thu, 2012-09-13 at 16:46 -0400, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Thu, 13 Sep 2012 21:42:51 +0100
> 
> > On Thu, 2012-09-13 at 16:23 -0400, David Miller wrote:
> >> From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
> >> Date: Tue, 11 Sep 2012 08:55:09 +0200
> >> 
> >> > +	unsigned long flags;
> >> > +
> >> > +	spin_lock_irqsave(&priv->tx_lock, flags);
> >> >  
> >> > -	spin_lock(&priv->tx_lock);
> >> > +	priv->xstats.tx_clean++;
> >> 
> >> You are changing the locking here for the sake of the new timer.
> >> 
> >> But timers run in software interrupt context, so this change is
> >> completely unnecessary since NAPI runs in software interrupt context
> >> as well, and neither timers nor NAPI run in hardware interrupts
> >> context.
> > 
> > NAPI pollers can be called by netpoll in hardware interrupt context, and
> > this driver supports netpoll.
> 
> And the netpoll handler need to make amends to make sure that
> hardware the environment expected by the software interrupt
> code is preserved.
>
> Well written NAPI drivers never need to disable hardware interrupts
> in their ->poll() method and it's callers, neither should you.

Perhaps you should get round to reviewing netpoll, because it does
exactly this.

> I'm not considering your patches until you implement this properly.

These aren't my patches.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] pktgen: fix crash with vlan and packet size less than 46
From: David Miller @ 2012-09-13 21:10 UTC (permalink / raw)
  To: nistrive; +Cc: netdev, benve
In-Reply-To: <1347492769-32409-1-git-send-email-nistrive@cisco.com>

From: Nishank Trivedi <nistrive@cisco.com>
Date: Wed, 12 Sep 2012 16:32:49 -0700

> If vlan option is being specified in the pktgen and packet size
> being requested is less than 46 bytes, despite being illogical
> request, pktgen should not crash the kernel.
> 
> BUG: unable to handle kernel paging request at ffff88021fb82000
> Process kpktgend_0 (pid: 1184, threadinfo ffff880215f1a000, task ffff880218544530)
> Call Trace:
> [<ffffffffa0637cd2>] ? pktgen_finalize_skb+0x222/0x300 [pktgen]
> [<ffffffff814f0084>] ? build_skb+0x34/0x1c0
> [<ffffffffa0639b11>] pktgen_thread_worker+0x5d1/0x1790 [pktgen]
> [<ffffffffa03ffb10>] ? igb_xmit_frame_ring+0xa30/0xa30 [igb]
> [<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
> [<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
> [<ffffffffa0639540>] ? spin+0x240/0x240 [pktgen]
> [<ffffffff8107b4e3>] kthread+0x93/0xa0
> [<ffffffff81615de4>] kernel_thread_helper+0x4/0x10
> [<ffffffff8107b450>] ? flush_kthread_worker+0x80/0x80
> [<ffffffff81615de0>] ? gs_change+0x13/0x13
> 
> The root cause of why pktgen is not able to handle this case is due
> to comparison of signed (datalen) and unsigned data (sizeof), which
> eventually passes a huge number to skb_put().
> 
> Signed-off-by: Nishank Trivedi <nistrive@cisco.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] drivers/net: Enable IOMMU pass through for be2net
From: David Miller @ 2012-09-13 21:09 UTC (permalink / raw)
  To: craig.hada
  Cc: netdev, sathya.perla, subbu.seetharaman, ajit.khaparde,
	linux-kernel
In-Reply-To: <505212A3.1030307@hp.com>

From: Craig Hada <craig.hada@hp.com>
Date: Thu, 13 Sep 2012 10:06:43 -0700

> On 9/13/2012 9:27 AM, Hada, Craig M wrote:
>> This patch sets the coherent DMA mask to 64-bit after the be2net
>> driver has been acknowledged that the system is 64-bit DMA
>> capable. The coherent DMA mask is examined by the Intel IOMMU driver
>> to determine whether to allow pass through context mapping for all
>> devices. With this patch, the be2net driver combined with be2net
>> compatible hardware provides comparable performance to the case where
>> vt-d is disabled. The main use case for this change is to decrease the
>> time necessary to copy virtual machine memory during KVM live
>> migration instantiations.
>>
>> This patch was tested on a system that enables the IOMMU in
>> non-coherent mode. Two DMA remapper issues were encountered and both
>> are in the Intel IOMMU driver with the following patches submitted
>> upstream but not yet commited.
>>
>> Patch 1 - DMAR:[fault reason 02] Present bit in context entry is clear
>> https://lkml.org/lkml/2012/6/15/20
> 
> My apologies for posting a truncated link for the above. The correct
> link is https://lkml.org/lkml/2012/6/15/204

First of all you've made this email reply in such a way it didn't
get logged in the patch in patchwork, perhaps because either the
Message-Id got changed or flat-out removed, I can't say for sure.

Secondly, it is not appropriate to install a change that will
knowingly break usage of the device until the IOMMU reaper fixes
actually exist upstream.  This patch is absolutely dependent upon
those fixes, and therefore must only get applied to trees that
have the IOMMU fixes installed.

You must therefore wait for upstream to acquire the fixes, upstream to
get sync'd into the networking trees, and then you apply a patch with
these requirements.

^ permalink raw reply

* Re: [net-next 0/6][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-09-13 21:05 UTC (permalink / raw)
  To: davem@davemloft.net
  Cc: netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com,
	Vick, Matthew, Ben Hutchings, Keller, Jacob E, Richard Cochran
In-Reply-To: <1346888106-25638-1-git-send-email-jeffrey.t.kirsher@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1636 bytes --]

On Wed, 2012-09-05 at 16:35 -0700, Kirsher, Jeffrey T wrote:
> This series contains updates to igb (specifically PTP code).
> 
> The following are changes since commit f6fe569fe056388166575af1cfaed0bcbc688305:
>   Revert "usbnet: drop unneeded check for NULL"
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
> 
> Matthew Vick (6):
>   igb: Tidy up wrapping for CONFIG_IGB_PTP.
>   igb: Update PTP function names/variables and locations.
>   igb: Correct PTP support query from ethtool.
>   igb: Store the MAC address in the name in the PTP struct.
>   igb: Prevent dropped Tx timestamps via work items and interrupts.
>   igb: Add 1588 support to I210/I211.
> 
>  drivers/net/ethernet/intel/igb/e1000_defines.h |   8 +
>  drivers/net/ethernet/intel/igb/e1000_regs.h    |   2 +
>  drivers/net/ethernet/intel/igb/igb.h           |  29 +-
>  drivers/net/ethernet/intel/igb/igb_ethtool.c   |  84 +--
>  drivers/net/ethernet/intel/igb/igb_main.c      | 329 +++---------
>  drivers/net/ethernet/intel/igb/igb_ptp.c       | 676 ++++++++++++++++++++-----
>  6 files changed, 708 insertions(+), 420 deletions(-)
> 
> --

Dave-
I see you have set this series to "Changes Requested" in patchworks, and
I am assuming that is from the discussion that occurred on patch 04 of
the series.  That discussion came to the conclusion that changes should
happen in the PTP core, and that the patch is fine as is currently.

If there was something else you want changed, let me/Matthew know so
that we can make the necessary changes.

Cheers,
Jeff

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 9/9] drivers/isdn/gigaset/common.c: Remove useless kfree
From: David Miller @ 2012-09-13 21:05 UTC (permalink / raw)
  To: tilman
  Cc: peter.senna, hjlipp, kernel-janitors, isdn, gigaset307x-common,
	netdev, linux-kernel
In-Reply-To: <5051ACDB.5030407@imap.cc>

From: Tilman Schmidt <tilman@imap.cc>
Date: Thu, 13 Sep 2012 11:52:27 +0200

> Am 12.09.2012 17:06, schrieb Peter Senna Tschudin:
>> From: Peter Senna Tschudin <peter.senna@gmail.com>
>> 
>> Remove useless kfree() and clean up code related to the removal.
>> 
>> The semantic patch that finds this problem is as follows:
> [...]
>> 
>> Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
> 
> Acked-by: Tilman Schmidt <tilman@imap.cc>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next 2/2] ipv6: dont cache cloned routes
From: David Miller @ 2012-09-13 21:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, lorenzo, maze, therbert
In-Reply-To: <1347451307.13103.885.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 12 Sep 2012 14:01:47 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We can now destroy cloned routes immediately from dst_release() instead
> of depending on garbage collection.
> 
> Set DST_NOCACHE in rt6_alloc_clone() so that :
> 
> 1) we avoid calling ip6_ins_rt() on such routes
> 
> 2) dst_release() can call destroy when refcount becomes 0
> 
> This allows machines to resist to DDOS.
> 
> Reported-by: Lorenzo Colitti <lorenzo@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Tom Herbert <therbert@google.com>

This current behavior is very much intentional and cannot be removed
so trivially.  The scope of this change is much wider than some DDOS
test.

This change is the moral equivalent of the ipv4 routing cache removal,
but we have not done anything to compensate for the resulting ipv6
performance loss as the routing cache removal changes did.

The insertion of ipv6 route clones into the tree is how the ipv6 code
caches routes.

The only legitimate way to make this change is to revamp ipv6 route
handling properly like we did for ipv4.

This means making it such that, when legitimate, prefixed routes found
directly into the route tree are used directly.

To achieve this you need to:

1) Convert ipv6 to do ref-count-less neighbour handling and not cache
   neighbours in the ipv6 routes, instead doing the lookup on demand
   in ip6_output as we do on the ipv4 side.

2) Stop caching inetpeers in the ipv6 routes.

3) Make ipv6 in-route metrics read-only, again as we already do in
   ipv4.

And so on and so forth, until direct use of prefixed ipv6 routes is
possible.

I really can't even remotely entertain applying this patch, sorry.

^ permalink raw reply

* Re: [PATCH v2] ipv6: replace write lock with read lock when get route info
From: David Miller @ 2012-09-13 20:54 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev
In-Reply-To: <1347436741-344-1-git-send-email-roy.qing.li@gmail.com>

From: roy.qing.li@gmail.com
Date: Wed, 12 Sep 2012 15:59:01 +0800

> From: Li RongQing <roy.qing.li@gmail.com>
> 
> geting route info does not write rt->rt6i_table, so replace
> write lock with read lock
> 
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

I'm surprised it's been like this for so long, but I can't find
any problems with your patch :-)

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: route templates can be const
From: David Miller @ 2012-09-13 20:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1347436071.13103.695.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 12 Sep 2012 09:47:51 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We kmemdup() templates, so they can be const.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: bnx2 cards intermittantly going offline
From: Michael Chan @ 2012-09-13 20:30 UTC (permalink / raw)
  To: Sven Ulland; +Cc: netdev, Marc A. Donges
In-Reply-To: <5051FFAA.8060501@opera.com>


On Thu, 2012-09-13 at 17:45 +0200, Sven Ulland wrote:
> On 09/13/2012 03:51 PM, Marc A. Donges wrote:
> > After 55 days of operation the machine (A) suddenly was no longer
> > reachable via network. Strangely, a second machine (B) that should
> > take over the IP addresses (keepalived) did not take over. Only
> > after shutting the switchport to which A is attached did B take
> > over.

The rx_ftq_discards problem is a firmware problem.  FTQ discards mean
that the firmware is no longer running and the packets are dropped at
the FTQ.  This is likely fixed in:

commit 22fa159d37efbfe781bbb99279efe83f58b87d29
Author: Michael Chan <mchan@broadcom.com>
Date:   Mon Oct 11 16:12:00 2010 -0700

    bnx2: Update firmware to 6.0.x.


> 
> Hi. We've had the same symptom with our BCM5709S [14e4:163a] on
> Debian. Like you, we were on stable's 2.6.32-41squeeze2. Google led us
> to many similar issues [1,2,3]. They concluded with the fix being in
> mainline commit c441b8d2 [4]: "bnx2: Fix lost MSI-X problem on 5709
> NICs".

This is a different problem and will not result in FTQ discards.

> 
> Broadcom: Can you publish a tool that decodes ethtool -d dumps to make
> debugging easier, or do you deem it no longer necessary with the the
> register dump commits in 555069da?

The register dump during tx timeout is now quite comprehensive.

> 
> Now, Debian's 2.6.32-41squeeze2 is based on longterm release 2.6.32.54
> [5]. That version includes commit 0b7817ed [6], which is a backport of
> the already mentioned mainline commit c441b8d2.
> 
> So we tried digging further and applying some seemingly relevant
> commits [7,8] to our 2.6.32, but without any change in behaviour. Our
> temporary fix was to run 'ethtool -t ethX' to reset the device every
> time it locked up.
> 
> This dragged on with various builds, until we ended up on mainline
> 2.6.38 where we no longer saw any symptoms. I don't know in which
> kernel version it was fixed, but we ended up on that one, sort of by
> chance. Unfortunately, it had severe issues with kswapd memory
> compaction causing CPU soft lockups [9], so we went straight to
> squeeze-backports' 3.2.23-1~bpo60+2. We've been happy since then.
> 
> > We have five pairs of basically identical machines performing the
> > same task (each pair for one site). The error has not occured with
> > any other one, but this site is the busiest:
> 
> We also saw the issue only at a site with generally higher load
> compared to other sites.
> 
> I'd love to know exactly which commit fixed the issue, but it's fairly
> tricky to reproduce the issue, and the bisect count is fairly high (it
> need not be a specific fix for bnx2).

If you see the same FTQ discards, please try that firmware commit
mentioned above.  Thanks.

> 
> sven
> 
> 
> [1]: bnx2 driver crashes under random circumstances
> https://bugzilla.redhat.com/show_bug.cgi?id=520888
> 
> [2]: Access denied. Come on, Red Hat!
> https://bugzilla.redhat.com/show_bug.cgi?id=511368
> 
> [3]: NIC doesn't register packets [rhel-5.5.z]
> https://bugzilla.redhat.com/show_bug.cgi?id=587799
> 
> [4]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=object;h=c441b8d2cb2194b05550a558d6d95d8944e56a84
> 
> [5]: Debian Changelog linux-2.6 (2.6.32-45)
> http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog#version2.6.32-41
> 
> [6]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=0b7817edda5e44e5fa769645bd1220f5e7b0beb5
> 
> [7]: bnx2: reset_task is crashing the kernel. Fixing it.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4529819c45161e4a119134f56ef504e69420bc98
> 
> [8]: bnx2: fixing a timout error due not refreshing TX timers correctly
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e6bf95ffa8d6f8f4b7ee33ea01490d95b0bbeb6e
> 
> [9]: [PATCH] remove compaction from kswapd
> http://thread.gmane.org/gmane.linux.kernel.mm/58962
> https://lkml.org/lkml/2011/3/25/664
> 
> 

^ permalink raw reply

* Re: [net-next.git 3/8 (V2)] stmmac: add the initial tx coalesce schema
From: David Miller @ 2012-09-13 20:46 UTC (permalink / raw)
  To: bhutchings; +Cc: peppe.cavallaro, netdev
In-Reply-To: <1347568971.13258.19.camel@deadeye.wl.decadent.org.uk>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 13 Sep 2012 21:42:51 +0100

> On Thu, 2012-09-13 at 16:23 -0400, David Miller wrote:
>> From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
>> Date: Tue, 11 Sep 2012 08:55:09 +0200
>> 
>> > +	unsigned long flags;
>> > +
>> > +	spin_lock_irqsave(&priv->tx_lock, flags);
>> >  
>> > -	spin_lock(&priv->tx_lock);
>> > +	priv->xstats.tx_clean++;
>> 
>> You are changing the locking here for the sake of the new timer.
>> 
>> But timers run in software interrupt context, so this change is
>> completely unnecessary since NAPI runs in software interrupt context
>> as well, and neither timers nor NAPI run in hardware interrupts
>> context.
> 
> NAPI pollers can be called by netpoll in hardware interrupt context, and
> this driver supports netpoll.

And the netpoll handler need to make amends to make sure that
hardware the environment expected by the software interrupt
code is preserved.

Well written NAPI drivers never need to disable hardware interrupts
in their ->poll() method and it's callers, neither should you.

I'm not considering your patches until you implement this properly.

^ permalink raw reply

* Re: Remarks and comments about ipconfig behavior
From: David Miller @ 2012-09-13 20:45 UTC (permalink / raw)
  To: erwanaliasr1; +Cc: netdev
In-Reply-To: <50524419.4080404@gmail.com>

From: Erwan Velu <erwanaliasr1@gmail.com>
Date: Thu, 13 Sep 2012 22:37:45 +0200

> Ok but shouldn't we display some message to the user trying to explain
> why the kernel is stopped and perfectly silent during its booting
> process ? 2 minutes, that's a pretty long time with an almost frozen
> kernel isn't it ?

Patches welcome.

^ permalink raw reply

* Re: [net-next.git 3/8 (V2)] stmmac: add the initial tx coalesce schema
From: Ben Hutchings @ 2012-09-13 20:42 UTC (permalink / raw)
  To: David Miller; +Cc: peppe.cavallaro, netdev
In-Reply-To: <20120913.162333.1518469374321928795.davem@davemloft.net>

On Thu, 2012-09-13 at 16:23 -0400, David Miller wrote:
> From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
> Date: Tue, 11 Sep 2012 08:55:09 +0200
> 
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&priv->tx_lock, flags);
> >  
> > -	spin_lock(&priv->tx_lock);
> > +	priv->xstats.tx_clean++;
> 
> You are changing the locking here for the sake of the new timer.
> 
> But timers run in software interrupt context, so this change is
> completely unnecessary since NAPI runs in software interrupt context
> as well, and neither timers nor NAPI run in hardware interrupts
> context.

NAPI pollers can be called by netpoll in hardware interrupt context, and
this driver supports netpoll.

Ben.

> Therefore, disabling hardware interrupts for this lock is unnecessary
> and will decrease performance.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net PATCH v2 1/7] bnx2x: Avoid sending multiple statistics queries
From: David Miller @ 2012-09-13 20:38 UTC (permalink / raw)
  To: yuvalmin; +Cc: netdev, dmitry, eilong
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A504B81E31@SJEXCHMB10.corp.ad.broadcom.com>

From: "Yuval Mintz" <yuvalmin@broadcom.com>
Date: Thu, 13 Sep 2012 20:34:57 +0000

> But there is a difference between:
> If (A && B) {
> 	C;
> 	D;
> }
> 
> And:
> if (A) {
> 	if (B) {
> 		C;
> 	}
> 	D;
> }

Indeed, I missed the return, it make a lot more sense now.

I've applied this series to 'net' and will push it out after
build testing.

Thanks.

^ permalink raw reply

* Re: [PATCH 4/4] net_sched: gred: actually perform idling in WRED mode
From: Jamal Hadi Salim @ 2012-09-13 20:37 UTC (permalink / raw)
  To: Ward, David - 0663 - MITLL
  Cc: netdev@vger.kernel.org, Bruce Osler, Cyril Chemparathy
In-Reply-To: <50523632.60703@ll.mit.edu>

On 12-09-13 03:38 PM, Ward, David - 0663 - MITLL wrote:
> Before applying this patch, the average queue size (as seen with "tc 
> -s qdisc") remained constant forever after I stopped sending any 
> packets through the interface -- it didn't taper off as you would 
> expect.  After the patch, the average queue size will now taper off if 
> packets are not being sent.
>


That makes sense. thanks

cheers,
jamal

^ permalink raw reply

* Re: Remarks and comments about ipconfig behavior
From: Erwan Velu @ 2012-09-13 20:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120913.163134.361379133013906511.davem@davemloft.net>

Le 13/09/2012 22:31, David Miller a écrit :
> From: Erwan Velu<erwanaliasr1@gmail.com>
> Date: Thu, 13 Sep 2012 22:21:15 +0200
>
>> That lever for me two points :
>> - why is this timeout setup for so long ? Even with a spantree
>> - configuration, not having a carrier for 2mn is *waow* ... Does a 30sec
>> - could not be enough ? What is the need of waiting so long time ?
> I've seen PHY/switch/hub combinations that take longer than 30 seconds
> to fully negotiate the link.
>
> There is really no upper limit to the link speed/duplex/etc.
> negoatiation process.
>
> Even if the actual negoatiation protocol had an upper limit on
> negoatiation time, hardware implementations do things like try
> sampling the quality of the cable signal and may choose to
> down-rev the advertised features and restart the negoatiation.
Ok but shouldn't we display some message to the user trying to explain 
why the kernel is stopped and perfectly silent during its booting 
process ? 2 minutes, that's a pretty long time with an almost frozen 
kernel isn't it ?

^ permalink raw reply

* RE: [net PATCH v2 1/7] bnx2x: Avoid sending multiple statistics queries
From: Dmitry Kravkov @ 2012-09-13 20:35 UTC (permalink / raw)
  To: David Miller, Yuval Mintz; +Cc: netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20120913.161711.2180618409918407558.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Thursday, September 13, 2012 11:17 PM
> To: Yuval Mintz
> Cc: netdev@vger.kernel.org; Dmitry Kravkov; Eilon Greenstein
> Subject: Re: [net PATCH v2 1/7] bnx2x: Avoid sending multiple statistics queries
> 
> From: "Yuval Mintz" <yuvalmin@broadcom.com>
> Date: Tue, 11 Sep 2012 17:34:08 +0300
> 
> > From: Dmitry Kravkov <dmitry@broadcom.com>
> >
> > During traffic when DCB is enabled, it is possible for multiple instances
> > of statistics queries to be sent to the chip - this may cause the FW to assert.
> >
> > This patch prevents the sending of an additional instance of statistics query
> > while the previous query hasn't completed.
> >
> > Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> > Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
> > Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
> 
> This change results in no change in behavior as far as I can tell.
> 
> > -	if (bnx2x_storm_stats_update(bp) && (bp->stats_pending++ == 3)) {
> > -		BNX2X_ERR("storm stats were not updated for 3 times\n");
> > -		bnx2x_panic();
> > +	if (bnx2x_storm_stats_update(bp)) {
> > +		if (bp->stats_pending++ == 3) {
> > +			BNX2X_ERR("storm stats were not updated for 3
> times\n");
> > +			bnx2x_panic();
> > +		}
> 
> There is no difference between:
> 
> 	if (A && B) {
> 		C;
> 	}
> 
> and:
> 
> 	if (A) {
> 		if (B) {
> 			C;
> 		}
> 	}
> 
> Yet that's exactly what is happening in this patch.
> 
> And such a do-nothing change is certainly not appropriate this late in
> the -rc series.
> 
> I'm tossing this entire series, please sort this out and submit
> the real actual critical bug fixes.
return statement is not seen in the patch:
Before the change we returned from the function if (A &&B)
Now we return even if (A)

^ permalink raw reply

* RE: [net PATCH v2 1/7] bnx2x: Avoid sending multiple statistics queries
From: Yuval Mintz @ 2012-09-13 20:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Dmitry Kravkov, Eilon Greenstein
In-Reply-To: <20120913.161711.2180618409918407558.davem@davemloft.net>

> > This patch prevents the sending of an additional instance of statistics
> query
> > while the previous query hasn't completed.
> >
> This change results in no change in behavior as far as I can tell.
> 
> > -	if (bnx2x_storm_stats_update(bp) && (bp->stats_pending++ == 3)) {
> > -		BNX2X_ERR("storm stats were not updated for 3 times\n");
> > -		bnx2x_panic();
> > +	if (bnx2x_storm_stats_update(bp)) {
> > +		if (bp->stats_pending++ == 3) {
> > +			BNX2X_ERR("storm stats were not updated for 3
> times\n");
> > +			bnx2x_panic();
> > +		}

But you're missing the 'return;' statement at the end.

> 
> There is no difference between:
> 
> 	if (A && B) {
> 		C;
> 	}
> 
> and:
> 
> 	if (A) {
> 		if (B) {
> 			C;
> 		}
> 	}

But there is a difference between:
If (A && B) {
	C;
	D;
}

And:
if (A) {
	if (B) {
		C;
	}
	D;
}

The point of this patch was not to change the condition of the print & panic but rather to guarantee that
if bnx2x_storm_stats_update failed, no more ramrods will be posted, as the function will return 
(regardless of stats_pending value).

Sorry it wasn't clearer in the patch.

^ permalink raw reply

* Re: [PATCH 2/2] ipv6: Compare addresses only bits up to the prefix length (RFC6724).
From: David Miller @ 2012-09-13 20:34 UTC (permalink / raw)
  To: yoshfuji; +Cc: netdev
In-Reply-To: <201209110508.q8B58Min015633@94.43.138.210.xn.2iij.net>

From: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Date: Tue, 11 Sep 2012 13:41:07 +0900

> Compare bits up to the source address's prefix length only to
> allows DNS load balancing to continue to be used as a tie breaker.
> 
> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

Applied to net-next

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox