Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Veaceslav Falico @ 2014-01-17  8:02 UTC (permalink / raw)
  To: David Miller; +Cc: fubar, netdev, andy
In-Reply-To: <20140115.220132.1518410490710218099.davem@davemloft.net>

On Wed, Jan 15, 2014 at 10:01:32PM -0800, David Miller wrote:
>From: Jay Vosburgh <fubar@us.ibm.com>
>Date: Wed, 15 Jan 2014 21:09:57 -0800
>
>> 	The main reason for preserving the non-validate behavior (any
>> traffic counts) is for the loadbalance (xor and rr) modes.  In those
>> modes, the switch decides which slave receives the incoming traffic, and
>> so it's to our advantage to permit any incoming traffic to count for
>> "up-ness."  The arp_validate option is not allowed in these modes
>> because it won't work.
>>
>> 	With these changes, I suspect that the loadbalance ARP monitor
>> will be less reliable with these changes (granted that it's already a
>> bit dodgy in its dependence on the switch to hit all slaves with
>> incoming packets regularly).  Particularly if the switch ports are
>> configured into an Etherchannel ("static link aggregation") group, in
>> which case only one slave will receive any given frame (broadcast /
>> multicast traffic will not be duplicated across all slaves).
>>
>> 	I'm not sure that this change (the "only count ARPs even without
>> arp_validate" bit) won't break existing configurations.  Did you test
>> the -rr and -xor modes with ARP monitor after your changes (with and
>> without configuring a channel group on the switch ports)?
>
>Sorry Jay, I only read this just now.  I won't push these changes
>out until you've had some time to discuss them.

Sorry David, this change indeed might break some configurations (leaving
aside the question if they're legit or not...). So, please, drop them, I'll
discuss with Jay and send a v2.

Thanks a lot, and sorry for the noise.

>
>To my untrained eye they looked rather straightforward :-)

^ permalink raw reply

* Re: Linux kernel patch: elide fib_validate_source() completely when possible - bad side effect?
From: Julian Anastasov @ 2014-01-17  8:24 UTC (permalink / raw)
  To: gregory hoggarth; +Cc: netdev
In-Reply-To: <52D945400200005D00047758@gwia2.atlnz.lc>


	Hello,

On Fri, 17 Jan 2014, gregory hoggarth wrote:

> Hi,
> 
> I work for Allied Telesis, a company that manufactures and sells routers and switches.
> 
> Earlier in the year we upgraded the linux kernel on our L3 switch firmware to v3.6.2 which includes this patch. Our testers have picked up a change in behaviour which I have traced back to this patch and I'd like to get comments on whether this is a bug or not.
> 
> We found that in a configuration with a device under test (DUT) that has an IP address 192.168.0.1 / 24, sending an ICMP echo request with a source IP address of 192.168.0.255 will now be accepted by the DUT, which will send an ICMP echo reply back to the 192.168.0.255 address with a broadcast MAC address of FFFF.FFFF.FFFF, meaning any other devices in the L2 domain will pick up and process the ICMP reply.

...

> Initially we were thinking that a device configured with 192.168.0.255 / 23 sending a ping to 192.168.0.1 / 24 should expect to get a reply, even though this is somewhat of a misconfiguration in that the subnets aren't matching and that some things may not work properly. For example ping from a device with IP address 192.168.0.254 would behave correctly, so it seemed correct that a ping from 192.168.0.255 should also work properly. However it appears that some other part of the kernel is detecting that the echo reply packet sent by the DUT has a destination IP address of the subnet broadcast, so instead of sending it with a unicast MAC address it sends it with the broadcast MAC address, which means all other devices on the subnet would process the ICMP echo reply. This therefore makes it
  seem like the ICMP echo request should have been dropped in the first place.
> 
> The patch added the "if (!r && !fib_num_tclassid_users)" check to fib_validate_source, which for our devices evaluates to true, hence returning early and the source address check inside __fib_validate_source is not carried out.
> 
> 
> Does anyone have comments as to what the correct behaviour should be in this case, and whether this is an unexpected side-effect from the patch or not?

	It seems only __fib_validate_source can reject all kind
of broadcast addresses in saddr. ip_route_input_slow() rejects
only the well known broadcasts. Without rp_filter may be we
can at least drop attempts to send replies back to broadcast
addresses? For example, checking result of ip_route_output_key() in
icmp_reply():

	if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
		=> ip_rt_put()

	For icmp_send() in icmp_route_lookup() ?

	For TCP in inet_csk_route_req() ?

	I did not checked all places...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: ipv6: default route for link local address is not added while assigning a address
From: sohny thomas @ 2014-01-17  8:34 UTC (permalink / raw)
  To: netdev, linux-kernel, yoshfuji, davem, kumuda, hannes
In-Reply-To: <52D3A885.2080107@gmail.com>

Hi All,

Any updates on my reply, Any more info is required.
Can this be pulled into the kernel tree?

Thanks & Regards,
Sohny


On Monday 13 January 2014 02:19 PM, sohny thomas wrote:
> On Friday 10 January 2014 10:46 PM, Hannes Frederic Sowa wrote:
>> On Fri, Jan 10, 2014 at 05:33:08PM +0100, Nicolas Dichtel wrote:
>>> CC: netdev
>>>
>>> Le 10/01/2014 13:20, sohny thomas a écrit :
>>>> Default route for link local address is configured automatically if
>>>> NETWORKING_IPV6=yes is in ifcfg-eth*.
>>>> When the route table for the interface is flushed and a new address is
>>>> added to
>>>> the same device with out removing linklocal addr, default route for
>>>> link
>>>> local
>>>> address has to added by default.
>>> I would say that removing the link local route but not the link local
>>> address
>>> is a configuration problem.
>>> If you remove a connected route but not the associated address, you will
>>> have
>>> the same problem.
>> We have some user accessible routes that are essential for IPv6 stack
>> to work at all. So I don't know if I can agree with that.
>>
>> Maybe flush is a bit too aggressive?
>>
> Hi ,
>
> Thank you for the inputs.
>
> In the test for ipv6 default address selection , we are testing the rule
> 2 as specified in RFC 6724
>
>    If Scope(SA) < Scope(SB): If Scope(SA) < Scope(D), then prefer SB
>      and otherwise prefer SA.
>      Similarly, if Scope(SB) < Scope(SA): If Scope(SB) < Scope(D), then
>      prefer SA and otherwise prefer SB.
>
> Test:
>
>    Check 04:
>       Destination: ff08::2(OS)
>       Candidate Source Addresses: fec0::1(SS) or LLA(LS)
>       Result: fec0::1(SS)
>
>       Scope(LLA) < Scope(fec0::1): If Scope(LLA) < Scope(ff08::2),  yes,
> prefer fec0::1
>
>
> Now in the test its flushing all the routes and adding an address ,
> which in causes to add route into the routing table including the link
> local routes.
> Earlier in 2.6.32 it used to work fine now due to the above mentioned
> check-in this is not happening
>
> Of course we can still just delete a route and add , but even if we
> delete the link local route, IMHO i think it should update the LLA route
> when the interface is next added an address or bought up which ever is
> the case.
>
> Regards,
> Sohny
>
>

^ permalink raw reply

* Re: [Patch net-next] vxlan: do not use vxlan_net before checking event type
From: Daniel Borkmann @ 2014-01-17  8:39 UTC (permalink / raw)
  To: Fan Du; +Cc: Cong Wang, netdev, David S. Miller
In-Reply-To: <52D8B17A.6030301@windriver.com>

On 01/17/2014 05:28 AM, Fan Du wrote:
> Hi, Cong
>
> On 2014年01月17日 12:20, Cong Wang wrote:
>> When cloning a netns, loopback device will be registered
>> and therefore an event will be notified. Of course
>> vxlan doesn't care about it, therefore should check if it
>> is NETDEV_UNREGISTER before getting the vxlan_net struct.
>> Otherwise, vxlan_net is probably not initialized yet at
>> this point.
>
> I'm bit new to vxlan, but in vxlan_init_module
>
> register_pernet_device is called before register_netdevice_notifier.
> By my understanding once register_pernet_device is called,
> then subsequent vxlan_notifier_block callback see a valid vxlan_net_id.
> I mean execute vxlan_notifier_block callback indicates a valid vxlan_net_id,
> or I miss somewhere else.

That was also my understanding, but apparently I missed that.
Very sorry. Then, it would have been enough to just provide this:

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a2dee80..d6ec71f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2681,10 +2681,12 @@ static int vxlan_lowerdev_event(struct notifier_block *unused,
  				unsigned long event, void *ptr)
  {
  	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	struct vxlan_net *vn;

-	if (event == NETDEV_UNREGISTER)
+	if (event == NETDEV_UNREGISTER) {
+		vn = net_generic(dev_net(dev), vxlan_net_id);
  		vxlan_handle_lowerdev_unregister(vn, dev);
+	}

  	return NOTIFY_DONE;
  }

^ permalink raw reply related

* Re: [PATCH v2 net] bpf: do not use reciprocal divide
From: Heiko Carstens @ 2014-01-17  8:59 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, schwidefsky, hannes, netdev, dborkman, darkjames-ws,
	mgherzan, rmk+kernel, matt
In-Reply-To: <20140115.170230.640619926492650356.davem@davemloft.net>

On Wed, Jan 15, 2014 at 05:02:30PM -0800, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 15 Jan 2014 06:50:07 -0800
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > At first Jakub Zawadzki noticed that some divisions by reciprocal_divide
> > were not correct. (off by one in some cases)
> > http://www.wireshark.org/~darkjames/reciprocal-buggy.c
> > 
> > He could also show this with BPF:
> > http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
> > 
> > The reciprocal divide in linux kernel is not generic enough,
> > lets remove its use in BPF, as it is not worth the pain with
> > current cpus.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
> 
> Applied and queued up for -stable, thanks Eric.

While thinking a bit more the s390 variant is still broken with special
casing the "divide with 1" case (see below).

Could you please also apply the patch below to your tree? It would only
generate a merge conflict, that would need fixing, if it would sit in the
s390 tree.

>From bf0f2dc84dd3774944fc4dddef8c7e699277aa96 Mon Sep 17 00:00:00 2001
From: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Fri, 17 Jan 2014 09:37:15 +0100
Subject: [PATCH] s390/bpf,jit: fix 32 bit divisions, use unsigned divide
 instructions

The s390 bpf jit compiler emits the signed divide instructions "dr" and "d"
for unsigned divisions.
This can cause problems: the dividend will be zero extended to a 64 bit value
and the divisor is the 32 bit signed value as specified A or X accumulator,
even though A and X are supposed to be treated as unsigned values.

The divide instrunctions will generate an exception if the result cannot be
expressed with a 32 bit signed value.
This is the case if e.g. the dividend is 0xffffffff and the divisor either 1
or also 0xffffffff (signed: -1).

To avoid all these issues simply use unsigned divide instructions.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 arch/s390/net/bpf_jit_comp.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index fc0fa77728e1..708d60e40066 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -368,16 +368,16 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
 		EMIT4_PCREL(0xa7840000, (jit->ret0_ip - jit->prg));
 		/* lhi %r4,0 */
 		EMIT4(0xa7480000);
-		/* dr %r4,%r12 */
-		EMIT2(0x1d4c);
+		/* dlr %r4,%r12 */
+		EMIT4(0xb997004c);
 		break;
 	case BPF_S_ALU_DIV_K: /* A /= K */
 		if (K == 1)
 			break;
 		/* lhi %r4,0 */
 		EMIT4(0xa7480000);
-		/* d %r4,<d(K)>(%r13) */
-		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
+		/* dl %r4,<d(K)>(%r13) */
+		EMIT6_DISP(0xe340d000, 0x0097, EMIT_CONST(K));
 		break;
 	case BPF_S_ALU_MOD_X: /* A %= X */
 		jit->seen |= SEEN_XREG | SEEN_RET0;
@@ -387,8 +387,8 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
 		EMIT4_PCREL(0xa7840000, (jit->ret0_ip - jit->prg));
 		/* lhi %r4,0 */
 		EMIT4(0xa7480000);
-		/* dr %r4,%r12 */
-		EMIT2(0x1d4c);
+		/* dlr %r4,%r12 */
+		EMIT4(0xb997004c);
 		/* lr %r5,%r4 */
 		EMIT2(0x1854);
 		break;
@@ -400,8 +400,8 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
 		}
 		/* lhi %r4,0 */
 		EMIT4(0xa7480000);
-		/* d %r4,<d(K)>(%r13) */
-		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
+		/* dl %r4,<d(K)>(%r13) */
+		EMIT6_DISP(0xe340d000, 0x0097, EMIT_CONST(K));
 		/* lr %r5,%r4 */
 		EMIT2(0x1854);
 		break;
-- 
1.8.4.5

^ permalink raw reply related

* Fwd: Linux bridge for route
From: tingwei liu @ 2014-01-17  9:14 UTC (permalink / raw)
  To: netfilter-devel, netfilter, netdev; +Cc: tingwei liu
In-Reply-To: <CA+qZnSRGKiFh8FiKMpVQku7CiWCMPq47FpPwrawvTC6KBPwz7A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

Dear all,

       There is  a question has puzzled me for a long time.
       You can find the topology from attachment.

       Normal traffic is:

               PC(192.168.1.8)--->Bridge(eth0)--->Bridget(eth1)--->NAT
server-->switch--->Server(192.168.5.3)

       Now I want the ssh traffic like this:
              PC(182.168.1.8)--->Bridge(eth0)--->eth2--->NAT
server--->switch--->Server(192.168.5.3)


      What I have done on LINUX Server:
              #net.bridge.bridge-nf-call-iptables = 1
              #iptables -t nat -A POSTROUTING -s 192.168.1.8 -p tcp
--dport 22 -j SNAT --to-source 192.168.5.2

      I have find the rule matched through command "iptables -t nat
-nvL", but the packets doesn't sent to 192.168.5.3.
      and "tcpdump -i eth2 tcp port 22" can not capture any packet!

Thanks very much!

[-- Attachment #2: topology.jpg --]
[-- Type: image/jpeg, Size: 84648 bytes --]

^ permalink raw reply

* [net-next 0/3] Intel Wired LAN Driver Updates
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Aaron Brown, netdev, gospo, sassmann

This series contains an updates to ixgbe and ixgbevf.

Jacob add braces around some ixgbe_qv_lock_* calls lto better adhere 
to Kernel style guidelines.  Don bumps the versions on ixgbe and
ixgbevf to match internal driver functionality better.

Don Skidmore (2):
  ixgbe: bump version number
  ixgbevf: bump version

Jacob Keller (1):
  ixgbe: add braces around else condition in ixgbe_qv_lock_* calls

 drivers/net/ethernet/intel/ixgbe/ixgbe.h          | 6 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

-- 
1.8.5.GIT

^ permalink raw reply

* [net-next 2/3] ixgbe: bump version number
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

Bump the version number to better match functionality provided with out
of tree driver of the same version.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b445ad1..72e2035 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -64,7 +64,7 @@ char ixgbe_default_device_descr[] =
 static char ixgbe_default_device_descr[] =
 			      "Intel(R) 10 Gigabit Network Connection";
 #endif
-#define DRV_VERSION "3.15.1-k"
+#define DRV_VERSION "3.19.1-k"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2013 Intel Corporation.";
-- 
1.8.5.GIT

^ permalink raw reply related

* [net-next 1/3] ixgbe: add braces around else condition in ixgbe_qv_lock_* calls
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch adds braces around the ixgbe_qv_lock_* calls which previously only
had braces around the if portion. Kernel style guidelines for this require
parenthesis around all conditions if they are required around one. In addition
the comment while not illegal C syntax makes the code look wrong at a cursory
glance. This patch corrects the style and adds braces so that the full if-else
block is uniform.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 3a4373f..0186ea2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -424,9 +424,10 @@ static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
 #ifdef BP_EXTENDED_STATS
 		q_vector->tx.ring->stats.yields++;
 #endif
-	} else
+	} else {
 		/* we don't care if someone yielded */
 		q_vector->state = IXGBE_QV_STATE_NAPI;
+	}
 	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
@@ -458,9 +459,10 @@ static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
 #ifdef BP_EXTENDED_STATS
 		q_vector->rx.ring->stats.yields++;
 #endif
-	} else
+	} else {
 		/* preserve yield marks */
 		q_vector->state |= IXGBE_QV_STATE_POLL;
+	}
 	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
-- 
1.8.5.GIT

^ permalink raw reply related

* [net-next 3/3] ixgbevf: bump version
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

Bump the version number to better match functionality provided with out of
tree driver of the same version.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6cf4120..9c92918 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -58,7 +58,7 @@ const char ixgbevf_driver_name[] = "ixgbevf";
 static const char ixgbevf_driver_string[] =
 	"Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver";
 
-#define DRV_VERSION "2.11.3-k"
+#define DRV_VERSION "2.12.1-k"
 const char ixgbevf_driver_version[] = DRV_VERSION;
 static char ixgbevf_copyright[] =
 	"Copyright (c) 2009 - 2012 Intel Corporation.";
-- 
1.8.5.GIT

^ permalink raw reply related

* [PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
From: Michael Dalton @ 2014-01-17  9:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

Commit ab7db91705e9 ("virtio-net: auto-tune mergeable rx buffer size for
improved performance") introduced a virtio-net dependency on EWMA.
The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
virtio-net's Kconfig entry.

Build failure reported using config make ARCH=s390 defconfig.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 drivers/net/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index b45b240..f342278 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -236,6 +236,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	select AVERAGE
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
-- 
1.8.5.2

^ permalink raw reply related

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Laight @ 2014-01-17  9:28 UTC (permalink / raw)
  To: 'Willy Tarreau', David Miller
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <20140116200757.GA19969@1wt.eu>

From: Willy Tarreau 
..
> Ah that's an interesting trick! We don't have an IOMMU on this platform
> so the call is cheap. However, it relies on an I/O barrier to wait for
> a cache snoop completion. From what I read, it's not needed when going
> to the device. But when going to the CPU for the Rx case, we're calling
> it for every packet which is counter productive because I'd like to do
> it only once when entering the rx loop and avoid any other call during
> the loop. I measured that I could save 300 ns per packet by doing this
> (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> 
> I think it could be interesting (at least for this platform, I don't know
> if other platforms work with cache coherent DMA which only require to
> verify end of snooping), to have two distinct set of DMA calls, something
> like this :
> 
>     rx_loop(napi, budget)
>     {
>          dma_wait_for_snoop_completion(dev);
>          ...
>          for_each_rx_packet(pkt) {
>              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
>              /* use packet */
>          }
>     }

You'd need to scan the rx ring for completed entries before the waiting
for the snoop to complete - and then only process those entries.
(Or read a block of rx ring entries...)

Otherwise you might try to process a rx that finished after you
waited for the snoop to complete.

	David

^ permalink raw reply

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Laight @ 2014-01-17  9:32 UTC (permalink / raw)
  To: 'David Miller', w@1wt.eu
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <20140116.114905.1582218754259744747.davem@davemloft.net>

From: David Miller
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Another option that ought to work is splitting the allocation of
the io address space and the mapping of the addresses to physical
memory.
So for rx you'd leave the rx ring pointers constant and change the
iommu to refer to the correct buffer.

	David

^ permalink raw reply

* [PATCH net] net: core: orphan frags before queuing to slow qdisc
From: Jason Wang @ 2014-01-17  9:42 UTC (permalink / raw)
  To: davem, netdev, linux-kernel; +Cc: Jason Wang, Michael S. Tsirkin

Many qdiscs can queue a packet for a long time, this will lead an issue
with zerocopy skb. It means the frags will not be orphaned in an expected
short time, this breaks the assumption that virtio-net will transmit the
packet in time.

So if guest packets were queued through such kind of qdisc and hit the
limitation of the max pending packets for virtio/vhost. All packets that
go to another destination from guest will also be blocked.

A case for reproducing the issue:

- Boot two VMs and connect them to the same bridge kvmbr.
- Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
- Let VM1 send lots of packets thorugh eth0
- After a while, VM1 is unable to send any packets out since the number of
  pending packets (queued to tbf) were exceeds the limitation of vhost/virito

Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
one without TCQ_F_CAN_BYPASS).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net/core/dev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ce469e..1209774 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	contended = qdisc_is_running(q);
 	if (unlikely(contended))
 		spin_lock(&q->busylock);
+	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
+	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
+		kfree_skb(skb);
+		rc = NET_XMIT_DROP;
+		goto out;
+	}

 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
@@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		}
 	}
 	spin_unlock(root_lock);
+out:
 	if (unlikely(contended))
 		spin_unlock(&q->busylock);
 	return rc;
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-17  9:48 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev@vger.kernel.org,
	thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EA2B@AcuExch.aculab.com>

On Fri, Jan 17, 2014 at 09:28:13AM +0000, David Laight wrote:
> From: Willy Tarreau 
> ..
> > Ah that's an interesting trick! We don't have an IOMMU on this platform
> > so the call is cheap. However, it relies on an I/O barrier to wait for
> > a cache snoop completion. From what I read, it's not needed when going
> > to the device. But when going to the CPU for the Rx case, we're calling
> > it for every packet which is counter productive because I'd like to do
> > it only once when entering the rx loop and avoid any other call during
> > the loop. I measured that I could save 300 ns per packet by doing this
> > (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> > 
> > I think it could be interesting (at least for this platform, I don't know
> > if other platforms work with cache coherent DMA which only require to
> > verify end of snooping), to have two distinct set of DMA calls, something
> > like this :
> > 
> >     rx_loop(napi, budget)
> >     {
> >          dma_wait_for_snoop_completion(dev);
> >          ...
> >          for_each_rx_packet(pkt) {
> >              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
> >              /* use packet */
> >          }
> >     }
> 
> You'd need to scan the rx ring for completed entries before the waiting
> for the snoop to complete - and then only process those entries.
> (Or read a block of rx ring entries...)
> 
> Otherwise you might try to process a rx that finished after you
> waited for the snoop to complete.

That's already the case, the number of descs to use is computed first
thing when entering the function. That's why it's already safe.

Willy

^ permalink raw reply

* RE: [PATCH RFC 1/6] net: rfkill: gpio: fix gpio name buffer size off by 1
From: David Laight @ 2014-01-17  9:46 UTC (permalink / raw)
  To: 'Chen-Yu Tsai', Johannes Berg, David S. Miller
  Cc: netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-sunxi@googlegroups.com, Maxime Ripard,
	linux-wireless@vger.kernel.org
In-Reply-To: <1389941251-32692-2-git-send-email-wens@csie.org>

From: Chen-Yu Tsai
> snprintf should be passed the complete size of the buffer, including
> the space for '\0'. The previous code resulted in the *_reset and
> *_shutdown strings being truncated.
...
> diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
...
> -	snprintf(rfkill->reset_name, len + 6 , "%s_reset", rfkill->name);
> -	snprintf(rfkill->shutdown_name, len + 9, "%s_shutdown", rfkill->name);
> +	snprintf(rfkill->reset_name, len + 7 , "%s_reset", rfkill->name);
> +	snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);

I can't find the context for the above, but they look very dubious.
I'd expect: snprintf(foo, sizeof foo, ...).
If you are trying to truncate rfkill->name you need to use %.*s.

	David

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
From: Jason Wang @ 2014-01-17  9:51 UTC (permalink / raw)
  To: Michael Dalton, David S. Miller
  Cc: netdev, Eric Dumazet, virtualization, Michael S. Tsirkin
In-Reply-To: <1389950828-24039-1-git-send-email-mwdalton@google.com>

On 01/17/2014 05:27 PM, Michael Dalton wrote:
> Commit ab7db91705e9 ("virtio-net: auto-tune mergeable rx buffer size for
> improved performance") introduced a virtio-net dependency on EWMA.
> The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
> when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
> virtio-net's Kconfig entry.
>
> Build failure reported using config make ARCH=s390 defconfig.
>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
>  drivers/net/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index b45b240..f342278 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -236,6 +236,7 @@ config VETH
>  config VIRTIO_NET
>  	tristate "Virtio network driver"
>  	depends on VIRTIO
> +	select AVERAGE
>  	---help---
>  	  This is the virtual network driver for virtio.  It can be used with
>  	  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.

Acked-by: Jason Wang <jasowang@redhat.com>

^ permalink raw reply

* RE: [PATCH net-next v2 2/3] net: add trim helper and convert users
From: David Laight @ 2014-01-17  9:52 UTC (permalink / raw)
  To: 'Hannes Frederic Sowa', netdev@vger.kernel.org
  Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet,
	linux-kernel@vger.kernel.org
In-Reply-To: <1389918519-23779-3-git-send-email-hannes@stressinduktion.org>

From: Hannes Frederic Sowa
...
> +/**
> + * trim - perform a reciprocal multiplication in order to "clamp" a
> + *        value into range [0, ep_ro), where the upper interval
> + *        endpoint is right-open. This is useful, e.g. for accessing
> + *        a index of an array containing ep_ro elements, for example.
> + *        Think of it as sort of modulus, only that the result isn't
> + *        that of modulo. ;)
> + *        More: http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

It isn't appropriate to put urls into code comments (or commit messages).
They are very likely to get out of date.

	David

^ permalink raw reply

* Re: [PATCH net-next] bonding: trivial: rename slave->jiffies to ->last_link_up
From: Veaceslav Falico @ 2014-01-17  9:53 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389838837-10283-1-git-send-email-vfalico@redhat.com>

On Thu, Jan 16, 2014 at 03:20:37AM +0100, Veaceslav Falico wrote:
>slave->jiffies is updated every time the slave becomes active, which, for
>bonding, means that its link is 'up'.
>
>CC: Jay Vosburgh <fubar@us.ibm.com>
>CC: Andy Gospodarek <andy@greyhouse.net>
>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

Self-NAK, as the first patchset was dropped and will be sent as v2, I'll
just slip this patch in there.

Sorry for the noise.

>---
>
>Notes:
>    On top of
>
>    [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
>
>    Sorry for the trivial patch, but I'm always loosing some time
>    trying to remember what that actually means.
>
> drivers/net/bonding/bond_main.c | 20 ++++++++++----------
> drivers/net/bonding/bonding.h   |  3 ++-
> 2 files changed, 12 insertions(+), 11 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f9e5512..0f613ae 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -846,7 +846,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
> 		return;
>
> 	if (new_active) {
>-		new_active->jiffies = jiffies;
>+		new_active->last_link_up = jiffies;
>
> 		if (new_active->link == BOND_LINK_BACK) {
> 			if (USES_PRIMARY(bond->params.mode)) {
>@@ -1488,7 +1488,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> 	}
>
> 	if (new_slave->link != BOND_LINK_DOWN)
>-		new_slave->jiffies = jiffies;
>+		new_slave->last_link_up = jiffies;
> 	pr_debug("Initial state of slave_dev is BOND_LINK_%s\n",
> 		new_slave->link == BOND_LINK_DOWN ? "DOWN" :
> 			(new_slave->link == BOND_LINK_UP ? "UP" : "BACK"));
>@@ -1923,7 +1923,7 @@ static int bond_miimon_inspect(struct bonding *bond)
> 				 * recovered before downdelay expired
> 				 */
> 				slave->link = BOND_LINK_UP;
>-				slave->jiffies = jiffies;
>+				slave->last_link_up = jiffies;
> 				pr_info("%s: link status up again after %d ms for interface %s.\n",
> 					bond->dev->name,
> 					(bond->params.downdelay - slave->delay) *
>@@ -1998,7 +1998,7 @@ static void bond_miimon_commit(struct bonding *bond)
>
> 		case BOND_LINK_UP:
> 			slave->link = BOND_LINK_UP;
>-			slave->jiffies = jiffies;
>+			slave->last_link_up = jiffies;
>
> 			if (bond->params.mode == BOND_MODE_8023AD) {
> 				/* prevent it from being the active one */
>@@ -2345,7 +2345,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
> 		bond_validate_arp(bond, slave, sip, tip);
> 	else if (bond->curr_active_slave &&
> 		 time_after(slave_last_arp_rx(bond, bond->curr_active_slave),
>-			    bond->curr_active_slave->jiffies))
>+			    bond->curr_active_slave->last_link_up))
> 		bond_validate_arp(bond, slave, tip, sip);
>
> out_unlock:
>@@ -2392,9 +2392,9 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
> 	oldcurrent = ACCESS_ONCE(bond->curr_active_slave);
> 	/* see if any of the previous devices are up now (i.e. they have
> 	 * xmt and rcv traffic). the curr_active_slave does not come into
>-	 * the picture unless it is null. also, slave->jiffies is not needed
>-	 * here because we send an arp on each slave and give a slave as
>-	 * long as it needs to get the tx/rx within the delta.
>+	 * the picture unless it is null. also, slave->last_link_up is not
>+	 * needed here because we send an arp on each slave and give a slave
>+	 * as long as it needs to get the tx/rx within the delta.
> 	 * TODO: what about up/down delay in arp mode? it wasn't here before
> 	 *       so it can wait
> 	 */
>@@ -2516,7 +2516,7 @@ static int bond_ab_arp_inspect(struct bonding *bond)
> 		 * active.  This avoids bouncing, as the last receive
> 		 * times need a full ARP monitor cycle to be updated.
> 		 */
>-		if (bond_time_in_interval(bond, slave->jiffies, 2))
>+		if (bond_time_in_interval(bond, slave->last_link_up, 2))
> 			continue;
>
> 		/*
>@@ -2709,7 +2709,7 @@ static void bond_ab_arp_probe(struct bonding *bond)
> 	new_slave->link = BOND_LINK_BACK;
> 	bond_set_slave_active_flags(new_slave);
> 	bond_arp_send_all(bond, new_slave);
>-	new_slave->jiffies = jiffies;
>+	new_slave->last_link_up = jiffies;
> 	rcu_assign_pointer(bond->current_arp_slave, new_slave);
> }
>
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 99126b2..9f07af1 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -184,7 +184,8 @@ struct slave {
> 	struct net_device *dev; /* first - useful for panic debug */
> 	struct bonding *bond; /* our master */
> 	int    delay;
>-	unsigned long jiffies;
>+	/* all three in jiffies */
>+	unsigned long last_link_up;
> 	unsigned long last_arp_rx;
> 	unsigned long target_last_arp_rx[BOND_MAX_ARP_TARGETS];
> 	s8     link;    /* one of BOND_LINK_XXXX */
>-- 
>1.8.4
>

^ permalink raw reply

* Re: [PATCH RFC 1/6] net: rfkill: gpio: fix gpio name buffer size off by 1
From: Chen-Yu Tsai @ 2014-01-17  9:59 UTC (permalink / raw)
  To: David Laight
  Cc: Johannes Berg, David S. Miller,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EA9D-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>

On Fri, Jan 17, 2014 at 5:46 PM, David Laight <David.Laight-JxhZ9S5GRejQT0dZR+AlfA@public.gmane.org> wrote:
> From: Chen-Yu Tsai
>> snprintf should be passed the complete size of the buffer, including
>> the space for '\0'. The previous code resulted in the *_reset and
>> *_shutdown strings being truncated.
> ...
>> diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
> ...
>> -     snprintf(rfkill->reset_name, len + 6 , "%s_reset", rfkill->name);
>> -     snprintf(rfkill->shutdown_name, len + 9, "%s_shutdown", rfkill->name);
>> +     snprintf(rfkill->reset_name, len + 7 , "%s_reset", rfkill->name);
>> +     snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);
>
> I can't find the context for the above, but they look very dubious.
> I'd expect: snprintf(foo, sizeof foo, ...).
> If you are trying to truncate rfkill->name you need to use %.*s.

The driver allocates these buffers on the fly, a few lines above:

        len = strlen(rfkill->name);
        rfkill->reset_name = devm_kzalloc(&pdev->dev, len + 7, GFP_KERNEL);
        rfkill->shutdown_name = devm_kzalloc(&pdev->dev, len + 10, GFP_KERNEL);

I am not trying to truncate rfkill->name. Rather, the buffer length passed
to snprintf was wrong, so the resulting name was truncated by one character.


Thanks,
ChenYu

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] net: add trim helper and convert users
From: Daniel Borkmann @ 2014-01-17 10:06 UTC (permalink / raw)
  To: David Laight
  Cc: 'Hannes Frederic Sowa', netdev@vger.kernel.org,
	Jakub Zawadzki, Eric Dumazet, linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EAC0@AcuExch.aculab.com>

On 01/17/2014 10:52 AM, David Laight wrote:
> From: Hannes Frederic Sowa
> ...
>> +/**
>> + * trim - perform a reciprocal multiplication in order to "clamp" a
>> + *        value into range [0, ep_ro), where the upper interval
>> + *        endpoint is right-open. This is useful, e.g. for accessing
>> + *        a index of an array containing ep_ro elements, for example.
>> + *        Think of it as sort of modulus, only that the result isn't
>> + *        that of modulo. ;)
>> + *        More: http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
>
> It isn't appropriate to put urls into code comments (or commit messages).
> They are very likely to get out of date.

Then please grep the git log for some references/urls, you'll find plenty.
This link is just a pointer for people interested to read some more
background; there's no need to overly elaborate in the kernel doc right
here.

Thanks !

> 	David
>
>
>

^ permalink raw reply

* RE: [PATCH net-next 2/2] reciprocal_divide: correction/update of the algorithm
From: David Laight @ 2014-01-17 10:05 UTC (permalink / raw)
  To: 'Hannes Frederic Sowa', Christoph Lameter
  Cc: Daniel Borkmann, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Eric Dumazet, Austin S Hemmelgarn,
	Jesse Gross, Jamal Hadi Salim, Stephen Hemminger, Matt Mackall,
	Pekka Enberg, Andy Gospodarek, Veaceslav Falico, Jay Vosburgh,
	Jakub Zawadzki
In-Reply-To: <20140116172406.GA17529@order.stressinduktion.org>

From: Hannes Frederic Sowa
> On Thu, Jan 16, 2014 at 10:37:37AM -0600, Christoph Lameter wrote:
> > On Thu, 16 Jan 2014, Daniel Borkmann wrote:
> >
> > > - * or else the performance is slower than a normal divide.
> > > - */
> > > -extern u32 reciprocal_value(u32 B);
> > > +struct reciprocal_value {
> > > +	u32 m;
> > > +	u8 sh1, sh2;
> > > +};
> > >
> > > +#define RECIPROCAL_VALUE_RESULT_TO_ZERO ((struct reciprocal_value){.sh1 = 32})
> > >
> > > -static inline u32 reciprocal_divide(u32 A, u32 R)
> > > +struct reciprocal_value reciprocal_value(u32 d);
> >
> > A function that returns a struct? That works? Which gcc versions support
> > it?
> 
> Sure, that works and I actually like it. This is supported by the c standard,
> but please don't ask me since which one. ;)

I think it has always been supported.
Originally by adding an extra hidden argument that points to an on-stack
structure.
More recent ABI tend to allow 'small' structures be returned in registers.
(Typically ones that would fit in the register(s) used for an integral result).

This doesn't mean that it is a good idea!

OTOH if the function is 'static inline' (and inlined) it probably doesn't matter.

	David


^ permalink raw reply

* Re: [PATCH 4/11]  use ether_addr_equal_64bits
From: Dan Carpenter @ 2014-01-17 10:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Henrique de Moraes Holschuh, Julia Lawall,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA, Emmanuel Grumbach,
	Intel Linux Wireless, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20140106104802.GN30234@mwanda>

We're worried about reading beyond the end of the array and it's a heap
allocation and the last char of the eth addr is the last byte of the
page.  This causes an oops.

It's almost impossible to hit that bug.

1) You would have to have the eth addr at the end of the array.
2) It would have to be a packed struct.
3) The struct size would have to be a multiple of 4 because otherwise we
   can't put it at the end of the page.
4) It would need to be allocated on the heap.

You add all those up which is pretty rare so I wasn't able to find
anything like that.  Then you have to get extremely unlucky.

The closest thing I could find were a couple places like like:

static struct mac_addr null_mac_addr = { { 0, 0, 0, 0, 0, 0 } };

It meets criteria 1 and 2 but not 3 and 4.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2] sch_htb: let skb->priority refer to non-leaf class
From: Harry Mason @ 2014-01-17 10:19 UTC (permalink / raw)
  To: Eric Dumazet, Jamal Hadi Salim; +Cc: linux-netdev
In-Reply-To: <1389889520.31367.403.camel@edumazet-glaptop2.roam.corp.google.com>

If the class in skb->priority is not a leaf, apply filters from the
selected class, not the qdisc. This lets netfilter or user space
partially classify the packet.

Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
---

On Thu, 2014-01-16 at 08:25 -0800, Eric Dumazet wrote:
> On Thu, 2014-01-16 at 14:45 +0000, Harry Mason wrote:
> 
>> + /* Start with inner filter chain if a non-leaf class is selected */
>> + if (cl)
>> +     tcf = cl->filter_list;
>> + else
>> +     tcf = q->filter_list;
> 
> Could this break some existing htb setups ?

I think it is unlikely. Setting skb->priority to a non-leaf class would
be equivalent to setting it to the base qdisc. In theory an application
might rely on this if it expects the classes to be dynamic, but adding
a filter could restore the old behaviour.

To me this is intuitively how it should behave, and reproduces what would
happen if a tc filter instead of netfilter had first assigned the
non-leaf class.

> Also we test cl being NULL at line 222, it would be nice to not
> test it again...

Updated below.

 net/sched/sch_htb.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 717b210..8073d92 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -219,11 +219,15 @@ static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
 	if (skb->priority == sch->handle)
 		return HTB_DIRECT;	/* X:0 (direct flow) selected */
 	cl = htb_find(skb->priority, sch);
-	if (cl && cl->level == 0)
-		return cl;
+	if (cl) {
+		if (cl->level == 0)
+			return cl;
+		/* Start with inner filter chain if a non-leaf class is selected */
+		tcf = cl->filter_list;
+	} else
+		tcf = q->filter_list;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	tcf = q->filter_list;
 	while (tcf && (result = tc_classify(skb, tcf, &res)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
-- 
1.7.10.4

^ permalink raw reply related

* RE: PROBLEM: usbnet / ax88179_178a: Panic in usb_hcd_map_urb_for_dma
From: David Laight @ 2014-01-17 10:38 UTC (permalink / raw)
  To: 'Thomas Kear', netdev@vger.kernel.org
In-Reply-To: <CAHNSM4Kv5gcJn=H4k8QjPggWzjF0zRm_t_icAYXjCTqXx4xL2w@mail.gmail.com>

From: Thomas Kear
> USB3 gigabit ethernet adapters with the ASIX AX88179 chipset (LevelOne
> USB0401-V3, Plugable USB3-E1000, SIIG JU-NE0211-S1 and others) are
> experiencing kernel panics in usb_hcd_map_urb_for_dma since 3.12...

I've worked out why I didn't hit that in my ax8179_18a testing.
I was running with a patch to xhci-ring.c that sends the zero length
packet, so the usbnet code wasn't adding the extra zero byte.

That patch is somewhere in Sarah's 'pending' queue.

	David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox