Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] can: dev: fix nlmsg size calculation in can_get_size()
From: Marc Kleine-Budde @ 2013-10-07 14:14 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev, linux-can, kernel
In-Reply-To: <52507B7D.6030008@grandegger.com>

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On 10/05/2013 10:50 PM, Wolfgang Grandegger wrote:
> On 10/05/2013 09:25 PM, Marc Kleine-Budde wrote:
>> This patch fixes the calculation of the nlmsg size, by adding the missing
>> nla_total_size().
>>
>> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
>> ---
>> Hello,
>>
>> this patch touches the rtnl_link_ops get_size() callback:
>>
>>     static struct rtnl_link_ops can_link_ops __read_mostly = {
>>     ...
>>     	.get_size	= can_get_size,
>>     ...
>>     };
>>
>> By looking at other nlmsg size calculation I think a nla_total_size() for all
>> contributers is needed. Am I correct?
> 
> Yes, seems so, nla_put() calls this code:
> 
>   http://lxr.free-electrons.com/source/lib/nlattr.c#L328

Is this an Acked-by? :)

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply

* Re: [PATCHv1 net] xen-netback: transition to CLOSED when removing a VIF
From: Wei Liu @ 2013-10-07 14:15 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	netdev, Ian Campbell, Paul Durrant
In-Reply-To: <5252BDD1.1000301@citrix.com>

On Mon, Oct 07, 2013 at 02:57:37PM +0100, David Vrabel wrote:
> On 07/10/13 14:43, Wei Liu wrote:
> > On Mon, Oct 07, 2013 at 01:55:19PM +0100, David Vrabel wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> If a guest is destroyed without transitioning its frontend to CLOSED,
> >> the domain becomes a zombie as netback was not grant unmapping the
> >> shared rings.
> >>
> >> When removing a VIF, transition the backend to CLOSED so the VIF is
> >> disconnected if necessary (which will unmap the shared rings etc).
> >>
> >> This fixes a regression introduced by
> >> 279f438e36c0a70b23b86d2090aeec50155034a9 (xen-netback: Don't destroy
> >> the netdev until the vif is shut down).
> >>
> > 
> > Is this regression solely caused by 279f438e36c or caused by both
> > ea732dff5c and 279f438e36c? I ask because you make use of the new state
> > machine introduced in ea732dff5c. Or are you simply using the new state
> > machine to fix the regression instead of going back to old code?
> 
> I bisected it to 279f438.  I'm just using the handy new state machine to
> fix it.
> 

Thanks for the explanation.

Acked-by: Wei Liu <wei.liu2@citrix.com>

Wei.

> David

^ permalink raw reply

* Re: [PATCH] can: dev: fix nlmsg size calculation in  can_get_size()
From: Wolfgang Grandegger @ 2013-10-07 14:24 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: netdev, linux-can, kernel
In-Reply-To: <5252C1CC.4060809@pengutronix.de>

On Mon, 07 Oct 2013 16:14:36 +0200, Marc Kleine-Budde <mkl@pengutronix.de>

wrote:

> On 10/05/2013 10:50 PM, Wolfgang Grandegger wrote:

>> On 10/05/2013 09:25 PM, Marc Kleine-Budde wrote:

>>> This patch fixes the calculation of the nlmsg size, by adding the

>>> missing

>>> nla_total_size().

>>>

>>> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

>>> ---

>>> Hello,

>>>

>>> this patch touches the rtnl_link_ops get_size() callback:

>>>

>>>     static struct rtnl_link_ops can_link_ops __read_mostly = {

>>>     ...

>>>     	.get_size	= can_get_size,

>>>     ...

>>>     };

>>>

>>> By looking at other nlmsg size calculation I think a nla_total_size()

>>> for all

>>> contributers is needed. Am I correct?

>> 

>> Yes, seems so, nla_put() calls this code:

>> 

>>   http://lxr.free-electrons.com/source/lib/nlattr.c#L328

> 

> Is this an Acked-by? :)



Yep, obviously a long time ago that I did something for Linux-CAN :(.



Wolfgang.



> 

> Marc

^ permalink raw reply

* Re: [PATCH] can: dev: fix nlmsg size calculation in  can_get_size()
From: Marc Kleine-Budde @ 2013-10-07 14:26 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev, linux-can, kernel
In-Reply-To: <705c23ba0332d9658b8760cd5d460e8a@grandegger.com>

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

On 10/07/2013 04:24 PM, Wolfgang Grandegger wrote:

>> Is this an Acked-by? :)
> Yep, obviously a long time ago that I did something for Linux-CAN :(.

Thx, there are some netlink related patches coming soon :)

Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply

* RE: [PATCHv1 net] xen-netback: transition to CLOSED when removing a VIF
From: Paul Durrant @ 2013-10-07 14:28 UTC (permalink / raw)
  To: David Vrabel, xen-devel@lists.xen.org
  Cc: David Vrabel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	netdev@vger.kernel.org, Ian Campbell, Wei Liu
In-Reply-To: <1381150519-14557-1-git-send-email-david.vrabel@citrix.com>

> -----Original Message-----
> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: 07 October 2013 13:55
> To: xen-devel@lists.xen.org
> Cc: David Vrabel; Konrad Rzeszutek Wilk; Boris Ostrovsky;
> netdev@vger.kernel.org; Ian Campbell; Wei Liu; Paul Durrant
> Subject: [PATCHv1 net] xen-netback: transition to CLOSED when removing a
> VIF
> 
> From: David Vrabel <david.vrabel@citrix.com>
> 
> If a guest is destroyed without transitioning its frontend to CLOSED,
> the domain becomes a zombie as netback was not grant unmapping the
> shared rings.
> 
> When removing a VIF, transition the backend to CLOSED so the VIF is
> disconnected if necessary (which will unmap the shared rings etc).
> 
> This fixes a regression introduced by
> 279f438e36c0a70b23b86d2090aeec50155034a9 (xen-netback: Don't destroy
> the netdev until the vif is shut down).
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Paul Durrant <Paul.Durrant@citrix.com>
> ---
>  drivers/net/xen-netback/xenbus.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-
> netback/xenbus.c
> index b45bce2..1b08d87 100644
> --- a/drivers/net/xen-netback/xenbus.c
> +++ b/drivers/net/xen-netback/xenbus.c
> @@ -39,11 +39,15 @@ static int connect_rings(struct backend_info *);
>  static void connect(struct backend_info *);
>  static void backend_create_xenvif(struct backend_info *be);
>  static void unregister_hotplug_status_watch(struct backend_info *be);
> +static void set_backend_state(struct backend_info *be,
> +			      enum xenbus_state state);
> 
>  static int netback_remove(struct xenbus_device *dev)
>  {
>  	struct backend_info *be = dev_get_drvdata(&dev->dev);
> 
> +	set_backend_state(be, XenbusStateClosed);
> +
>  	unregister_hotplug_status_watch(be);
>  	if (be->vif) {
>  		kobject_uevent(&dev->dev.kobj, KOBJ_OFFLINE);
> --
> 1.7.2.5

Looks good to me.

Reviewed-by:  Paul Durrant <paul.durrant@citrix.com>

  Paul

^ permalink raw reply

* Re: IPv6 path MTU discovery broken
From: Hannes Frederic Sowa @ 2013-10-07 14:32 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: netdev, edumazet, fan.du
In-Reply-To: <20131007083228.GB18903@sesse.net>

On Mon, Oct 07, 2013 at 10:32:28AM +0200, Steinar H. Gunderson wrote:
> On Mon, Oct 07, 2013 at 05:09:10AM +0200, Hannes Frederic Sowa wrote:
> > Could you try to check (with e.g. nstat) if any of the following counters
> > change if the icmp messages hit the host?
> > 
> > TcpExtOutOfWindowIcmps
> > Icmp6InErrors
> > TcpExtTCPMinTTLDrop
> > TcpExtListenDrops
> 
> Icmp6InErrors is 0, so that's not it. How do I find the Tcp* counters?
> None of them show up in nstat, although other Tcp* counters do.

Strange, I have them in nstat:

$ nstat -z | egrep  '(TcpExtOutOfWindowIcmps|Icmp6InErrors|TcpExtTCPMinTTLDrop|TcpExtListenDrops)'
Icmp6InErrors                   0                  0.0
TcpExtOutOfWindowIcmps          0                  0.0
TcpExtListenDrops               0                  0.0
TcpExtTCPMinTTLDrop             0                  0.0

Otherwise you can manually extract them from /proc/net/netstat.

Greetings,

  Hannes

^ permalink raw reply

* Re: IPv6 path MTU discovery broken
From: Steinar H. Gunderson @ 2013-10-07 14:34 UTC (permalink / raw)
  To: netdev, edumazet, fan.du
In-Reply-To: <20131007143206.GH9295@order.stressinduktion.org>

On Mon, Oct 07, 2013 at 04:32:06PM +0200, Hannes Frederic Sowa wrote:
> Strange, I have them in nstat:
> 
> $ nstat -z | egrep  '(TcpExtOutOfWindowIcmps|Icmp6InErrors|TcpExtTCPMinTTLDrop|TcpExtListenDrops)'

-z was the trick. (I've never used nstat before.)

pannekake:~> nstat -z | egrep '(TcpExtOutOfWindowIcmps|Icmp6InErrors|TcpExtTCPMinTTLDrop|TcpExtListenDrops)'              
Icmp6InErrors                   0                  0.0
TcpExtOutOfWindowIcmps          4                  0.0
TcpExtListenDrops               0                  0.0
TcpExtTCPMinTTLDrop             0                  0.0

Next time it triggers, I'll see if TcpExtOutOfWindowIcmps increases, then.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply

* [PATCH net-next] net_sched: increment drop counters in qdisc_tree_decrease_qlen()
From: Eric Dumazet @ 2013-10-07 15:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

qdisc_tree_decrease_qlen() is called when some packets are dropped
on a qdisc, and we want to notify parents of qlen changes.

We also can increment parents qdisc qstats drop counters.

This permits more accurate drop counters up to root qdisc.

For example a graft operation typically resets a qdisc 
(drops all packets) and call qdisc_tree_decrease_qlen()

Note that callers are responsible for their drop counters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_api.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 2adda7f..cd81505 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -737,9 +737,11 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n)
 	const struct Qdisc_class_ops *cops;
 	unsigned long cl;
 	u32 parentid;
+	int drops;
 
 	if (n == 0)
 		return;
+	drops = max_t(int, n, 0);
 	while ((parentid = sch->parent)) {
 		if (TC_H_MAJ(parentid) == TC_H_MAJ(TC_H_INGRESS))
 			return;
@@ -756,6 +758,7 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n)
 			cops->put(sch, cl);
 		}
 		sch->q.qlen -= n;
+		sch->qstats.drops += drops;
 	}
 }
 EXPORT_SYMBOL(qdisc_tree_decrease_qlen);

^ permalink raw reply related

* Re: IPv6 path MTU discovery broken
From: Eric Dumazet @ 2013-10-07 15:49 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: netdev, edumazet, fan.du
In-Reply-To: <20131007143404.GA24997@sesse.net>

On Mon, 2013-10-07 at 16:34 +0200, Steinar H. Gunderson wrote:
> On Mon, Oct 07, 2013 at 04:32:06PM +0200, Hannes Frederic Sowa wrote:
> > Strange, I have them in nstat:
> > 
> > $ nstat -z | egrep  '(TcpExtOutOfWindowIcmps|Icmp6InErrors|TcpExtTCPMinTTLDrop|TcpExtListenDrops)'
> 
> -z was the trick. (I've never used nstat before.)

Best might be to use -a  (absolute counters).

^ permalink raw reply

* Re: [PATCH] static_key: WARN on usage before jump_label_init was called
From: Steven Rostedt @ 2013-10-07 15:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Jason Baron, Peter Zijlstra, Eric Dumazet,
	andi @ firstfloor. org David S. Miller, x86
In-Reply-To: <20131006182919.GD9295@order.stressinduktion.org>

On Sun, 6 Oct 2013 20:29:19 +0200
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:


> diff --git a/lib/jump_label_initialized.c b/lib/jump_label_initialized.c
> new file mode 100644
> index 0000000..a668a40
> --- /dev/null
> +++ b/lib/jump_label_initialized.c
> @@ -0,0 +1,6 @@
> +#include <linux/types.h>
> +#include <linux/cache.h>
> +
> +bool static_key_initialized __read_mostly = false;
> +EXPORT_SYMBOL_GPL(static_key_initialized);
> +

So far, the only thing I don't like about this patch is the creation of
this file for the sole purpose of adding this variable.

Perhaps it can just be added to init/main.c?

-- Steve

^ permalink raw reply

* [PATCH net-next v3 1/3] udp: Only allow busy read/poll on connected sockets
From: Shawn Bohrer @ 2013-10-07 16:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1381161700-14453-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues.  Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets.  This makes busy read/poll only work for connected UDP sockets.

This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c |    5 +++--
 net/ipv6/udp.c |    5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c41833e..5950e12 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1405,8 +1405,10 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
 
-	if (inet_sk(sk)->inet_daddr)
+	if (inet_sk(sk)->inet_daddr) {
 		sock_rps_save_rxhash(sk, skb);
+		sk_mark_napi_id(sk, skb);
+	}
 
 	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
@@ -1716,7 +1718,6 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk != NULL) {
 		int ret;
 
-		sk_mark_napi_id(sk, skb);
 		ret = udp_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8119791..3753247 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -549,8 +549,10 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
 
-	if (!ipv6_addr_any(&inet6_sk(sk)->daddr))
+	if (!ipv6_addr_any(&inet6_sk(sk)->daddr)) {
 		sock_rps_save_rxhash(sk, skb);
+		sk_mark_napi_id(sk, skb);
+	}
 
 	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
@@ -844,7 +846,6 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk != NULL) {
 		int ret;
 
-		sk_mark_napi_id(sk, skb);
 		ret = udpv6_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next v3 0/3] Improve UDP multicast receive latency
From: Shawn Bohrer @ 2013-10-07 16:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

The removal of the routing cache in 3.6 had impacted the latency of our
UDP multicast workload.  This patch series brings down the latency to
what we were seeing with 3.4.

Patch 1 "udp: Only allow busy read/poll on connected sockets" is mostly
done for correctness and because it allows unifying the unicast and
multicast paths when a socket is found in early demux.  It can also
improve latency for a connected multicast socket if busy read/poll is
used.

Patches 2&3 remove the fib lookups and restore latency for our workload
to the pre 3.6 levels.

Benchmark results from a netperf UDP_RR test:
v3.12-rc3-447-g40dc9ab kernel   87961.22 transactions/s
v3.12-rc3-447-g40dc9ab + series 90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
v3.12-rc3-447-g40dc9ab kernel   12.97us RTT
v3.12-rc3-447-g40dc9ab + series 12.48us RTT

v2 Changes:
- Unicast UDP early demux now requires an exact socket match and only
  tests first socket in UDP hash chain.
- ipv4_pktinfo_prepare() now takes a const struct sock*

v3 Changes:
- Use secondary hash for UDP unicast early demux lookup
- Double check socket match after increasing refcount in both unicast
  and multicast early demux lookups

Shawn Bohrer (3):
  udp: Only allow busy read/poll on connected sockets
  udp: ipv4: Add udp early demux
  net: ipv4 only populate IP_PKTINFO when needed

 include/net/ip.h       |    2 +-
 include/net/sock.h     |    2 +-
 include/net/udp.h      |    1 +
 net/ipv4/af_inet.c     |    1 +
 net/ipv4/ip_sockglue.c |    5 +-
 net/ipv4/raw.c         |    2 +-
 net/ipv4/udp.c         |  209 +++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv6/udp.c         |    5 +-
 8 files changed, 199 insertions(+), 28 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* [PATCH net-next v3 2/3] udp: ipv4: Add udp early demux
From: Shawn Bohrer @ 2013-10-07 16:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1381161700-14453-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host.  Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups.  Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After  89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After  12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
v3 Changes:
- Use secondary hash for UDP unicast early demux lookup
- Double check socket match after increasing refcount in both unicast
  and multicast early demux lookups

 include/net/sock.h |    2 +-
 include/net/udp.h  |    1 +
 net/ipv4/af_inet.c |    1 +
 net/ipv4/udp.c     |  202 +++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 187 insertions(+), 19 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..7953254 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -218,7 +218,7 @@ struct cg_proto;
   *	@sk_lock:	synchronizer
   *	@sk_rcvbuf: size of receive buffer in bytes
   *	@sk_wq: sock wait queue and async head
-  *	@sk_rx_dst: receive input route used by early tcp demux
+  *	@sk_rx_dst: receive input route used by early demux
   *	@sk_dst_cache: destination cache
   *	@sk_dst_lock: destination cache lock
   *	@sk_policy: flow policy
diff --git a/include/net/udp.h b/include/net/udp.h
index 510b8cb..fe4ba9f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,6 +175,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		     unsigned int hash2_nulladdr);
 
 /* net/ipv4/udp.c */
+void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
 				  const struct sock *));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cfeb85c..35913fb 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1546,6 +1546,7 @@ static const struct net_protocol tcp_protocol = {
 };
 
 static const struct net_protocol udp_protocol = {
+	.early_demux =	udp_v4_early_demux,
 	.handler =	udp_rcv,
 	.err_handler =	udp_err,
 	.no_policy =	1,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5950e12..262ea39 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -103,6 +103,7 @@
 #include <linux/seq_file.h>
 #include <net/net_namespace.h>
 #include <net/icmp.h>
+#include <net/inet_hashtables.h>
 #include <net/route.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
@@ -565,6 +566,26 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup);
 
+static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
+				       __be16 loc_port, __be32 loc_addr,
+				       __be16 rmt_port, __be32 rmt_addr,
+				       int dif, unsigned short hnum)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	if (!net_eq(sock_net(sk), net) ||
+	    udp_sk(sk)->udp_port_hash != hnum ||
+	    (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
+	    (inet->inet_dport != rmt_port && inet->inet_dport) ||
+	    (inet->inet_rcv_saddr && inet->inet_rcv_saddr != loc_addr) ||
+	    ipv6_only_sock(sk) ||
+	    (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		return false;
+	if (!ip_mc_sf_allow(sk, loc_addr, rmt_addr, dif))
+		return false;
+	return true;
+}
+
 static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
 					     __be16 loc_port, __be32 loc_addr,
 					     __be16 rmt_port, __be32 rmt_addr,
@@ -575,20 +596,11 @@ static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
 	unsigned short hnum = ntohs(loc_port);
 
 	sk_nulls_for_each_from(s, node) {
-		struct inet_sock *inet = inet_sk(s);
-
-		if (!net_eq(sock_net(s), net) ||
-		    udp_sk(s)->udp_port_hash != hnum ||
-		    (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
-		    (inet->inet_dport != rmt_port && inet->inet_dport) ||
-		    (inet->inet_rcv_saddr &&
-		     inet->inet_rcv_saddr != loc_addr) ||
-		    ipv6_only_sock(s) ||
-		    (s->sk_bound_dev_if && s->sk_bound_dev_if != dif))
-			continue;
-		if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif))
-			continue;
-		goto found;
+		if (__udp_is_mcast_sock(net, s,
+					loc_port, loc_addr,
+					rmt_port, rmt_addr,
+					dif, hnum))
+			goto found;
 	}
 	s = NULL;
 found:
@@ -1581,6 +1593,14 @@ static void flush_stack(struct sock **stack, unsigned int count,
 		kfree_skb(skb1);
 }
 
+static void udp_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+
+	dst_hold(dst);
+	sk->sk_rx_dst = dst;
+}
+
 /*
  *	Multicasts and broadcasts go to each listener.
  *
@@ -1709,11 +1729,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
-	if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
-		return __udp4_lib_mcast_deliver(net, skb, uh,
-				saddr, daddr, udptable);
+	if (skb->sk) {
+		int ret;
+		sk = skb->sk;
 
-	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+		if (unlikely(sk->sk_rx_dst == NULL))
+			udp_sk_rx_dst_set(sk, skb);
+
+		ret = udp_queue_rcv_skb(sk, skb);
+
+		/* a return value > 0 means to resubmit the input, but
+		 * it wants the return to be -protocol, or 0
+		 */
+		if (ret > 0)
+			return -ret;
+		return 0;
+	} else {
+		if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
+			return __udp4_lib_mcast_deliver(net, skb, uh,
+					saddr, daddr, udptable);
+
+		sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+	}
 
 	if (sk != NULL) {
 		int ret;
@@ -1771,6 +1808,135 @@ drop:
 	return 0;
 }
 
+/* We can only early demux multicast if there is a single matching socket.
+ * If more than one socket found returns NULL
+ */
+static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
+						  __be16 loc_port, __be32 loc_addr,
+						  __be16 rmt_port, __be32 rmt_addr,
+						  int dif)
+{
+	struct sock *sk, *result;
+	struct hlist_nulls_node *node;
+	unsigned short hnum = ntohs(loc_port);
+	unsigned int count, slot = udp_hashfn(net, hnum, udp_table.mask);
+	struct udp_hslot *hslot = &udp_table.hash[slot];
+
+	rcu_read_lock();
+begin:
+	count = 0;
+	result = NULL;
+	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
+		if (__udp_is_mcast_sock(net, sk,
+					loc_port, loc_addr,
+					rmt_port, rmt_addr,
+					dif, hnum)) {
+			result = sk;
+			++count;
+		}
+	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(node) != slot)
+		goto begin;
+
+	if (result) {
+		if (count != 1 ||
+		    unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
+			result = NULL;
+		else if (unlikely(!__udp_is_mcast_sock(net, sk,
+						       loc_port, loc_addr,
+						       rmt_port, rmt_addr,
+						       dif, hnum))) {
+			sock_put(result);
+			result = NULL;
+		}
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+/* For unicast we should only early demux connected sockets or we can
+ * break forwarding setups.  The chains here can be long so only check
+ * if the first socket is an exact match and if not move on.
+ */
+static struct sock *__udp4_lib_demux_lookup(struct net *net,
+					    __be16 loc_port, __be32 loc_addr,
+					    __be16 rmt_port, __be32 rmt_addr,
+					    int dif)
+{
+	struct sock *sk, *result;
+	struct hlist_nulls_node *node;
+	unsigned short hnum = ntohs(loc_port);
+	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
+	unsigned int slot2 = hash2 & udp_table.mask;
+	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
+	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr)
+	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
+
+	rcu_read_lock();
+	result = NULL;
+	udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
+		if (INET_MATCH(sk, net, acookie,
+			       rmt_addr, loc_addr, ports, dif))
+			result = sk;
+		/* Only check first socket in chain */
+		break;
+	}
+
+	if (result) {
+		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
+			result = NULL;
+		else if (unlikely(!INET_MATCH(sk, net, acookie,
+					      rmt_addr, loc_addr,
+					      ports, dif))) {
+			sock_put(result);
+			result = NULL;
+		}
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+void udp_v4_early_demux(struct sk_buff *skb)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	const struct udphdr *uh = udp_hdr(skb);
+	struct sock *sk;
+	struct dst_entry *dst;
+	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+
+	/* validate the packet */
+	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr)))
+		return;
+
+	if (skb->pkt_type == PACKET_BROADCAST ||
+	    skb->pkt_type == PACKET_MULTICAST)
+		sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
+						   uh->source, iph->saddr, dif);
+	else if (skb->pkt_type == PACKET_HOST)
+		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
+					     uh->source, iph->saddr, dif);
+	else
+		return;
+
+	if (!sk)
+		return;
+
+	skb->sk = sk;
+	skb->destructor = sock_edemux;
+	dst = sk->sk_rx_dst;
+
+	if (dst)
+		dst = dst_check(dst, 0);
+	if (dst)
+		skb_dst_set_noref(skb, dst);
+}
+
 int udp_rcv(struct sk_buff *skb)
 {
 	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next v3 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Shawn Bohrer @ 2013-10-07 16:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1381161700-14453-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received.  This has introduced a performance regression for some
UDP workloads.

This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.

Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After  90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After  12.48us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ip.h       |    2 +-
 net/ipv4/ip_sockglue.c |    5 +++--
 net/ipv4/raw.c         |    2 +-
 net/ipv4/udp.c         |    2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 16078f4..b39ebe5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -459,7 +459,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
  *	Functions provided by ip_sockglue.c
  */
 
-void ipv4_pktinfo_prepare(struct sk_buff *skb);
+void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
 int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 56e3445..0626f2c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1052,11 +1052,12 @@ e_inval:
  * destination in skb->cb[] before dst drop.
  * This way, receiver doesnt make cache line misses to read rtable.
  */
-void ipv4_pktinfo_prepare(struct sk_buff *skb)
+void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb)
 {
 	struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
 
-	if (skb_rtable(skb)) {
+	if ((inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
+	    skb_rtable(skb)) {
 		pktinfo->ipi_ifindex = inet_iif(skb);
 		pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb);
 	} else {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index b2fa14c..41e1d28 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -299,7 +299,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	/* Charge it to the socket. */
 
-	ipv4_pktinfo_prepare(skb);
+	ipv4_pktinfo_prepare(sk, skb);
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 262ea39..4226c53 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1544,7 +1544,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	rc = 0;
 
-	ipv4_pktinfo_prepare(skb);
+	ipv4_pktinfo_prepare(sk, skb);
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		rc = __udp_queue_rcv_skb(sk, skb);
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH] mac80211: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
From: Johannes Berg @ 2013-10-07 16:16 UTC (permalink / raw)
  To: djduanjiong
  Cc: John W. Linville, David S. Miller, linux-wireless, netdev,
	Duan Jiong
In-Reply-To: <1381190953-6362-1-git-send-email-duanj.fnst@cn.fujitsu.com>

On Mon, 2013-10-07 at 17:09 -0700, djduanjiong@gmail.com wrote:

>  	if (IS_ERR(key))
> -		return ERR_PTR(PTR_ERR(key));
> +		return ERR_CAST(key);

I already have this patch in my tree.

johannes

^ permalink raw reply

* [PATCH] wimax: Use WARN(1,...) rather than printk followed by WARN_ON(1)
From: djduanjiong @ 2013-10-08  0:30 UTC (permalink / raw)
  To: Inaky Perez-Gonzalez, David S. Miller, linux-wimax
  Cc: wimax, netdev, Duan Jiong

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
---
 net/wimax/id-table.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/wimax/id-table.c b/net/wimax/id-table.c
index 72273ab..635a4d2 100644
--- a/net/wimax/id-table.c
+++ b/net/wimax/id-table.c
@@ -136,10 +136,8 @@ void wimax_id_table_release(void)
 	return;
 #endif
 	spin_lock(&wimax_id_table_lock);
-	list_for_each_entry(wimax_dev, &wimax_id_table, id_table_node) {
-		printk(KERN_ERR "BUG: %s wimax_dev %p ifindex %d not cleared\n",
+	list_for_each_entry(wimax_dev, &wimax_id_table, id_table_node)
+		WARN(1, KERN_ERR "BUG: %s wimax_dev %p ifindex %d not cleared\n",
 		       __func__, wimax_dev, wimax_dev->net_dev->ifindex);
-		WARN_ON(1);
-	}
 	spin_unlock(&wimax_id_table_lock);
 }
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH net v2] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: David Miller @ 2013-10-07 16:31 UTC (permalink / raw)
  To: somnath.kotur; +Cc: netdev
In-Reply-To: <cca19523-3f51-4f87-b0e1-2ecec6ba37d4@CMEXHTCAS1.ad.emulex.com>

From: Somnath Kotur <somnath.kotur@emulex.com>
Date: Thu, 3 Oct 2013 15:34:29 +0530

> +	if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
> +		dev_err(dev, "Firmware version is too old.IRQs may not work\n");

So many grammatical mistakes in one line.

First sentence got a period, second one did not.

Missing space between period and second sentence.

^ permalink raw reply

* Re: [PATCHv2] IPv6: Allow the MTU of ipip6 tunnel to be set below 1280
From: David Miller @ 2013-10-07 16:33 UTC (permalink / raw)
  To: hannes; +Cc: ou.ghorbel, kuznet, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <20131004133223.GA11410@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri, 4 Oct 2013 15:32:23 +0200

> On Thu, Oct 03, 2013 at 02:49:26PM +0100, Oussama Ghorbel wrote:
>> The (inner) MTU of a ipip6 (IPv4-in-IPv6) tunnel cannot be set below 1280, which is the minimum MTU in IPv6.
>> However, there should be no IPv6 on the tunnel interface at all, so the IPv6 rules should not apply.
>> More info at https://bugzilla.kernel.org/show_bug.cgi?id=15530
>> 
>> This patch allows to check the minimum MTU for ipv6 tunnel according to these rules:
>> -In case the tunnel is configured with ipip6 mode the minimum MTU is 68.
>> -In case the tunnel is configured with ip6ip6 or any mode the minimum MTU is 1280.
>> 
>> Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>
 ...
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied, but please do not capitalize "IPv6" when using it as a subsystem
prefix in Subject lines.

Thanks.

^ permalink raw reply

* Re: [PATCH net v2] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: Joe Perches @ 2013-10-07 16:44 UTC (permalink / raw)
  To: David Miller; +Cc: somnath.kotur, netdev
In-Reply-To: <20131007.123116.1462434377931881914.davem@davemloft.net>

On Mon, 2013-10-07 at 12:31 -0400, David Miller wrote:
> From: Somnath Kotur <somnath.kotur@emulex.com>
> Date: Thu, 3 Oct 2013 15:34:29 +0530
> > +     if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
> > +             dev_err(dev, "Firmware version is too old.IRQs may not work\n");
> 
> So many grammatical mistakes in one line.
> 
> First sentence got a period, second one did not.
> 
> Missing space between period and second sentence.

Periods are unnecessary here.
Just use a comma or a dash.

Also, it might be nicer to show the current firmware version too.

Maybe:

		dev_err(dev, "Firmware version is '%s' - Version 5 or higher is required for properly functional IRQs\n",
			adapter->fw_ver);

^ permalink raw reply

* Re: [PATCH nf-next] netfilter: xtables: lightweight process control group matching
From: Tejun Heo @ 2013-10-07 16:46 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: pablo, netfilter-devel, netdev, cgroups
In-Reply-To: <1380910855-12325-1-git-send-email-dborkman@redhat.com>

Hello,

On Fri, Oct 04, 2013 at 08:20:55PM +0200, Daniel Borkmann wrote:
> It would be useful e.g. in a server or desktop environment to have
> a facility in the notion of fine-grained "per application" or "per
> application group" firewall policies. Probably, users in the mobile/
> embedded area (e.g. Android based) with different security policy
> requirements for application groups could have great benefit from
> that as well. For example, with a little bit of configuration effort,
> an admin could whitelist well-known applications, and thus block
> otherwise unwanted "hard-to-track" applications like [1] from a
> user's machine.
> 
> Implementation of PID-based matching would not be appropriate
> as they frequently change, and child tracking would make that
> even more complex and ugly. Cgroups would be a perfect candidate
> for accomplishing that as they associate a set of tasks with a
> set of parameters for one or more subsystems, in our case the
> netfilter subsystem, which, of course, can be combined with other
> cgroup subsystems into something more complex.
> 
> As mentioned, to overcome this constraint, such processes could
> be placed into one or multiple cgroups where different fine-grained
> rules can be defined depending on the application scenario, while
> e.g. everything else that is not part of that could be dropped (or
> vice versa), thus making life harder for unwanted processes to
> communicate to the outside world. So, we make use of cgroups here
> to track jobs and limit their resources in terms of iptables
> policies; in other words, limiting what they are allowed to
> communicate.
> 
> We have similar cgroup facilities in networking for traffic
> classifier, and netprio cgroups. This feature adds a lightweight
> cgroup id matching in terms of network security resp. network
> traffic isolation as part of netfilter's xtables subsystem.

I don't think the two net cgroups were a good idea and definitely
don't want to continue the trend.  I think this is being done
backwards.  Wouldn't it be more logical to implement netfilter rule to
match the target cgroup paths?  It really doesn't make much sense to
me to add separate controllers to just tag processes.  Please classify
tasks in cgroup and let netfilter match the cgroups.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH RFC 54/77] ntb: Ensure number of MSIs on SNB is enough for the link interrupt
From: Jon Mason @ 2013-10-07 16:50 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Matt Porter, stable,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	VMware, Inc.
In-Reply-To: <20131005214303.GA21589@dhcp-26-207.brq.redhat.com>

On Sat, Oct 05, 2013 at 11:43:04PM +0200, Alexander Gordeev wrote:
> On Wed, Oct 02, 2013 at 05:48:05PM -0700, Jon Mason wrote:
> > On Wed, Oct 02, 2013 at 12:49:10PM +0200, Alexander Gordeev wrote:
> > > Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> > > ---
> > >  drivers/ntb/ntb_hw.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/drivers/ntb/ntb_hw.c b/drivers/ntb/ntb_hw.c
> > > index de2062c..eccd5e5 100644
> > > --- a/drivers/ntb/ntb_hw.c
> > > +++ b/drivers/ntb/ntb_hw.c
> > > @@ -1066,7 +1066,7 @@ static int ntb_setup_msix(struct ntb_device *ndev)
> > >  		/* On SNB, the link interrupt is always tied to 4th vector.  If
> > >  		 * we can't get all 4, then we can't use MSI-X.
> > >  		 */
> > > -		if (ndev->hw_type != BWD_HW) {
> > > +		if ((rc < SNB_MSIX_CNT) && (ndev->hw_type != BWD_HW)) {
> > 
> > Nack, this check is unnecessary.
> 
> If SNB can do more than SNB_MSIX_CNT MSI-Xs then this check is needed
> to enable less than maximum MSI-Xs in case the maximum was not allocated.
> Otherwise SNB will fallback to single MSI instead of multiple MSI-Xs.

Per the comment in the code snippet above, "If we can't get all 4,
then we can't use MSI-X".  There is already a check to see if more
than 4 were acquired.  So it's not possible to hit this.  Even if it
was, don't use SNB_MSIX_CNT here (limits.msix_cnt is the preferred
variable).  Also, the "()" are unnecessary.

Thanks,
Jon

> -- 
> Regards,
> Alexander Gordeev
> agordeev@redhat.com

^ permalink raw reply

* Re: [PATCH] Fix the upper MTU limit in ipv6 GRE tunnel
From: David Miller @ 2013-10-07 16:53 UTC (permalink / raw)
  To: ou.ghorbel; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <1380880333-3546-1-git-send-email-ou.ghorbel@gmail.com>

From: Oussama Ghorbel <ou.ghorbel@gmail.com>
Date: Fri,  4 Oct 2013 10:52:13 +0100

> Unlike ipv4, the struct member hlen holds the length of the GRE and ipv6
> headers. This length is also counted in dev->hard_header_len.
> Perhaps, it's more clean to modify the hlen to count only the GRE header
> without ipv6 header as the variable name suggest, but the simple way to fix
> this without regression risk is simply modify the calculation of the limit
> in ip6gre_tunnel_change_mtu function.
> Verified in kernel version v3.11.
> 
> Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>

Please resubmit this with a proper Subject line.

It should be "[PATCH] " then a legitimate subsystem prefix.
In this case "ipv6: " would be appropriate.  And then
the "ipv6" in your existing Subject text is redundant so
can be removed.

Thanks.

^ permalink raw reply

* Re: Neterion and UFO handling [was: Re: [PATCH] ipv6: udp packets following an UFO enqueued packet need also be handled by UFO]
From: Hannes Frederic Sowa @ 2013-10-07 16:53 UTC (permalink / raw)
  To: jon.mason
  Cc: jdmason, Jiri Pirko, netdev, yoshfuji, davem, kuznet, jmorris,
	kaber, herbert, eric.dumazet
In-Reply-To: <20131002162730.GA3991@order.stressinduktion.org>

Hi Jon!

Maybe I got the wrong email address for the neterion driver from the
maintainers filer? If you are (still) affiliated with the neterion driver,
maybe you could have a short look at the quoted mail below?

Thanks,

  Hannes

On Wed, Oct 02, 2013 at 06:27:30PM +0200, Hannes Frederic Sowa wrote:
> Hi!
> 
> I have a question regarding UFO and the neterion driver, which as the only one
> advertises hardware UFO support:
> 
> The patch discusses in this thread
> http://thread.gmane.org/gmane.linux.network/284348/focus=285405 could change
> some semantics how packets are constructed before submitted to the driver.
> 
> We currently guarantee that we have the MAC/IP/UDP header in skb->data and the
> payload is attached in the skb's frags. With the changes discussed in this
> thread it is possible that we also append to skb->data some amount of data
> which is not targeted for the header. From reading the driver sources it seems
> the hardware interprets the skb->data to skb_headlen as the header, so we
> could include some data in the fragments more than once.
> 
> Do you think this change is safe? Otherwise I would suggest that the UFO
> capability is switched off until the driver signals the hardware the start and
> end of the headers correctly?
> 
> I left the mail below intact which points to the specific place in s2io.c
> where I think the problem is.
> 
> On Wed, Oct 02, 2013 at 08:14:27AM -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-02 at 15:03 +0200, Hannes Frederic Sowa wrote:
> > > On Wed, Oct 02, 2013 at 02:12:07PM +0200, Hannes Frederic Sowa wrote:
> > > > Hi Eric!
> > > > 
> > > > On Wed, Oct 02, 2013 at 03:41:28AM -0700, Eric Dumazet wrote:
> > > > > On Wed, 2013-10-02 at 10:58 +0200, Jiri Pirko wrote:
> > > > > > Wed, Oct 02, 2013 at 01:25:34AM CEST, hannes@stressinduktion.org wrote:
> > > > > > >-	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
> > > > > > >+	if (((length > mtu) || (skb && skb_has_frags(skb))) &&
> > > > > 
> > > > > > 
> > > > > > This seems correct to me. sk_is_gso would work as well is you apply my
> > > > > > patch "[patch net] ip6_output: do skb ufo init for peeked non ufo skb as
> > > > > > well" which does the setting of gso_size.
> > > > > 
> > > > > Well, skb having frags or not should not be a concern :
> > > > > Thats an allocation choice (lets say to avoid high order allocations). 
> > > > > 
> > > > > Setting gso_size is probably better.
> > > > 
> > > > e89e9cf539a28df7d0eb1d0a545368e9920b34ac ("[IPv4/IPv6]: UFO Scatter-gather
> > > > approach") states:
> > > > 
> > > > "
> > > > skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
> > > > contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
> > > > indicating that hardware has to do checksum calculation. Hardware should
> > > > compute the UDP checksum of complete datagram and also ip header checksum of
> > > > each fragmented IP packet.
> > > > "
> > > > 
> > > > This is the reason why I tried not to update the gso_size. If it is ok, I am
> > > > fine with that.
> > > 
> > > Especially, drivers/net/ethernet/neterion/s2io.c states that the first dma
> > > mapping (skb->data with skb_headlen, which is fine) is used as the inband
> > > header:
> > > 
> > >         if (offload_type == SKB_GSO_UDP)
> > >                 frg_cnt++; /* as Txd0 was used for inband header */
> > > 
> > > That is my only other hint that we maybe should not update gso_size and
> > > gso_type. I guess software fallback does not have this problem, but I won't
> > > have time to check until this evening.
> > > 
> > > I am really not sure if just setting gso_size does not break neterion UFO
> > > offloading. :/
> > 
> > Well, just ask Jon Mason to double check ;)
> > 
> > I think the commit intent was to set gso_size :
> > 
> >    skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
> >     fragment going out of the adapter after IP fragmentation by hardware.
> > 
> > The fact that it states "skb->data will contain MAC/IP/UDP header and
> > skb_shinfo(skb)->frags[] contains the data payload." seems irrelevant.
> > 
> > If Neterion driver mandates that skb->head *only* contains the
> > MAC/IP/UDP header, that should be handled in the driver itself.
> 
> Thanks Eric for clearing this up.
> 
> I really thought it would be the common pattern for UFO to have only headers
> in skb->data, so I didn't bother to ask in the first place.
> 
> Thanks,
> 
>   Hannes
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH v2] static_key: WARN on usage before jump_label_init was called
From: Hannes Frederic Sowa @ 2013-10-07 16:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: netdev, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Jason Baron, Peter Zijlstra, Eric Dumazet,
	andi @ firstfloor. org David S. Miller, x86
In-Reply-To: <20131007115152.0964afde@gandalf.local.home>

On Mon, Oct 07, 2013 at 11:51:52AM -0400, Steven Rostedt wrote:
> On Sun, 6 Oct 2013 20:29:19 +0200
> Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> 
> 
> > diff --git a/lib/jump_label_initialized.c b/lib/jump_label_initialized.c
> > new file mode 100644
> > index 0000000..a668a40
> > --- /dev/null
> > +++ b/lib/jump_label_initialized.c
> > @@ -0,0 +1,6 @@
> > +#include <linux/types.h>
> > +#include <linux/cache.h>
> > +
> > +bool static_key_initialized __read_mostly = false;
> > +EXPORT_SYMBOL_GPL(static_key_initialized);
> > +
> 
> So far, the only thing I don't like about this patch is the creation of
> this file for the sole purpose of adding this variable.
> 
> Perhaps it can just be added to init/main.c?

Yes, init/main.c seems to be a good fit. I also fixed some whitespace issues
and simplified the macro (as it is only one statement).

[PATCH v2] static_key: WARN on usage before jump_label_init was called

Based on a patch from Andi Kleen.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/jump_label.h           | 10 ++++++++++
 include/linux/jump_label_ratelimit.h |  2 ++
 init/main.c                          |  7 +++++++
 kernel/jump_label.c                  |  5 +++++
 4 files changed, 24 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index a507907..e96be72 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -48,6 +48,13 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/bug.h>
+
+extern bool static_key_initialized;
+
+#define STATIC_KEY_CHECK_USE() WARN(!static_key_initialized,		      \
+				    "%s used before call to jump_label_init", \
+				    __func__)
 
 #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
 
@@ -128,6 +135,7 @@ struct static_key {
 
 static __always_inline void jump_label_init(void)
 {
+	static_key_initialized = true;
 }
 
 static __always_inline bool static_key_false(struct static_key *key)
@@ -146,11 +154,13 @@ static __always_inline bool static_key_true(struct static_key *key)
 
 static inline void static_key_slow_inc(struct static_key *key)
 {
+	STATIC_KEY_CHECK_USE();
 	atomic_inc(&key->enabled);
 }
 
 static inline void static_key_slow_dec(struct static_key *key)
 {
+	STATIC_KEY_CHECK_USE();
 	atomic_dec(&key->enabled);
 }
 
diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h
index 1137883..089f70f 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -23,12 +23,14 @@ struct static_key_deferred {
 };
 static inline void static_key_slow_dec_deferred(struct static_key_deferred *key)
 {
+	STATIC_KEY_CHECK_USE();
 	static_key_slow_dec(&key->key);
 }
 static inline void
 jump_label_rate_limit(struct static_key_deferred *key,
 		unsigned long rl)
 {
+	STATIC_KEY_CHECK_USE();
 }
 #endif	/* HAVE_JUMP_LABEL */
 #endif	/* _LINUX_JUMP_LABEL_RATELIMIT_H */
diff --git a/init/main.c b/init/main.c
index af310af..6735b631 100644
--- a/init/main.c
+++ b/init/main.c
@@ -136,6 +136,13 @@ static char *execute_command;
 static char *ramdisk_execute_command;
 
 /*
+ * Used to generate warnings if static_key manipulation functions are used
+ * before boot
+ */
+bool static_key_initialized __read_mostly = false;
+EXPORT_SYMBOL_GPL(static_key_initialized);
+
+/*
  * If set, this is an indication to the drivers that reset the underlying
  * device before going ahead with the initialization otherwise driver might
  * rely on the BIOS and skip the reset operation.
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 297a924..9019f15 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -58,6 +58,7 @@ static void jump_label_update(struct static_key *key, int enable);
 
 void static_key_slow_inc(struct static_key *key)
 {
+	STATIC_KEY_CHECK_USE();
 	if (atomic_inc_not_zero(&key->enabled))
 		return;
 
@@ -103,12 +104,14 @@ static void jump_label_update_timeout(struct work_struct *work)
 
 void static_key_slow_dec(struct static_key *key)
 {
+	STATIC_KEY_CHECK_USE();
 	__static_key_slow_dec(key, 0, NULL);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec);
 
 void static_key_slow_dec_deferred(struct static_key_deferred *key)
 {
+	STATIC_KEY_CHECK_USE();
 	__static_key_slow_dec(&key->key, key->timeout, &key->work);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);
@@ -116,6 +119,7 @@ EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);
 void jump_label_rate_limit(struct static_key_deferred *key,
 		unsigned long rl)
 {
+	STATIC_KEY_CHECK_USE();
 	key->timeout = rl;
 	INIT_DELAYED_WORK(&key->work, jump_label_update_timeout);
 }
@@ -212,6 +216,7 @@ void __init jump_label_init(void)
 		key->next = NULL;
 #endif
 	}
+	static_key_initialized = true;
 	jump_label_unlock();
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: Neterion and UFO handling [was: Re: [PATCH] ipv6: udp packets following an UFO enqueued packet need also be handled by UFO]
From: Jon Mason @ 2013-10-07 17:19 UTC (permalink / raw)
  To: jdmason, Jiri Pirko, netdev, yoshfuji, davem, kuznet, jmorris,
	kaber, herbert, eric.dumazet
In-Reply-To: <20131007165343.GI9295@order.stressinduktion.org>

On Mon, Oct 07, 2013 at 06:53:43PM +0200, Hannes Frederic Sowa wrote:
> Hi Jon!
> 
> Maybe I got the wrong email address for the neterion driver from the
> maintainers filer? If you are (still) affiliated with the neterion driver,
> maybe you could have a short look at the quoted mail below?

Both are valid email addresses, but I prefer to address non-Intel
issues with my kudzu.us email account.

I apologize for not addressing your question yet.  What you are saying
makes sense, but I want to dig through the documentation and verify.
However, I haven't had the time.  I'll brew up a pot of coffee when I
get home and I'll get an answer to you before I go to bed tonight :)

Thanks,
Jon

> 
> Thanks,
> 
>   Hannes
> 
> On Wed, Oct 02, 2013 at 06:27:30PM +0200, Hannes Frederic Sowa wrote:
> > Hi!
> > 
> > I have a question regarding UFO and the neterion driver, which as the only one
> > advertises hardware UFO support:
> > 
> > The patch discusses in this thread
> > http://thread.gmane.org/gmane.linux.network/284348/focus=285405 could change
> > some semantics how packets are constructed before submitted to the driver.
> > 
> > We currently guarantee that we have the MAC/IP/UDP header in skb->data and the
> > payload is attached in the skb's frags. With the changes discussed in this
> > thread it is possible that we also append to skb->data some amount of data
> > which is not targeted for the header. From reading the driver sources it seems
> > the hardware interprets the skb->data to skb_headlen as the header, so we
> > could include some data in the fragments more than once.
> > 
> > Do you think this change is safe? Otherwise I would suggest that the UFO
> > capability is switched off until the driver signals the hardware the start and
> > end of the headers correctly?
> > 
> > I left the mail below intact which points to the specific place in s2io.c
> > where I think the problem is.
> > 
> > On Wed, Oct 02, 2013 at 08:14:27AM -0700, Eric Dumazet wrote:
> > > On Wed, 2013-10-02 at 15:03 +0200, Hannes Frederic Sowa wrote:
> > > > On Wed, Oct 02, 2013 at 02:12:07PM +0200, Hannes Frederic Sowa wrote:
> > > > > Hi Eric!
> > > > > 
> > > > > On Wed, Oct 02, 2013 at 03:41:28AM -0700, Eric Dumazet wrote:
> > > > > > On Wed, 2013-10-02 at 10:58 +0200, Jiri Pirko wrote:
> > > > > > > Wed, Oct 02, 2013 at 01:25:34AM CEST, hannes@stressinduktion.org wrote:
> > > > > > > >-	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
> > > > > > > >+	if (((length > mtu) || (skb && skb_has_frags(skb))) &&
> > > > > > 
> > > > > > > 
> > > > > > > This seems correct to me. sk_is_gso would work as well is you apply my
> > > > > > > patch "[patch net] ip6_output: do skb ufo init for peeked non ufo skb as
> > > > > > > well" which does the setting of gso_size.
> > > > > > 
> > > > > > Well, skb having frags or not should not be a concern :
> > > > > > Thats an allocation choice (lets say to avoid high order allocations). 
> > > > > > 
> > > > > > Setting gso_size is probably better.
> > > > > 
> > > > > e89e9cf539a28df7d0eb1d0a545368e9920b34ac ("[IPv4/IPv6]: UFO Scatter-gather
> > > > > approach") states:
> > > > > 
> > > > > "
> > > > > skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
> > > > > contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
> > > > > indicating that hardware has to do checksum calculation. Hardware should
> > > > > compute the UDP checksum of complete datagram and also ip header checksum of
> > > > > each fragmented IP packet.
> > > > > "
> > > > > 
> > > > > This is the reason why I tried not to update the gso_size. If it is ok, I am
> > > > > fine with that.
> > > > 
> > > > Especially, drivers/net/ethernet/neterion/s2io.c states that the first dma
> > > > mapping (skb->data with skb_headlen, which is fine) is used as the inband
> > > > header:
> > > > 
> > > >         if (offload_type == SKB_GSO_UDP)
> > > >                 frg_cnt++; /* as Txd0 was used for inband header */
> > > > 
> > > > That is my only other hint that we maybe should not update gso_size and
> > > > gso_type. I guess software fallback does not have this problem, but I won't
> > > > have time to check until this evening.
> > > > 
> > > > I am really not sure if just setting gso_size does not break neterion UFO
> > > > offloading. :/
> > > 
> > > Well, just ask Jon Mason to double check ;)
> > > 
> > > I think the commit intent was to set gso_size :
> > > 
> > >    skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
> > >     fragment going out of the adapter after IP fragmentation by hardware.
> > > 
> > > The fact that it states "skb->data will contain MAC/IP/UDP header and
> > > skb_shinfo(skb)->frags[] contains the data payload." seems irrelevant.
> > > 
> > > If Neterion driver mandates that skb->head *only* contains the
> > > MAC/IP/UDP header, that should be handled in the driver itself.
> > 
> > Thanks Eric for clearing this up.
> > 
> > I really thought it would be the common pattern for UFO to have only headers
> > in skb->data, so I didn't bother to ask in the first place.
> > 
> > Thanks,
> > 
> >   Hannes
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox