Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Thomas Graf @ 2014-12-02 17:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Du, Fan, 'Jason Wang', netdev@vger.kernel.org,
	davem@davemloft.net, fw@strlen.de, dev, jesse, pshelar
In-Reply-To: <20141202173401.GB4126@redhat.com>

On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > > What about containers or any other virtualization environment that
> > > doesn't use Virtio?
> > 
> > The host can dictate the MTU in that case for both veth or OVS
> > internal which would be primary container plumbing techniques.
> 
> It typically can't do this easily for VMs with emulated devices:
> real ethernet uses a fixed MTU.
> 
> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> an unrelated optimization.
> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.

PMTU discovery only resolves the issue if an actual IP stack is
running inside the VM. This may not be the case at all.

I agree that exposing an MTU towards the guest is not applicable
in all situations, in particular because it is difficult to decide
what MTU to expose. It is a relatively elegant solution in a lot
of virtualization host cases hooked up to an orchestration system
though.

^ permalink raw reply

* Fw: [Bug 89161] New: Regression in bonding driver with devices that have no MAC address
From: Stephen Hemminger @ 2014-12-02 17:34 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Tue, 2 Dec 2014 02:15:05 -0800
From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
To: "stephen@networkplumber.org" <stephen@networkplumber.org>
Subject: [Bug 89161] New: Regression in bonding driver with devices that have no MAC address


https://bugzilla.kernel.org/show_bug.cgi?id=89161

            Bug ID: 89161
           Summary: Regression in bonding driver with devices that have no
                    MAC address
           Product: Networking
           Version: 2.5
    Kernel Version: 3.17.4
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: shemminger@linux-foundation.org
          Reporter: tjc@wintrmute.net
        Regression: No

In kernel 3.13 and earlier, a user could use the "bonding" network module to
enslave devices without MAC addresses, such as ppp devices, in certain modes
(such as balance-rr).

At some point between 3.13 and 3.17, this ability was removed -- even though
apparently a commit was made to specifically ALLOW this to happen:
https://github.com/torvalds/linux/commit/f54424412b6b2f64cae4d7c39d981ca14ce0052c


On kernel 3.13, this was printed to syslog upon enslaving ppp0:
bonding: bond0: Warning: The first slave device specified does not support
setting the MAC address. Setting fail_over_mac to active.
bonding: bond0: enslaving ppp0 as an active interface with an up link.

On kernel 3.17.2 and 3.17.4 (and probably others) instead this error comes up:
bond0: Adding slave ppp0
bond0: The slave device specified does not support setting the MAC address

And the slave is not added.


In case it was relevant, I resorted to manually creating the bond0 with
appropriate options (including fail_over_mac) preset, and yet the problem
persists.

To replicate the problem, setup at least one ppp connection, and then follow
these instructions:

    modprobe bonding
    echo '+bond0' > /sys/class/net/bonding_masters
    echo 'active' > /sys/class/net/bond0/bonding/fail_over_mac
    ifconfig bond0 down
    echo 'balance-rr' > /sys/class/net/bond0/bonding/mode
    ifconfig bond0 202.xx.xx.xx netmask 255.255.255.255 mtu 1492 up
    echo '500' > /sys/class/net/bond0/bonding/miimon

    ifconfig ppp0 down
    echo '+ppp0' > /sys/class/net/bond0/bonding/slaves


Under earlier kernel versions, that would work -- but current stable kernels
fail.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Michael S. Tsirkin @ 2014-12-02 17:34 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Jason Wang', Du, Fan,
	fw-HFFVJYpyMKqzQB+pC5nmwQ@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <20141202170927.GA9457-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>

On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > What about containers or any other virtualization environment that
> > doesn't use Virtio?
> 
> The host can dictate the MTU in that case for both veth or OVS
> internal which would be primary container plumbing techniques.

It typically can't do this easily for VMs with emulated devices:
real ethernet uses a fixed MTU.

IMHO it's confusing to suggest MTU as a fix for this bug, it's
an unrelated optimization.
ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.


> ivshmem would need it's own fix.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
From: Dave Taht @ 2014-12-02 17:31 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Eric Dumazet, Willy Tarreau, netdev@vger.kernel.org,
	Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <547DF2EA.2020908@free-electrons.com>

On Tue, Dec 2, 2014 at 9:12 AM, Ezequiel Garcia
<ezequiel.garcia@free-electrons.com> wrote:
> Eric,
>
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
>>
>> I am surprised TCP even worked correctly with this problem.
>>
>> I highly suggest BQL for this driver, now this issue is fixed.
>>
>
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
>
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
>
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.

at a gigabit.

Wildly variable with pause frames. And: 50ms at 100mbit. 500ms at 10mbit.

Secondly, why accept any latency in the driver when you can get it well below
2ms in most circumstances with a couple lines of BQL code?

>
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.

BQL is nice at any speed, and darn useful at lower speeds,
as well.

>
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.

We only do one upload or one download at a time on our systems,
followed by a ping measurement... never do more than one flow
at the same time, right?


> ""
>
> Maybe I misunderstood something?

Try it, test it, you'll like it.

> --
> Ezequiel García, Free Electrons
> Embedded Linux, Kernel and Android Engineering
> http://free-electrons.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Roopa Prabhu @ 2014-12-02 17:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mahesh Bandewar, netdev, David Miller, Eric Dumazet,
	Toshiaki Makita
In-Reply-To: <1417539160.5303.73.camel@edumazet-glaptop2.roam.corp.google.com>

On 12/2/14, 8:52 AM, Eric Dumazet wrote:
> On Tue, 2014-12-02 at 08:41 -0800, Roopa Prabhu wrote:
>
>> fair point. But the commit that moved things around was done to handle
>> cases where,
>> the ndo_uninit() already sends some notifications to userspace for the
>> changes
>> during uninit (example bond driver).
>>
>> The only point i was making was that the dellink after the ndo_uninit in
>> your
>> case now contains state that was prior to uninit for these drivers.
> I think Mahesh forgot to mention your patch probably broke some drivers.
>
> calling rtmsg_ifinfo() after uninit() is probably breaking dummy device,
> as it does :
>
> static void dummy_dev_uninit(struct net_device *dev)
> {
> 	free_percpu(dev->dstats);
> }
>
> It looks like 'fixing' ipvlan is not going to help.
>
> Instead of checking all drivers for such interesting side effects,
> and revert your patch, we had the idea of this solution.

ok fair enough. I could go through all drivers again and check and fix them.
but dont want to miss any such cases.
The patch idea is good.
Worth a comment in the source regarding the state in the dellink. 
Hopefully the drivers that are affected by this are very few.

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Thanks!.

^ permalink raw reply

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
From: Ezequiel Garcia @ 2014-12-02 17:12 UTC (permalink / raw)
  To: Eric Dumazet, Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1417522688.5303.35.camel@edumazet-glaptop2.roam.corp.google.com>

Eric,

On 12/02/2014 09:18 AM, Eric Dumazet wrote:
[..]
> 
> I am surprised TCP even worked correctly with this problem.
> 
> I highly suggest BQL for this driver, now this issue is fixed.
> 

Implementing BQL for the mvneta driver was something I wanted to do a
while ago, but you explained that these kind drivers (i.e. those with
software TSO) didn't need BQL, because the latency that resulted from
the ring was too small.

Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
""
Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
descriptors.

So I do not think BQL is really needed, because a 512 slots TX ring wont
add a big latency : About 5 ms max.

BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
packet consumes 3 or 4 descriptors.

Also note that TCP Small Queues should limit TX ring occupancy of a
single bulk flow anyway.
""

Maybe I misunderstood something?
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Thomas Graf @ 2014-12-02 17:09 UTC (permalink / raw)
  To: Du, Fan, 'Jason Wang', netdev@vger.kernel.org,
	davem@davemloft.net, fw@strlen.de, dev, mst, jesse, pshelar
In-Reply-To: <20141202154839.GB5344@t520.home>

On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> What about containers or any other virtualization environment that
> doesn't use Virtio?

The host can dictate the MTU in that case for both veth or OVS
internal which would be primary container plumbing techniques.

ivshmem would need it's own fix.

^ permalink raw reply

* [patch net-next 5/6] net_sched: cls_flow: remove duplicate assignments
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_flow.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index a605739..819c230 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -426,10 +426,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 			goto err2;
 
 		/* Copy fold into fnew */
-		fnew->handle = fold->handle;
-		fnew->keymask = fold->keymask;
 		fnew->tp = fold->tp;
-
 		fnew->handle = fold->handle;
 		fnew->nkeys = fold->nkeys;
 		fnew->keymask = fold->keymask;
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 2/6] net_sched: cls_bpf: remove unnecessary iteration and use passed arg
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_bpf.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index eed49d1..cbfaf6f 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -109,19 +109,12 @@ static void __cls_bpf_delete_prog(struct rcu_head *rcu)
 
 static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
 {
-	struct cls_bpf_head *head = rtnl_dereference(tp->root);
-	struct cls_bpf_prog *prog, *todel = (struct cls_bpf_prog *) arg;
-
-	list_for_each_entry(prog, &head->plist, link) {
-		if (prog == todel) {
-			list_del_rcu(&prog->link);
-			tcf_unbind_filter(tp, &prog->res);
-			call_rcu(&prog->rcu, __cls_bpf_delete_prog);
-			return 0;
-		}
-	}
+	struct cls_bpf_prog *prog = (struct cls_bpf_prog *) arg;
 
-	return -ENOENT;
+	list_del_rcu(&prog->link);
+	tcf_unbind_filter(tp, &prog->res);
+	call_rcu(&prog->rcu, __cls_bpf_delete_prog);
+	return 0;
 }
 
 static void cls_bpf_destroy(struct tcf_proto *tp)
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 6/6] net_sched: cls_cgroup: remove unnecessary if
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

since head->handle == handle (checked before), just assign handle.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_cgroup.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index d61a801..dbee65e 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -117,11 +117,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 		return -ENOBUFS;
 
 	tcf_exts_init(&new->exts, TCA_CGROUP_ACT, TCA_CGROUP_POLICE);
-	if (head)
-		new->handle = head->handle;
-	else
-		new->handle = handle;
-
+	new->handle = handle;
 	new->tp = tp;
 	err = nla_parse_nested(tb, TCA_CGROUP_MAX, tca[TCA_OPTIONS],
 			       cgroup_policy);
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 3/6] net_sched: cls_bpf: remove faulty use of list_for_each_entry_rcu
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

rcu variant is not correct here. The code is called by updater (rtnl
lock is held), not by reader (no rcu_read_lock is held).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_bpf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index cbfaf6f..d0de979 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -141,7 +141,7 @@ static unsigned long cls_bpf_get(struct tcf_proto *tp, u32 handle)
 	if (head == NULL)
 		return 0UL;
 
-	list_for_each_entry_rcu(prog, &head->plist, link) {
+	list_for_each_entry(prog, &head->plist, link) {
 		if (prog->handle == handle) {
 			ret = (unsigned long) prog;
 			break;
@@ -337,7 +337,7 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 	struct cls_bpf_prog *prog;
 
-	list_for_each_entry_rcu(prog, &head->plist, link) {
+	list_for_each_entry(prog, &head->plist, link) {
 		if (arg->count < arg->skip)
 			goto skip;
 		if (arg->fn(tp, (unsigned long) prog, arg) < 0) {
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 4/6] net_sched: cls_flow: remove faulty use of list_for_each_entry_rcu
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

rcu variant is not correct here. The code is called by updater (rtnl
lock is held), not by reader (no rcu_read_lock is held).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_flow.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 4ac515f..a605739 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -578,7 +578,7 @@ static unsigned long flow_get(struct tcf_proto *tp, u32 handle)
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f;
 
-	list_for_each_entry_rcu(f, &head->filters, list)
+	list_for_each_entry(f, &head->filters, list)
 		if (f->handle == handle)
 			return (unsigned long)f;
 	return 0;
@@ -654,7 +654,7 @@ static void flow_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f;
 
-	list_for_each_entry_rcu(f, &head->filters, list) {
+	list_for_each_entry(f, &head->filters, list) {
 		if (arg->count < arg->skip)
 			goto skip;
 		if (arg->fn(tp, (unsigned long)f, arg) < 0) {
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 1/6] net_sched: cls_basic: remove unnecessary iteration and use passed arg
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1417539636-12710-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/cls_basic.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index cd61280..1c122c7 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -113,18 +113,12 @@ static void basic_destroy(struct tcf_proto *tp)
 
 static int basic_delete(struct tcf_proto *tp, unsigned long arg)
 {
-	struct basic_head *head = rtnl_dereference(tp->root);
-	struct basic_filter *t, *f = (struct basic_filter *) arg;
-
-	list_for_each_entry(t, &head->flist, link)
-		if (t == f) {
-			list_del_rcu(&t->link);
-			tcf_unbind_filter(tp, &t->res);
-			call_rcu(&t->rcu, basic_delete_filter);
-			return 0;
-		}
+	struct basic_filter *f = (struct basic_filter *) arg;
 
-	return -ENOENT;
+	list_del_rcu(&f->link);
+	tcf_unbind_filter(tp, &f->res);
+	call_rcu(&f->rcu, basic_delete_filter);
+	return 0;
 }
 
 static const struct nla_policy basic_policy[TCA_BASIC_MAX + 1] = {
-- 
1.9.3

^ permalink raw reply related

* [patch net-next 0/6] net_sched: cls_*: couple of fixes
From: Jiri Pirko @ 2014-12-02 17:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs

Jiri Pirko (6):
  net_sched: cls_basic: remove unnecessary iteration and use passed arg
  net_sched: cls_bpf: remove unnecessary iteration and use passed arg
  net_sched: cls_bpf: remove faulty use of list_for_each_entry_rcu
  net_sched: cls_flow: remove faulty use of list_for_each_entry_rcu
  net_sched: cls_flow: remove duplicate assignments
  net_sched: cls_cgroup: remove unnecessary if

 net/sched/cls_basic.c  | 16 +++++-----------
 net/sched/cls_bpf.c    | 21 +++++++--------------
 net/sched/cls_cgroup.c |  6 +-----
 net/sched/cls_flow.c   |  7 ++-----
 4 files changed, 15 insertions(+), 35 deletions(-)

-- 
1.9.3

^ permalink raw reply

* Re: net-PA Semi: Deletion of unnecessary checks before the function call "pci_dev_put"
From: Johannes Berg @ 2014-12-02 16:53 UTC (permalink / raw)
  To: Julia Lawall
  Cc: SF Markus Elfring, Lino Sanfilippo, Olof Johansson,
	netdev-u79uwXL29TY76Z2rM5mHXA, backports-u79uwXL29TY76Z2rM5mHXA,
	LKML, kernel-janitors-u79uwXL29TY76Z2rM5mHXA, Luis R. Rodriguez
In-Reply-To: <alpine.DEB.2.02.1412012134290.2076-bi+AKbBUZKagILUCTcTcHdKyNwTtLsGr@public.gmane.org>

On Mon, 2014-12-01 at 21:34 +0100, Julia Lawall wrote:

> > So this kind of evolution is no problem for the (automated) backports
> > using the backports project - although it can be difficult to detect
> > such a thing is needed.
> 
> That is exactly the problem...

I'm not convinced though that it should stop such progress in mainline.

johannes

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Eric Dumazet @ 2014-12-02 16:52 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Mahesh Bandewar, netdev, David Miller, Eric Dumazet,
	Toshiaki Makita
In-Reply-To: <547DEBAC.5000005@cumulusnetworks.com>

On Tue, 2014-12-02 at 08:41 -0800, Roopa Prabhu wrote:

> fair point. But the commit that moved things around was done to handle 
> cases where,
> the ndo_uninit() already sends some notifications to userspace for the 
> changes
> during uninit (example bond driver).
> 
> The only point i was making was that the dellink after the ndo_uninit in 
> your
> case now contains state that was prior to uninit for these drivers.

I think Mahesh forgot to mention your patch probably broke some drivers.

calling rtmsg_ifinfo() after uninit() is probably breaking dummy device,
as it does :

static void dummy_dev_uninit(struct net_device *dev)
{
	free_percpu(dev->dstats);
}

It looks like 'fixing' ipvlan is not going to help.

Instead of checking all drivers for such interesting side effects,
and revert your patch, we had the idea of this solution.

^ permalink raw reply

* [PATCH iproute2] ip route: don't assume default route
From: =?UTF-8?q?Pavel=20=C5=A0imerda?= @ 2014-12-02 16:45 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

Just print the help when "ip route del" is called without any other
arguments.

Resolves:

 * https://bugzilla.redhat.com/show_bug.cgi?id=997965

Signed-off-by: Pavel Šimerda <psimerda@redhat.com>
---
 ip/iproute.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/iproute.c b/ip/iproute.c
index d77b1e3..c171f29 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -975,6 +975,9 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
 		argc--; argv++;
 	}
 
+	if (!dst_ok)
+		usage();
+
 	if (d || nhs_ok)  {
 		int idx;
 
-- 
1.8.5.5

^ permalink raw reply related

* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Roopa Prabhu @ 2014-12-02 16:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mahesh Bandewar, netdev, David Miller, Eric Dumazet,
	Toshiaki Makita
In-Reply-To: <1417537150.5303.67.camel@edumazet-glaptop2.roam.corp.google.com>

On 12/2/14, 8:19 AM, Eric Dumazet wrote:
> On Tue, 2014-12-02 at 08:08 -0800, Roopa Prabhu wrote:
>
>> interestingly I have never seen this. We use this heavily with most
>> other logical devices.
>> Which tells me most logical devices do have checks in their fill_info.
>> The patch idea is good. My only concern is stale information
>> in the DELLINK notification. Because,  ndo_uninit() does do a lot of
>> cleanup, sending
>> newlink's for some of these cleanup changes. And now with your patch the
>> dellink notification
>> skb probably  contains information that has been already deleted by
>> ndo_uninit ?
> What do you mean ? We send a message and device is deleted.
>
> Its an asynchronous message.
>
> At the time you read it, lot of things might already have changed, no
> matter how careful you are.
>
> This patch permits to get a precise snapshot of device info right before
> dismantle (like stats counters for accounting purpose)

fair point. But the commit that moved things around was done to handle 
cases where,
the ndo_uninit() already sends some notifications to userspace for the 
changes
during uninit (example bond driver).

The only point i was making was that the dellink after the ndo_uninit in 
your
case now contains state that was prior to uninit for these drivers.

For the bug being discussed, i was thinking fill_info in ipvlan should 
be fixed to handle this correctly.
  ie fill_info could be fixed to not barf if the private information is 
not present and just return nothing.
But understand that this will not give you the state prior to uninit for 
some drivers.

^ permalink raw reply

* Re: e1000_netpoll(): disable_irq() triggers might_sleep() on linux-next
From: Sabrina Dubroca @ 2014-12-02 16:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20141029193603.GS12706@worktop.programming.kicks-ass.net>

Hello, sorry for the delay.

2014-10-29, 20:36:03 +0100, Peter Zijlstra wrote:
> On Wed, Oct 29, 2014 at 07:33:00PM +0100, Thomas Gleixner wrote:
> > Yuck. No. You are just papering over the problem.
> > 
> > What happens if you add 'threadirqs' to the kernel command line? Or if
> > the interrupt line is shared with a real threaded interrupt user?
> > 
> > The proper solution is to have a poll_lock for e1000 which serializes
> > the hardware interrupt against netpoll instead of using
> > disable/enable_irq().
> > 
> > In fact that's less expensive than the disable/enable_irq() dance and
> > the chance of contention is pretty low. If done right it will be a
> > NOOP for the CONFIG_NET_POLL_CONTROLLER=n case.
> > 
> 
> OK a little something like so then I suppose.. But I suspect most all
> the network drivers will need this and maybe more, disable_irq() is a
> popular little thing and we 'just' changed semantics on them.
> 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |  2 ++
>  drivers/net/ethernet/intel/e1000/e1000_main.c | 22 +++++++++++++++++-----
>  kernel/irq/manage.c                           |  2 +-
>  3 files changed, 20 insertions(+), 6 deletions(-)

I've been running with variants of this patch, things seem ok.

As noted earlier, there are a lot of drivers doing this disable_irq +
irq_handler + enable_irq sequence.  I found about 60.
Many already take a lock in the interrupt handler, and look like we
could just remove the call to disable_irq (example: cp_interrupt,
drivers/net/ethernet/realtek/8139cp.c).

Here's how I modified your patch.  The locking compiles away if
CONFIG_NET_POLL_CONTROLLER=n.

I can work on converting all the drivers from disable_irq to
netpoll_irq_lock, if that's okay with you.

In igb there's also a synchronize_irq() called from the netpoll
controller (in igb_irq_disable()), I think a similar locking scheme
would work.
I also saw a few disable_irq_nosync and disable_percpu_irq. These are
okay?


Thanks.

---

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 69707108d23c..79444125b9bd 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -68,6 +68,7 @@
 #include <linux/mii.h>
 #include <linux/ethtool.h>
 #include <linux/if_vlan.h>
+#include <linux/netpoll.h>
 
 #define BAR_0		0
 #define BAR_1		1
@@ -323,6 +324,8 @@ struct e1000_adapter {
 	struct delayed_work watchdog_task;
 	struct delayed_work fifo_stall_task;
 	struct delayed_work phy_info_task;
+
+	struct netpoll_irq_lock netpoll_lock;
 };
 
 enum e1000_state_t {
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 24f3986cfae2..5749a27e5c5e 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include <linux/prefetch.h>
 #include <linux/bitops.h>
 #include <linux/if_vlan.h>
+#include <linux/netpoll.h>
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -1313,6 +1314,7 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	e1000_irq_disable(adapter);
 
 	spin_lock_init(&adapter->stats_lock);
+	netpoll_irq_lock_init(&adapter->netpoll_lock);
 
 	set_bit(__E1000_DOWN, &adapter->flags);
 
@@ -3751,10 +3753,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int irq, void *data)
+static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	u32 icr = er32(ICR);
 
@@ -3796,6 +3796,19 @@ static irqreturn_t e1000_intr(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int irq, void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	irqreturn_t ret;
+
+	netpoll_irq_lock(&adapter->netpoll_lock);
+	ret = __e1000_intr(irq, adapter);
+	netpoll_irq_unlock(&adapter->netpoll_lock);
+
+	return ret;
+}
+
 /**
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
@@ -5220,9 +5233,7 @@ static void e1000_netpoll(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
-	enable_irq(adapter->pdev->irq);
 }
 #endif
 
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index b25ee9ffdbe6..a171f1a50e0e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -117,4 +117,35 @@ static inline bool netpoll_tx_running(struct net_device *dev)
 }
 #endif
 
+struct netpoll_irq_lock {
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spinlock_t lock;
+#endif
+};
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static inline void netpoll_irq_lock_init(struct netpoll_irq_lock *np_lock)
+{
+	spin_lock_init(&np_lock->lock);
+}
+static inline void netpoll_irq_lock(struct netpoll_irq_lock *np_lock)
+{
+	spin_lock(&np_lock->lock);
+}
+static inline void netpoll_irq_unlock(struct netpoll_irq_lock *np_lock)
+{
+	spin_unlock(&np_lock->lock);
+}
+#else
+static inline void netpoll_irq_lock_init(struct netpoll_irq_lock *np_lock)
+{
+}
+static inline void netpoll_irq_lock(struct netpoll_irq_lock *np_lock)
+{
+}
+static inline void netpoll_irq_unlock(struct netpoll_irq_lock *np_lock)
+{
+}
+#endif
+
 #endif



> 
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 69707108d23c..3f48609f2318 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -323,6 +323,8 @@ struct e1000_adapter {
>  	struct delayed_work watchdog_task;
>  	struct delayed_work fifo_stall_task;
>  	struct delayed_work phy_info_task;
> +
> +	spinlock_t irq_lock;
>  };
>  
>  enum e1000_state_t {
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 5f6aded512f5..d12cbffe2149 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -1310,6 +1310,7 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
>  	e1000_irq_disable(adapter);
>  
>  	spin_lock_init(&adapter->stats_lock);
> +	spin_lock_init(&adapter->irq_lock);
>  
>  	set_bit(__E1000_DOWN, &adapter->flags);
>  
> @@ -3748,10 +3749,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
>   * @irq: interrupt number
>   * @data: pointer to a network interface device structure
>   **/
> -static irqreturn_t e1000_intr(int irq, void *data)
> +static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
>  {
> -	struct net_device *netdev = data;
> -	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct e1000_hw *hw = &adapter->hw;
>  	u32 icr = er32(ICR);
>  
> @@ -3793,6 +3792,19 @@ static irqreturn_t e1000_intr(int irq, void *data)
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t e1000_intr(int irq, void *data)
> +{
> +	struct net_device *netdev = data;
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	irqreturn_t ret;
> +
> +	spin_lock(&adapter->irq_lock);
> +	ret = __e1000_intr(irq, adapter);
> +	spin_unlock(&adapter->irq_lock);
> +
> +	return ret;
> +}
> +
>  /**
>   * e1000_clean - NAPI Rx polling callback
>   * @adapter: board private structure
> @@ -5217,9 +5229,9 @@ static void e1000_netpoll(struct net_device *netdev)
>  {
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  
> -	disable_irq(adapter->pdev->irq);
> +	spin_lock(&adapter->irq_lock)
>  	e1000_intr(adapter->pdev->irq, netdev);
> -	enable_irq(adapter->pdev->irq);
> +	spin_unlock(&adapter->irq_lock)
>  }
>  #endif
>  
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 0a9104b4608b..b5a4a06bf2fd 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -427,7 +427,7 @@ EXPORT_SYMBOL(disable_irq_nosync);
>   *	to complete before returning. If you use this function while
>   *	holding a resource the IRQ handler may need you will deadlock.
>   *
> - *	This function may be called - with care - from IRQ context.
> + *	This function may _NOT_ be called from IRQ context.
>   */
>  void disable_irq(unsigned int irq)
>  {

-- 
Sabrina

^ permalink raw reply related

* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Eric Dumazet @ 2014-12-02 16:19 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Mahesh Bandewar, netdev, David Miller, Eric Dumazet,
	Toshiaki Makita
In-Reply-To: <547DE418.9000309@cumulusnetworks.com>

On Tue, 2014-12-02 at 08:08 -0800, Roopa Prabhu wrote:

> interestingly I have never seen this. We use this heavily with most 
> other logical devices.
> Which tells me most logical devices do have checks in their fill_info.
> The patch idea is good. My only concern is stale information
> in the DELLINK notification. Because,  ndo_uninit() does do a lot of 
> cleanup, sending
> newlink's for some of these cleanup changes. And now with your patch the 
> dellink notification
> skb probably  contains information that has been already deleted by 
> ndo_uninit ?

What do you mean ? We send a message and device is deleted.

Its an asynchronous message.

At the time you read it, lot of things might already have changed, no
matter how careful you are.

This patch permits to get a precise snapshot of device info right before
dismantle (like stats counters for accounting purpose)

^ permalink raw reply

* [PATCH net-next V3 1/2] ethtool: Support for configurable RSS hash function
From: Amir Vadai @ 2014-12-02 16:12 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings
  Cc: netdev, Or Gerlitz, Eyal Perry, Yevgeny Petrilin, Tom Lendacky,
	Ariel Elior, Prashant Sreedharan, Michael Chan, Hariprasad S,
	Sathya Perla, Subbu Seetharaman, Ajit Khaparde, Jeff Kirsher,
	Jesse Brandeburg, Bruce Allan, Carolyn Wyborny, Don Skidmore,
	Greg Rose, Matthew Vick, John Ronciak, Mitch Williams, Amir Vadai
In-Reply-To: <1417536731-11211-1-git-send-email-amirv@mellanox.com>

From: Eyal Perry <eyalpe@mellanox.com>

This patch extends the set/get_rxfh ethtool-options for getting or
setting the RSS hash function.

It modifies drivers implementation of set/get_rxfh accordingly.

This change also delegates the responsibility of checking whether a
modification to a certain RX flow hash parameter is supported to the
driver implementation of set_rxfh.

User-kernel API is done through the new hfunc bitmask field in the
ethtool_rxfh struct. A bit set in the hfunc field is corresponding to an
index in the new string-set ETH_SS_RSS_HASH_FUNCS.

Got approval from most of the relevant driver maintainers that their
driver is using Toeplitz, and for the few that didn't answered, also
assumed it is Toeplitz.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: Shradha Shah <sshah@solarflare.com>
Cc: Shreyas Bhatewara <sbhatewara@vmware.com>
Cc: "VMware, Inc." <pv-drivers@vmware.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c       | 11 +++-
 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c    | 20 ++++++-
 drivers/net/ethernet/broadcom/tg3.c                | 20 ++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 18 +++++-
 drivers/net/ethernet/emulex/benet/be_ethtool.c     | 12 +++-
 drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c   | 12 +++-
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c | 17 +++++-
 drivers/net/ethernet/intel/igb/igb_ethtool.c       | 16 ++++-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    | 13 +++-
 drivers/net/ethernet/sfc/ethtool.c                 | 18 ++++--
 drivers/net/vmxnet3/vmxnet3_ethtool.c              | 15 ++++-
 include/linux/ethtool.h                            | 42 +++++++++----
 include/uapi/linux/ethtool.h                       | 10 +++-
 net/core/ethtool.c                                 | 69 ++++++++++++----------
 14 files changed, 223 insertions(+), 70 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
index 95d4453..ebf4893 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c
@@ -511,7 +511,8 @@ static u32 xgbe_get_rxfh_indir_size(struct net_device *netdev)
 	return ARRAY_SIZE(pdata->rss_table);
 }
 
-static int xgbe_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key)
+static int xgbe_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
+			 u8 *hfunc)
 {
 	struct xgbe_prv_data *pdata = netdev_priv(netdev);
 	unsigned int i;
@@ -525,16 +526,22 @@ static int xgbe_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key)
 	if (key)
 		memcpy(key, pdata->rss_key, sizeof(pdata->rss_key));
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+
 	return 0;
 }
 
 static int xgbe_set_rxfh(struct net_device *netdev, const u32 *indir,
-			 const u8 *key)
+			 const u8 *key, const u8 hfunc)
 {
 	struct xgbe_prv_data *pdata = netdev_priv(netdev);
 	struct xgbe_hw_if *hw_if = &pdata->hw_if;
 	unsigned int ret;
 
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+
 	if (indir) {
 		ret = hw_if->set_rss_lookup_table(pdata, indir);
 		if (ret)
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1edc931..ffe4e00 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -3358,12 +3358,18 @@ static u32 bnx2x_get_rxfh_indir_size(struct net_device *dev)
 	return T_ETH_INDIRECTION_TABLE_SIZE;
 }
 
-static int bnx2x_get_rxfh(struct net_device *dev, u32 *indir, u8 *key)
+static int bnx2x_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
+			  u8 *hfunc)
 {
 	struct bnx2x *bp = netdev_priv(dev);
 	u8 ind_table[T_ETH_INDIRECTION_TABLE_SIZE] = {0};
 	size_t i;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!indir)
+		return 0;
+
 	/* Get the current configuration of the RSS indirection table */
 	bnx2x_get_rss_ind_table(&bp->rss_conf_obj, ind_table);
 
@@ -3383,11 +3389,21 @@ static int bnx2x_get_rxfh(struct net_device *dev, u32 *indir, u8 *key)
 }
 
 static int bnx2x_set_rxfh(struct net_device *dev, const u32 *indir,
-			  const u8 *key)
+			  const u8 *key, const u8 hfunc)
 {
 	struct bnx2x *bp = netdev_priv(dev);
 	size_t i;
 
+	/* We require at least one supported parameter to be changed and no
+	 * change in any of the unsupported parameters
+	 */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+
+	if (!indir)
+		return 0;
+
 	for (i = 0; i < T_ETH_INDIRECTION_TABLE_SIZE; i++) {
 		/*
 		 * The same as in bnx2x_get_rxfh: we can't use a memcpy()
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 43fd1b7..bb48a61 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -12561,22 +12561,38 @@ static u32 tg3_get_rxfh_indir_size(struct net_device *dev)
 	return size;
 }
 
-static int tg3_get_rxfh(struct net_device *dev, u32 *indir, u8 *key)
+static int tg3_get_rxfh(struct net_device *dev, u32 *indir, u8 *key, u8 *hfunc)
 {
 	struct tg3 *tp = netdev_priv(dev);
 	int i;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!indir)
+		return 0;
+
 	for (i = 0; i < TG3_RSS_INDIR_TBL_SIZE; i++)
 		indir[i] = tp->rss_ind_tbl[i];
 
 	return 0;
 }
 
-static int tg3_set_rxfh(struct net_device *dev, const u32 *indir, const u8 *key)
+static int tg3_set_rxfh(struct net_device *dev, const u32 *indir, const u8 *key,
+			const u8 hfunc)
 {
 	struct tg3 *tp = netdev_priv(dev);
 	size_t i;
 
+	/* We require at least one supported parameter to be changed and no
+	 * change in any of the unsupported parameters
+	 */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+
+	if (!indir)
+		return 0;
+
 	for (i = 0; i < TG3_RSS_INDIR_TBL_SIZE; i++)
 		tp->rss_ind_tbl[i] = indir[i];
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 3aea82b..e7342bc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2923,21 +2923,35 @@ static u32 get_rss_table_size(struct net_device *dev)
 	return pi->rss_size;
 }
 
-static int get_rss_table(struct net_device *dev, u32 *p, u8 *key)
+static int get_rss_table(struct net_device *dev, u32 *p, u8 *key, u8 *hfunc)
 {
 	const struct port_info *pi = netdev_priv(dev);
 	unsigned int n = pi->rss_size;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!p)
+		return 0;
 	while (n--)
 		p[n] = pi->rss[n];
 	return 0;
 }
 
-static int set_rss_table(struct net_device *dev, const u32 *p, const u8 *key)
+static int set_rss_table(struct net_device *dev, const u32 *p, const u8 *key,
+			 const u8 hfunc)
 {
 	unsigned int i;
 	struct port_info *pi = netdev_priv(dev);
 
+	/* We require at least one supported parameter to be changed and no
+	 * change in any of the unsupported parameters
+	 */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+	if (!p)
+		return 0;
+
 	for (i = 0; i < pi->rss_size; i++)
 		pi->rss[i] = p[i];
 	if (pi->adapter->flags & FULL_INIT_DONE)
diff --git a/drivers/net/ethernet/emulex/benet/be_ethtool.c b/drivers/net/ethernet/emulex/benet/be_ethtool.c
index e42a791..73a500c 100644
--- a/drivers/net/ethernet/emulex/benet/be_ethtool.c
+++ b/drivers/net/ethernet/emulex/benet/be_ethtool.c
@@ -1171,7 +1171,8 @@ static u32 be_get_rxfh_key_size(struct net_device *netdev)
 	return RSS_HASH_KEY_LEN;
 }
 
-static int be_get_rxfh(struct net_device *netdev, u32 *indir, u8 *hkey)
+static int be_get_rxfh(struct net_device *netdev, u32 *indir, u8 *hkey,
+		       u8 *hfunc)
 {
 	struct be_adapter *adapter = netdev_priv(netdev);
 	int i;
@@ -1185,16 +1186,23 @@ static int be_get_rxfh(struct net_device *netdev, u32 *indir, u8 *hkey)
 	if (hkey)
 		memcpy(hkey, rss->rss_hkey, RSS_HASH_KEY_LEN);
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+
 	return 0;
 }
 
 static int be_set_rxfh(struct net_device *netdev, const u32 *indir,
-		       const u8 *hkey)
+		       const u8 *hkey, const u8 hfunc)
 {
 	int rc = 0, i, j;
 	struct be_adapter *adapter = netdev_priv(netdev);
 	u8 rsstable[RSS_INDIR_TABLE_LEN];
 
+	/* We do not allow change in unsupported parameters */
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+
 	if (indir) {
 		struct be_rx_obj *rxo;
 
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
index 2d04464..651f53b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
@@ -916,11 +916,15 @@ static u32 fm10k_get_rssrk_size(struct net_device *netdev)
 	return FM10K_RSSRK_SIZE * FM10K_RSSRK_ENTRIES_PER_REG;
 }
 
-static int fm10k_get_rssh(struct net_device *netdev, u32 *indir, u8 *key)
+static int fm10k_get_rssh(struct net_device *netdev, u32 *indir, u8 *key,
+			  u8 *hfunc)
 {
 	struct fm10k_intfc *interface = netdev_priv(netdev);
 	int i, err;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+
 	err = fm10k_get_reta(netdev, indir);
 	if (err || !key)
 		return err;
@@ -932,12 +936,16 @@ static int fm10k_get_rssh(struct net_device *netdev, u32 *indir, u8 *key)
 }
 
 static int fm10k_set_rssh(struct net_device *netdev, const u32 *indir,
-			  const u8 *key)
+			  const u8 *key, const u8 hfunc)
 {
 	struct fm10k_intfc *interface = netdev_priv(netdev);
 	struct fm10k_hw *hw = &interface->hw;
 	int i, err;
 
+	/* We do not allow change in unsupported parameters */
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+
 	err = fm10k_set_reta(netdev, indir);
 	if (err || !key)
 		return err;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
index 69a269b..69b97ba 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
@@ -627,13 +627,19 @@ static u32 i40evf_get_rxfh_indir_size(struct net_device *netdev)
  *
  * Reads the indirection table directly from the hardware. Always returns 0.
  **/
-static int i40evf_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key)
+static int i40evf_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
+			   u8 *hfunc)
 {
 	struct i40evf_adapter *adapter = netdev_priv(netdev);
 	struct i40e_hw *hw = &adapter->hw;
 	u32 hlut_val;
 	int i, j;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!indir)
+		return 0;
+
 	for (i = 0, j = 0; i <= I40E_VFQF_HLUT_MAX_INDEX; i++) {
 		hlut_val = rd32(hw, I40E_VFQF_HLUT(i));
 		indir[j++] = hlut_val & 0xff;
@@ -654,13 +660,20 @@ static int i40evf_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key)
  * returns 0 after programming the table.
  **/
 static int i40evf_set_rxfh(struct net_device *netdev, const u32 *indir,
-			   const u8 *key)
+			   const u8 *key, const u8 hfunc)
 {
 	struct i40evf_adapter *adapter = netdev_priv(netdev);
 	struct i40e_hw *hw = &adapter->hw;
 	u32 hlut_val;
 	int i, j;
 
+	/* We do not allow change in unsupported parameters */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+	if (!indir)
+		return 0;
+
 	for (i = 0, j = 0; i <= I40E_VFQF_HLUT_MAX_INDEX; i++) {
 		hlut_val = indir[j++];
 		hlut_val |= indir[j++] << 8;
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 02cfd3b..d5673eb 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2842,11 +2842,16 @@ static u32 igb_get_rxfh_indir_size(struct net_device *netdev)
 	return IGB_RETA_SIZE;
 }
 
-static int igb_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key)
+static int igb_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
+			u8 *hfunc)
 {
 	struct igb_adapter *adapter = netdev_priv(netdev);
 	int i;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!indir)
+		return 0;
 	for (i = 0; i < IGB_RETA_SIZE; i++)
 		indir[i] = adapter->rss_indir_tbl[i];
 
@@ -2889,13 +2894,20 @@ void igb_write_rss_indir_tbl(struct igb_adapter *adapter)
 }
 
 static int igb_set_rxfh(struct net_device *netdev, const u32 *indir,
-			const u8 *key)
+			const u8 *key, const u8 hfunc)
 {
 	struct igb_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	int i;
 	u32 num_queues;
 
+	/* We do not allow change in unsupported parameters */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+	if (!indir)
+		return 0;
+
 	num_queues = adapter->rss_queues;
 
 	switch (hw->mac.type) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index c45e06a..28c3fc5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -978,7 +978,8 @@ static u32 mlx4_en_get_rxfh_key_size(struct net_device *netdev)
 	return MLX4_EN_RSS_KEY_SIZE;
 }
 
-static int mlx4_en_get_rxfh(struct net_device *dev, u32 *ring_index, u8 *key)
+static int mlx4_en_get_rxfh(struct net_device *dev, u32 *ring_index, u8 *key,
+			    u8 *hfunc)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_rss_map *rss_map = &priv->rss_map;
@@ -990,16 +991,20 @@ static int mlx4_en_get_rxfh(struct net_device *dev, u32 *ring_index, u8 *key)
 	rss_rings = 1 << ilog2(rss_rings);
 
 	while (n--) {
+		if (!ring_index)
+			break;
 		ring_index[n] = rss_map->qps[n % rss_rings].qpn -
 			rss_map->base_qpn;
 	}
 	if (key)
 		memcpy(key, priv->rss_key, MLX4_EN_RSS_KEY_SIZE);
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
 	return err;
 }
 
 static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
-			    const u8 *key)
+			    const u8 *key, const u8 hfunc)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
@@ -1008,6 +1013,10 @@ static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
 	int i;
 	int rss_rings = 0;
 
+	/* We do not allow change in unsupported parameters */
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+
 	/* Calculate RSS table size and make sure flows are spread evenly
 	 * between rings
 	 */
diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c
index cad258a..4835bc0 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -1086,19 +1086,29 @@ static u32 efx_ethtool_get_rxfh_indir_size(struct net_device *net_dev)
 		0 : ARRAY_SIZE(efx->rx_indir_table));
 }
 
-static int efx_ethtool_get_rxfh(struct net_device *net_dev, u32 *indir, u8 *key)
+static int efx_ethtool_get_rxfh(struct net_device *net_dev, u32 *indir, u8 *key,
+				u8 *hfunc)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 
-	memcpy(indir, efx->rx_indir_table, sizeof(efx->rx_indir_table));
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (indir)
+		memcpy(indir, efx->rx_indir_table, sizeof(efx->rx_indir_table));
 	return 0;
 }
 
-static int efx_ethtool_set_rxfh(struct net_device *net_dev,
-				const u32 *indir, const u8 *key)
+static int efx_ethtool_set_rxfh(struct net_device *net_dev, const u32 *indir,
+				const u8 *key, const u8 hfunc)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 
+	/* We do not allow change in unsupported parameters */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+	if (!indir)
+		return 0;
 	memcpy(efx->rx_indir_table, indir, sizeof(efx->rx_indir_table));
 	efx->type->rx_push_rss_config(efx);
 	return 0;
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index b725fd9..b7b5332 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -583,12 +583,16 @@ vmxnet3_get_rss_indir_size(struct net_device *netdev)
 }
 
 static int
-vmxnet3_get_rss(struct net_device *netdev, u32 *p, u8 *key)
+vmxnet3_get_rss(struct net_device *netdev, u32 *p, u8 *key, u8 *hfunc)
 {
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
 	struct UPT1_RSSConf *rssConf = adapter->rss_conf;
 	unsigned int n = rssConf->indTableSize;
 
+	if (hfunc)
+		*hfunc = ETH_RSS_HASH_TOP;
+	if (!p)
+		return 0;
 	while (n--)
 		p[n] = rssConf->indTable[n];
 	return 0;
@@ -596,13 +600,20 @@ vmxnet3_get_rss(struct net_device *netdev, u32 *p, u8 *key)
 }
 
 static int
-vmxnet3_set_rss(struct net_device *netdev, const u32 *p, const u8 *key)
+vmxnet3_set_rss(struct net_device *netdev, const u32 *p, const u8 *key,
+		const u8 hfunc)
 {
 	unsigned int i;
 	unsigned long flags;
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
 	struct UPT1_RSSConf *rssConf = adapter->rss_conf;
 
+	/* We do not allow change in unsupported parameters */
+	if (key ||
+	    (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP))
+		return -EOPNOTSUPP;
+	if (!p)
+		return 0;
 	for (i = 0; i < rssConf->indTableSize; i++)
 		rssConf->indTable[i] = p[i];
 
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index c1a2d60..653dc9c 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -59,6 +59,26 @@ enum ethtool_phys_id_state {
 	ETHTOOL_ID_OFF
 };
 
+enum {
+	ETH_RSS_HASH_TOP_BIT, /* Configurable RSS hash function - Toeplitz */
+	ETH_RSS_HASH_XOR_BIT, /* Configurable RSS hash function - Xor */
+
+	/*
+	 * Add your fresh new hash function bits above and remember to update
+	 * rss_hash_func_strings[] in ethtool.c
+	 */
+	ETH_RSS_HASH_FUNCS_COUNT
+};
+
+#define __ETH_RSS_HASH_BIT(bit)	((u32)1 << (bit))
+#define __ETH_RSS_HASH(name)	__ETH_RSS_HASH_BIT(ETH_RSS_HASH_##name##_BIT)
+
+#define ETH_RSS_HASH_TOP	__ETH_RSS_HASH(TOP)
+#define ETH_RSS_HASH_XOR	__ETH_RSS_HASH(XOR)
+
+#define ETH_RSS_HASH_UNKNOWN	0
+#define ETH_RSS_HASH_NO_CHANGE	0
+
 struct net_device;
 
 /* Some generic methods drivers may use in their ethtool_ops */
@@ -158,17 +178,14 @@ static inline u32 ethtool_rxfh_indir_default(u32 index, u32 n_rx_rings)
  *	Returns zero if not supported for this specific device.
  * @get_rxfh_indir_size: Get the size of the RX flow hash indirection table.
  *	Returns zero if not supported for this specific device.
- * @get_rxfh: Get the contents of the RX flow hash indirection table and hash
- *	key.
- *	Will only be called if one or both of @get_rxfh_indir_size and
- *	@get_rxfh_key_size are implemented and return non-zero.
- *	Returns a negative error code or zero.
- * @set_rxfh: Set the contents of the RX flow hash indirection table and/or
- *	hash key.  In case only the indirection table or hash key is to be
- *	changed, the other argument will be %NULL.
- *	Will only be called if one or both of @get_rxfh_indir_size and
- *	@get_rxfh_key_size are implemented and return non-zero.
+ * @get_rxfh: Get the contents of the RX flow hash indirection table, hash key
+ *	and/or hash function.
  *	Returns a negative error code or zero.
+ * @set_rxfh: Set the contents of the RX flow hash indirection table, hash
+ *	key, and/or hash function.  Arguments which are set to %NULL or zero
+ *	will remain unchanged.
+ *	Returns a negative error code or zero. An error code must be returned
+ *	if at least one unsupported change was requested.
  * @get_channels: Get number of channels.
  * @set_channels: Set number of channels.  Returns a negative error code or
  *	zero.
@@ -241,9 +258,10 @@ struct ethtool_ops {
 	int	(*reset)(struct net_device *, u32 *);
 	u32	(*get_rxfh_key_size)(struct net_device *);
 	u32	(*get_rxfh_indir_size)(struct net_device *);
-	int	(*get_rxfh)(struct net_device *, u32 *indir, u8 *key);
+	int	(*get_rxfh)(struct net_device *, u32 *indir, u8 *key,
+			    u8 *hfunc);
 	int	(*set_rxfh)(struct net_device *, const u32 *indir,
-			    const u8 *key);
+			    const u8 *key, const u8 hfunc);
 	void	(*get_channels)(struct net_device *, struct ethtool_channels *);
 	int	(*set_channels)(struct net_device *, struct ethtool_channels *);
 	int	(*get_dump_flag)(struct net_device *, struct ethtool_dump *);
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index eb2095b..5f66d9c 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -534,6 +534,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_NTUPLE_FILTERS: Previously used with %ETHTOOL_GRXNTUPLE;
  *	now deprecated
  * @ETH_SS_FEATURES: Device feature names
+ * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  */
 enum ethtool_stringset {
 	ETH_SS_TEST		= 0,
@@ -541,6 +542,7 @@ enum ethtool_stringset {
 	ETH_SS_PRIV_FLAGS,
 	ETH_SS_NTUPLE_FILTERS,
 	ETH_SS_FEATURES,
+	ETH_SS_RSS_HASH_FUNCS,
 };
 
 /**
@@ -884,6 +886,8 @@ struct ethtool_rxfh_indir {
  * @key_size: On entry, the array size of the user buffer for the hash key,
  *	which may be zero.  On return from %ETHTOOL_GRSSH, the size of the
  *	hardware hash key.
+ * @hfunc: Defines the current RSS hash function used by HW (or to be set to).
+ *	Valid values are one of the %ETH_RSS_HASH_*.
  * @rsvd:	Reserved for future extensions.
  * @rss_config: RX ring/queue index for each hash value i.e., indirection table
  *	of @indir_size __u32 elements, followed by hash key of @key_size
@@ -893,14 +897,16 @@ struct ethtool_rxfh_indir {
  * size should be returned.  For %ETHTOOL_SRSSH, an @indir_size of
  * %ETH_RXFH_INDIR_NO_CHANGE means that indir table setting is not requested
  * and a @indir_size of zero means the indir table should be reset to default
- * values.
+ * values. An hfunc of zero means that hash function setting is not requested.
  */
 struct ethtool_rxfh {
 	__u32   cmd;
 	__u32	rss_context;
 	__u32   indir_size;
 	__u32   key_size;
-	__u32	rsvd[2];
+	__u8	hfunc;
+	__u8	rsvd8[3];
+	__u32	rsvd32;
 	__u32   rss_config[0];
 };
 #define ETH_RXFH_INDIR_NO_CHANGE	0xffffffff
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 715f51f..550892c 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -100,6 +100,12 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_BUSY_POLL_BIT] =        "busy-poll",
 };
 
+static const char
+rss_hash_func_strings[ETH_RSS_HASH_FUNCS_COUNT][ETH_GSTRING_LEN] = {
+	[ETH_RSS_HASH_TOP_BIT] =	"toeplitz",
+	[ETH_RSS_HASH_XOR_BIT] =	"xor",
+};
+
 static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
 {
 	struct ethtool_gfeatures cmd = {
@@ -185,6 +191,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 	if (sset == ETH_SS_FEATURES)
 		return ARRAY_SIZE(netdev_features_strings);
 
+	if (sset == ETH_SS_RSS_HASH_FUNCS)
+		return ARRAY_SIZE(rss_hash_func_strings);
+
 	if (ops->get_sset_count && ops->get_strings)
 		return ops->get_sset_count(dev, sset);
 	else
@@ -199,6 +208,9 @@ static void __ethtool_get_strings(struct net_device *dev,
 	if (stringset == ETH_SS_FEATURES)
 		memcpy(data, netdev_features_strings,
 			sizeof(netdev_features_strings));
+	else if (stringset == ETH_SS_RSS_HASH_FUNCS)
+		memcpy(data, rss_hash_func_strings,
+		       sizeof(rss_hash_func_strings));
 	else
 		/* ops->get_strings is valid because checked earlier */
 		ops->get_strings(dev, stringset, data);
@@ -618,7 +630,7 @@ static noinline_for_stack int ethtool_get_rxfh_indir(struct net_device *dev,
 	if (!indir)
 		return -ENOMEM;
 
-	ret = dev->ethtool_ops->get_rxfh(dev, indir, NULL);
+	ret = dev->ethtool_ops->get_rxfh(dev, indir, NULL, NULL);
 	if (ret)
 		goto out;
 
@@ -679,7 +691,7 @@ static noinline_for_stack int ethtool_set_rxfh_indir(struct net_device *dev,
 			goto out;
 	}
 
-	ret = ops->set_rxfh(dev, indir, NULL);
+	ret = ops->set_rxfh(dev, indir, NULL, ETH_RSS_HASH_NO_CHANGE);
 
 out:
 	kfree(indir);
@@ -697,12 +709,11 @@ static noinline_for_stack int ethtool_get_rxfh(struct net_device *dev,
 	u32 total_size;
 	u32 indir_bytes;
 	u32 *indir = NULL;
+	u8 dev_hfunc = 0;
 	u8 *hkey = NULL;
 	u8 *rss_config;
 
-	if (!(dev->ethtool_ops->get_rxfh_indir_size ||
-	      dev->ethtool_ops->get_rxfh_key_size) ||
-	      !dev->ethtool_ops->get_rxfh)
+	if (!ops->get_rxfh)
 		return -EOPNOTSUPP;
 
 	if (ops->get_rxfh_indir_size)
@@ -710,16 +721,14 @@ static noinline_for_stack int ethtool_get_rxfh(struct net_device *dev,
 	if (ops->get_rxfh_key_size)
 		dev_key_size = ops->get_rxfh_key_size(dev);
 
-	if ((dev_key_size + dev_indir_size) == 0)
-		return -EOPNOTSUPP;
-
 	if (copy_from_user(&rxfh, useraddr, sizeof(rxfh)))
 		return -EFAULT;
 	user_indir_size = rxfh.indir_size;
 	user_key_size = rxfh.key_size;
 
 	/* Check that reserved fields are 0 for now */
-	if (rxfh.rss_context || rxfh.rsvd[0] || rxfh.rsvd[1])
+	if (rxfh.rss_context || rxfh.rsvd8[0] || rxfh.rsvd8[1] ||
+	    rxfh.rsvd8[2] || rxfh.rsvd32)
 		return -EINVAL;
 
 	rxfh.indir_size = dev_indir_size;
@@ -727,13 +736,6 @@ static noinline_for_stack int ethtool_get_rxfh(struct net_device *dev,
 	if (copy_to_user(useraddr, &rxfh, sizeof(rxfh)))
 		return -EFAULT;
 
-	/* If the user buffer size is 0, this is just a query for the
-	 * device table size and key size.  Otherwise, if the User size is
-	 * not equal to device table size or key size it's an error.
-	 */
-	if (!user_indir_size && !user_key_size)
-		return 0;
-
 	if ((user_indir_size && (user_indir_size != dev_indir_size)) ||
 	    (user_key_size && (user_key_size != dev_key_size)))
 		return -EINVAL;
@@ -750,14 +752,19 @@ static noinline_for_stack int ethtool_get_rxfh(struct net_device *dev,
 	if (user_key_size)
 		hkey = rss_config + indir_bytes;
 
-	ret = dev->ethtool_ops->get_rxfh(dev, indir, hkey);
-	if (!ret) {
-		if (copy_to_user(useraddr +
-				 offsetof(struct ethtool_rxfh, rss_config[0]),
-				 rss_config, total_size))
-			ret = -EFAULT;
-	}
+	ret = dev->ethtool_ops->get_rxfh(dev, indir, hkey, &dev_hfunc);
+	if (ret)
+		goto out;
 
+	if (copy_to_user(useraddr + offsetof(struct ethtool_rxfh, hfunc),
+			 &dev_hfunc, sizeof(rxfh.hfunc))) {
+		ret = -EFAULT;
+	} else if (copy_to_user(useraddr +
+			      offsetof(struct ethtool_rxfh, rss_config[0]),
+			      rss_config, total_size)) {
+		ret = -EFAULT;
+	}
+out:
 	kfree(rss_config);
 
 	return ret;
@@ -776,33 +783,31 @@ static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev,
 	u8 *rss_config;
 	u32 rss_cfg_offset = offsetof(struct ethtool_rxfh, rss_config[0]);
 
-	if (!(ops->get_rxfh_indir_size || ops->get_rxfh_key_size) ||
-	    !ops->get_rxnfc || !ops->set_rxfh)
+	if (!ops->get_rxnfc || !ops->set_rxfh)
 		return -EOPNOTSUPP;
 
 	if (ops->get_rxfh_indir_size)
 		dev_indir_size = ops->get_rxfh_indir_size(dev);
 	if (ops->get_rxfh_key_size)
 		dev_key_size = dev->ethtool_ops->get_rxfh_key_size(dev);
-	if ((dev_key_size + dev_indir_size) == 0)
-		return -EOPNOTSUPP;
 
 	if (copy_from_user(&rxfh, useraddr, sizeof(rxfh)))
 		return -EFAULT;
 
 	/* Check that reserved fields are 0 for now */
-	if (rxfh.rss_context || rxfh.rsvd[0] || rxfh.rsvd[1])
+	if (rxfh.rss_context || rxfh.rsvd8[0] || rxfh.rsvd8[1] ||
+	    rxfh.rsvd8[2] || rxfh.rsvd32)
 		return -EINVAL;
 
-	/* If either indir or hash key is valid, proceed further.
-	 * It is not valid to request that both be unchanged.
+	/* If either indir, hash key or function is valid, proceed further.
+	 * Must request at least one change: indir size, hash key or function.
 	 */
 	if ((rxfh.indir_size &&
 	     rxfh.indir_size != ETH_RXFH_INDIR_NO_CHANGE &&
 	     rxfh.indir_size != dev_indir_size) ||
 	    (rxfh.key_size && (rxfh.key_size != dev_key_size)) ||
 	    (rxfh.indir_size == ETH_RXFH_INDIR_NO_CHANGE &&
-	     rxfh.key_size == 0))
+	     rxfh.key_size == 0 && rxfh.hfunc == ETH_RSS_HASH_NO_CHANGE))
 		return -EINVAL;
 
 	if (rxfh.indir_size != ETH_RXFH_INDIR_NO_CHANGE)
@@ -845,7 +850,7 @@ static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev,
 		}
 	}
 
-	ret = ops->set_rxfh(dev, indir, hkey);
+	ret = ops->set_rxfh(dev, indir, hkey, rxfh.hfunc);
 
 out:
 	kfree(rss_config);
-- 
1.8.3.4

^ permalink raw reply related

* [PATCH net-next V3 2/2] net/mlx4_en: Support for configurable RSS hash function
From: Amir Vadai @ 2014-12-02 16:12 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings
  Cc: netdev, Or Gerlitz, Eyal Perry, Yevgeny Petrilin, Amir Vadai
In-Reply-To: <1417536731-11211-1-git-send-email-amirv@mellanox.com>

From: Eyal Perry <eyalpe@mellanox.com>

The ConnectX HW is capable of using one of the following hash functions:
Toeplitz and an XOR hash function. This patch extends the implementation
of the mlx4_en driver set/get_rxfh callbacks to support getting and
setting the RSS hash function used by the device.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 34 +++++++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 11 ++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 14 +++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  2 +-
 4 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 28c3fc5..90e0f04 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -978,6 +978,27 @@ static u32 mlx4_en_get_rxfh_key_size(struct net_device *netdev)
 	return MLX4_EN_RSS_KEY_SIZE;
 }
 
+static int mlx4_en_check_rxfh_func(struct net_device *dev, u8 hfunc)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	/* check if requested function is supported by the device */
+	if ((hfunc == ETH_RSS_HASH_TOP &&
+	     !(priv->mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_TOP)) ||
+	    (hfunc == ETH_RSS_HASH_XOR &&
+	     !(priv->mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_XOR)))
+		return -EINVAL;
+
+	priv->rss_hash_fn = hfunc;
+	if (hfunc == ETH_RSS_HASH_TOP && !(dev->features & NETIF_F_RXHASH))
+		en_warn(priv,
+			"Toeplitz hash function should be used in conjunction with RX hashing for optimal performance\n");
+	if (hfunc == ETH_RSS_HASH_XOR && (dev->features & NETIF_F_RXHASH))
+		en_warn(priv,
+			"Enabling both XOR Hash function and RX Hashing can limit RPS functionality\n");
+	return 0;
+}
+
 static int mlx4_en_get_rxfh(struct net_device *dev, u32 *ring_index, u8 *key,
 			    u8 *hfunc)
 {
@@ -999,7 +1020,7 @@ static int mlx4_en_get_rxfh(struct net_device *dev, u32 *ring_index, u8 *key,
 	if (key)
 		memcpy(key, priv->rss_key, MLX4_EN_RSS_KEY_SIZE);
 	if (hfunc)
-		*hfunc = ETH_RSS_HASH_TOP;
+		*hfunc = priv->rss_hash_fn;
 	return err;
 }
 
@@ -1013,10 +1034,6 @@ static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
 	int i;
 	int rss_rings = 0;
 
-	/* We do not allow change in unsupported parameters */
-	if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
-		return -EOPNOTSUPP;
-
 	/* Calculate RSS table size and make sure flows are spread evenly
 	 * between rings
 	 */
@@ -1037,6 +1054,12 @@ static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
 	if (!is_power_of_2(rss_rings))
 		return -EINVAL;
 
+	if (hfunc != ETH_RSS_HASH_NO_CHANGE) {
+		err = mlx4_en_check_rxfh_func(dev, hfunc);
+		if (err)
+			return err;
+	}
+
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
 		port_up = 1;
@@ -1047,6 +1070,7 @@ static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
 		priv->prof->rss_rings = rss_rings;
 	if (key)
 		memcpy(priv->rss_key, key, MLX4_EN_RSS_KEY_SIZE);
+
 	if (port_up) {
 		err = mlx4_en_start_port(dev);
 		if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b7c9978..dcc0d3d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2608,6 +2608,17 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	if (mdev->dev->caps.steering_mode != MLX4_STEERING_MODE_A0)
 		dev->priv_flags |= IFF_UNICAST_FLT;
 
+	/* Setting a default hash function value */
+	if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_TOP) {
+		priv->rss_hash_fn = ETH_RSS_HASH_TOP;
+	} else if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_XOR) {
+		priv->rss_hash_fn = ETH_RSS_HASH_XOR;
+	} else {
+		en_warn(priv,
+			"No RSS hash capabilities exposed, using Toeplitz\n");
+		priv->rss_hash_fn = ETH_RSS_HASH_TOP;
+	}
+
 	mdev->pndev[port] = dev;
 
 	netif_carrier_off(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 946d352..4ca396e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1223,7 +1223,19 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 
 	rss_context->flags = rss_mask;
 	rss_context->hash_fn = MLX4_RSS_HASH_TOP;
-	memcpy(rss_context->rss_key, priv->rss_key, MLX4_EN_RSS_KEY_SIZE);
+	if (priv->rss_hash_fn == ETH_RSS_HASH_XOR) {
+		rss_context->hash_fn = MLX4_RSS_HASH_XOR;
+	} else if (priv->rss_hash_fn == ETH_RSS_HASH_TOP) {
+		rss_context->hash_fn = MLX4_RSS_HASH_TOP;
+		memcpy(rss_context->rss_key, priv->rss_key,
+		       MLX4_EN_RSS_KEY_SIZE);
+		netdev_rss_key_fill(rss_context->rss_key,
+				    MLX4_EN_RSS_KEY_SIZE);
+	} else {
+		en_err(priv, "Unknown RSS hash function requested\n");
+		err = -EINVAL;
+		goto indir_err;
+	}
 	err = mlx4_qp_to_ready(mdev->dev, &priv->res.mtt, &context,
 			       &rss_map->indir_qp, &rss_map->indir_state);
 	if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index aaa7efb..ac48a8d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -376,7 +376,6 @@ struct mlx4_en_port_profile {
 };
 
 struct mlx4_en_profile {
-	int rss_xor;
 	int udp_rss;
 	u8 rss_mask;
 	u32 active_ports;
@@ -619,6 +618,7 @@ struct mlx4_en_priv {
 
 	u32 pflags;
 	u8 rss_key[MLX4_EN_RSS_KEY_SIZE];
+	u8 rss_hash_fn;
 };
 
 enum mlx4_en_wol {
-- 
1.8.3.4

^ permalink raw reply related

* hola
From: hi @ 2014-12-02 16:01 UTC (permalink / raw)


iphone 6plus, 480 euro
ipad 4th,280euro,
samsung s5,260euro,
moto,laptop,camera....

si te:   assroooo . com

^ permalink raw reply

* [PATCH net-next V3 0/2] ethtool, net/mlx4_en: RSS hash function selection
From: Amir Vadai @ 2014-12-02 16:12 UTC (permalink / raw)
  To: David S. Miller, Ben Hutchings
  Cc: netdev, Or Gerlitz, Eyal Perry, Yevgeny Petrilin, Amir Vadai

Hi,

This patchset by Eyal adds support in set/get of RSS hash function. Current
supported functions are Toeplitz and XOR. The API is design to enable adding
new hash functions without breaking backward compatibility.
Userspace patch will be sent after API is available in kernel.


The patchset was applied and tested over commit cd4c910 ("netpoll: delete
defconfig references to obsolete NETPOLL_TRAP")

Amir

Changes from V2:
- Patch 1/2 - ethtool: Support for configurable RSS hash function:
  - Return ETH_RSS_HASH_TOP in get_rxfh() callback at sfc driver.

Changes from V1:
- Patch 1/2 - ethtool: Support for configurable RSS hash function:
   - Accept an hfunc the chip is using instead of only ETH_RSS_HASH_NO_CHANGE.
   - Provide an accurate hfunc value in get_rxfh() call.
   - Changed the behavior of the get_rxfh() w.r.t the validation of the
     arguments. Function will return 0 instead of -EOPNOSUPP when all arguments
     are NULL.

Changes from V0:
- Patch 1/2 - ethtool: Support for configurable RSS hash function:
 - Add ETH prefix to RSS_HASH_* definitions
 - Moved the strings array to ethtool.c
 - Extend {get,set}_rxfh with additional arg instead of adding new
   ethtool_option and adopt the change into drivers implementations.
 - Moved indir_size and key_size validation into drivers implantation
 - Documented hfunc filed in ethtool_rxfh struct
- Patch 2/2 - net/mlx4_en: Support for configurable RSS hash function
 - Remove redundant priv->rss_hash_fn_caps 
 - Use == operator instead & when determining requested hash function. 

Eyal Perry (2):
  ethtool: Support for configurable RSS hash function
  net/mlx4_en: Support for configurable RSS hash function

 drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c       | 11 +++-
 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c    | 20 ++++++-
 drivers/net/ethernet/broadcom/tg3.c                | 20 ++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 18 +++++-
 drivers/net/ethernet/emulex/benet/be_ethtool.c     | 12 +++-
 drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c   | 12 +++-
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c | 17 +++++-
 drivers/net/ethernet/intel/igb/igb_ethtool.c       | 16 ++++-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    | 37 +++++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     | 11 ++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         | 14 ++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |  2 +-
 drivers/net/ethernet/sfc/ethtool.c                 | 18 ++++--
 drivers/net/vmxnet3/vmxnet3_ethtool.c              | 15 ++++-
 include/linux/ethtool.h                            | 42 +++++++++----
 include/uapi/linux/ethtool.h                       | 10 +++-
 net/core/ethtool.c                                 | 69 ++++++++++++----------
 17 files changed, 272 insertions(+), 72 deletions(-)

-- 
1.8.3.4

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Roopa Prabhu @ 2014-12-02 16:08 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: netdev, David Miller, Eric Dumazet, Toshiaki Makita
In-Reply-To: <1417499650-29176-1-git-send-email-maheshb@google.com>

On 12/1/14, 9:54 PM, Mahesh Bandewar wrote:
> The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
> until after ndo_uninit") tried to do this ealier but while doing so
> it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
> delayed call to fill_info(). So this translated into asking driver
> to remove private state and then quert it's private state. This
> could have catastropic consequences.
>
> This change breaks the rtmsg_ifinfo() into two parts - one fills the
> skb by calling fill_info() prior to calling ndo_uninit() and the second
> part just sends the message using the skb filled earlier.
>
> It was brought to notice when last link is deleted from an ipvlan device
> when it has free-ed the port and the subsequent .fill_info() call is
> trying to get the info from the port.
>
> kernel: [  255.139429] ------------[ cut here ]------------
> kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
> kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
> kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
> kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
> kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
> kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
> kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
> kernel: [  255.139520] Call Trace:
> kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
> kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
> kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
> kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
> kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
> kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
> kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
> kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
> kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
> kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
> kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
> kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
> kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
> kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
> kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
> kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
> kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
> kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
> kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
> kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
> kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
> kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
> kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
> kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
> kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
> kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
> kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
> kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---

interestingly I have never seen this. We use this heavily with most 
other logical devices.
Which tells me most logical devices do have checks in their fill_info.
The patch idea is good. My only concern is stale information
in the DELLINK notification. Because,  ndo_uninit() does do a lot of 
cleanup, sending
newlink's for some of these cleanup changes. And now with your patch the 
dellink notification
skb probably  contains information that has been already deleted by 
ndo_uninit ?





>
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> Report-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
> Cc: David S. Miller <davem@davemloft.net>
> ---
>   drivers/net/bonding/bond_main.c |  4 ++--
>   include/linux/rtnetlink.h       |  6 +++++-
>   include/net/bonding.h           |  8 ++++----
>   net/core/dev.c                  | 26 ++++++++++++++++----------
>   net/core/rtnetlink.c            | 20 ++++++++++++++++----
>   5 files changed, 43 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 184c434ae305..06206a1439a4 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1135,7 +1135,7 @@ static int bond_master_upper_dev_link(struct net_device *bond_dev,
>   	if (err)
>   		return err;
>   	slave_dev->flags |= IFF_SLAVE;
> -	rtmsg_ifinfo(RTM_NEWLINK, slave_dev, IFF_SLAVE, GFP_KERNEL);
> +	rtmsg_ifinfo(RTM_NEWLINK, slave_dev, IFF_SLAVE, GFP_KERNEL, false);
>   	return 0;
>   }
>   
> @@ -1144,7 +1144,7 @@ static void bond_upper_dev_unlink(struct net_device *bond_dev,
>   {
>   	netdev_upper_dev_unlink(slave_dev, bond_dev);
>   	slave_dev->flags &= ~IFF_SLAVE;
> -	rtmsg_ifinfo(RTM_NEWLINK, slave_dev, IFF_SLAVE, GFP_KERNEL);
> +	rtmsg_ifinfo(RTM_NEWLINK, slave_dev, IFF_SLAVE, GFP_KERNEL, false);
>   }
>   
>   static struct slave *bond_alloc_slave(struct bonding *bond)
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 6cacbce1a06c..545dd0b8c83d 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -16,7 +16,11 @@ extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics);
>   extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst,
>   			      u32 id, long expires, u32 error);
>   
> -void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change, gfp_t flags);
> +struct sk_buff *rtmsg_ifinfo(int type, struct net_device *dev, unsigned change,
> +			     gfp_t flags, bool fill_only);
> +void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev,
> +		       gfp_t flags);
> +
>   
>   /* RTNL is used as a global lock for all changes to network configuration  */
>   extern void rtnl_lock(void);
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index 983a94b86b95..ea09f6c5af51 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -315,7 +315,7 @@ static inline void bond_set_active_slave(struct slave *slave)
>   {
>   	if (slave->backup) {
>   		slave->backup = 0;
> -		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC);
> +		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC, false);
>   	}
>   }
>   
> @@ -323,7 +323,7 @@ static inline void bond_set_backup_slave(struct slave *slave)
>   {
>   	if (!slave->backup) {
>   		slave->backup = 1;
> -		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC);
> +		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC, false);
>   	}
>   }
>   
> @@ -335,7 +335,7 @@ static inline void bond_set_slave_state(struct slave *slave,
>   
>   	slave->backup = slave_state;
>   	if (notify) {
> -		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC);
> +		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_ATOMIC, false);
>   		slave->should_notify = 0;
>   	} else {
>   		if (slave->should_notify)
> @@ -365,7 +365,7 @@ static inline void bond_slave_state_notify(struct bonding *bond)
>   
>   	bond_for_each_slave(bond, tmp, iter) {
>   		if (tmp->should_notify) {
> -			rtmsg_ifinfo(RTM_NEWLINK, tmp->dev, 0, GFP_ATOMIC);
> +			rtmsg_ifinfo(RTM_NEWLINK, tmp->dev, 0, GFP_ATOMIC, false);
>   			tmp->should_notify = 0;
>   		}
>   	}
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ac4836241a96..29bc78d5e6cb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1230,7 +1230,7 @@ void netdev_state_change(struct net_device *dev)
>   		change_info.flags_changed = 0;
>   		call_netdevice_notifiers_info(NETDEV_CHANGE, dev,
>   					      &change_info.info);
> -		rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL);
> +		rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL, false);
>   	}
>   }
>   EXPORT_SYMBOL(netdev_state_change);
> @@ -1319,7 +1319,7 @@ int dev_open(struct net_device *dev)
>   	if (ret < 0)
>   		return ret;
>   
> -	rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL);
> +	rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL, false);
>   	call_netdevice_notifiers(NETDEV_UP, dev);
>   
>   	return ret;
> @@ -1396,7 +1396,8 @@ static int dev_close_many(struct list_head *head)
>   	__dev_close_many(head);
>   
>   	list_for_each_entry_safe(dev, tmp, head, close_list) {
> -		rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL);
> +		rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING,
> +			     GFP_KERNEL, false);
>   		call_netdevice_notifiers(NETDEV_DOWN, dev);
>   		list_del_init(&dev->close_list);
>   	}
> @@ -5683,7 +5684,7 @@ void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
>   	unsigned int changes = dev->flags ^ old_flags;
>   
>   	if (gchanges)
> -		rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC);
> +		rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC, false);
>   
>   	if (changes & IFF_UP) {
>   		if (dev->flags & IFF_UP)
> @@ -5925,6 +5926,8 @@ static void rollback_registered_many(struct list_head *head)
>   	synchronize_net();
>   
>   	list_for_each_entry(dev, head, unreg_list) {
> +		struct sk_buff *skb = NULL;
> +
>   		/* Shutdown queueing discipline. */
>   		dev_shutdown(dev);
>   
> @@ -5934,6 +5937,10 @@ static void rollback_registered_many(struct list_head *head)
>   		*/
>   		call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>   
> +		if (!dev->rtnl_link_ops ||
> +		    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
> +			skb = rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL, true);
> +
>   		/*
>   		 *	Flush the unicast and multicast chains
>   		 */
> @@ -5943,9 +5950,8 @@ static void rollback_registered_many(struct list_head *head)
>   		if (dev->netdev_ops->ndo_uninit)
>   			dev->netdev_ops->ndo_uninit(dev);
>   
> -		if (!dev->rtnl_link_ops ||
> -		    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
> -			rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
> +		if (skb)
> +			rtmsg_ifinfo_send(skb, dev, GFP_KERNEL);
>   
>   		/* Notifier chain MUST detach us all upper devices. */
>   		WARN_ON(netdev_has_any_upper_dev(dev));
> @@ -6334,7 +6340,7 @@ int register_netdevice(struct net_device *dev)
>   	 */
>   	if (!dev->rtnl_link_ops ||
>   	    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
> -		rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
> +		rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL, false);
>   
>   out:
>   	return ret;
> @@ -6959,7 +6965,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
>   	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>   	rcu_barrier();
>   	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
> -	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
> +	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL, false);
>   
>   	/*
>   	 *	Flush the unicast and multicast chains
> @@ -7000,7 +7006,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
>   	 *	Prevent userspace races by waiting until the network
>   	 *	device is fully setup before sending notifications.
>   	 */
> -	rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
> +	rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL, false);
>   
>   	synchronize_net();
>   	err = 0;
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index b9b7dfaf202b..1035d8cdbc08 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -2220,8 +2220,16 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
>   	return skb->len;
>   }
>   
> -void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
> -		  gfp_t flags)
> +void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev, gfp_t flags)
> +{
> +	struct net *net = dev_net(dev);
> +
> +	rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
> +}
> +EXPORT_SYMBOL(rtmsg_ifinfo_send);
> +
> +struct sk_buff *rtmsg_ifinfo(int type, struct net_device *dev,
> +			     unsigned int change, gfp_t flags, bool fill_only)
>   {
>   	struct net *net = dev_net(dev);
>   	struct sk_buff *skb;
> @@ -2239,11 +2247,15 @@ void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
>   		kfree_skb(skb);
>   		goto errout;
>   	}
> +	if (fill_only)
> +	    return skb;
> +
>   	rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
> -	return;
> +	return NULL;
>   errout:
>   	if (err < 0)
>   		rtnl_set_sk_err(net, RTNLGRP_LINK, err);
> +	return NULL;
>   }
>   EXPORT_SYMBOL(rtmsg_ifinfo);
>   
> @@ -3011,7 +3023,7 @@ static int rtnetlink_event(struct notifier_block *this, unsigned long event, voi
>   	case NETDEV_JOIN:
>   		break;
>   	default:
> -		rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL);
> +		rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL, false);
>   		break;
>   	}
>   	return NOTIFY_DONE;

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox