Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Mahesh Bandewar @ 2011-01-05  2:06 UTC (permalink / raw)
  To: Rick Jones
  Cc: Stephen Hemminger, Ben Hutchings, David Miller, Laurent Chavey,
	Tom Herbert, netdev
In-Reply-To: <4D23CAA5.7060902@hp.com>

On Tue, Jan 4, 2011 at 5:34 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>>> Since this is a boolean it SHOULD go into ethtool_flags rather than
>>>> being a high level operation.
>>>
>>> It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
>>> controlling offload features.
>>
>>
>> It just seems the number of hooks keeps growing which takes more space
>> and increases complexity.
>
> Is there any complication/downside to using flags in the (un?)likely event
> of wanting different flavors of loopback in the card?

The purpose of the patch is to stress / exercise the ingress/egress
path(s). So like Ben had suggested earlier to keep the loopback
implementation as near as possible to the host would streamline /
simplify the implementation & usage.

This is not a new patch and the earlier thread has an answer for this.
It's just that when I re-submitted this patch today, it went in as a
new patch! Here are the reference(s) the old thread -

http://marc.info/?l=linux-netdev&w=3&r=1&s=Allow+ethtool+to+set+interface&q=t

>
> rick jones
>

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Ben Hutchings @ 2011-01-05  1:59 UTC (permalink / raw)
  To: Rick Jones
  Cc: Stephen Hemminger, Mahesh Bandewar, David Miller, Laurent Chavey,
	Tom Herbert, netdev
In-Reply-To: <4D23CAA5.7060902@hp.com>

On Tue, 2011-01-04 at 17:34 -0800, Rick Jones wrote:
> >>>Since this is a boolean it SHOULD go into ethtool_flags rather than
> >>>being a high level operation.
> >>
> >>It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
> >>controlling offload features.
> > 
> > 
> > It just seems the number of hooks keeps growing which takes more space
> > and increases complexity.
> 
> Is there any complication/downside to using flags in the (un?)likely event of 
> wanting different flavors of loopback in the card?

You have to define the flags.   And once you start, where would you
stop?  The sfc driver alone recognises 18 host-side and 8 wire-side
loopback modes.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Stephen Hemminger @ 2011-01-05  1:53 UTC (permalink / raw)
  To: Rick Jones
  Cc: Ben Hutchings, Mahesh Bandewar, David Miller, Laurent Chavey,
	Tom Herbert, netdev
In-Reply-To: <4D23CAA5.7060902@hp.com>

On Tue, 04 Jan 2011 17:34:29 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> >>>Since this is a boolean it SHOULD go into ethtool_flags rather than
> >>>being a high level operation.
> >>
> >>It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
> >>controlling offload features.
> > 
> > 
> > It just seems the number of hooks keeps growing which takes more space
> > and increases complexity.
> 
> Is there any complication/downside to using flags in the (un?)likely event of 
> wanting different flavors of loopback in the card?
> 
> rick jones

Then let is keep it as command and have a parameter to set mode.
There might be drivers that want to loopback in SW, HW, PHY or even switch.

-- 

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Mahesh Bandewar @ 2011-01-05  1:39 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ben Hutchings, David Miller, Laurent Chavey, Tom Herbert, netdev
In-Reply-To: <20110104172939.711b758d@nehalam>

On Tue, Jan 4, 2011 at 5:29 PM, Stephen Hemminger <shemminger@vyatta.com> wrote:
> On Wed, 05 Jan 2011 01:21:44 +0000
> Ben Hutchings <bhutchings@solarflare.com> wrote:
>
>> On Tue, 2011-01-04 at 16:36 -0800, Stephen Hemminger wrote:
>> > On Tue,  4 Jan 2011 16:30:01 -0800
>> > Mahesh Bandewar <maheshb@google.com> wrote:
>> >
>> > > This patch enables ethtool to set the loopback mode on a given interface.
>> > > By configuring the interface in loopback mode in conjunction with a policy
>> > > route / rule, a userland application can stress the egress / ingress path
>> > > exposing the flows of the change in progress and potentially help developer(s)
>> > > understand the impact of those changes without even sending a packet out
>> > > on the network.
>> > >
>> > > Following set of commands illustrates one such example -
>> > >   a) ip -4 addr add 192.168.1.1/24 dev eth1
>> > >   b) ip -4 rule add from all iif eth1 lookup 250
>> > >   c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
>> > >   d) arp -Ds 192.168.1.100 eth1
>> > >   e) arp -Ds 192.168.1.200 eth1
>> > >   f) sysctl -w net.ipv4.ip_nonlocal_bind=1
>> > >   g) sysctl -w net.ipv4.conf.all.accept_local=1
>> > >   # Assuming that the machine has 8 cores
>> > >   h) taskset 000f netserver -L 192.168.1.200
>> > >   i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
>> > >
>> > > Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> > > Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
>> >
>> > Since this is a boolean it SHOULD go into ethtool_flags rather than
>> > being a high level operation.
>>
>> It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
>> controlling offload features.
>
> It just seems the number of hooks keeps growing which takes more space
> and increases complexity.
>
> There was some talk about changing GRO/TSO/UFO .. to be bits in FLAGS
> but not sure how far along that is.
> --
>

This is not merely getting / setting flags but involves invoking a
method from the driver(s). If done this way; the code in
ethtool_op_set_flags() will have to be special-cased to handle this
flag which (I think) would not be clean.

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Rick Jones @ 2011-01-05  1:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ben Hutchings, Mahesh Bandewar, David Miller, Laurent Chavey,
	Tom Herbert, netdev
In-Reply-To: <20110104172939.711b758d@nehalam>

>>>Since this is a boolean it SHOULD go into ethtool_flags rather than
>>>being a high level operation.
>>
>>It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
>>controlling offload features.
> 
> 
> It just seems the number of hooks keeps growing which takes more space
> and increases complexity.

Is there any complication/downside to using flags in the (un?)likely event of 
wanting different flavors of loopback in the card?

rick jones

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Stephen Hemminger @ 2011-01-05  1:29 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Mahesh Bandewar, David Miller, Laurent Chavey, Tom Herbert,
	netdev
In-Reply-To: <1294190504.2992.3.camel@localhost>

On Wed, 05 Jan 2011 01:21:44 +0000
Ben Hutchings <bhutchings@solarflare.com> wrote:

> On Tue, 2011-01-04 at 16:36 -0800, Stephen Hemminger wrote:
> > On Tue,  4 Jan 2011 16:30:01 -0800
> > Mahesh Bandewar <maheshb@google.com> wrote:
> > 
> > > This patch enables ethtool to set the loopback mode on a given interface.
> > > By configuring the interface in loopback mode in conjunction with a policy
> > > route / rule, a userland application can stress the egress / ingress path
> > > exposing the flows of the change in progress and potentially help developer(s)
> > > understand the impact of those changes without even sending a packet out
> > > on the network.
> > > 
> > > Following set of commands illustrates one such example -
> > > 	a) ip -4 addr add 192.168.1.1/24 dev eth1
> > > 	b) ip -4 rule add from all iif eth1 lookup 250
> > > 	c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
> > > 	d) arp -Ds 192.168.1.100 eth1
> > > 	e) arp -Ds 192.168.1.200 eth1
> > > 	f) sysctl -w net.ipv4.ip_nonlocal_bind=1
> > > 	g) sysctl -w net.ipv4.conf.all.accept_local=1
> > > 	# Assuming that the machine has 8 cores
> > > 	h) taskset 000f netserver -L 192.168.1.200
> > > 	i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
> > > 
> > > Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> > > Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
> > 
> > Since this is a boolean it SHOULD go into ethtool_flags rather than
> > being a high level operation.
> 
> It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
> controlling offload features.

It just seems the number of hooks keeps growing which takes more space
and increases complexity.

There was some talk about changing GRO/TSO/UFO .. to be bits in FLAGS
but not sure how far along that is.
-- 

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Ben Hutchings @ 2011-01-05  1:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mahesh Bandewar, David Miller, Laurent Chavey, Tom Herbert,
	netdev
In-Reply-To: <20110104163645.0b3a3687@nehalam>

On Tue, 2011-01-04 at 16:36 -0800, Stephen Hemminger wrote:
> On Tue,  4 Jan 2011 16:30:01 -0800
> Mahesh Bandewar <maheshb@google.com> wrote:
> 
> > This patch enables ethtool to set the loopback mode on a given interface.
> > By configuring the interface in loopback mode in conjunction with a policy
> > route / rule, a userland application can stress the egress / ingress path
> > exposing the flows of the change in progress and potentially help developer(s)
> > understand the impact of those changes without even sending a packet out
> > on the network.
> > 
> > Following set of commands illustrates one such example -
> > 	a) ip -4 addr add 192.168.1.1/24 dev eth1
> > 	b) ip -4 rule add from all iif eth1 lookup 250
> > 	c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
> > 	d) arp -Ds 192.168.1.100 eth1
> > 	e) arp -Ds 192.168.1.200 eth1
> > 	f) sysctl -w net.ipv4.ip_nonlocal_bind=1
> > 	g) sysctl -w net.ipv4.conf.all.accept_local=1
> > 	# Assuming that the machine has 8 cores
> > 	h) taskset 000f netserver -L 192.168.1.200
> > 	i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
> > 
> > Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> > Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
> 
> Since this is a boolean it SHOULD go into ethtool_flags rather than
> being a high level operation.

It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
controlling offload features.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Alexander Duyck @ 2011-01-05  1:06 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1294185662.3636.61.camel@bwh-desktop>

On 1/4/2011 4:01 PM, Ben Hutchings wrote:
> On Tue, 2011-01-04 at 15:29 -0800, Alexander Duyck wrote:
>> Currently any strings returned via the get_rx_ntuple call will just be
>> dropped because the num_strings will be zero.  In order to correct this I
>> am updating things so that the return value of get_rx_ntuple is the number
>> of strings that were written, or a negative value if there was an error.
> [...]
>
> Nothing implements ethtool_ops::get_rx_ntuple, anyway.
>
> The fallback implementation is totally bogus, too.  Maximum of 1024
> filters?  Erm, sfc can handle more than that.  And doing complex string
> formatting in the kernel, even though all the parsing is in ethtool?
>
> Please, let's write off ETHTOOL_GRXNTUPLE as a failed experiment and
> replace it with a command that behaves more like ETHTOOL_GRXCLSRLALL.
>
> Ben.

In order to address several different issues in the perfect filters 
provided by 82599 I found it necessary to implement get_rx_ntuple so 
that the driver could maintain the filter list inside of the driver 
instead of having it maintained by the stack.  In doing so though I 
found the bug.

I agree the fallback implementation has a limitation on the number and 
format of filters it supports.  However declaring the function a "failed 
experiment" and just dropping it isn't exactly constructive since we 
have customers that are making use of the feature.  The fallback 
implementation is meant to be just that, and the patch I provided makes 
it possible to support more filters if needed by implementing a means of 
tracking/displaying the filters within the driver itself.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Stephen Hemminger @ 2011-01-05  0:36 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: David Miller, Ben Hutchings, Laurent Chavey, Tom Herbert, netdev
In-Reply-To: <1294187401-4662-1-git-send-email-maheshb@google.com>

On Tue,  4 Jan 2011 16:30:01 -0800
Mahesh Bandewar <maheshb@google.com> wrote:

> This patch enables ethtool to set the loopback mode on a given interface.
> By configuring the interface in loopback mode in conjunction with a policy
> route / rule, a userland application can stress the egress / ingress path
> exposing the flows of the change in progress and potentially help developer(s)
> understand the impact of those changes without even sending a packet out
> on the network.
> 
> Following set of commands illustrates one such example -
> 	a) ip -4 addr add 192.168.1.1/24 dev eth1
> 	b) ip -4 rule add from all iif eth1 lookup 250
> 	c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
> 	d) arp -Ds 192.168.1.100 eth1
> 	e) arp -Ds 192.168.1.200 eth1
> 	f) sysctl -w net.ipv4.ip_nonlocal_bind=1
> 	g) sysctl -w net.ipv4.conf.all.accept_local=1
> 	# Assuming that the machine has 8 cores
> 	h) taskset 000f netserver -L 192.168.1.200
> 	i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

Since this is a boolean it SHOULD go into ethtool_flags rather than
being a high level operation.


-- 

^ permalink raw reply

* [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Mahesh Bandewar @ 2011-01-05  0:30 UTC (permalink / raw)
  To: David Miller, Ben Hutchings, Laurent Chavey, Tom Herbert
  Cc: netdev, Mahesh Bandewar
In-Reply-To: <AANLkTikW24AcLEAioYtCwuQOoVnTrqVoAaZw3sksZ_jU@mail.gmail.com>

This patch enables ethtool to set the loopback mode on a given interface.
By configuring the interface in loopback mode in conjunction with a policy
route / rule, a userland application can stress the egress / ingress path
exposing the flows of the change in progress and potentially help developer(s)
understand the impact of those changes without even sending a packet out
on the network.

Following set of commands illustrates one such example -
	a) ip -4 addr add 192.168.1.1/24 dev eth1
	b) ip -4 rule add from all iif eth1 lookup 250
	c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
	d) arp -Ds 192.168.1.100 eth1
	e) arp -Ds 192.168.1.200 eth1
	f) sysctl -w net.ipv4.ip_nonlocal_bind=1
	g) sysctl -w net.ipv4.conf.all.accept_local=1
	# Assuming that the machine has 8 cores
	h) taskset 000f netserver -L 192.168.1.200
	i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/ethtool.h |   16 ++++++++++++++++
 net/core/ethtool.c      |   39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 6628a50..c036347 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -616,6 +616,18 @@ void ethtool_ntuple_flush(struct net_device *dev);
  *	Should validate the magic field.  Don't need to check len for zero
  *	or wraparound.  Update len to the amount written.  Returns an error
  *	or zero.
+ *
+ * get_loopback:
+ * set_loopback:
+ *	These are the driver specific get / set methods to report / enable-
+ *	disable loopback mode. The idea is to stress test the ingress/egress
+ *	paths by enabling this mode. There are multiple places this could be
+ *	done and choice of place will most likely be affected by the device
+ *	capabilities. So as a guiding principle; select a place to implement 
+ *	loopback mode as close to the host as possible. This would maximize
+ *	the soft-path length and maintain parity in terms of comparison with
+ *	different set of drivers.
+ *
  */
 struct ethtool_ops {
 	int	(*get_settings)(struct net_device *, struct ethtool_cmd *);
@@ -678,6 +690,8 @@ struct ethtool_ops {
 				  struct ethtool_rxfh_indir *);
 	int	(*set_rxfh_indir)(struct net_device *,
 				  const struct ethtool_rxfh_indir *);
+	int	(*get_loopback)(struct net_device *, u32 *);
+	int	(*set_loopback)(struct net_device *, u32);
 };
 #endif /* __KERNEL__ */
 
@@ -741,6 +755,8 @@ struct ethtool_ops {
 #define ETHTOOL_GSSET_INFO	0x00000037 /* Get string set info */
 #define ETHTOOL_GRXFHINDIR	0x00000038 /* Get RX flow hash indir'n table */
 #define ETHTOOL_SRXFHINDIR	0x00000039 /* Set RX flow hash indir'n table */
+#define ETHTOOL_SLOOPBACK	0x0000003a /* Enable / Disable Loopback */
+#define ETHTOOL_GLOOPBACK	0x0000003b /* Get Loopback status */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 956a9f4..5c87c93 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1434,6 +1434,39 @@ static noinline_for_stack int ethtool_flash_device(struct net_device *dev,
 	return dev->ethtool_ops->flash_device(dev, &efl);
 }
 
+static int ethtool_set_loopback(struct net_device *dev, void __user *useraddr)
+{
+	struct ethtool_value edata;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+
+	if (!ops || !ops->set_loopback)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	return ops->set_loopback(dev, edata.data);
+}
+
+static int ethtool_get_loopback(struct net_device *dev, void __user *useraddr)
+{
+	struct ethtool_value edata;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	int err;
+
+	if (!ops || !ops->get_loopback)
+		return -EOPNOTSUPP;
+
+	err = ops->get_loopback(dev, &edata.data);	
+	if (err)
+		return (err);
+
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		return -EFAULT;
+
+	return 0;
+}
+
 /* The main entry point in this file.  Called from net/core/dev.c */
 
 int dev_ethtool(struct net *net, struct ifreq *ifr)
@@ -1678,6 +1711,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_SRXFHINDIR:
 		rc = ethtool_set_rxfh_indir(dev, useraddr);
 		break;
+	case ETHTOOL_SLOOPBACK:
+		rc = ethtool_set_loopback(dev, useraddr);
+		break;
+	case ETHTOOL_GLOOPBACK:
+		rc = ethtool_get_loopback(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}
-- 
1.7.3.1


^ permalink raw reply related

* [RFC] sched: CHOKe packet scheduler
From: Stephen Hemminger @ 2011-01-05  0:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This implements the CHOKe packet scheduler based on the existing
Linux RED scheduler based on the algorithm described in the paper.
Configuration is the same as RED; only the name changes.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth) {
	   Queue the new packet
	}
	Else {
	     Select randomly a packet from the queue for their flow id
	     Compare arriving packet with a randomly selected packet.
	     If they have the same flow id {
	     	Drop both the packets
	     }
	     Else {
	     	  if (Qave ≥ maxth) {
		     Calculate the dropping probability pa
		     Drop the packet with probability pa
		  }
		  Else {
		     Drop the new packet
		  }
	     }
       }

This an early access version.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 net/sched/Kconfig     |   11 +
 net/sched/Makefile    |    1 
 net/sched/sch_choke.c |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 376 insertions(+)

--- a/net/sched/Kconfig	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Kconfig	2011-01-04 16:26:02.335973715 -0800
@@ -205,6 +205,17 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_CHOKE
+	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
+	help
+	  Say Y here if you want to use the CHOKe packet scheduler (CHOose
+	  and Keep for responsive flows, CHOose and Kill for unresponsive
+	  flows). This is a variation of RED which trys to penalize flows
+	  that monopolize the queue.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_choke.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Makefile	2011-01-04 16:26:16.048938937 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_mult
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_choke.c	2011-01-04 16:25:33.913971468 -0800
@@ -0,0 +1,364 @@
+/*
+ * net/sched/sch_choke.c	CHOKE scheduler
+ *
+ * Copyright (c) 2011 Stephen Hemminger <shemminger@vyatta.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/jhash.h>
+#include <net/pkt_sched.h>
+#include <net/ip.h>
+#include <net/red.h>
+#include <net/ipv6.h>
+
+/*	CHOKe stateless AQM for fair bandwidth allocation
+        =================================================
+
+	Source:
+	R. Pan, B. Prabhakar, and K. Psounis, "CHOKe, A Stateless
+	Active Queue Management Scheme for Approximating Fair Bandwidth Allocation",
+	IEEE INFOCOM, 2000.
+
+	A. Tang, J. Wang, S. Low, "Understanding CHOKe: Throughput and Spatial
+	Characteristics", IEEE/ACM Transactions on Networking, 2004
+
+	ADVANTAGE:
+	- Penalizes unfair flows
+	- Random drop provide gradual feedback
+
+	DRAWBACKS:
+	- Small queue for single flow
+	- Can be gamed by opening lots of connections
+	- Hard to get correct paremeters (same problem as RED)
+
+ */
+
+struct choke_sched_data
+{
+	u32		  limit;
+	unsigned char	  flags;
+
+	struct red_parms  parms;
+	struct red_stats  stats;
+};
+
+/* Select a packet at random from the list.
+ * Same caveats as skb_peek.
+ */
+static struct sk_buff *skb_peek_random(struct sk_buff_head *list)
+{
+	struct sk_buff *skb = list->next;
+	unsigned int idx = net_random() % list->qlen;
+
+	while (skb && idx-- > 0)
+		skb = skb->next;
+
+	return skb;
+}
+
+/* Given IP header and size find src/dst port pair */
+static inline u32 get_ports(const void *hdr, size_t hdr_size, int offset)
+{
+	return *(u32 *)(hdr + hdr_size + offset);
+}
+
+
+static bool same_flow(struct sk_buff *nskb, const struct sk_buff *oskb)
+{
+	if (nskb->protocol != oskb->protocol)
+		return false;
+
+	switch (nskb->protocol) {
+	case htons(ETH_P_IP):
+	{
+		const struct iphdr *iph1, *iph2;
+		int poff;
+
+		if (!pskb_network_may_pull(nskb, sizeof(*iph1)))
+			return false;
+
+		iph1 = ip_hdr(nskb);
+		iph2 = ip_hdr(oskb);
+
+		if (iph1->protocol != iph2->protocol ||
+		    iph1->daddr != iph2->daddr ||
+		    iph1->saddr != iph2->saddr)
+			return false;
+
+		/* Be hostile to new fragmented packets */
+		if (iph1->frag_off & htons(IP_MF|IP_OFFSET))
+			return true;
+
+		if (iph2->frag_off & htons(IP_MF|IP_OFFSET))
+			return false;
+
+		poff = proto_ports_offset(iph1->protocol);
+		if (poff >= 0 &&
+		    pskb_network_may_pull(nskb, iph1->ihl * 4 + 4 + poff)) {
+			iph1 = ip_hdr(nskb);
+
+			return get_ports(iph1, iph1->ihl * 4, poff)
+				== get_ports(iph2, iph2->ihl * 4, poff);
+		}
+
+		return false;
+	}
+
+	case htons(ETH_P_IPV6):
+	{
+		const struct ipv6hdr *iph1, *iph2;
+		int poff;
+
+		if (!pskb_network_may_pull(nskb, sizeof(*iph1)))
+			return false;
+
+		iph1 = ipv6_hdr(nskb);
+		iph2 = ipv6_hdr(oskb);
+
+		if (iph1->nexthdr != iph2->nexthdr ||
+		    ipv6_addr_cmp(&iph1->daddr, &iph2->daddr) != 0 ||
+		    ipv6_addr_cmp(&iph1->saddr, &iph2->saddr) != 0)
+			return false;
+
+		poff = proto_ports_offset(iph1->nexthdr);
+		if (poff >= 0 &&
+		    pskb_network_may_pull(nskb, sizeof(*iph1) + 4 + poff)) {
+			iph1 = ipv6_hdr(nskb);
+
+			return get_ports(iph1, sizeof(*iph1), poff)
+				== get_ports(iph2, sizeof(*iph2), poff);
+		}
+		return false;
+	}
+	default:
+		return false;
+	}
+
+}
+
+/*
+ * Decide what to do with new packet based on queue size.
+ * returns 1 if packet should be admitted
+ *         0 if packet should be dropped
+ */
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct red_parms *p = &q->parms;
+
+	p->qavg = red_calc_qavg(p, skb_queue_len(&sch->q));
+	if (red_is_idling(p))
+		red_end_of_idle_period(p);
+
+	if (p->qavg <= p->qth_min)
+		p->qcount = -1;
+	else {
+		struct sk_buff *oskb;
+
+		/* Draw a packet at random from queue */
+		oskb = skb_peek_random(&sch->q);
+
+		/* Both packets from same flow? */
+		if (same_flow(skb, oskb)) {
+			/* Drop both packets */
+			__skb_unlink(oskb, &sch->q);
+			qdisc_drop(oskb, sch);
+			goto congestion_drop;
+		}
+
+		if (p->qavg > p->qth_max) {
+			p->qcount = -1;
+
+			sch->qstats.overlimits++;
+			q->stats.forced_drop++;
+			goto congestion_drop;
+		}
+
+		if (++p->qcount) {
+			if (red_mark_probability(p, p->qavg)) {
+				p->qcount = 0;
+				p->qR = red_random(p);
+
+				sch->qstats.overlimits++;
+				q->stats.prob_drop++;
+				goto congestion_drop;
+			}
+		} else
+			p->qR = red_random(p);
+	}
+
+	/* Admit new packet */
+	if (likely(skb_queue_len(&sch->q) < q->limit))
+		return qdisc_enqueue_tail(skb, sch);
+
+	q->stats.pdrop++;
+	sch->qstats.drops++;
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+
+ congestion_drop:
+	qdisc_drop(skb, sch);
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *choke_dequeue(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	skb = qdisc_dequeue_head(sch);
+	if (!skb) {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return skb;
+}
+
+static unsigned int choke_drop(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	unsigned int len;
+
+	len = qdisc_queue_drop(sch);
+
+	if (len > 0)
+		q->stats.other++;
+	else {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return len;
+}
+
+static void choke_reset(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	red_restart(&q->parms);
+}
+
+static const struct nla_policy choke_policy[TCA_RED_MAX + 1] = {
+	[TCA_RED_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
+	[TCA_RED_STAB]	= { .len = RED_STAB_SIZE },
+};
+
+static int choke_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_RED_MAX + 1];
+	struct tc_red_qopt *ctl;
+	int err;
+
+	if (opt == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_RED_MAX, opt, choke_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_RED_PARMS] == NULL ||
+	    tb[TCA_RED_STAB] == NULL)
+		return -EINVAL;
+
+	ctl = nla_data(tb[TCA_RED_PARMS]);
+
+	sch_tree_lock(sch);
+	q->flags = ctl->flags;
+	q->limit = ctl->limit;
+
+	red_set_parms(&q->parms, ctl->qth_min, ctl->qth_max, ctl->Wlog,
+		      ctl->Plog, ctl->Scell_log,
+		      nla_data(tb[TCA_RED_STAB]));
+
+	if (skb_queue_empty(&sch->q))
+		red_end_of_idle_period(&q->parms);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static int choke_init(struct Qdisc* sch, struct nlattr *opt)
+{
+	return choke_change(sch, opt);
+}
+
+static int choke_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts = NULL;
+	struct tc_red_qopt opt = {
+		.limit		= q->limit,
+		.flags		= q->flags,
+		.qth_min	= q->parms.qth_min >> q->parms.Wlog,
+		.qth_max	= q->parms.qth_max >> q->parms.Wlog,
+		.Wlog		= q->parms.Wlog,
+		.Plog		= q->parms.Plog,
+		.Scell_log	= q->parms.Scell_log,
+	};
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	NLA_PUT(skb, TCA_RED_PARMS, sizeof(opt), &opt);
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static int choke_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tc_red_xstats st = {
+		.early	= q->stats.prob_drop + q->stats.forced_drop,
+		.pdrop	= q->stats.pdrop,
+		.other	= q->stats.other,
+		.marked	= q->stats.prob_mark + q->stats.forced_mark,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static struct Qdisc_ops choke_qdisc_ops __read_mostly = {
+	.id		=	"choke",
+	.priv_size	=	sizeof(struct choke_sched_data),
+
+	.enqueue	=	choke_enqueue,
+	.dequeue	=	choke_dequeue,
+	.peek		=	qdisc_peek_head,
+	.drop		=	choke_drop,
+	.init		=	choke_init,
+	.reset		=	choke_reset,
+	.change		=	choke_change,
+	.dump		=	choke_dump,
+	.dump_stats	=	choke_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init choke_module_init(void)
+{
+	return register_qdisc(&choke_qdisc_ops);
+}
+
+static void __exit choke_module_exit(void)
+{
+	unregister_qdisc(&choke_qdisc_ops);
+}
+
+module_init(choke_module_init)
+module_exit(choke_module_exit)
+
+MODULE_LICENSE("GPL");

^ permalink raw reply

* [RFC] ECN and IP defragmentation
From: Eric Dumazet @ 2011-01-05  0:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi David

It seems ip_fragment.c doesnt comply with RFC3168, section 5.3

5.3.  Fragmentation

   ECN-capable packets MAY have the DF (Don't Fragment) bit set.
   Reassembly of a fragmented packet MUST NOT lose indications of
   congestion.  In other words, if any fragment of an IP packet to be
   reassembled has the CE codepoint set, then one of two actions MUST be
   taken:

      * Set the CE codepoint on the reassembled packet.  However, this
        MUST NOT occur if any of the other fragments contributing to
        this reassembly carries the Not-ECT codepoint.

      * The packet is dropped, instead of being reassembled, for any
        other reason.

Should we fix this ?

(I tried to use ECN on UDP messages and noticed this problem)

Thanks

^ permalink raw reply

* Re: [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Ben Hutchings @ 2011-01-05  0:01 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: davem, netdev
In-Reply-To: <20110104232947.3451.91153.stgit@gitlad.jf.intel.com>

On Tue, 2011-01-04 at 15:29 -0800, Alexander Duyck wrote:
> Currently any strings returned via the get_rx_ntuple call will just be
> dropped because the num_strings will be zero.  In order to correct this I
> am updating things so that the return value of get_rx_ntuple is the number
> of strings that were written, or a negative value if there was an error.
[...]

Nothing implements ethtool_ops::get_rx_ntuple, anyway.

The fallback implementation is totally bogus, too.  Maximum of 1024
filters?  Erm, sfc can handle more than that.  And doing complex string
formatting in the kernel, even though all the parsing is in ethtool?

Please, let's write off ETHTOOL_GRXNTUPLE as a failed experiment and
replace it with a command that behaves more like ETHTOOL_GRXCLSRLALL.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net-next-2.6 08/08] r8169: more 8168dp support.
From: Francois Romieu @ 2011-01-04 23:53 UTC (permalink / raw)
  To: hayeswang; +Cc: davem, netdev, 'Ben Hutchings'
In-Reply-To: <FFFEF1293F9B4E3B81D3777F91F21B01@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
[...]
> When doing reset, the nic has to notify the embedded system and wait a response.
> However, maybe the system accesses the same register at the same time, so the
> embedded system and the driver implement the same method of software mutex to
> prevent this situation.

Thanks for the information.

I have found no 8168dp in my collection. A volunteer to test this patch with
the relevant hardware would be welcome.

-- 
Ueimor

^ permalink raw reply

* [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Alexander Duyck @ 2011-01-04 23:29 UTC (permalink / raw)
  To: davem, bhutchings; +Cc: netdev

Currently any strings returned via the get_rx_ntuple call will just be
dropped because the num_strings will be zero.  In order to correct this I
am updating things so that the return value of get_rx_ntuple is the number
of strings that were written, or a negative value if there was an error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/ethtool.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 1774178..7ade13b 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -587,6 +587,9 @@ static int ethtool_get_rx_ntuple(struct net_device *dev, void __user *useraddr)
 	if (ops->get_rx_ntuple) {
 		/* driver-specific filter grab */
 		ret = ops->get_rx_ntuple(dev, gstrings.string_set, data);
+		if (ret < 0)
+			goto out;
+		num_strings = ret;
 		goto copy;
 	}
 


^ permalink raw reply related

* Re: [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: Jarek Poplawski @ 2011-01-04 22:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem, hadi, shemminger, tgraf, eric.dumazet, bhutchings, nhorman,
	netdev
In-Reply-To: <20110104185646.13692.68146.stgit@jf-dev1-dcblab>

On Tue, Jan 04, 2011 at 10:56:46AM -0800, John Fastabend wrote:
> This implements a mqprio queueing discipline that by default creates
> a pfifo_fast qdisc per tx queue and provides the needed configuration
> interface.
> 
> Using the mqprio qdisc the number of tcs currently in use along
> with the range of queues alloted to each class can be configured. By
> default skbs are mapped to traffic classes using the skb priority.
> This mapping is configurable.
> 
> Configurable parameters,
> 
> struct tc_mqprio_qopt {
>         __u8    num_tc;
>         __u8    prio_tc_map[TC_BITMASK + 1];
>         __u8    hw;
>         __u16   count[TC_MAX_QUEUE];
>         __u16   offset[TC_MAX_QUEUE];
> };
> 
> Here the count/offset pairing give the queue alignment and the
> prio_tc_map gives the mapping from skb->priority to tc.
> 
> The hw bit determines if the hardware should configure the count
> and offset values. If the hardware bit is set then the operation
> will fail if the hardware does not implement the ndo_setup_tc
> operation. This is to avoid undetermined states where the hardware
> may or may not control the queue mapping. Also minimal bounds
> checking is done on the count/offset to verify a queue does not
> exceed num_tx_queues and that queue ranges do not overlap. Otherwise
> it is left to user policy or hardware configuration to create
> useful mappings.
> 
> It is expected that hardware QOS schemes can be implemented by
> creating appropriate mappings of queues in ndo_tc_setup().
> 
> One expected use case is drivers will use the ndo_setup_tc to map
> queue ranges onto 802.1Q traffic classes. This provides a generic
> mechanism to map network traffic onto these traffic classes and
> removes the need for lower layer drivers to know specifics about
> traffic types.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>  include/linux/netdevice.h |    3 
>  include/linux/pkt_sched.h |   10 +
>  net/sched/Kconfig         |   12 +
>  net/sched/Makefile        |    1 
>  net/sched/sch_generic.c   |    4 
>  net/sched/sch_mqprio.c    |  413 +++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 443 insertions(+), 0 deletions(-)
>  create mode 100644 net/sched/sch_mqprio.c
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ae51323..19a855b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -764,6 +764,8 @@ struct netdev_tc_txq {
>   * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
>   *			  struct nlattr *port[]);
>   * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
> + *
> + * int (*ndo_setup_tc)(struct net_device *dev, int tc);

 * int (*ndo_setup_tc)(struct net_device *dev, u8 tc);

>   */
>  #define HAVE_NET_DEVICE_OPS
>  struct net_device_ops {
> @@ -822,6 +824,7 @@ struct net_device_ops {
>  						   struct nlattr *port[]);
>  	int			(*ndo_get_vf_port)(struct net_device *dev,
>  						   int vf, struct sk_buff *skb);
> +	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
>  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>  	int			(*ndo_fcoe_enable)(struct net_device *dev);
>  	int			(*ndo_fcoe_disable)(struct net_device *dev);
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 2cfa4bc..1c5310a 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -2,6 +2,7 @@
>  #define __LINUX_PKT_SCHED_H
>  
>  #include <linux/types.h>
> +#include <linux/netdevice.h>

This should better be consulted with Stephen wrt. iproute patch.

>  
>  /* Logical priority bands not depending on specific packet scheduler.
>     Every scheduler will map them to real traffic classes, if it has
> @@ -481,4 +482,13 @@ struct tc_drr_stats {
>  	__u32	deficit;
>  };
>  
> +/* MQPRIO */
> +struct tc_mqprio_qopt {
> +	__u8	num_tc;
> +	__u8	prio_tc_map[TC_BITMASK + 1];
> +	__u8	hw;
> +	__u16	count[TC_MAX_QUEUE];
> +	__u16	offset[TC_MAX_QUEUE];
> +};
> +
>  #endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index a36270a..f52f5eb 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -205,6 +205,18 @@ config NET_SCH_DRR
>  
>  	  If unsure, say N.
>  
> +config NET_SCH_MQPRIO
> +	tristate "Multi-queue priority scheduler (MQPRIO)"
> +	help
> +	  Say Y here if you want to use the Multi-queue Priority scheduler.
> +	  This scheduler allows QOS to be offloaded on NICs that have support
> +	  for offloading QOS schedulers.
> +
> +	  To compile this driver as a module, choose M here: the module will
> +	  be called sch_mqprio.
> +
> +	  If unsure, say N.
> +
>  config NET_SCH_INGRESS
>  	tristate "Ingress Qdisc"
>  	depends on NET_CLS_ACT
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 960f5db..26ce681 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
>  obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
>  obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
>  obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
> +obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
>  obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
>  obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 34dc598..723b278 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>  	.dump		=	pfifo_fast_dump,
>  	.owner		=	THIS_MODULE,
>  };
> +EXPORT_SYMBOL(pfifo_fast_ops);
>  
>  struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
>  			  struct Qdisc_ops *ops)
> @@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>  
>  	return oqdisc;
>  }
> +EXPORT_SYMBOL(dev_graft_qdisc);
>  
>  static void attach_one_default_qdisc(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
> @@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
>  		dev_watchdog_up(dev);
>  	}
>  }
> +EXPORT_SYMBOL(dev_activate);
>  
>  static void dev_deactivate_queue(struct net_device *dev,
>  				 struct netdev_queue *dev_queue,
> @@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
>  	list_add(&dev->unreg_list, &single);
>  	dev_deactivate_many(&single);
>  }
> +EXPORT_SYMBOL(dev_deactivate);
>  
>  static void dev_init_scheduler_queue(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> new file mode 100644
> index 0000000..b16dc2c
> --- /dev/null
> +++ b/net/sched/sch_mqprio.c
> @@ -0,0 +1,413 @@
> +/*
> + * net/sched/sch_mqprio.c
> + *
> + * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2 as published by the Free Software Foundation.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/skbuff.h>
> +#include <net/netlink.h>
> +#include <net/pkt_sched.h>
> +#include <net/sch_generic.h>
> +
> +struct mqprio_sched {
> +	struct Qdisc		**qdiscs;
> +	int hw_owned;
> +};
> +
> +static void mqprio_destroy(struct Qdisc *sch)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	unsigned int ntx;
> +
> +	if (!priv->qdiscs)
> +		return;
> +
> +	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
> +		qdisc_destroy(priv->qdiscs[ntx]);
> +
> +	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
> +		dev->netdev_ops->ndo_setup_tc(dev, 0);
> +	else
> +		netdev_set_num_tc(dev, 0);
> +
> +	kfree(priv->qdiscs);
> +}
> +
> +static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
> +{
> +	int i, j;
> +
> +	/* Verify num_tc is not out of max range */
> +	if (qopt->num_tc > TC_MAX_QUEUE)
> +		return -EINVAL;
> +
> +	for (i = 0; i < qopt->num_tc; i++) {
> +		unsigned int last = qopt->offset[i] + qopt->count[i];

(empty line after declarations)

> +		/* Verify the queue offset is in the num tx range */
> +		if (qopt->offset[i] >= dev->num_tx_queues)
> +			return -EINVAL;
> +		/* Verify the queue count is in tx range being equal to the
> +		 * num_tx_queues indicates the last queue is in use.
> +		 */
> +		else if (last > dev->num_tx_queues)
> +			return -EINVAL;
> +
> +		/* Verify that the offset and counts do not overlap */
> +		for (j = i + 1; j < qopt->num_tc; j++) {
> +			if (last > qopt->offset[j])
> +				return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	struct netdev_queue *dev_queue;
> +	struct Qdisc *qdisc;
> +	int i, err = -EOPNOTSUPP;
> +	struct tc_mqprio_qopt *qopt = NULL;
> +
> +	/* Unwind attributes on failure */
> +	u8 unwnd_tc = dev->num_tc;
> +	u8 unwnd_map[TC_BITMASK + 1];
> +	struct netdev_tc_txq unwnd_txq[TC_MAX_QUEUE];
> +
> +	if (sch->parent != TC_H_ROOT)
> +		return -EOPNOTSUPP;
> +
> +	if (!netif_is_multiqueue(dev))
> +		return -EOPNOTSUPP;
> +
> +	if (nla_len(opt) < sizeof(*qopt))
> +		return -EINVAL;
> +	qopt = nla_data(opt);
> +
> +	memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
> +	memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
> +
> +	/* If the mqprio options indicate that hardware should own
> +	 * the queue mapping then run ndo_setup_tc if this can not
> +	 * be done fail immediately.
> +	 */
> +	if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
> +		priv->hw_owned = 1;
> +		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
> +		if (err)
> +			return err;
> +	} else if (!qopt->hw) {
> +		if (mqprio_parse_opt(dev, qopt))
> +			return -EINVAL;
> +
> +		if (netdev_set_num_tc(dev, qopt->num_tc))
> +			return -EINVAL;
> +
> +		for (i = 0; i < qopt->num_tc; i++)
> +			netdev_set_tc_queue(dev, i,
> +					    qopt->count[i], qopt->offset[i]);
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	/* Always use supplied priority mappings */
> +	for (i = 0; i < TC_BITMASK + 1; i++) {
> +		if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
> +			err = -EINVAL;

This would probably trigger if we try qopt->num_tc == 0. Is it expected?

> +			goto tc_err;
> +		}
> +	}
> +
> +	/* pre-allocate qdisc, attachment can't fail */
> +	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
> +			       GFP_KERNEL);
> +	if (priv->qdiscs == NULL) {
> +		err = -ENOMEM;
> +		goto tc_err;
> +	}
> +
> +	for (i = 0; i < dev->num_tx_queues; i++) {
> +		dev_queue = netdev_get_tx_queue(dev, i);
> +		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
> +					  TC_H_MAKE(TC_H_MAJ(sch->handle),
> +						    TC_H_MIN(i + 1)));
> +		if (qdisc == NULL) {
> +			err = -ENOMEM;
> +			goto err;
> +		}
> +		qdisc->flags |= TCQ_F_CAN_BYPASS;
> +		priv->qdiscs[i] = qdisc;
> +	}
> +
> +	sch->flags |= TCQ_F_MQROOT;
> +	return 0;
> +
> +err:
> +	mqprio_destroy(sch);
> +tc_err:
> +	if (priv->hw_owned)
> +		dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);

Setting here (again) to unwind a bit later looks strange.
Why not this 'else' only?

> +	else
> +		netdev_set_num_tc(dev, unwnd_tc);
> +
> +	memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
> +	memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
> +
> +	return err;
> +}
> +
> +static void mqprio_attach(struct Qdisc *sch)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	struct Qdisc *qdisc;
> +	unsigned int ntx;
> +
> +	/* Attach underlying qdisc */
> +	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
> +		qdisc = priv->qdiscs[ntx];
> +		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
> +		if (qdisc)
> +			qdisc_destroy(qdisc);
> +	}
> +	kfree(priv->qdiscs);
> +	priv->qdiscs = NULL;
> +}
> +
> +static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
> +					     unsigned long cl)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
> +
> +	if (ntx >= dev->num_tx_queues)
> +		return NULL;
> +	return netdev_get_tx_queue(dev, ntx);
> +}
> +
> +static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
> +		    struct Qdisc **old)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
> +
> +	if (dev->flags & IFF_UP)
> +		dev_deactivate(dev);
> +
> +	*old = dev_graft_qdisc(dev_queue, new);
> +
> +	if (dev->flags & IFF_UP)
> +		dev_activate(dev);
> +
> +	return 0;
> +}
> +
> +static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	unsigned char *b = skb_tail_pointer(skb);
> +	struct tc_mqprio_qopt opt;
> +	struct Qdisc *qdisc;
> +	unsigned int i;
> +
> +	sch->q.qlen = 0;
> +	memset(&sch->bstats, 0, sizeof(sch->bstats));
> +	memset(&sch->qstats, 0, sizeof(sch->qstats));
> +
> +	for (i = 0; i < dev->num_tx_queues; i++) {
> +		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
> +		spin_lock_bh(qdisc_lock(qdisc));
> +		sch->q.qlen		+= qdisc->q.qlen;
> +		sch->bstats.bytes	+= qdisc->bstats.bytes;
> +		sch->bstats.packets	+= qdisc->bstats.packets;
> +		sch->qstats.qlen	+= qdisc->qstats.qlen;
> +		sch->qstats.backlog	+= qdisc->qstats.backlog;
> +		sch->qstats.drops	+= qdisc->qstats.drops;
> +		sch->qstats.requeues	+= qdisc->qstats.requeues;
> +		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
> +		spin_unlock_bh(qdisc_lock(qdisc));
> +	}
> +
> +	opt.num_tc = dev->num_tc;
> +	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
> +	opt.hw = priv->hw_owned;
> +
> +	for (i = 0; i < dev->num_tc; i++) {
> +		opt.count[i] = dev->tc_to_txq[i].count;
> +		opt.offset[i] = dev->tc_to_txq[i].offset;
> +	}
> +
> +	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> +
> +	return skb->len;
> +nla_put_failure:
> +	nlmsg_trim(skb, b);
> +	return -1;
> +}
> +
> +static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
> +{
> +	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
> +
> +	return dev_queue->qdisc_sleeping;
> +}
> +
> +static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
> +{
> +	unsigned int ntx = TC_H_MIN(classid);

We need to 'get' tc classes too, eg for individual dumps. Then we
should omit them in .leaf, .graft etc.

> +
> +	if (!mqprio_queue_get(sch, ntx))
> +		return 0;
> +	return ntx;
> +}
> +
> +static void mqprio_put(struct Qdisc *sch, unsigned long cl)
> +{
> +}
> +
> +static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
> +			 struct sk_buff *skb, struct tcmsg *tcm)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +
> +	if (cl <= dev->num_tc) {
> +		tcm->tcm_parent = TC_H_ROOT;
> +		tcm->tcm_info = 0;
> +	} else {
> +		int i;
> +		struct netdev_queue *dev_queue;
> +		dev_queue = mqprio_queue_get(sch, cl);
> +
> +		tcm->tcm_parent = 0;
> +		for (i = 0; i < netdev_get_num_tc(dev); i++) {


Why dev->num_tc above, netdev_get_num_tc(dev) here, and dev->num_tc
below?

> +			struct netdev_tc_txq tc = dev->tc_to_txq[i];
> +			int q_idx = cl - dev->num_tc;

(empty line after declarations)

> +			if (q_idx >= tc.offset &&
> +			    q_idx < tc.offset + tc.count) {

cl == 17, tc.offset == 0, tc.count == 1, num_tc = 16, q_idx = 1,
!(1 < 0 + 1), doesn't belong to the parent #1?

> +				tcm->tcm_parent =
> +					TC_H_MAKE(TC_H_MAJ(sch->handle),
> +						  TC_H_MIN(i + 1));
> +				break;
> +			}
> +		}
> +		tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
> +	}
> +	tcm->tcm_handle |= TC_H_MIN(cl);
> +	return 0;
> +}
> +
> +static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
> +			       struct gnet_dump *d)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +
> +	if (cl <= netdev_get_num_tc(dev)) {
> +		int i;
> +		struct Qdisc *qdisc;
> +		struct gnet_stats_queue qstats = {0};
> +		struct gnet_stats_basic_packed bstats = {0};
> +		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
> +
> +		/* Drop lock here it will be reclaimed before touching
> +		 * statistics this is required because the d->lock we
> +		 * hold here is the look on dev_queue->qdisc_sleeping
> +		 * also acquired below.
> +		 */
> +		spin_unlock_bh(d->lock);
> +
> +		for (i = tc.offset; i < tc.offset + tc.count; i++) {
> +			qdisc = netdev_get_tx_queue(dev, i)->qdisc;
> +			spin_lock_bh(qdisc_lock(qdisc));
> +			bstats.bytes      += qdisc->bstats.bytes;
> +			bstats.packets    += qdisc->bstats.packets;
> +			qstats.qlen       += qdisc->qstats.qlen;
> +			qstats.backlog    += qdisc->qstats.backlog;
> +			qstats.drops      += qdisc->qstats.drops;
> +			qstats.requeues   += qdisc->qstats.requeues;
> +			qstats.overlimits += qdisc->qstats.overlimits;
> +			spin_unlock_bh(qdisc_lock(qdisc));
> +		}
> +		/* Reclaim root sleeping lock before completing stats */
> +		spin_lock_bh(d->lock);
> +		if (gnet_stats_copy_basic(d, &bstats) < 0 ||
> +		    gnet_stats_copy_queue(d, &qstats) < 0)
> +			return -1;
> +	} else {
> +		struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);

(empty line after declarations)

> +		sch = dev_queue->qdisc_sleeping;
> +		sch->qstats.qlen = sch->q.qlen;
> +		if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
> +		    gnet_stats_copy_queue(d, &sch->qstats) < 0)
> +			return -1;
> +	}
> +	return 0;
> +}
> +
> +static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	unsigned long ntx;
> +	u8 num_tc = netdev_get_num_tc(dev);
> +
> +	if (arg->stop)
> +		return;
> +
> +	/* Walk hierarchy with a virtual class per tc */
> +	arg->count = arg->skip;
> +	for (ntx = arg->skip; ntx < dev->num_tx_queues + num_tc; ntx++) {

Should we report possibly unused/unconfigured tx_queues?

Jarek P.

> +		if (arg->fn(sch, ntx + 1, arg) < 0) {
> +			arg->stop = 1;
> +			break;
> +		}
> +		arg->count++;
> +	}
> +}
> +
> +static const struct Qdisc_class_ops mqprio_class_ops = {
> +	.graft		= mqprio_graft,
> +	.leaf		= mqprio_leaf,
> +	.get		= mqprio_get,
> +	.put		= mqprio_put,
> +	.walk		= mqprio_walk,
> +	.dump		= mqprio_dump_class,
> +	.dump_stats	= mqprio_dump_class_stats,
> +};
> +
> +struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
> +	.cl_ops		= &mqprio_class_ops,
> +	.id		= "mqprio",
> +	.priv_size	= sizeof(struct mqprio_sched),
> +	.init		= mqprio_init,
> +	.destroy	= mqprio_destroy,
> +	.attach		= mqprio_attach,
> +	.dump		= mqprio_dump,
> +	.owner		= THIS_MODULE,
> +};
> +
> +static int __init mqprio_module_init(void)
> +{
> +	return register_qdisc(&mqprio_qdisc_ops);
> +}
> +
> +static void __exit mqprio_module_exit(void)
> +{
> +	unregister_qdisc(&mqprio_qdisc_ops);
> +}
> +
> +module_init(mqprio_module_init);
> +module_exit(mqprio_module_exit);
> +
> +MODULE_LICENSE("GPL");
> 

^ permalink raw reply

* Re: [PATCH] r8169: support control of advertising
From: Francois Romieu @ 2011-01-04 22:38 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Oliver Neukum, davem, netdev, machen, Hayes Wang
In-Reply-To: <1294161621.3636.17.camel@bwh-desktop>

Ben Hutchings <bhutchings@solarflare.com> :
[...]
> Should be ADVERTISE_10HALF.

Fixed.

[...]
> > +		if (adv & ADVERTISED_10baseT_Half)
> > +			auto_nego |= ADVERTISE_10HALF;
> > +		if (adv & ADVERTISED_10baseT_Full)
> > +			auto_nego |= ADVERTISE_10FULL;
> > +		if (adv & ADVERTISED_100baseT_Half)
> > +			auto_nego |= ADVERTISE_100HALF;
> > +		if (adv & ADVERTISED_100baseT_Full)
> > +			auto_nego |=  ADVERTISE_100FULL;
> > +
> >  		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
> [...]
> 
> Pause advertising should also be controllable through ethtool, if flow
> control can be altered in the MAC.  (It's not clear whether it can.)

It would be gluttony.

> This should also check for unsupported advertising flags (e.g.
> ADVERTISED_1000baseT_Full when the MAC doesn't support 1000M) and return
> -EINVAL if they're set.

Fixed.

Oliver, are you ok with the patch below (against davem's net-next or you
will experience rejects) ?

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 27a7c20..4175a14 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -540,7 +540,7 @@ struct rtl8169_private {
 		void (*up)(struct rtl8169_private *);
 	} pll_power_ops;
 
-	int (*set_speed)(struct net_device *, u8 autoneg, u16 speed, u8 duplex);
+	int (*set_speed)(struct net_device *, u8 aneg, u16 sp, u8 dpx, u32 adv);
 	int (*get_settings)(struct net_device *, struct ethtool_cmd *);
 	void (*phy_reset_enable)(struct rtl8169_private *tp);
 	void (*hw_start)(struct net_device *);
@@ -1093,7 +1093,7 @@ static int rtl8169_get_regs_len(struct net_device *dev)
 }
 
 static int rtl8169_set_speed_tbi(struct net_device *dev,
-				 u8 autoneg, u16 speed, u8 duplex)
+				 u8 autoneg, u16 speed, u8 duplex, u32 ignored)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1116,17 +1116,28 @@ static int rtl8169_set_speed_tbi(struct net_device *dev,
 }
 
 static int rtl8169_set_speed_xmii(struct net_device *dev,
-				  u8 autoneg, u16 speed, u8 duplex)
+				  u8 autoneg, u16 speed, u8 duplex, u32 adv)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	int giga_ctrl, bmcr;
+	int rc = -EINVAL;
 
 	if (autoneg == AUTONEG_ENABLE) {
 		int auto_nego;
 
 		auto_nego = rtl_readphy(tp, MII_ADVERTISE);
-		auto_nego |= (ADVERTISE_10HALF | ADVERTISE_10FULL |
-			      ADVERTISE_100HALF | ADVERTISE_100FULL);
+		auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL |
+				ADVERTISE_100HALF | ADVERTISE_100FULL);
+
+		if (adv & ADVERTISED_10baseT_Half)
+			auto_nego |= ADVERTISE_10HALF;
+		if (adv & ADVERTISED_10baseT_Full)
+			auto_nego |= ADVERTISE_10FULL;
+		if (adv & ADVERTISED_100baseT_Half)
+			auto_nego |= ADVERTISE_100HALF;
+		if (adv & ADVERTISED_100baseT_Full)
+			auto_nego |= ADVERTISE_100FULL;
+
 		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
 
 		giga_ctrl = rtl_readphy(tp, MII_CTRL1000);
@@ -1141,10 +1152,15 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		    (tp->mac_version != RTL_GIGA_MAC_VER_14) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_15) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_16)) {
-			giga_ctrl |= ADVERTISE_1000FULL | ADVERTISE_1000HALF;
-		} else {
+			if (adv & ADVERTISED_1000baseT_Half)
+				giga_ctrl |= ADVERTISE_1000HALF;
+			if (adv & ADVERTISED_1000baseT_Full)
+				giga_ctrl |= ADVERTISE_1000FULL;
+		} else if (adv & (ADVERTISED_1000baseT_Half |
+				  ADVERTISED_1000baseT_Full)) {
 			netif_info(tp, link, dev,
 				   "PHY does not support 1000Mbps\n");
+			goto out;
 		}
 
 		bmcr = BMCR_ANENABLE | BMCR_ANRESTART;
@@ -1171,7 +1187,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		else if (speed == SPEED_100)
 			bmcr = BMCR_SPEED100;
 		else
-			return -EINVAL;
+			goto out;
 
 		if (duplex == DUPLEX_FULL)
 			bmcr |= BMCR_FULLDPLX;
@@ -1194,16 +1210,18 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		}
 	}
 
-	return 0;
+	rc = 0;
+out:
+	return rc;
 }
 
 static int rtl8169_set_speed(struct net_device *dev,
-			     u8 autoneg, u16 speed, u8 duplex)
+			     u8 autoneg, u16 speed, u8 duplex, u32 advertising)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	int ret;
 
-	ret = tp->set_speed(dev, autoneg, speed, duplex);
+	ret = tp->set_speed(dev, autoneg, speed, duplex, advertising);
 
 	if (netif_running(dev) && (tp->phy_1000_ctrl_reg & ADVERTISE_1000FULL))
 		mod_timer(&tp->timer, jiffies + RTL8169_PHY_TIMEOUT);
@@ -1218,7 +1236,8 @@ static int rtl8169_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 	int ret;
 
 	spin_lock_irqsave(&tp->lock, flags);
-	ret = rtl8169_set_speed(dev, cmd->autoneg, cmd->speed, cmd->duplex);
+	ret = rtl8169_set_speed(dev,
+		cmd->autoneg, cmd->speed, cmd->duplex, cmd->advertising);
 	spin_unlock_irqrestore(&tp->lock, flags);
 
 	return ret;
@@ -2517,11 +2536,11 @@ static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
 
 	rtl8169_phy_reset(dev, tp);
 
-	/*
-	 * rtl8169_set_speed_xmii takes good care of the Fast Ethernet
-	 * only 8101. Don't panic.
-	 */
-	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL);
+	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
+		ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
+		ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full |
+		tp->mii.supports_gmii ? 0 :
+			ADVERTISED_1000baseT_Half | ADVERTISED_1000baseT_Full);
 
 	if (RTL_R8(PHYstatus) & TBI_Enable)
 		netif_info(tp, link, dev, "TBI auto-negotiating\n");

^ permalink raw reply related

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 22:19 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294178690.3636.49.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 22:04 +0000, Ben Hutchings a écrit :

> get_rps_cpu() will need to read from an arbitrary entry in cpu_rmap (not
> the current CPU's entry) for each new flow and for each flow that went
> idle for a while.  That's not fast path but it is part of the data path,
> not the control path.
> 

Hmm, I call this fast path :(

> > Cache lines dont matter. I was not concerned about speed but memory
> > needs.
> > 
> > NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.
> > 
> > Really, you'll have to have very strong arguments to introduce an
> > [NR_CPUS] array in the kernel today.
> 
> I could replace this with a pointer to an array of size
> num_possible_cpus().  But I think per_cpu is wrong here.

Yes, an dynamic array is acceptable

You probably mean nr_cpu_ids 

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 22:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294177548.3420.11.camel@edumazet-laptop>

On Tue, 2011-01-04 at 22:45 +0100, Eric Dumazet wrote:
> Le mardi 04 janvier 2011 à 21:23 +0000, Ben Hutchings a écrit :
> > On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> > > Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > > > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > > > to select a queue for which the response will be handled on the same
> > > > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > > > library functions to support a generic reverse-mapping from CPUs to
> > > > objects with affinity and the specific case where the objects are
> > > > IRQs.
> > [...]
> > > > +/**
> > > > + * struct cpu_rmap - CPU affinity reverse-map
> > > > + * @near: For each CPU, the index and distance to the nearest object,
> > > > + *      based on affinity masks
> > > > + * @size: Number of objects to be reverse-mapped
> > > > + * @used: Number of objects added
> > > > + * @obj: Array of object pointers
> > > > + */
> > > > +struct cpu_rmap {
> > > > +	struct {
> > > > +		u16     index;
> > > > +		u16     dist;
> > > > +	} near[NR_CPUS];
> > > 
> > > This [NR_CPUS] is highly suspect.
> > > 
> > > Are you sure you cant use a per_cpu allocation here ?
> > 
> > I think that would be a waste of space in shared caches, as this is
> > read-mostly.
> 
> This is slow path, unless I dont understood the intent.

get_rps_cpu() will need to read from an arbitrary entry in cpu_rmap (not
the current CPU's entry) for each new flow and for each flow that went
idle for a while.  That's not fast path but it is part of the data path,
not the control path.

> Cache lines dont matter. I was not concerned about speed but memory
> needs.
> 
> NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.
> 
> Really, you'll have to have very strong arguments to introduce an
> [NR_CPUS] array in the kernel today.

I could replace this with a pointer to an array of size
num_possible_cpus().  But I think per_cpu is wrong here.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 21:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294176216.3636.38.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 21:23 +0000, Ben Hutchings a écrit :
> On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> > Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > > to select a queue for which the response will be handled on the same
> > > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > > library functions to support a generic reverse-mapping from CPUs to
> > > objects with affinity and the specific case where the objects are
> > > IRQs.
> [...]
> > > +/**
> > > + * struct cpu_rmap - CPU affinity reverse-map
> > > + * @near: For each CPU, the index and distance to the nearest object,
> > > + *      based on affinity masks
> > > + * @size: Number of objects to be reverse-mapped
> > > + * @used: Number of objects added
> > > + * @obj: Array of object pointers
> > > + */
> > > +struct cpu_rmap {
> > > +	struct {
> > > +		u16     index;
> > > +		u16     dist;
> > > +	} near[NR_CPUS];
> > 
> > This [NR_CPUS] is highly suspect.
> > 
> > Are you sure you cant use a per_cpu allocation here ?
> 
> I think that would be a waste of space in shared caches, as this is
> read-mostly.

This is slow path, unless I dont understood the intent.

Cache lines dont matter. I was not concerned about speed but memory
needs.

NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.

Really, you'll have to have very strong arguments to introduce an
[NR_CPUS] array in the kernel today.

^ permalink raw reply

* Re: [Bugme-new] [Bug 25062] New: Bonding packet deduplication doesn't work properly anymore
From: Andrew Morton @ 2011-01-04 21:39 UTC (permalink / raw)
  To: netdev
  Cc: bugzilla-daemon, bugme-new, bugme-daemon, Jay Vosburgh,
	kevin.lapagna
In-Reply-To: <bug-25062-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 17 Dec 2010 11:45:18 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=25062
> 
>            Summary: Bonding packet deduplication doesn't work properly
>                     anymore
>            Product: Networking
>            Version: 2.5
>     Kernel Version: > 2.6.33
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: kevin.lapagna@bigtag.ch
>         Regression: No
> 
> 
> Here's the setup:
> 
> switch: ordinary cisco switch
> eth0: NIC with kernel module tg3
> eth1: NIC with kernel module e1000e
> bond0: bond with slaves eth0,eth1 in mode 1 (or 5)
> bond0.100: vlan device created with vconfig
> bridge100: bridge created with brctl
> tap1: tap device created with tunctl
> vguest: qemu-kvm vguest whit emulated e1000 NIC
> 
> 
> |________________|-- eth0 \                                               |________________|
> | switch |          -- bond0 -- bond0.100 -- bridge100 -- tap1 -- | vguest |
> |________|-- eth1 /                                               |________|
> 
> When the vguest emits an ethernet broadcast (DHCP-request), it's forwarded all
> the way up to the switch, through eth0. The switch forwards the broadcast -
> also to eth1. The packet travels then all the way back to bridge100. So the
> last status known for bridge100, regarding the mac address of the vgeust is,
> that it is behind bond0.110 (instead of tap1). If a DHCP-server responds to the
> request, the packet travels to bridge100, which has now a faulty
> MAC-address-table and the packet will be rejected and never reaches tap1 and
> therefor not the vguest.
> 
> I witnessed this wrong behavior in kernel 2.6.37-rc5 (debian package), 2.6.36.2
> and 2.6.35.9 (self compiled -  vanilla). The setup has worked with kernels <=
> 2.6.33.7. I've never tried 2.6.34.
> 
> I assume the setup above is a common way for the separation of virtual guests
> on a network level. So this could become a major issue for a lot of people when
> upgrading their kernels.
> 


^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 21:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294175823.3420.7.camel@edumazet-laptop>

On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > to select a queue for which the response will be handled on the same
> > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > library functions to support a generic reverse-mapping from CPUs to
> > objects with affinity and the specific case where the objects are
> > IRQs.
[...]
> > +/**
> > + * struct cpu_rmap - CPU affinity reverse-map
> > + * @near: For each CPU, the index and distance to the nearest object,
> > + *      based on affinity masks
> > + * @size: Number of objects to be reverse-mapped
> > + * @used: Number of objects added
> > + * @obj: Array of object pointers
> > + */
> > +struct cpu_rmap {
> > +	struct {
> > +		u16     index;
> > +		u16     dist;
> > +	} near[NR_CPUS];
> 
> This [NR_CPUS] is highly suspect.
> 
> Are you sure you cant use a per_cpu allocation here ?

I think that would be a waste of space in shared caches, as this is
read-mostly.

> > +	u16		size, used;
> > +	void		*obj[0];
> > +};
> > +#define CPU_RMAP_DIST_INF 0xffff
> > +
> 
> 
> > +
> > +/**
> > + * alloc_cpu_rmap - allocate CPU affinity reverse-map
> > + * @size: Number of objects to be mapped
> > + * @flags: Allocation flags e.g. %GFP_KERNEL
> > + */
> 
> I really doubt you need other than GFP_KERNEL. (Especially if you switch
> to per_cpu alloc ;) )
[...]

I agree, but this is consistent with ~all other allocation functions.

Ben.


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 21:17 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294169967.3636.34.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> When initiating I/O on a multiqueue and multi-IRQ device, we may want
> to select a queue for which the response will be handled on the same
> or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> library functions to support a generic reverse-mapping from CPUs to
> objects with affinity and the specific case where the objects are
> IRQs.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
>  include/linux/cpu_rmap.h |   73 +++++++++++++
>  lib/Kconfig              |    4 +
>  lib/Makefile             |    2 +
>  lib/cpu_rmap.c           |  262 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 341 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/cpu_rmap.h
>  create mode 100644 lib/cpu_rmap.c
> 
> diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
> new file mode 100644
> index 0000000..6e2f5ff
> --- /dev/null
> +++ b/include/linux/cpu_rmap.h
> @@ -0,0 +1,73 @@
> +/*
> + * cpu_rmap.c: CPU affinity reverse-map support
> + * Copyright 2010 Solarflare Communications Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation, incorporated herein by reference.
> + */
> +
> +#include <linux/cpumask.h>
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +
> +/**
> + * struct cpu_rmap - CPU affinity reverse-map
> + * @near: For each CPU, the index and distance to the nearest object,
> + *      based on affinity masks
> + * @size: Number of objects to be reverse-mapped
> + * @used: Number of objects added
> + * @obj: Array of object pointers
> + */
> +struct cpu_rmap {
> +	struct {
> +		u16     index;
> +		u16     dist;
> +	} near[NR_CPUS];

This [NR_CPUS] is highly suspect.

Are you sure you cant use a per_cpu allocation here ?

> +	u16		size, used;
> +	void		*obj[0];
> +};
> +#define CPU_RMAP_DIST_INF 0xffff
> +


> +
> +/**
> + * alloc_cpu_rmap - allocate CPU affinity reverse-map
> + * @size: Number of objects to be mapped
> + * @flags: Allocation flags e.g. %GFP_KERNEL
> + */

I really doubt you need other than GFP_KERNEL. (Especially if you switch
to per_cpu alloc ;) )

> +struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags)
> +{
> +	struct cpu_rmap *rmap;
> +	unsigned int cpu;
> +
> +	/* This is a silly number of objects, and we use u16 indices. */
> +	if (size > 0xffff)
> +		return NULL;
> +
> +	rmap = kzalloc(sizeof(*rmap) + size * sizeof(rmap->obj[0]), flags);
> +	if (!rmap)
> +		return NULL;

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2011-01-04 19:56 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel

Some stragglers, the most important bit is the bridging multicast
crash cure.

1) MIPS build of starfire fails on some 64-bit platforms, fix from Ben
   Hutchings.

2) VLAN flags of ehea inadvertantly changed when modifying other
   feature flags.  Fix from Breno Leitao.

3) Don't expose kernel addresses in CAN proc files, from Dan
   Rosenberg.

4) skfp driver probe checks wrong return value for error, from Dan
   Carpenter.

5) Bridging netfilter expects SKB mac header to be initialized
   properly, a simplification made to the STP code inadvertantly broke
   that.  Fix from Florian Westphal.

6) tg3_read_vpd() checks return value incorrectly, fix from David
   Sterba.

7) Bridging code doesn't handle non-linear SKBs properly when parsing
   through ipv6 extension headers to get at the IGMP message bits.
   Fix from Tomas Winkler.

8) There's a rather pervasive CISCO ppp implementation bug regarding a
   corner case of compression and protocol IDs, add a sysctl to work
   around this so people can at least function while waiting for
   various CISCO kit to get updated.  From Stephen Hemminger.

9) Memory leaks in ISDN gigaset and broadcom CNIC drivers, from Jesper
   Juhl.

10) Changing ring parameters causes oops in atl1, fix from J. K. Cliburn.

11) In ipv4 we ignore preferred source address setting for local routes,
    fix from Joel Sing.

12) Device leak in atmtcp.c, fix from Julia Lawall.

Please pull, thanks a lot.

The following changes since commit 989d873fc5b6a96695b97738dea8d9f02a60f8ab:

  Merge master.kernel.org:/home/rmk/linux-2.6-arm (2011-01-03 16:37:01 -0800)

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Ben Hutchings (1):
      starfire: Fix dma_addr_t size test for MIPS

Breno Leitao (1):
      ehea: Avoid changing vlan flags

Dan Carpenter (1):
      skfp: testing the wrong variable in skfp_driver_init()

Dan Rosenberg (1):
      CAN: Use inode instead of kernel address for /proc file

Dan Williams (1):
      ueagle-atm: fix PHY signal initialization race

David Sterba (1):
      tg3: fix return value check in tg3_read_vpd()

Florian Westphal (1):
      bridge: stp: ensure mac header is set

J. K. Cliburn (1):
      atl1: fix oops when changing tx/rx ring params

Jesper Juhl (2):
      ISDN, Gigaset: Fix memory leak in do_disconnect_req()
      Broadcom CNIC core network driver: fix mem leak on allocation failures in cnic_alloc_uio_rings()

Joel Sing (1):
      ipv4/route.c: respect prefsrc for local routes

Julia Lawall (1):
      drivers/atm/atmtcp.c: add missing atm_dev_put

Tomas Winkler (1):
      bridge: fix br_multicast_ipv6_rcv for paged skbs

stephen hemminger (1):
      ppp: allow disabling multilink protocol ID compression

 drivers/atm/atmtcp.c            |    5 ++++-
 drivers/isdn/gigaset/capi.c     |    1 +
 drivers/net/atlx/atl1.c         |   10 ++++++++++
 drivers/net/cnic.c              |   10 ++++++++--
 drivers/net/ehea/ehea_ethtool.c |    7 +++++++
 drivers/net/ppp_generic.c       |    9 +++++++--
 drivers/net/skfp/skfddi.c       |    2 +-
 drivers/net/starfire.c          |    2 +-
 drivers/net/tg3.c               |    2 +-
 drivers/usb/atm/ueagle-atm.c    |   22 +++++++++++++++++++---
 net/bridge/br_multicast.c       |   28 ++++++++++++++++++----------
 net/bridge/br_stp_bpdu.c        |    2 ++
 net/can/bcm.c                   |    4 ++--
 net/ipv4/route.c                |    8 ++++++--
 14 files changed, 87 insertions(+), 25 deletions(-)

^ permalink raw reply

* [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 19:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294169842.3636.31.camel@bwh-desktop>

When initiating I/O on a multiqueue and multi-IRQ device, we may want
to select a queue for which the response will be handled on the same
or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
library functions to support a generic reverse-mapping from CPUs to
objects with affinity and the specific case where the objects are
IRQs.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/cpu_rmap.h |   73 +++++++++++++
 lib/Kconfig              |    4 +
 lib/Makefile             |    2 +
 lib/cpu_rmap.c           |  262 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 341 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/cpu_rmap.h
 create mode 100644 lib/cpu_rmap.c

diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
new file mode 100644
index 0000000..6e2f5ff
--- /dev/null
+++ b/include/linux/cpu_rmap.h
@@ -0,0 +1,73 @@
+/*
+ * cpu_rmap.c: CPU affinity reverse-map support
+ * Copyright 2010 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/cpumask.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+
+/**
+ * struct cpu_rmap - CPU affinity reverse-map
+ * @near: For each CPU, the index and distance to the nearest object,
+ *      based on affinity masks
+ * @size: Number of objects to be reverse-mapped
+ * @used: Number of objects added
+ * @obj: Array of object pointers
+ */
+struct cpu_rmap {
+	struct {
+		u16     index;
+		u16     dist;
+	} near[NR_CPUS];
+	u16		size, used;
+	void		*obj[0];
+};
+#define CPU_RMAP_DIST_INF 0xffff
+
+extern struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags);
+
+/**
+ * free_cpu_rmap - free CPU affinity reverse-map
+ * @rmap: Reverse-map allocated with alloc_cpu_rmap(), or %NULL
+ */
+static inline void free_cpu_rmap(struct cpu_rmap *rmap)
+{
+	kfree(rmap);
+}
+
+extern int cpu_rmap_add(struct cpu_rmap *rmap, void *obj);
+extern int cpu_rmap_update(struct cpu_rmap *rmap, u16 index,
+			   const struct cpumask *affinity);
+
+static inline u16 cpu_rmap_lookup_index(struct cpu_rmap *rmap, unsigned int cpu)
+{
+	return rmap->near[cpu].index;
+}
+
+static inline void *cpu_rmap_lookup_obj(struct cpu_rmap *rmap, unsigned int cpu)
+{
+	return rmap->obj[rmap->near[cpu].index];
+}
+
+#ifdef CONFIG_GENERIC_HARDIRQS
+
+/**
+ * alloc_irq_cpu_rmap - allocate CPU affinity reverse-map for IRQs
+ * @size: Number of objects to be mapped
+ *
+ * Must be called in process context.
+ */
+static inline struct cpu_rmap *alloc_irq_cpu_rmap(unsigned int size)
+{
+	return alloc_cpu_rmap(size, GFP_KERNEL);
+}
+extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
+
+extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 3d498b2..f43cb2e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -195,6 +195,10 @@ config DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
        bool "Disable obsolete cpumask functions" if DEBUG_PER_CPU_MAPS
        depends on EXPERIMENTAL && BROKEN
 
+config CPU_RMAP
+	bool
+	depends on SMP
+
 #
 # Netlink attribute parsing support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index 0248767..001b528 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -110,6 +110,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_AVERAGE) += average.o
 
+obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
+
 hostprogs-y	:= gen_crc32table
 clean-files	:= crc32table.h
 
diff --git a/lib/cpu_rmap.c b/lib/cpu_rmap.c
new file mode 100644
index 0000000..8f7f6c9
--- /dev/null
+++ b/lib/cpu_rmap.c
@@ -0,0 +1,262 @@
+/*
+ * cpu_rmap.c: CPU affinity reverse-map support
+ * Copyright 2010 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/cpu_rmap.h>
+#ifdef CONFIG_GENERIC_HARDIRQS
+#include <linux/interrupt.h>
+#endif
+#include <linux/module.h>
+
+/*
+ * These functions maintain a mapping from CPUs to some ordered set of
+ * objects with CPU affinities.  This can be seen as a reverse-map of
+ * CPU affinity.  However, we do not assume that the object affinities
+ * cover all CPUs in the system.  For those CPUs not directly covered
+ * by object affinities, we attempt to find a nearest object based on
+ * CPU topology.
+ */
+
+/**
+ * alloc_cpu_rmap - allocate CPU affinity reverse-map
+ * @size: Number of objects to be mapped
+ * @flags: Allocation flags e.g. %GFP_KERNEL
+ */
+struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags)
+{
+	struct cpu_rmap *rmap;
+	unsigned int cpu;
+
+	/* This is a silly number of objects, and we use u16 indices. */
+	if (size > 0xffff)
+		return NULL;
+
+	rmap = kzalloc(sizeof(*rmap) + size * sizeof(rmap->obj[0]), flags);
+	if (!rmap)
+		return NULL;
+
+	/* Initially assign CPUs to objects on a rota, since we have
+	 * no idea where the objects are.  Use infinite distance, so
+	 * any object with known distance is preferable.  Include the
+	 * CPUs that are not present/online, since we definitely want
+	 * any newly-hotplugged CPUs to have some object assigned.
+	 */
+	for_each_possible_cpu(cpu) {
+		rmap->near[cpu].index = cpu % size;
+		rmap->near[cpu].dist = CPU_RMAP_DIST_INF;
+	}
+
+	rmap->size = size;
+	return rmap;
+}
+EXPORT_SYMBOL(alloc_cpu_rmap);
+
+/* Reevaluate nearest object for given CPU, comparing with the given
+ * neighbours at the given distance.
+ */
+static bool cpu_rmap_copy_neigh(struct cpu_rmap *rmap, unsigned int cpu,
+				const struct cpumask *mask, u16 dist)
+{
+	int neigh;
+
+	for_each_cpu(neigh, mask) {
+		if (rmap->near[cpu].dist > dist &&
+		    rmap->near[neigh].dist <= dist) {
+			rmap->near[cpu].index = rmap->near[neigh].index;
+			rmap->near[cpu].dist = dist;
+			return true;
+		}
+	}
+	return false;
+}
+
+#ifdef DEBUG
+static void debug_print_rmap(const struct cpu_rmap *rmap, const char *prefix)
+{
+	unsigned index;
+	unsigned int cpu;
+
+	pr_info("cpu_rmap %p, %s:\n", rmap, prefix);
+
+	for_each_possible_cpu(cpu) {
+		index = rmap->near[cpu].index;
+		pr_info("cpu %d -> obj %u (distance %u)\n",
+			cpu, index, rmap->near[cpu].dist);
+	}
+}
+#else
+static inline void
+debug_print_rmap(const struct cpu_rmap *rmap, const char *prefix)
+{
+}
+#endif
+
+/**
+ * cpu_rmap_add - add object to a rmap
+ * @rmap: CPU rmap allocated with alloc_cpu_rmap()
+ * @obj: Object to add to rmap
+ *
+ * Return index of object.
+ */
+int cpu_rmap_add(struct cpu_rmap *rmap, void *obj)
+{
+	u16 index;
+
+	BUG_ON(rmap->used >= rmap->size);
+	index = rmap->used++;
+	rmap->obj[index] = obj;
+	return index;
+}
+EXPORT_SYMBOL(cpu_rmap_add);
+
+/**
+ * cpu_rmap_update - update CPU rmap following a change of object affinity
+ * @rmap: CPU rmap to update
+ * @index: Index of object whose affinity changed
+ * @affinity: New CPU affinity of object
+ */
+int cpu_rmap_update(struct cpu_rmap *rmap, u16 index,
+		    const struct cpumask *affinity)
+{
+	cpumask_var_t update_mask;
+	unsigned int cpu;
+
+	if (unlikely(!zalloc_cpumask_var(&update_mask, GFP_KERNEL)))
+		return -ENOMEM;
+
+	/* Invalidate distance for all CPUs for which this used to be
+	 * the nearest object.  Mark those CPUs for update.
+	 */
+	for_each_online_cpu(cpu) {
+		if (rmap->near[cpu].index == index) {
+			rmap->near[cpu].dist = CPU_RMAP_DIST_INF;
+			cpumask_set_cpu(cpu, update_mask);
+		}
+	}
+
+	debug_print_rmap(rmap, "after invalidating old distances");
+
+	/* Set distance to 0 for all CPUs in the new affinity mask.
+	 * Mark all CPUs within their NUMA nodes for update.
+	 */
+	for_each_cpu(cpu, affinity) {
+		rmap->near[cpu].index = index;
+		rmap->near[cpu].dist = 0;
+		cpumask_or(update_mask, update_mask,
+			   cpumask_of_node(cpu_to_node(cpu)));
+	}
+
+	debug_print_rmap(rmap, "after updating neighbours");
+
+	/* Update distances based on topology */
+	for_each_cpu(cpu, update_mask) {
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					topology_thread_cpumask(cpu), 1))
+			continue;
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					topology_core_cpumask(cpu), 2))
+			continue;
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					cpumask_of_node(cpu_to_node(cpu)), 3))
+			continue;
+		/* We could continue into NUMA node distances, but for now
+		 * we give up.
+		 */
+	}
+
+	debug_print_rmap(rmap, "after copying neighbours");
+
+	free_cpumask_var(update_mask);
+	return 0;
+}
+EXPORT_SYMBOL(cpu_rmap_update);
+
+#ifdef CONFIG_GENERIC_HARDIRQS
+
+/* Glue between IRQ affinity notifiers and CPU rmaps */
+
+struct irq_glue {
+	struct irq_affinity_notify notify;
+	struct cpu_rmap *rmap;
+	u16 index;
+};
+
+/**
+ * free_irq_cpu_rmap - free a CPU affinity reverse-map used for IRQs
+ * @rmap: Reverse-map allocated with alloc_irq_cpu_map(), or %NULL
+ *
+ * Must be called in process context, before freeing the IRQs, and
+ * without holding any locks required by global workqueue items.
+ */
+void free_irq_cpu_rmap(struct cpu_rmap *rmap)
+{
+	struct irq_glue *glue;
+	u16 index;
+
+	if (!rmap)
+		return;
+
+	for (index = 0; index < rmap->used; index++) {
+		glue = rmap->obj[index];
+		irq_set_affinity_notifier(glue->notify.irq, NULL);
+	}
+	irq_run_affinity_notifiers();
+
+	kfree(rmap);
+}
+EXPORT_SYMBOL(free_irq_cpu_rmap);
+
+static void
+irq_cpu_rmap_notify(struct irq_affinity_notify *notify, const cpumask_t *mask)
+{
+	struct irq_glue *glue =
+		container_of(notify, struct irq_glue, notify);
+	int rc;
+
+	rc = cpu_rmap_update(glue->rmap, glue->index, mask);
+	if (rc)
+		pr_warning("irq_cpu_rmap_notify: update failed: %d\n", rc);
+}
+
+static void irq_cpu_rmap_release(struct kref *ref)
+{
+	struct irq_glue *glue =
+		container_of(ref, struct irq_glue, notify.kref);
+	kfree(glue);
+}
+
+/**
+ * irq_cpu_rmap_add - add an IRQ to a CPU affinity reverse-map
+ * @rmap: The reverse-map
+ * @irq: The IRQ number
+ *
+ * This adds an IRQ affinity notifier that will update the reverse-map
+ * automatically.
+ *
+ * Must be called in process context, after the IRQ is allocated but
+ * before it is bound with request_irq().
+ */
+int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq)
+{
+	struct irq_glue *glue = kzalloc(sizeof(*glue), GFP_KERNEL);
+	int rc;
+
+	if (!glue)
+		return -ENOMEM;
+	glue->notify.notify = irq_cpu_rmap_notify;
+	glue->notify.release = irq_cpu_rmap_release;
+	glue->rmap = rmap;
+	glue->index = cpu_rmap_add(rmap, glue);
+	rc = irq_set_affinity_notifier(irq, &glue->notify);
+	if (rc)
+		kfree(glue);
+	return rc;
+}
+EXPORT_SYMBOL(irq_cpu_rmap_add);
+
+#endif /* CONFIG_GENERIC_HARDIRQS */
-- 
1.7.3.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox