Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Buglet in net/pkt_cls.h pointer handling.
From: David Miller @ 2010-12-21 20:57 UTC (permalink / raw)
  To: suckfish; +Cc: netdev
In-Reply-To: <20101216215627.43f34977.suckfish@ihug.co.nz>

From: Ralph Loader <suckfish@ihug.co.nz>
Date: Thu, 16 Dec 2010 21:56:27 +1300

> tcf_valid_offset() in net/pkt_cls.h appears to have a couple of 
> problems (obvious patch below):
> 
> (a) there is no check for overflow in the pointer arithmetic.
> (b) the pointers are presumably likely to be normally valid, so the
>     hint should be 'likely()' not 'unlikely()'.
> 
> The offsets used to construct the arguments to that function, e.g., as
> called in net/sched/em_u32.c, I think come from user-space & in theory
> could be crafted to cause an invalid pointer deref if ptr+len overflows?
> 
> Possibly the '<' and '>' in that function should be '<=' and '>='
> also.  I'm not familiar enough with the data-structures to be sure.
> 
> Also a question:  in em_u32.c em_u32_match(), and in cls_u32.c
> u32_classify(), we dereference pointers that have had an offset
> (originally from user space) added to them.  I can't see anything that
> keeps those pointers aligned.  Is that a problem on architectures that
> don't support unaligned pointers, or am I missing something?

Your analysis is accurate, so I added the <= and >= test changes
and applied the following to the tree.

Please read Documentation/SubmittingPatches and
Documentation/email-clients.txt before submitting your
own patches in the future.

What you sent here was whitespace damaged by your email client
and you didn't provide a proper "Signed-off-by: " tag in your
commit message.

Thanks.

--------------------
net: Fix range checks in tcf_valid_offset().

This function has three bugs:

1) The offset should be valid most of the time, this is just
   a sanity check, therefore we should use "likely" not "unlikely"

2) This is the only place where we can check for arithmetic overflow
   of the pointer plus the length.

3) The existing range checks are off by one, the valid range is
   skb->head to skb_tail_pointer(), inclusive.

Based almost entirely upon a patch by Ralph Loader.

Reported-by: Ralph Loader <suckfish@ihug.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/pkt_cls.h |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index dd3031a..9fcc680 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -323,7 +323,9 @@ static inline unsigned char * tcf_get_base_ptr(struct sk_buff *skb, int layer)
 static inline int tcf_valid_offset(const struct sk_buff *skb,
 				   const unsigned char *ptr, const int len)
 {
-	return unlikely((ptr + len) < skb_tail_pointer(skb) && ptr > skb->head);
+	return likely((ptr + len) <= skb_tail_pointer(skb) &&
+		      ptr >= skb->head &&
+		      (ptr <= (ptr + len)));
 }
 
 #ifdef CONFIG_NET_CLS_IND
-- 
1.7.3.4


^ permalink raw reply related

* Re: [PATCH V7 1/8] ntp: add ADJ_SETOFFSET mode bit
From: Kuwahara,T. @ 2010-12-21 20:57 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	John Stultz, Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner
In-Reply-To: <20101221075612.GA13626-7KxsofuKt4IfAd9E5cN8NEzG7cXyKsk/@public.gmane.org>

On Tue, Dec 21, 2010 at 4:56 PM, Richard Cochran
<richardcochran-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Can you please elaborate?

The timex.constant is defined as equal to the binary logarithm of the
reciprocal of
the natural frequency minus SHIFT_PLL.  In other words, the following
equation holds:

	log2(natural frequency) + time_constant + SHIFT_PLL = 0,

which means that decreasing time_constant increases natural frequency
exponentially.
And since a larger natural frequency gives a smaller settling time, a
sufficiently
large negative time_constant results in immediate time step, at least in theory.

> I don't see any way to use timex.constant with ADJ_OFFSET in order to
> correct a time offset.

How about this?

if (txc->modes & ADJ_OFFSET) {
	if (txc->constant == INT32_MIN) {
		/* step time */
	} else {
		/* slew time */
	}
}

> The 'time_constant' in kernel/time/ntp.c is
> restricted to the interval [0..MAXTC], and MAXTC is 10 in timex.h.

Then let's just ignore the restriction.  (It's possible by setting the
timex.constant
without setting the ADJ_TIMECONST flag.)

That said, I'm somehow against the idea of using the adjtimex syscall
for that purpose.

^ permalink raw reply

* Re: [RFC] ipv4: add ICMP socket kind
From: Colin Walters @ 2010-12-21 20:46 UTC (permalink / raw)
  To: Solar Designer; +Cc: Vasiliy Kulikov, netdev, linux-kernel, Pavel Kankovsky
In-Reply-To: <20101221194606.GA25359@openwall.com>

On Tue, Dec 21, 2010 at 2:46 PM, Solar Designer <solar@openwall.com> wrote:
> On Tue, Dec 21, 2010 at 01:46:41PM -0500, Colin Walters wrote:
>> On Tue, Dec 21, 2010 at 1:18 PM, Vasiliy Kulikov <segooon@gmail.com> wrote:
>> > A new ping socket is created with
>> >
>> >  socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)
>>
>> And the default is to allow any uid to do this (modulo LSM)?
>
> We intend to have this sysctl'able and to have it restricted to a group
> by default (the sysctl would set the GID) on our Linux distro,
> Openwall GNU/*/Linux.  However, we figured that it'd be tough for us to
> get this complication accepted into mainstream, so we opted to have the
> patch posted for comment without it.

Right, a sysctl was the obvious thing to have.

>> If you really have a burning desire to get rid of setuid /bin/ping,
>> why not just do it in userspace via message passing to/from a
>> privileged process, and avoid a lot of code in the kernel?
>
> Yes, we thought of that, and we don't like this solution.

...because?

> We similarly
> (but for different reasons) don't like using fscaps to grant CAP_NET_RAW
> to ping.

I am (learning now) about the fscaps drawbacks...

> We share your concern about the size of net/ipv4/ping.c introduced by
> this patch, yet this is our current proposal.

To be clear I have no personal stake in the size of net/; my concern
is more about the set of permissions granted by the default kernel
configuration.  Both from an OS developer standpoint, and also just so
I feel I can continue to explain to someone who's learning about Linux
all the interactions between the uid/gid, capabilities, SELinux, etc.
If the kernel starts adding extensions to things it historically
didn't, that gets even more complicated.

> We figured that there's little point behind such restrictions.  Just how
> is an ICMP echo request any worse than a UDP packet of the same size?
> Anyone can send the latter with current kernels.

Clearly if we could go back in time and make some changes to the
default Unix security model, one of those would probably be making
allocating TCP/UDP sockets a privileged operation in some way.  But
the ship has sailed there...

> Yet, as I have mentioned, we're in fact going to restrict this to a
> group by default and to have ping SGID - just not to expose the extra
> kernel code for direct attack by a local user.  That's in case there's a
> vulnerability in the added code.

So wait...this whole patch is to remove the setuid bit, but then
you're going to go back and add setgid?  How is that really
compellingly different and better?

^ permalink raw reply

* Re: [patch -next] pch_can: off by one bugs
From: Wolfgang Grandegger @ 2010-12-21 20:41 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20101220092601.GS1936@bicker>

Hello,

On 12/20/2010 10:26 AM, Dan Carpenter wrote:
> priv->tx_enable[] has PCH_TX_OBJ_END elements so this code is
> reading and writing one past the end of the array.
> 
> Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> index 8d45fdd..b2c1292 100644
> --- a/drivers/net/can/pch_can.c
> +++ b/drivers/net/can/pch_can.c
> @@ -1077,7 +1077,7 @@ static int pch_can_suspend(struct pci_dev *pdev, pm_message_t state)
>  	pch_can_set_int_enables(priv, PCH_CAN_DISABLE);
>  
>  	/* Save Tx buffer enable state */
> -	for (i = PCH_TX_OBJ_START; i <= PCH_TX_OBJ_END; i++)
> +	for (i = PCH_TX_OBJ_START; i < PCH_TX_OBJ_END; i++)
>  		priv->tx_enable[i] = pch_can_get_rxtx_ir(priv, i, PCH_TX_IFREG);
>  
>  	/* Disable all Transmit buffers */
> @@ -1138,7 +1138,7 @@ static int pch_can_resume(struct pci_dev *pdev)
>  	pch_can_set_optmode(priv);
>  
>  	/* Enabling the transmit buffer. */
> -	for (i = PCH_TX_OBJ_START; i <= PCH_TX_OBJ_END; i++)
> +	for (i = PCH_TX_OBJ_START; i < PCH_TX_OBJ_END; i++)
>  		pch_can_set_rxtx(priv, i, priv->tx_enable[i], PCH_TX_IFREG);
>  
>  	/* Configuring the receive buffer and enabling them. */
> 

This fix does not look correct too me. There are much more loop using
"i <= PCH_TX_OBJ_END" and the message numbering is from 1..32. Therefore
using "priv->tx_enable[i - 1]" seems more appropriate to me. Tomaya,
could you please check.

Thanks,

Wolfgang.

^ permalink raw reply

* Re: [PATCH net-next-2.6] filter: optimize accesses to ancillary data
From: David Miller @ 2010-12-21 20:30 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1292478328.2603.56.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 16 Dec 2010 06:45:28 +0100

> We can translate pseudo load instructions at filter check time to
> dedicated instructions to speed up filtering and avoid one switch().
> libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use
> other ancillary accesses.
> 
> Note : I made the assertion that ancillary data was always accessed with
> BPF_LD|BPF_?|BPF_ABS instructions, not with BPF_LD|BPF_?|BPF_IND ones
> (offset given by K constant, not by K + X register)
> 
> On x86_64, this saves a few bytes of text :
> 
> # size net/core/filter.o.*
>    text	   data	    bss	    dec	    hex	filename
>    4864	      0	      0	   4864	   1300	net/core/filter.o.new
>    4944	      0	      0	   4944	   1350	net/core/filter.o.old
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2: remove cancel_work_sync() from remove_one
From: David Miller @ 2010-12-21 20:20 UTC (permalink / raw)
  To: tj; +Cc: mchan, netdev, linux-kernel
In-Reply-To: <20101221105103.GA32744@htj.dyndns.org>

From: Tejun Heo <tj@kernel.org>
Date: Tue, 21 Dec 2010 11:51:04 +0100

> Yeah, I agree the synchronize_rcu() there would guarantee the actual
> timer completion but as it currently stands it looks a bit too subtle.
> Maybe it's a good idea to add a big fat comment explaining that the
> the timer is guaranteed to stop after close() and how it's guaranteed
> through synchronize_rcu() at the moment?  Also, it might be better to
> use synchronize_sched() there as timer synchronization through
> synchronize_rcu() is more of a happy accident.

I'm not sure the synchronize_*() is even necessary to guarentee
watchdog timer completion.

Like I said, I think the netif_tx_lock() held around both the timer
function itself, and the del_timer() call, are sufficient.

So, this ensures that the watchdog timer either runs to completion or
sees the no-op scheduler attached and returns immediately without
rescheduling the timer.

In any event, I'm going to apply your bnx2 patch to net-next-2.6

Thanks.

^ permalink raw reply

* Re: pull request: sfc-next-2.6 2010-12-21
From: David Miller @ 2010-12-21 20:17 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1292942817.3256.2.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 21 Dec 2010 14:46:57 +0000

> The following changes since commit cf78f8ee3de7d8d5b47d371c95716d0e4facf1c4:
> 
>   Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 (2010-12-10 10:20:43 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6.git for-davem
> 
> Some overdue cleanup of the TX path.

Pulled, thanks a lot Ben.

^ permalink raw reply

* Re: [RFC] ipv4: add ICMP socket kind
From: Solar Designer @ 2010-12-21 19:46 UTC (permalink / raw)
  To: Colin Walters; +Cc: Vasiliy Kulikov, netdev, linux-kernel, Pavel Kankovsky
In-Reply-To: <AANLkTikS6rs-AFGmtTfuRdELJ5+y=o4g1i0Xk9KDFRiv@mail.gmail.com>

On Tue, Dec 21, 2010 at 01:46:41PM -0500, Colin Walters wrote:
> On Tue, Dec 21, 2010 at 1:18 PM, Vasiliy Kulikov <segooon@gmail.com> wrote:
> > A new ping socket is created with
> >
> >  socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)
> 
> And the default is to allow any uid to do this (modulo LSM)?

We intend to have this sysctl'able and to have it restricted to a group
by default (the sysctl would set the GID) on our Linux distro,
Openwall GNU/*/Linux.  However, we figured that it'd be tough for us to
get this complication accepted into mainstream, so we opted to have the
patch posted for comment without it.

> If you really have a burning desire to get rid of setuid /bin/ping,
> why not just do it in userspace via message passing to/from a
> privileged process, and avoid a lot of code in the kernel?

Yes, we thought of that, and we don't like this solution.  We similarly
(but for different reasons) don't like using fscaps to grant CAP_NET_RAW
to ping.

We share your concern about the size of net/ipv4/ping.c introduced by
this patch, yet this is our current proposal.

> It's much
> more flexible.  You could, for example, limit it to once a second by
> default, allow only one process doing this per uid, etc.

We figured that there's little point behind such restrictions.  Just how
is an ICMP echo request any worse than a UDP packet of the same size?
Anyone can send the latter with current kernels.

Additionally, Vasiliy found out that Mac OS X has a similar feature,
implemented in a riskier way than what we propose (they do no filtering
of incoming ICMP traffic):

http://www.manpagez.com/man/4/icmp/

So there's precedent, and our proposal is better.

Yet, as I have mentioned, we're in fact going to restrict this to a
group by default and to have ping SGID - just not to expose the extra
kernel code for direct attack by a local user.  That's in case there's a
vulnerability in the added code.

If a sysctl like this is what others want to have as well, we'd be happy
to provide a revision of the patch including that.  Then we won't have
to maintain it as a custom patch.

Thank you for your criticism.

Alexander Peslyak <solar at openwall.com>
GPG key ID: 5B341F15  fp: B3FB 63F4 D7A3 BCCC 6F6E  FC55 A2FC 027C 5B34 1F15
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply

* Re: [PATCH V7 1/8] ntp: add ADJ_SETOFFSET mode bit
From: john stultz @ 2010-12-21 19:37 UTC (permalink / raw)
  To: Kuwahara,T.
  Cc: Richard Cochran, linux-kernel, linux-api, netdev, Alan Cox,
	Arnd Bergmann, Christoph Lameter, David Miller, Krzysztof Halasa,
	Peter Zijlstra, Rodolfo Giometti, Thomas Gleixner
In-Reply-To: <AANLkTi=yGoFwYt4p_LeHtAQyYgmURspO-p57UdL0sUEZ@mail.gmail.com>

On Sat, 2010-12-18 at 05:16 +0900, Kuwahara,T. wrote:
> On 12/17/10, Richard Cochran <richardcochran@gmail.com> wrote:
> > This patch adds a new mode bit into the timex structure. When set, the bit
> > instructs the kernel to add the given time value to the current time.
> >
> 
> The proposed new control mode, ADJ_SETOFFSET, is logically the same as
> ADJ_OFFSET with timex.constant == -INFINITY. 

I'm not sure if this is correct. Its more like settimeofday, only giving
a relative offset to jump the clock, rather then an absolute time. It
does not slew the clock over time like ADJ_OFFSET does.

>  So it is possible to do
> the same thing without risking forward compatibility.  (I mean by "risking
> forward compatibility" that the mode bit 0x0040 may be defined differently
> by the upstream maintainer anytime in the future.)

adjtimex is a linux specific interface, which is compatible but not
identical to the ntp specified interfaces. The ntp client code already
has Linux specific modifications, so I don't think we have to worry
about 0x40 specifically being reserved by the NTP client.

thanks
-john

^ permalink raw reply

* [PATCH net-next 3/3] bnx2x: adding dcbnl support
From: Shmulik Ravid @ 2010-12-21 19:33 UTC (permalink / raw)
  To: davem; +Cc: eilong, lucy.liu, netdev

Adding dcbnl implementation to bnx2x allowing users to manage the
embedded DCBX engine.

Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x.h      |   23 ++-
 drivers/net/bnx2x/bnx2x_dcb.c  |  661 ++++++++++++++++++++++++++++++++++++++-
 drivers/net/bnx2x/bnx2x_dcb.h  |    9 +-
 drivers/net/bnx2x/bnx2x_main.c |    8 +-
 4 files changed, 676 insertions(+), 25 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x.h b/drivers/net/bnx2x/bnx2x.h
index f14c6ed..77d6c8d 100644
--- a/drivers/net/bnx2x/bnx2x.h
+++ b/drivers/net/bnx2x/bnx2x.h
@@ -22,15 +22,17 @@
  * (you will need to reboot afterwards) */
 /* #define BNX2X_STOP_ON_ERROR */
 
-#define DRV_MODULE_VERSION      "1.62.00-2"
-#define DRV_MODULE_RELDATE      "2010/12/13"
+#define DRV_MODULE_VERSION      "1.62.00-3"
+#define DRV_MODULE_RELDATE      "2010/12/21"
 #define BNX2X_BC_VER            0x040200
 
 #define BNX2X_MULTI_QUEUE
 
 #define BNX2X_NEW_NAPI
 
-
+#if defined(CONFIG_DCB)
+#define BCM_DCB
+#endif
 #if defined(CONFIG_CNIC) || defined(CONFIG_CNIC_MODULE)
 #define BCM_CNIC 1
 #include "../cnic_if.h"
@@ -1186,7 +1188,20 @@ struct bnx2x {
 	/* LLDP params */
 	struct bnx2x_config_lldp_params		lldp_config_params;
 
-	/* DCBX params */
+	/* DCB support on/off */
+	u16 dcb_state;
+#define BNX2X_DCB_STATE_OFF			0
+#define BNX2X_DCB_STATE_ON			1
+
+	/* DCBX engine mode */
+	int dcbx_enabled;
+#define BNX2X_DCBX_ENABLED_OFF			0
+#define BNX2X_DCBX_ENABLED_ON_NEG_OFF		1
+#define BNX2X_DCBX_ENABLED_ON_NEG_ON		2
+#define BNX2X_DCBX_ENABLED_INVALID		(-1)
+
+	bool dcbx_mode_uset;
+
 	struct bnx2x_config_dcbx_params		dcbx_config_params;
 
 	struct bnx2x_dcbx_port_params		dcbx_port_params;
diff --git a/drivers/net/bnx2x/bnx2x_dcb.c b/drivers/net/bnx2x/bnx2x_dcb.c
index 0b86480..9249aa1 100644
--- a/drivers/net/bnx2x/bnx2x_dcb.c
+++ b/drivers/net/bnx2x/bnx2x_dcb.c
@@ -619,13 +619,10 @@ static void bnx2x_dcbx_admin_mib_updated_params(struct bnx2x *bp,
 	for (i = 0; i < sizeof(struct lldp_admin_mib); i += 4, buff++)
 		*buff = REG_RD(bp, (offset + i));
 
-
-	if (BNX2X_DCBX_CONFIG_INV_VALUE != dp->admin_dcbx_enable) {
-		if (dp->admin_dcbx_enable)
-			SET_FLAGS(admin_mib.ver_cfg_flags, DCBX_DCBX_ENABLED);
-		else
-			RESET_FLAGS(admin_mib.ver_cfg_flags, DCBX_DCBX_ENABLED);
-	}
+	if (bp->dcbx_enabled == BNX2X_DCBX_ENABLED_ON_NEG_ON)
+		SET_FLAGS(admin_mib.ver_cfg_flags, DCBX_DCBX_ENABLED);
+	else
+		RESET_FLAGS(admin_mib.ver_cfg_flags, DCBX_DCBX_ENABLED);
 
 	if ((BNX2X_DCBX_OVERWRITE_SETTINGS_ENABLE ==
 				dp->overwrite_settings)) {
@@ -734,12 +731,26 @@ static void bnx2x_dcbx_admin_mib_updated_params(struct bnx2x *bp,
 		REG_WR(bp, (offset + i), *buff);
 }
 
-/* default */
+void bnx2x_dcbx_set_state(struct bnx2x *bp, bool dcb_on, u32 dcbx_enabled)
+{
+	if (CHIP_IS_E2(bp) && !CHIP_MODE_IS_4_PORT(bp)) {
+		bp->dcb_state = dcb_on;
+		bp->dcbx_enabled = dcbx_enabled;
+	} else {
+		bp->dcb_state = false;
+		bp->dcbx_enabled = BNX2X_DCBX_ENABLED_INVALID;
+	}
+	DP(NETIF_MSG_LINK, "DCB state [%s:%s]\n",
+	   dcb_on ? "ON" : "OFF",
+	   dcbx_enabled == BNX2X_DCBX_ENABLED_OFF ? "user-mode" :
+	   dcbx_enabled == BNX2X_DCBX_ENABLED_ON_NEG_OFF ? "on-chip static" :
+	   dcbx_enabled == BNX2X_DCBX_ENABLED_ON_NEG_ON ?
+	   "on-chip with negotiation" : "invalid");
+}
+
 void bnx2x_dcbx_init_params(struct bnx2x *bp)
 {
 	bp->dcbx_config_params.admin_dcbx_version = 0x0; /* 0 - CEE; 1 - IEEE */
-	bp->dcbx_config_params.dcb_enable = 1;
-	bp->dcbx_config_params.admin_dcbx_enable = 1;
 	bp->dcbx_config_params.admin_ets_willing = 1;
 	bp->dcbx_config_params.admin_pfc_willing = 1;
 	bp->dcbx_config_params.overwrite_settings = 1;
@@ -807,23 +818,27 @@ void bnx2x_dcbx_init_params(struct bnx2x *bp)
 void bnx2x_dcbx_init(struct bnx2x *bp)
 {
 	u32 dcbx_lldp_params_offset = SHMEM_LLDP_DCBX_PARAMS_NONE;
+
+	if (bp->dcbx_enabled <= 0)
+		return;
+
 	/* validate:
 	 * chip of good for dcbx version,
 	 * dcb is wanted
 	 * the function is pmf
 	 * shmem2 contains DCBX support fields
 	 */
-	DP(NETIF_MSG_LINK, "dcb_enable %d bp->port.pmf %d\n",
-	   bp->dcbx_config_params.dcb_enable, bp->port.pmf);
+	DP(NETIF_MSG_LINK, "dcb_state %d bp->port.pmf %d\n",
+	   bp->dcb_state, bp->port.pmf);
 
-	if (CHIP_IS_E2(bp) && !CHIP_MODE_IS_4_PORT(bp) &&
-	    bp->dcbx_config_params.dcb_enable &&
-	    bp->port.pmf &&
+	if (bp->dcb_state ==  BNX2X_DCB_STATE_ON && bp->port.pmf &&
 	    SHMEM2_HAS(bp, dcbx_lldp_params_offset)) {
-		dcbx_lldp_params_offset = SHMEM2_RD(bp,
-						    dcbx_lldp_params_offset);
+		dcbx_lldp_params_offset =
+			SHMEM2_RD(bp, dcbx_lldp_params_offset);
+
 		DP(NETIF_MSG_LINK, "dcbx_lldp_params_offset 0x%x\n",
 		   dcbx_lldp_params_offset);
+
 		if (SHMEM_LLDP_DCBX_PARAMS_NONE != dcbx_lldp_params_offset) {
 			bnx2x_dcbx_lldp_updated_params(bp,
 						       dcbx_lldp_params_offset);
@@ -1489,3 +1504,615 @@ static void bnx2x_pfc_fw_struct_e2(struct bnx2x *bp)
 	}
 	bnx2x_dcbx_print_cos_params(bp,	pfc_fw_cfg);
 }
+/* DCB netlink */
+#ifdef BCM_DCB
+#include <linux/dcbnl.h>
+
+#define BNX2X_DCBX_CAPS		(DCB_CAP_DCBX_HW | DCB_CAP_DCBX_VER_CEE | \
+				DCB_CAP_DCBX_STATIC)
+
+static inline bool bnx2x_dcbnl_set_valid(struct bnx2x *bp)
+{
+	/* validate dcbnl call that may change HW state:
+	 * DCB is on and DCBX mode was SUCCESSFULLY set by the user.
+	 */
+	return bp->dcb_state && bp->dcbx_mode_uset;
+}
+
+static u8 bnx2x_dcbnl_get_state(struct net_device *netdev)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "state = %d\n", bp->dcb_state);
+	return bp->dcb_state;
+}
+
+static u8 bnx2x_dcbnl_set_state(struct net_device *netdev, u8 state)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "state = %s\n", state ? "on" : "off");
+
+	bnx2x_dcbx_set_state(bp, (state ? true : false), bp->dcbx_enabled);
+	return 0;
+}
+
+static void bnx2x_dcbnl_get_perm_hw_addr(struct net_device *netdev,
+					 u8 *perm_addr)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "GET-PERM-ADDR\n");
+
+	/* first the HW mac address */
+	memcpy(perm_addr, netdev->dev_addr, netdev->addr_len);
+
+#ifdef BCM_CNIC
+	/* second SAN address */
+	memcpy(perm_addr+netdev->addr_len, bp->fip_mac, netdev->addr_len);
+#endif
+}
+
+static void bnx2x_dcbnl_set_pg_tccfg_tx(struct net_device *netdev, int prio,
+					u8 prio_type, u8 pgid, u8 bw_pct,
+					u8 up_map)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+
+	DP(NETIF_MSG_LINK, "prio[%d] = %d\n", prio, pgid);
+	if (!bnx2x_dcbnl_set_valid(bp) || prio >= DCBX_MAX_NUM_PRI_PG_ENTRIES)
+		return;
+
+	/**
+	 * bw_pct ingnored -	band-width percentage devision between user
+	 *			priorities within the same group is not
+	 *			standard and hence not supported
+	 *
+	 * prio_type igonred -	priority levels within the same group are not
+	 *			standard and hence are not supported. According
+	 *			to the standard pgid 15 is dedicated to strict
+	 *			prioirty traffic (on the port level).
+	 *
+	 * up_map ignored
+	 */
+
+	bp->dcbx_config_params.admin_configuration_ets_pg[prio] = pgid;
+	bp->dcbx_config_params.admin_ets_configuration_tx_enable = 1;
+}
+
+static void bnx2x_dcbnl_set_pg_bwgcfg_tx(struct net_device *netdev,
+					 int pgid, u8 bw_pct)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "pgid[%d] = %d\n", pgid, bw_pct);
+
+	if (!bnx2x_dcbnl_set_valid(bp) || pgid >= DCBX_MAX_NUM_PG_BW_ENTRIES)
+		return;
+
+	bp->dcbx_config_params.admin_configuration_bw_precentage[pgid] = bw_pct;
+	bp->dcbx_config_params.admin_ets_configuration_tx_enable = 1;
+}
+
+static void bnx2x_dcbnl_set_pg_tccfg_rx(struct net_device *netdev, int prio,
+					u8 prio_type, u8 pgid, u8 bw_pct,
+					u8 up_map)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "Nothing to set; No RX support\n");
+}
+
+static void bnx2x_dcbnl_set_pg_bwgcfg_rx(struct net_device *netdev,
+					 int pgid, u8 bw_pct)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "Nothing to set; No RX support\n");
+}
+
+static void bnx2x_dcbnl_get_pg_tccfg_tx(struct net_device *netdev, int prio,
+					u8 *prio_type, u8 *pgid, u8 *bw_pct,
+					u8 *up_map)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "prio = %d\n", prio);
+
+	/**
+	 * bw_pct ingnored -	band-width percentage devision between user
+	 *			priorities within the same group is not
+	 *			standard and hence not supported
+	 *
+	 * prio_type igonred -	priority levels within the same group are not
+	 *			standard and hence are not supported. According
+	 *			to the standard pgid 15 is dedicated to strict
+	 *			prioirty traffic (on the port level).
+	 *
+	 * up_map ignored
+	 */
+	*up_map = *bw_pct = *prio_type = *pgid = 0;
+
+	if (!bp->dcb_state || prio >= DCBX_MAX_NUM_PRI_PG_ENTRIES)
+		return;
+
+	*pgid = DCBX_PRI_PG_GET(bp->dcbx_local_feat.ets.pri_pg_tbl, prio);
+}
+
+static void bnx2x_dcbnl_get_pg_bwgcfg_tx(struct net_device *netdev,
+					 int pgid, u8 *bw_pct)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "pgid = %d\n", pgid);
+
+	*bw_pct = 0;
+
+	if (!bp->dcb_state || pgid >= DCBX_MAX_NUM_PG_BW_ENTRIES)
+		return;
+
+	*bw_pct = DCBX_PG_BW_GET(bp->dcbx_local_feat.ets.pg_bw_tbl, pgid);
+}
+
+static void bnx2x_dcbnl_get_pg_tccfg_rx(struct net_device *netdev, int prio,
+					u8 *prio_type, u8 *pgid, u8 *bw_pct,
+					u8 *up_map)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "Nothing to get; No RX support\n");
+
+	*prio_type = *pgid = *bw_pct = *up_map = 0;
+}
+
+static void bnx2x_dcbnl_get_pg_bwgcfg_rx(struct net_device *netdev,
+					 int pgid, u8 *bw_pct)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "Nothing to get; No RX support\n");
+
+	*bw_pct = 0;
+}
+
+static void bnx2x_dcbnl_set_pfc_cfg(struct net_device *netdev, int prio,
+				    u8 setting)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "prio[%d] = %d\n", prio, setting);
+
+	if (!bnx2x_dcbnl_set_valid(bp) || prio >= MAX_PFC_PRIORITIES)
+		return;
+
+	bp->dcbx_config_params.admin_pfc_bitmap |= ((setting ? 1 : 0) << prio);
+
+	if (setting)
+		bp->dcbx_config_params.admin_pfc_tx_enable = 1;
+}
+
+static void bnx2x_dcbnl_get_pfc_cfg(struct net_device *netdev, int prio,
+				    u8 *setting)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "prio = %d\n", prio);
+
+	*setting = 0;
+
+	if (!bp->dcb_state || prio >= MAX_PFC_PRIORITIES)
+		return;
+
+	*setting = (bp->dcbx_local_feat.pfc.pri_en_bitmap >> prio) & 0x1;
+}
+
+static u8 bnx2x_dcbnl_set_all(struct net_device *netdev)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	int rc = 0;
+
+	DP(NETIF_MSG_LINK, "SET-ALL\n");
+
+	if (!bnx2x_dcbnl_set_valid(bp))
+		return 1;
+
+	if (bp->recovery_state != BNX2X_RECOVERY_DONE) {
+		netdev_err(bp->dev, "Handling parity error recovery. "
+				"Try again later\n");
+		return 1;
+	}
+	if (netif_running(bp->dev)) {
+		bnx2x_nic_unload(bp, UNLOAD_NORMAL);
+		rc = bnx2x_nic_load(bp, LOAD_NORMAL);
+	}
+	DP(NETIF_MSG_LINK, "set_dcbx_params done (%d)\n", rc);
+	if (rc)
+		return 1;
+
+	return 0;
+}
+
+static u8 bnx2x_dcbnl_get_cap(struct net_device *netdev, int capid, u8 *cap)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	u8 rval = 0;
+
+	if (bp->dcb_state) {
+		switch (capid) {
+		case DCB_CAP_ATTR_PG:
+			*cap = true;
+			break;
+		case DCB_CAP_ATTR_PFC:
+			*cap = true;
+			break;
+		case DCB_CAP_ATTR_UP2TC:
+			*cap = false;
+			break;
+		case DCB_CAP_ATTR_PG_TCS:
+			*cap = 0x80;	/* 8 priorities for PGs */
+			break;
+		case DCB_CAP_ATTR_PFC_TCS:
+			*cap = 0x80;	/* 8 priorities for PFC */
+			break;
+		case DCB_CAP_ATTR_GSP:
+			*cap = true;
+			break;
+		case DCB_CAP_ATTR_BCN:
+			*cap = false;
+			break;
+		case DCB_CAP_ATTR_DCBX:
+			*cap = BNX2X_DCBX_CAPS;
+		default:
+			rval = -EINVAL;
+			break;
+		}
+	} else
+		rval = -EINVAL;
+
+	DP(NETIF_MSG_LINK, "capid %d:%x\n", capid, *cap);
+	return rval;
+}
+
+static u8 bnx2x_dcbnl_get_numtcs(struct net_device *netdev, int tcid, u8 *num)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	u8 rval = 0;
+
+	DP(NETIF_MSG_LINK, "tcid %d\n", tcid);
+
+	if (bp->dcb_state) {
+		switch (tcid) {
+		case DCB_NUMTCS_ATTR_PG:
+			*num = E2_NUM_OF_COS;
+			break;
+		case DCB_NUMTCS_ATTR_PFC:
+			*num = E2_NUM_OF_COS;
+			break;
+		default:
+			rval = -EINVAL;
+			break;
+		}
+	} else
+		rval = -EINVAL;
+
+	return rval;
+}
+
+static u8 bnx2x_dcbnl_set_numtcs(struct net_device *netdev, int tcid, u8 num)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "num tcs = %d; Not supported\n", num);
+	return -EINVAL;
+}
+
+static u8  bnx2x_dcbnl_get_pfc_state(struct net_device *netdev)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "state = %d\n", bp->dcbx_local_feat.pfc.enabled);
+
+	if (!bp->dcb_state)
+		return 0;
+
+	return bp->dcbx_local_feat.pfc.enabled;
+}
+
+static void bnx2x_dcbnl_set_pfc_state(struct net_device *netdev, u8 state)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "state = %s\n", state ? "on" : "off");
+
+	if (!bnx2x_dcbnl_set_valid(bp))
+		return;
+
+	bp->dcbx_config_params.admin_pfc_tx_enable =
+	bp->dcbx_config_params.admin_pfc_enable = (state ? 1 : 0);
+}
+
+static bool bnx2x_app_is_equal(struct dcbx_app_priority_entry *app_ent,
+			       u8 idtype, u16 idval)
+{
+	if (!(app_ent->appBitfield & DCBX_APP_ENTRY_VALID))
+		return false;
+
+	switch (idtype) {
+	case DCB_APP_IDTYPE_ETHTYPE:
+		if ((app_ent->appBitfield & DCBX_APP_ENTRY_SF_MASK) !=
+			DCBX_APP_SF_ETH_TYPE)
+			return false;
+		break;
+	case DCB_APP_IDTYPE_PORTNUM:
+		if ((app_ent->appBitfield & DCBX_APP_ENTRY_SF_MASK) !=
+			DCBX_APP_SF_PORT)
+			return false;
+		break;
+	default:
+		return false;
+	}
+	if (app_ent->app_id != idval)
+		return false;
+
+	return true;
+}
+
+static void bnx2x_admin_app_set_ent(
+	struct bnx2x_admin_priority_app_table *app_ent,
+	u8 idtype, u16 idval, u8 up)
+{
+	app_ent->valid = 1;
+
+	switch (idtype) {
+	case DCB_APP_IDTYPE_ETHTYPE:
+		app_ent->traffic_type = TRAFFIC_TYPE_ETH;
+		break;
+	case DCB_APP_IDTYPE_PORTNUM:
+		app_ent->traffic_type = TRAFFIC_TYPE_PORT;
+		break;
+	default:
+		break; /* never gets here */
+	}
+	app_ent->app_id = idval;
+	app_ent->priority = up;
+}
+
+static bool bnx2x_admin_app_is_equal(
+	struct bnx2x_admin_priority_app_table *app_ent,
+	u8 idtype, u16 idval)
+{
+	if (!app_ent->valid)
+		return false;
+
+	switch (idtype) {
+	case DCB_APP_IDTYPE_ETHTYPE:
+		if (app_ent->traffic_type != TRAFFIC_TYPE_ETH)
+			return false;
+		break;
+	case DCB_APP_IDTYPE_PORTNUM:
+		if (app_ent->traffic_type != TRAFFIC_TYPE_PORT)
+			return false;
+		break;
+	default:
+		return false;
+	}
+	if (app_ent->app_id != idval)
+		return false;
+
+	return true;
+}
+
+static int bnx2x_set_admin_app_up(struct bnx2x *bp, u8 idtype, u16 idval, u8 up)
+{
+	int i, ff;
+
+	/* iterate over the app entries looking for idtype and idval */
+	for (i = 0, ff = -1; i < 4; i++) {
+		struct bnx2x_admin_priority_app_table *app_ent =
+			&bp->dcbx_config_params.admin_priority_app_table[i];
+		if (bnx2x_admin_app_is_equal(app_ent, idtype, idval))
+			break;
+
+		if (ff < 0 && !app_ent->valid)
+			ff = i;
+	}
+	if (i < 4)
+		/* if found overwrite up */
+		bp->dcbx_config_params.
+			admin_priority_app_table[i].priority = up;
+	else if (ff >= 0)
+		/* not found use first-free */
+		bnx2x_admin_app_set_ent(
+			&bp->dcbx_config_params.admin_priority_app_table[ff],
+			idtype, idval, up);
+	else
+		/* app table is full */
+		return -EBUSY;
+
+	/* up configured, if not 0 make sure feature is enabled */
+	if (up)
+		bp->dcbx_config_params.admin_application_priority_tx_enable = 1;
+
+	return 0;
+}
+
+static u8 bnx2x_dcbnl_set_app_up(struct net_device *netdev, u8 idtype,
+				 u16 idval, u8 up)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+
+	DP(NETIF_MSG_LINK, "app_type %d, app_id %x, prio bitmap %d\n",
+	   idtype, idval, up);
+
+	if (!bnx2x_dcbnl_set_valid(bp))
+		return -EINVAL;
+
+	/* verify idtype */
+	switch (idtype) {
+	case DCB_APP_IDTYPE_ETHTYPE:
+	case DCB_APP_IDTYPE_PORTNUM:
+		break;
+	default:
+		return -EINVAL;
+	}
+	return bnx2x_set_admin_app_up(bp, idtype, idval, up);
+}
+
+static u8 bnx2x_dcbnl_get_app_up(struct net_device *netdev, u8 idtype,
+				 u16 idval)
+{
+	int i;
+	u8 up = 0;
+
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "app_type %d, app_id 0x%x\n", idtype, idval);
+
+	/* iterate over the app entries looking for idtype and idval */
+	for (i = 0; i < DCBX_MAX_APP_PROTOCOL; i++)
+		if (bnx2x_app_is_equal(&bp->dcbx_local_feat.app.app_pri_tbl[i],
+				       idtype, idval))
+			break;
+
+	if (i < DCBX_MAX_APP_PROTOCOL)
+		/* if found return up */
+		up = bp->dcbx_local_feat.app.app_pri_tbl[i].pri_bitmap;
+	else
+		DP(NETIF_MSG_LINK, "app not found\n");
+
+	return up;
+}
+
+static u8 bnx2x_dcbnl_get_dcbx(struct net_device *netdev)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	u8 state;
+
+	state = DCB_CAP_DCBX_HW | DCB_CAP_DCBX_VER_CEE;
+
+	if (bp->dcbx_enabled == BNX2X_DCBX_ENABLED_ON_NEG_OFF)
+		state |= DCB_CAP_DCBX_STATIC;
+
+	return state;
+}
+
+static u8 bnx2x_dcbnl_set_dcbx(struct net_device *netdev, u8 state)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	DP(NETIF_MSG_LINK, "state = %02x\n", state);
+
+	/* set dcbx mode */
+
+	if ((state & BNX2X_DCBX_CAPS) != state) {
+		BNX2X_ERR("Requested DCBX mode %x is beyond advertised "
+			  "capabilities\n", state);
+		return 1;
+	}
+
+	if (bp->dcb_state != BNX2X_DCB_STATE_ON) {
+		BNX2X_ERR("DCB turned off, DCBX configuration is invalid\n");
+		return 1;
+	}
+
+	if (state & DCB_CAP_DCBX_STATIC)
+		bp->dcbx_enabled = BNX2X_DCBX_ENABLED_ON_NEG_OFF;
+	else
+		bp->dcbx_enabled = BNX2X_DCBX_ENABLED_ON_NEG_ON;
+
+	bp->dcbx_mode_uset = true;
+	return 0;
+}
+
+
+static u8 bnx2x_dcbnl_get_featcfg(struct net_device *netdev, int featid,
+				  u8 *flags)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	u8 rval = 0;
+
+	DP(NETIF_MSG_LINK, "featid %d\n", featid);
+
+	if (bp->dcb_state) {
+		*flags = 0;
+		switch (featid) {
+		case DCB_FEATCFG_ATTR_PG:
+			if (bp->dcbx_local_feat.ets.enabled)
+				*flags |= DCB_FEATCFG_ENABLE;
+			if (bp->dcbx_error & DCBX_LOCAL_ETS_ERROR)
+				*flags |= DCB_FEATCFG_ERROR;
+			break;
+		case DCB_FEATCFG_ATTR_PFC:
+			if (bp->dcbx_local_feat.pfc.enabled)
+				*flags |= DCB_FEATCFG_ENABLE;
+			if (bp->dcbx_error & (DCBX_LOCAL_PFC_ERROR |
+			    DCBX_LOCAL_PFC_MISMATCH))
+				*flags |= DCB_FEATCFG_ERROR;
+			break;
+		case DCB_FEATCFG_ATTR_APP:
+			if (bp->dcbx_local_feat.app.enabled)
+				*flags |= DCB_FEATCFG_ENABLE;
+			if (bp->dcbx_error & (DCBX_LOCAL_APP_ERROR |
+			    DCBX_LOCAL_APP_MISMATCH))
+				*flags |= DCB_FEATCFG_ERROR;
+			break;
+		default:
+			rval = -EINVAL;
+			break;
+		}
+	} else
+		rval = -EINVAL;
+
+	return rval;
+}
+
+static u8 bnx2x_dcbnl_set_featcfg(struct net_device *netdev, int featid,
+				  u8 flags)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	u8 rval = 0;
+
+	DP(NETIF_MSG_LINK, "featid = %d flags = %02x\n", featid, flags);
+
+	/* ignore the 'advertise' flag */
+	if (bnx2x_dcbnl_set_valid(bp)) {
+		switch (featid) {
+		case DCB_FEATCFG_ATTR_PG:
+			bp->dcbx_config_params.admin_ets_enable =
+				flags & DCB_FEATCFG_ENABLE ? 1 : 0;
+			bp->dcbx_config_params.admin_ets_willing =
+				flags & DCB_FEATCFG_WILLING ? 1 : 0;
+			break;
+		case DCB_FEATCFG_ATTR_PFC:
+			bp->dcbx_config_params.admin_pfc_enable =
+				flags & DCB_FEATCFG_ENABLE ? 1 : 0;
+			bp->dcbx_config_params.admin_pfc_willing =
+				flags & DCB_FEATCFG_WILLING ? 1 : 0;
+			break;
+		case DCB_FEATCFG_ATTR_APP:
+			/* ignore enable, always enabled */
+			bp->dcbx_config_params.admin_app_priority_willing =
+				flags & DCB_FEATCFG_WILLING ? 1 : 0;
+			break;
+		default:
+			rval = -EINVAL;
+			break;
+		}
+	} else
+		rval = -EINVAL;
+
+	return rval;
+}
+
+const struct dcbnl_rtnl_ops bnx2x_dcbnl_ops = {
+	.getstate       = bnx2x_dcbnl_get_state,
+	.setstate       = bnx2x_dcbnl_set_state,
+	.getpermhwaddr  = bnx2x_dcbnl_get_perm_hw_addr,
+	.setpgtccfgtx   = bnx2x_dcbnl_set_pg_tccfg_tx,
+	.setpgbwgcfgtx  = bnx2x_dcbnl_set_pg_bwgcfg_tx,
+	.setpgtccfgrx   = bnx2x_dcbnl_set_pg_tccfg_rx,
+	.setpgbwgcfgrx  = bnx2x_dcbnl_set_pg_bwgcfg_rx,
+	.getpgtccfgtx   = bnx2x_dcbnl_get_pg_tccfg_tx,
+	.getpgbwgcfgtx  = bnx2x_dcbnl_get_pg_bwgcfg_tx,
+	.getpgtccfgrx   = bnx2x_dcbnl_get_pg_tccfg_rx,
+	.getpgbwgcfgrx  = bnx2x_dcbnl_get_pg_bwgcfg_rx,
+	.setpfccfg      = bnx2x_dcbnl_set_pfc_cfg,
+	.getpfccfg      = bnx2x_dcbnl_get_pfc_cfg,
+	.setall         = bnx2x_dcbnl_set_all,
+	.getcap         = bnx2x_dcbnl_get_cap,
+	.getnumtcs      = bnx2x_dcbnl_get_numtcs,
+	.setnumtcs      = bnx2x_dcbnl_set_numtcs,
+	.getpfcstate    = bnx2x_dcbnl_get_pfc_state,
+	.setpfcstate    = bnx2x_dcbnl_set_pfc_state,
+	.getapp         = bnx2x_dcbnl_get_app_up,
+	.setapp         = bnx2x_dcbnl_set_app_up,
+	.getdcbx        = bnx2x_dcbnl_get_dcbx,
+	.setdcbx        = bnx2x_dcbnl_set_dcbx,
+	.getfeatcfg     = bnx2x_dcbnl_get_featcfg,
+	.setfeatcfg     = bnx2x_dcbnl_set_featcfg,
+};
+
+#endif /* BCM_DCB */
diff --git a/drivers/net/bnx2x/bnx2x_dcb.h b/drivers/net/bnx2x/bnx2x_dcb.h
index 8dea56b..f650f98 100644
--- a/drivers/net/bnx2x/bnx2x_dcb.h
+++ b/drivers/net/bnx2x/bnx2x_dcb.h
@@ -51,7 +51,6 @@ struct bnx2x_dcbx_pfc_params {
 };
 
 struct bnx2x_dcbx_port_params {
-	u32 dcbx_enabled;
 	struct bnx2x_dcbx_pfc_params pfc;
 	struct bnx2x_dcbx_pg_params  ets;
 	struct bnx2x_dcbx_app_params app;
@@ -88,8 +87,6 @@ struct bnx2x_admin_priority_app_table {
  * DCBX protocol configuration parameters.
  ******************************************************************************/
 struct bnx2x_config_dcbx_params {
-	u32 dcb_enable;
-	u32 admin_dcbx_enable;
 	u32 overwrite_settings;
 	u32 admin_dcbx_version;
 	u32 admin_ets_enable;
@@ -182,6 +179,7 @@ struct bnx2x;
 void bnx2x_dcb_init_intmem_pfc(struct bnx2x *bp);
 void bnx2x_dcbx_update(struct work_struct *work);
 void bnx2x_dcbx_init_params(struct bnx2x *bp);
+void bnx2x_dcbx_set_state(struct bnx2x *bp, bool dcb_on, u32 dcbx_enabled);
 
 enum {
 	BNX2X_DCBX_STATE_NEG_RECEIVED = 0x1,
@@ -190,4 +188,9 @@ enum {
 };
 void bnx2x_dcbx_set_params(struct bnx2x *bp, u32 state);
 
+/* DCB netlink */
+#ifdef BCM_DCB
+extern const struct dcbnl_rtnl_ops bnx2x_dcbnl_ops;
+#endif /* BCM_DCB */
+
 #endif /* BNX2X_DCB_H */
diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
index bdc3fc2..c49f26c 100644
--- a/drivers/net/bnx2x/bnx2x_main.c
+++ b/drivers/net/bnx2x/bnx2x_main.c
@@ -3107,7 +3107,8 @@ static inline void bnx2x_attn_int_deasserted3(struct bnx2x *bp, u32 attn)
 				bnx2x_pmf_update(bp);
 
 			if (bp->port.pmf &&
-			    (val & DRV_STATUS_DCBX_NEGOTIATION_RESULTS))
+			    (val & DRV_STATUS_DCBX_NEGOTIATION_RESULTS) &&
+				bp->dcbx_enabled > 0)
 				/* start dcbx state machine */
 				bnx2x_dcbx_set_params(bp,
 					BNX2X_DCBX_STATE_NEG_RECEIVED);
@@ -8793,6 +8794,7 @@ static int __devinit bnx2x_init_bp(struct bnx2x *bp)
 	bp->timer.data = (unsigned long) bp;
 	bp->timer.function = bnx2x_timer;
 
+	bnx2x_dcbx_set_state(bp, true, BNX2X_DCBX_ENABLED_ON_NEG_ON);
 	bnx2x_dcbx_init_params(bp);
 
 	return rc;
@@ -9144,6 +9146,10 @@ static int __devinit bnx2x_init_dev(struct pci_dev *pdev,
 	dev->vlan_features |= (NETIF_F_TSO | NETIF_F_TSO_ECN);
 	dev->vlan_features |= NETIF_F_TSO6;
 
+#ifdef BCM_DCB
+	dev->dcbnl_ops = &bnx2x_dcbnl_ops;
+#endif
+
 	/* get_port_hwinfo() will set prtad and mmds properly */
 	bp->mdio.prtad = MDIO_PRTAD_NONE;
 	bp->mdio.mmds = 0;
-- 
1.7.1





^ permalink raw reply related

* [PATCH net-next 1/3] dcbnl: adding DCBX engine capability
From: Shmulik Ravid @ 2010-12-21 19:32 UTC (permalink / raw)
  To: davem; +Cc: eilong, lucy.liu, netdev

Adding an optional DCBX capability and a pair for get-set routines for
setting the device DCBX mode. The DCBX capability is a bit field of
supported attributes. The user is expected to set the DCBX mode with a
subset of the advertised attributes.


Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
---
 include/linux/dcbnl.h |   17 +++++++++++++++++
 include/net/dcbnl.h   |    2 ++
 net/dcb/dcbnl.c       |   42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 8723491..974fd1e 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -50,6 +50,8 @@ struct dcbmsg {
  * @DCB_CMD_SBCN: get backward congestion notification configration.
  * @DCB_CMD_GAPP: get application protocol configuration
  * @DCB_CMD_SAPP: set application protocol configuration
+ * @DCB_CMD_GDCBX: get DCBX engine configuration
+ * @DCB_CMD_SDCBX: set DCBX engine configuration
  */
 enum dcbnl_commands {
 	DCB_CMD_UNDEFINED,
@@ -83,6 +85,9 @@ enum dcbnl_commands {
 	DCB_CMD_GAPP,
 	DCB_CMD_SAPP,
 
+	DCB_CMD_GDCBX,
+	DCB_CMD_SDCBX,
+
 	__DCB_CMD_ENUM_MAX,
 	DCB_CMD_MAX = __DCB_CMD_ENUM_MAX - 1,
 };
@@ -102,6 +107,7 @@ enum dcbnl_commands {
  * @DCB_ATTR_CAP: DCB capabilities of the device (NLA_NESTED)
  * @DCB_ATTR_NUMTCS: number of traffic classes supported (NLA_NESTED)
  * @DCB_ATTR_BCN: backward congestion notification configuration (NLA_NESTED)
+ * @DCB_ATTR_DCBX: DCBX engine configuration in the device (NLA_U8)
  */
 enum dcbnl_attrs {
 	DCB_ATTR_UNDEFINED,
@@ -118,6 +124,7 @@ enum dcbnl_attrs {
 	DCB_ATTR_NUMTCS,
 	DCB_ATTR_BCN,
 	DCB_ATTR_APP,
+	DCB_ATTR_DCBX,
 
 	__DCB_ATTR_ENUM_MAX,
 	DCB_ATTR_MAX = __DCB_ATTR_ENUM_MAX - 1,
@@ -262,6 +269,8 @@ enum dcbnl_tc_attrs {
  * @DCB_CAP_ATTR_GSP: (NLA_U8) device supports group strict priority
  * @DCB_CAP_ATTR_BCN: (NLA_U8) device supports Backwards Congestion
  *                             Notification
+ * @DCB_CAP_ATTR_DCBX: (NLA_U8) device supports DCBX engine
+ *
  */
 enum dcbnl_cap_attrs {
 	DCB_CAP_ATTR_UNDEFINED,
@@ -273,11 +282,19 @@ enum dcbnl_cap_attrs {
 	DCB_CAP_ATTR_PFC_TCS,
 	DCB_CAP_ATTR_GSP,
 	DCB_CAP_ATTR_BCN,
+	DCB_CAP_ATTR_DCBX,
 
 	__DCB_CAP_ATTR_ENUM_MAX,
 	DCB_CAP_ATTR_MAX = __DCB_CAP_ATTR_ENUM_MAX - 1,
 };
 
+/* DCBX capabilities */
+#define DCB_CAP_DCBX_HOST	0x01 /* host based DCBX engine support */
+#define DCB_CAP_DCBX_HW		0x02 /* HW DCBX engine support */
+#define DCB_CAP_DCBX_VER_CEE	0x04 /* HW DCBX supports CEE protocol */
+#define DCB_CAP_DCBX_VER_IEEE	0x08 /* HW DCBX supports IEEE protocol */
+#define DCB_CAP_DCBX_STATIC	0x10 /* HW DCBX supports static config */
+
 /**
  * enum dcbnl_numtcs_attrs - number of traffic classes
  *
diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index b36ac7e..f03079f 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -50,6 +50,8 @@ struct dcbnl_rtnl_ops {
 	void (*setbcnrp)(struct net_device *, int, u8);
 	u8   (*setapp)(struct net_device *, u8, u16, u8);
 	u8   (*getapp)(struct net_device *, u8, u16);
+	u8   (*getdcbx)(struct net_device *);
+	u8   (*setdcbx)(struct net_device *, u8);
 };
 
 #endif /* __NET_DCBNL_H__ */
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 19ac2b9..44e4237 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -66,6 +66,7 @@ static const struct nla_policy dcbnl_rtnl_policy[DCB_ATTR_MAX + 1] = {
 	[DCB_ATTR_PFC_STATE]   = {.type = NLA_U8},
 	[DCB_ATTR_BCN]         = {.type = NLA_NESTED},
 	[DCB_ATTR_APP]         = {.type = NLA_NESTED},
+	[DCB_ATTR_DCBX]        = {.type = NLA_U8},
 };
 
 /* DCB priority flow control to User Priority nested attributes */
@@ -122,6 +123,7 @@ static const struct nla_policy dcbnl_cap_nest[DCB_CAP_ATTR_MAX + 1] = {
 	[DCB_CAP_ATTR_PFC_TCS] = {.type = NLA_U8},
 	[DCB_CAP_ATTR_GSP]     = {.type = NLA_U8},
 	[DCB_CAP_ATTR_BCN]     = {.type = NLA_U8},
+	[DCB_CAP_ATTR_DCBX]    = {.type = NLA_U8},
 };
 
 /* DCB capabilities nested attributes. */
@@ -1118,6 +1120,38 @@ err:
 	return ret;
 }
 
+static int dcbnl_getdcbx(struct net_device *netdev, struct nlattr **tb,
+			 u32 pid, u32 seq, u16 flags)
+{
+	int ret = -EINVAL;
+
+	if (!netdev->dcbnl_ops->getdcbx)
+		return ret;
+
+	ret = dcbnl_reply(netdev->dcbnl_ops->getdcbx(netdev), RTM_GETDCB,
+			  DCB_CMD_GDCBX, DCB_ATTR_DCBX, pid, seq, flags);
+
+	return ret;
+}
+
+static int dcbnl_setdcbx(struct net_device *netdev, struct nlattr **tb,
+			 u32 pid, u32 seq, u16 flags)
+{
+	int ret = -EINVAL;
+	u8 value;
+
+	if (!tb[DCB_ATTR_DCBX] || !netdev->dcbnl_ops->setdcbx)
+		return ret;
+
+	value = nla_get_u8(tb[DCB_ATTR_DCBX]);
+
+	ret = dcbnl_reply(netdev->dcbnl_ops->setdcbx(netdev, value),
+			  RTM_SETDCB, DCB_CMD_SDCBX, DCB_ATTR_DCBX,
+			  pid, seq, flags);
+
+	return ret;
+}
+
 static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 {
 	struct net *net = sock_net(skb->sk);
@@ -1223,6 +1257,14 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 		ret = dcbnl_setapp(netdev, tb, pid, nlh->nlmsg_seq,
 		                   nlh->nlmsg_flags);
 		goto out;
+	case DCB_CMD_GDCBX:
+		ret = dcbnl_getdcbx(netdev, tb, pid, nlh->nlmsg_seq,
+				    nlh->nlmsg_flags);
+		goto out;
+	case DCB_CMD_SDCBX:
+		ret = dcbnl_setdcbx(netdev, tb, pid, nlh->nlmsg_seq,
+				    nlh->nlmsg_flags);
+		goto out;
 	default:
 		goto errout;
 	}
-- 
1.7.1





^ permalink raw reply related

* [PATCH net-next 2/3] dcbnl: adding DCBX feature flags get-set
From: Shmulik Ravid @ 2010-12-21 19:32 UTC (permalink / raw)
  To: davem; +Cc: eilong, lucy.liu, netdev

Adding a pair of set-get functions to dcbnl for setting the negotiation
flags of the various DCB features. The user sets these flags (enable,
advertise, willing) for each feature to be used by the device DCBX
engine. The 'get' routine returns which of the features is enabled
after the negotiation.

Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
---
 include/linux/dcbnl.h |   33 ++++++++++++
 include/net/dcbnl.h   |    2 +
 net/dcb/dcbnl.c       |  133 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+), 0 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 974fd1e..681c1f2 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -52,6 +52,8 @@ struct dcbmsg {
  * @DCB_CMD_SAPP: set application protocol configuration
  * @DCB_CMD_GDCBX: get DCBX engine configuration
  * @DCB_CMD_SDCBX: set DCBX engine configuration
+ * @DCB_CMD_GFEATCFG: get DCBX features flags
+ * @DCB_CMD_SFEATCFG: set DCBX features negotiation flags
  */
 enum dcbnl_commands {
 	DCB_CMD_UNDEFINED,
@@ -88,6 +90,9 @@ enum dcbnl_commands {
 	DCB_CMD_GDCBX,
 	DCB_CMD_SDCBX,
 
+	DCB_CMD_GFEATCFG,
+	DCB_CMD_SFEATCFG,
+
 	__DCB_CMD_ENUM_MAX,
 	DCB_CMD_MAX = __DCB_CMD_ENUM_MAX - 1,
 };
@@ -108,6 +113,7 @@ enum dcbnl_commands {
  * @DCB_ATTR_NUMTCS: number of traffic classes supported (NLA_NESTED)
  * @DCB_ATTR_BCN: backward congestion notification configuration (NLA_NESTED)
  * @DCB_ATTR_DCBX: DCBX engine configuration in the device (NLA_U8)
+ * @DCB_ATTR_FEATCFG: DCBX features flags (NLA_NESTED)
  */
 enum dcbnl_attrs {
 	DCB_ATTR_UNDEFINED,
@@ -125,6 +131,7 @@ enum dcbnl_attrs {
 	DCB_ATTR_BCN,
 	DCB_ATTR_APP,
 	DCB_ATTR_DCBX,
+	DCB_ATTR_FEATCFG,
 
 	__DCB_ATTR_ENUM_MAX,
 	DCB_ATTR_MAX = __DCB_ATTR_ENUM_MAX - 1,
@@ -372,4 +379,30 @@ enum dcbnl_app_attrs {
 	DCB_APP_ATTR_MAX = __DCB_APP_ATTR_ENUM_MAX - 1,
 };
 
+/**
+ * enum dcbnl_featcfg_attrs - features conifiguration flags
+ *
+ * @DCB_FEATCFG_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_FEATCFG_ATTR_ALL: (NLA_FLAG) all features configuration attributes
+ * @DCB_FEATCFG_ATTR_PG: (NLA_U8) configuration flags for priority groups
+ * @DCB_FEATCFG_ATTR_PFC: (NLA_U8) configuration flags for priority
+ *                                 flow control
+ * @DCB_FEATCFG_ATTR_APP: (NLA_U8) configuration flags for application TLV
+ *
+ */
+#define DCB_FEATCFG_ENABLE	0x01	/* enable feature */
+#define DCB_FEATCFG_ADVERTISE	0x02	/* advertise feature */
+#define DCB_FEATCFG_WILLING	0x04	/* feature is willing */
+#define DCB_FEATCFG_ERROR	0x08	/* error in feature resolution */
+enum dcbnl_featcfg_attrs {
+	DCB_FEATCFG_ATTR_UNDEFINED,
+	DCB_FEATCFG_ATTR_ALL,
+	DCB_FEATCFG_ATTR_PG,
+	DCB_FEATCFG_ATTR_PFC,
+	DCB_FEATCFG_ATTR_APP,
+
+	__DCB_FEATCFG_ATTR_ENUM_MAX,
+	DCB_FEATCFG_ATTR_MAX = __DCB_FEATCFG_ATTR_ENUM_MAX - 1,
+};
+
 #endif /* __LINUX_DCBNL_H__ */
diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index f03079f..92735af 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -52,6 +52,8 @@ struct dcbnl_rtnl_ops {
 	u8   (*getapp)(struct net_device *, u8, u16);
 	u8   (*getdcbx)(struct net_device *);
 	u8   (*setdcbx)(struct net_device *, u8);
+	u8   (*getfeatcfg)(struct net_device *, int, u8 *);
+	u8   (*setfeatcfg)(struct net_device *, int, u8);
 };
 
 #endif /* __NET_DCBNL_H__ */
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 44e4237..d066c62 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -67,6 +67,7 @@ static const struct nla_policy dcbnl_rtnl_policy[DCB_ATTR_MAX + 1] = {
 	[DCB_ATTR_BCN]         = {.type = NLA_NESTED},
 	[DCB_ATTR_APP]         = {.type = NLA_NESTED},
 	[DCB_ATTR_DCBX]        = {.type = NLA_U8},
+	[DCB_ATTR_FEATCFG]     = {.type = NLA_NESTED},
 };
 
 /* DCB priority flow control to User Priority nested attributes */
@@ -169,6 +170,14 @@ static const struct nla_policy dcbnl_app_nest[DCB_APP_ATTR_MAX + 1] = {
 	[DCB_APP_ATTR_PRIORITY]     = {.type = NLA_U8},
 };
 
+/* DCB number of traffic classes nested attributes. */
+static const struct nla_policy dcbnl_featcfg_nest[DCB_FEATCFG_ATTR_MAX + 1] = {
+	[DCB_FEATCFG_ATTR_ALL]      = {.type = NLA_FLAG},
+	[DCB_FEATCFG_ATTR_PG]       = {.type = NLA_U8},
+	[DCB_FEATCFG_ATTR_PFC]      = {.type = NLA_U8},
+	[DCB_FEATCFG_ATTR_APP]      = {.type = NLA_U8},
+};
+
 /* standard netlink reply call */
 static int dcbnl_reply(u8 value, u8 event, u8 cmd, u8 attr, u32 pid,
                        u32 seq, u16 flags)
@@ -1152,6 +1161,122 @@ static int dcbnl_setdcbx(struct net_device *netdev, struct nlattr **tb,
 	return ret;
 }
 
+static int dcbnl_getfeatcfg(struct net_device *netdev, struct nlattr **tb,
+			    u32 pid, u32 seq, u16 flags)
+{
+	struct sk_buff *dcbnl_skb;
+	struct nlmsghdr *nlh;
+	struct dcbmsg *dcb;
+	struct nlattr *data[DCB_FEATCFG_ATTR_MAX + 1], *nest;
+	u8 value;
+	int ret = -EINVAL;
+	int i;
+	int getall = 0;
+
+	if (!tb[DCB_ATTR_FEATCFG] || !netdev->dcbnl_ops->getfeatcfg)
+		return ret;
+
+	ret = nla_parse_nested(data, DCB_FEATCFG_ATTR_MAX, tb[DCB_ATTR_FEATCFG],
+			       dcbnl_featcfg_nest);
+	if (ret) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!dcbnl_skb) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	nlh = NLMSG_NEW(dcbnl_skb, pid, seq, RTM_GETDCB, sizeof(*dcb), flags);
+
+	dcb = NLMSG_DATA(nlh);
+	dcb->dcb_family = AF_UNSPEC;
+	dcb->cmd = DCB_CMD_GFEATCFG;
+
+	nest = nla_nest_start(dcbnl_skb, DCB_ATTR_FEATCFG);
+	if (!nest) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (data[DCB_FEATCFG_ATTR_ALL])
+		getall = 1;
+
+	for (i = DCB_FEATCFG_ATTR_ALL+1; i <= DCB_FEATCFG_ATTR_MAX; i++) {
+		if (!getall && !data[i])
+			continue;
+
+		ret = netdev->dcbnl_ops->getfeatcfg(netdev, i, &value);
+		if (!ret) {
+			ret = nla_put_u8(dcbnl_skb, i, value);
+
+			if (ret) {
+				nla_nest_cancel(dcbnl_skb, nest);
+				ret = -EINVAL;
+				goto err;
+			}
+		} else
+			goto err;
+	}
+	nla_nest_end(dcbnl_skb, nest);
+
+	nlmsg_end(dcbnl_skb, nlh);
+
+	ret = rtnl_unicast(dcbnl_skb, &init_net, pid);
+	if (ret) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	return 0;
+nlmsg_failure:
+err:
+	kfree_skb(dcbnl_skb);
+err_out:
+	return ret;
+}
+
+static int dcbnl_setfeatcfg(struct net_device *netdev, struct nlattr **tb,
+			    u32 pid, u32 seq, u16 flags)
+{
+	struct nlattr *data[DCB_FEATCFG_ATTR_MAX + 1];
+	int ret = -EINVAL;
+	u8 value;
+	int i;
+
+	if (!tb[DCB_ATTR_FEATCFG] || !netdev->dcbnl_ops->setfeatcfg)
+		return ret;
+
+	ret = nla_parse_nested(data, DCB_FEATCFG_ATTR_MAX, tb[DCB_ATTR_FEATCFG],
+			       dcbnl_featcfg_nest);
+
+	if (ret) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	for (i = DCB_FEATCFG_ATTR_ALL+1; i <= DCB_FEATCFG_ATTR_MAX; i++) {
+		if (data[i] == NULL)
+			continue;
+
+		value = nla_get_u8(data[i]);
+
+		ret = netdev->dcbnl_ops->setfeatcfg(netdev, i, value);
+
+		if (ret)
+			goto operr;
+	}
+
+operr:
+	ret = dcbnl_reply(!!ret, RTM_SETDCB, DCB_CMD_SFEATCFG,
+			  DCB_ATTR_FEATCFG, pid, seq, flags);
+
+err:
+	return ret;
+}
+
 static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 {
 	struct net *net = sock_net(skb->sk);
@@ -1265,6 +1390,14 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 		ret = dcbnl_setdcbx(netdev, tb, pid, nlh->nlmsg_seq,
 				    nlh->nlmsg_flags);
 		goto out;
+	case DCB_CMD_GFEATCFG:
+		ret = dcbnl_getfeatcfg(netdev, tb, pid, nlh->nlmsg_seq,
+				       nlh->nlmsg_flags);
+		goto out;
+	case DCB_CMD_SFEATCFG:
+		ret = dcbnl_setfeatcfg(netdev, tb, pid, nlh->nlmsg_seq,
+				       nlh->nlmsg_flags);
+		goto out;
 	default:
 		goto errout;
 	}
-- 
1.7.1





^ permalink raw reply related

* [PATCH net-next 0/3] dcbnl: Extending dcbnl to support HW based DCBX
From: Shmulik Ravid @ 2010-12-21 19:32 UTC (permalink / raw)
  To: davem; +Cc: eilong, lucy.liu, netdev

DCBX is the exchange protocol for negotiating DCB parameters between a
host and a switch. Many converged network adapters support an embedded
DCBX engine that performs the negotiation and configures the device
with the negotiated parameters. The following patches extend the dcbnl
netlink interface so in addition to its current semantics it offers a
standard mechanism for managing such embedded DCBX engines. In this new
mode 'set' operations are used to set the initial negotiation
configuration and the 'get' operation are used to retrieve the
negotiated results. 

The current definition of dcbnl allows a user to configure the device
with the negotiated parameters only after the negotiation itself is
performed by an external entity (such as lldpad). As DCBX runs on top
of LLDP, there can be only one entity responsible for the negotiation
for each physical port. Thus the current scheme does not allow the
coexistence of CNAs that have embedded DCBX engines and CNAs that do
not.

The last patch adds an implementation of the dcbnl operations in their
proposed new semantics to the bnx2x, allowing users to configure and
manage the embedded DCBX engine.

^ permalink raw reply

* [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container qdisc sch_mclass
From: John Fastabend @ 2010-12-21 19:29 UTC (permalink / raw)
  To: davem
  Cc: john.r.fastabend, netdev, hadi, shemminger, tgraf, eric.dumazet,
	bhutchings, nhorman
In-Reply-To: <20101221192831.9703.56356.stgit@jf-dev1-dcblab>

This implements a mclass 'multi-class' queueing discipline that by
default creates multiple mq qdisc's one for each traffic class. Each
mq qdisc then owns a range of queues per the netdev_tc_txq mappings.

Using the mclass qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mclass_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[16];
        __u8    hw;
        __u16   count[16];
        __u16   offset[16];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc. The
hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup(). This
scheme can be expanded as needed with additional qdisc being graft'd
onto the root qdisc to provide per tc queuing disciplines. Allowing
Software and hardware queuing disciplines can be used together

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to no specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |    3 
 include/linux/pkt_sched.h |    9 +
 include/net/sch_generic.h |    2 
 net/sched/Makefile        |    2 
 net/sched/sch_api.c       |    1 
 net/sched/sch_generic.c   |   10 +
 net/sched/sch_mclass.c    |  376 +++++++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_mq.c        |    3 
 8 files changed, 403 insertions(+), 3 deletions(-)
 create mode 100644 net/sched/sch_mclass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 453b2d7..911185b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -764,6 +764,8 @@ struct netdev_tc_txq {
  * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
  *			  struct nlattr *port[]);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ *
+ * int (*ndo_setup_tc)(struct net_device *dev, int tc);
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -822,6 +824,7 @@ struct net_device_ops {
 						   struct nlattr *port[]);
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
+	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..0134ed4 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,13 @@ struct tc_drr_stats {
 	__u32	deficit;
 };
 
+/* MCLASS */
+struct tc_mclass_qopt {
+	__u8	num_tc;
+	__u8	prio_tc_map[16];
+	__u8	hw;
+	__u16	count[16];
+	__u16	offset[16];
+};
+
 #endif
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 0af57eb..723ee52 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -50,6 +50,7 @@ struct Qdisc {
 #define TCQ_F_INGRESS		4
 #define TCQ_F_CAN_BYPASS	8
 #define TCQ_F_MQROOT		16
+#define TCQ_F_MQSAFE		32
 #define TCQ_F_WARN_NONWC	(1 << 16)
 	int			padded;
 	struct Qdisc_ops	*ops;
@@ -276,6 +277,7 @@ extern struct Qdisc noop_qdisc;
 extern struct Qdisc_ops noop_qdisc_ops;
 extern struct Qdisc_ops pfifo_fast_ops;
 extern struct Qdisc_ops mq_qdisc_ops;
+extern struct Qdisc_ops mclass_qdisc_ops;
 
 struct Qdisc_class_common {
 	u32			classid;
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..76dcf5b 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -2,7 +2,7 @@
 # Makefile for the Linux Traffic Control Unit.
 #
 
-obj-y	:= sch_generic.o sch_mq.o
+obj-y	:= sch_generic.o sch_mq.o sch_mclass.o
 
 obj-$(CONFIG_NET_SCHED)		+= sch_api.o sch_blackhole.o
 obj-$(CONFIG_NET_CLS)		+= cls_api.o
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b22ca2d..24f40e0 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1770,6 +1770,7 @@ static int __init pktsched_init(void)
 	register_qdisc(&bfifo_qdisc_ops);
 	register_qdisc(&pfifo_head_drop_qdisc_ops);
 	register_qdisc(&mq_qdisc_ops);
+	register_qdisc(&mclass_qdisc_ops);
 
 	rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL);
 	rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..1c86ea1 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -376,7 +376,7 @@ static struct netdev_queue noop_netdev_queue = {
 struct Qdisc noop_qdisc = {
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
-	.flags		=	TCQ_F_BUILTIN,
+	.flags		=	TCQ_F_BUILTIN | TCQ_F_MQSAFE,
 	.ops		=	&noop_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
@@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
 		dev->qdisc = txq->qdisc_sleeping;
 		atomic_inc(&dev->qdisc->refcnt);
 	} else {
-		qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
+		if (dev->num_tc)
+			qdisc = qdisc_create_dflt(txq, &mclass_qdisc_ops,
+						  TC_H_ROOT);
+		else
+			qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops,
+						  TC_H_ROOT);
+
 		if (qdisc) {
 			qdisc->ops->attach(qdisc);
 			dev->qdisc = qdisc;
diff --git a/net/sched/sch_mclass.c b/net/sched/sch_mclass.c
new file mode 100644
index 0000000..444492a
--- /dev/null
+++ b/net/sched/sch_mclass.c
@@ -0,0 +1,376 @@
+/*
+ * net/sched/sch_mclass.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mclass_sched {
+	struct Qdisc		**qdiscs;
+	int hw_owned;
+};
+
+static void mclass_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned int ntc;
+
+	if (!priv->qdiscs)
+		return;
+
+	for (ntc = 0; ntc < dev->num_tc && priv->qdiscs[ntc]; ntc++)
+		qdisc_destroy(priv->qdiscs[ntc]);
+
+	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+		dev->netdev_ops->ndo_setup_tc(dev, 0);
+	else
+		netdev_set_num_tc(dev, 0);
+
+	kfree(priv->qdiscs);
+}
+
+static int mclass_parse_opt(struct net_device *dev, struct tc_mclass_qopt *qopt)
+{
+	int i, j;
+
+	/* Verify TC offset and count are sane */
+	for (i = 0; i < qopt->num_tc; i++) {
+		int last = qopt->offset[i] + qopt->count[i];
+		if (last > dev->num_tx_queues)
+			return -EINVAL;
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int mclass_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	struct netdev_queue *dev_queue;
+	struct Qdisc *qdisc;
+	int i, err = -EOPNOTSUPP;
+	struct tc_mclass_qopt *qopt = NULL;
+
+	/* Unwind attributes on failure */
+	u8 unwnd_tc = dev->num_tc;
+	u8 unwnd_map[16];
+	struct netdev_tc_txq unwnd_txq[16];
+
+	if (sch->parent != TC_H_ROOT)
+		return -EOPNOTSUPP;
+
+	if (!netif_is_multiqueue(dev))
+		return -EOPNOTSUPP;
+
+	if (nla_len(opt) < sizeof(*qopt))
+		return -EINVAL;
+	qopt = nla_data(opt);
+
+	memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
+	memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
+
+	/* If the mclass options indicate that hardware should own
+	 * the queue mapping then run ndo_setup_tc if this can not
+	 * be done fail immediately.
+	 */
+	if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
+		priv->hw_owned = 1;
+		if (dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc))
+			return -EINVAL;
+	} else if (!qopt->hw) {
+		if (mclass_parse_opt(dev, qopt))
+			return -EINVAL;
+
+		if (netdev_set_num_tc(dev, qopt->num_tc))
+			return -ENOMEM;
+
+		for (i = 0; i < qopt->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    qopt->count[i], qopt->offset[i]);
+	} else {
+		return -EINVAL;
+	}
+
+	/* Always use supplied priority mappings */
+	for (i = 0; i < 16; i++) {
+		if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
+			err = -EINVAL;
+			goto tc_err;
+		}
+	}
+
+	/* pre-allocate qdisc, attachment can't fail */
+	priv->qdiscs = kcalloc(qopt->num_tc,
+			       sizeof(priv->qdiscs[0]), GFP_KERNEL);
+	if (priv->qdiscs == NULL) {
+		err = -ENOMEM;
+		goto tc_err;
+	}
+
+	for (i = 0; i < dev->num_tc; i++) {
+		dev_queue = netdev_get_tx_queue(dev, dev->tc_to_txq[i].offset);
+		qdisc = qdisc_create_dflt(dev_queue, &mq_qdisc_ops,
+					  TC_H_MAKE(TC_H_MAJ(sch->handle),
+						    TC_H_MIN(i + 1)));
+		if (qdisc == NULL) {
+			err = -ENOMEM;
+			goto err;
+		}
+		qdisc->flags |= TCQ_F_CAN_BYPASS;
+		priv->qdiscs[i] = qdisc;
+	}
+
+	sch->flags |= TCQ_F_MQROOT;
+	return 0;
+
+err:
+	mclass_destroy(sch);
+tc_err:
+	if (priv->hw_owned)
+		dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);
+	else
+		netdev_set_num_tc(dev, unwnd_tc);
+
+	memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
+	memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
+
+	return err;
+}
+
+static void mclass_attach(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	unsigned int ntc;
+
+	/* Attach underlying qdisc */
+	for (ntc = 0; ntc < dev->num_tc; ntc++) {
+		qdisc = priv->qdiscs[ntc];
+		if (qdisc->ops && qdisc->ops->attach)
+			qdisc->ops->attach(qdisc);
+	}
+}
+
+static int mclass_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc || (new && !(new->flags & TCQ_F_MQSAFE)))
+		return -EINVAL;
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+
+	if (new == NULL)
+		new = &noop_qdisc;
+
+	*old = priv->qdiscs[ntc];
+	priv->qdiscs[ntc] = new;
+	qdisc_reset(*old);
+
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return 0;
+}
+
+static int mclass_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_mclass_qopt opt;
+	struct Qdisc *qdisc;
+	unsigned int i;
+
+	sch->q.qlen = 0;
+	memset(&sch->bstats, 0, sizeof(sch->bstats));
+	memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		sch->q.qlen		+= qdisc->q.qlen;
+		sch->bstats.bytes	+= qdisc->bstats.bytes;
+		sch->bstats.packets	+= qdisc->bstats.packets;
+		sch->qstats.qlen	+= qdisc->qstats.qlen;
+		sch->qstats.backlog	+= qdisc->qstats.backlog;
+		sch->qstats.drops	+= qdisc->qstats.drops;
+		sch->qstats.requeues	+= qdisc->qstats.requeues;
+		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	opt.num_tc = dev->num_tc;
+	memcpy(opt.prio_tc_map, dev->prio_tc_map, 16);
+	opt.hw = priv->hw_owned;
+
+	for (i = 0; i < dev->num_tc; i++) {
+		opt.count[i] = dev->tc_to_txq[i].count;
+		opt.offset[i] = dev->tc_to_txq[i].offset;
+	}
+
+	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	return skb->len;
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static struct Qdisc *mclass_leaf(struct Qdisc *sch, unsigned long cl)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc)
+		return NULL;
+	return priv->qdiscs[ntc];
+}
+
+static unsigned long mclass_get(struct Qdisc *sch, u32 classid)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntc = TC_H_MIN(classid);
+
+	if (ntc >= dev->num_tc)
+		return 0;
+	return ntc;
+}
+
+static void mclass_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mclass_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct Qdisc *class;
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc)
+		return -EINVAL;
+
+	class = priv->qdiscs[ntc];
+
+	tcm->tcm_parent = TC_H_ROOT;
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	tcm->tcm_info = class->handle;
+	return 0;
+}
+
+static int mclass_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct Qdisc *class, *qdisc;
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+	unsigned int i;
+	u16 count, offset;
+
+	if (ntc >= dev->num_tc)
+		return -EINVAL;
+
+	class = priv->qdiscs[ntc];
+	count = dev->tc_to_txq[ntc].count;
+	offset = dev->tc_to_txq[ntc].offset;
+
+	memset(&class->bstats, 0, sizeof(class->bstats));
+	memset(&class->qstats, 0, sizeof(class->qstats));
+
+	/* Drop lock here it will be reclaimed before touching statistics
+	 * this is required because the qdisc_root_sleeping_lock we hold
+	 * here is the look on dev_queue->qdisc_sleeping also acquired
+	 * below.
+	 */
+	spin_unlock_bh(d->lock);
+
+	for (i = offset; i < offset + count; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		class->q.qlen		 += qdisc->q.qlen;
+		class->bstats.bytes	 += qdisc->bstats.bytes;
+		class->bstats.packets	 += qdisc->bstats.packets;
+		class->qstats.qlen	 += qdisc->qstats.qlen;
+		class->qstats.backlog	 += qdisc->qstats.backlog;
+		class->qstats.drops	 += qdisc->qstats.drops;
+		class->qstats.requeues	 += qdisc->qstats.requeues;
+		class->qstats.overlimits += qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	/* Reclaim root sleeping lock before completing stats */
+	spin_lock_bh(d->lock);
+
+	class->qstats.qlen = class->q.qlen;
+	if (gnet_stats_copy_basic(d, &class->bstats) < 0 ||
+	    gnet_stats_copy_queue(d, &class->qstats) < 0)
+		return -1;
+	return 0;
+}
+
+static void mclass_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntc;
+
+	if (arg->stop)
+		return;
+
+	arg->count = arg->skip;
+	for (ntc = arg->skip; ntc < dev->num_tc; ntc++) {
+		if (arg->fn(sch, ntc + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops mclass_class_ops = {
+	.graft		= mclass_graft,
+	.leaf		= mclass_leaf,
+	.get		= mclass_get,
+	.put		= mclass_put,
+	.walk		= mclass_walk,
+	.dump		= mclass_dump_class,
+	.dump_stats	= mclass_dump_class_stats,
+};
+
+struct Qdisc_ops mclass_qdisc_ops __read_mostly = {
+	.cl_ops		= &mclass_class_ops,
+	.id		= "mclass",
+	.priv_size	= sizeof(struct mclass_sched),
+	.init		= mclass_init,
+	.destroy	= mclass_destroy,
+	.attach		= mclass_attach,
+	.dump		= mclass_dump,
+	.owner		= THIS_MODULE,
+};
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index 86da74c..886cfac 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -86,6 +86,9 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 
 	if (!priv->num_tc)
 		sch->flags |= TCQ_F_MQROOT;
+	else
+		sch->flags |= TCQ_F_MQSAFE;
+
 	return 0;
 
 err:


^ permalink raw reply related

* [net-next-2.6 PATCH v2 2/3] net_sched: Allow multiple mq qdisc to be used as non-root
From: John Fastabend @ 2010-12-21 19:29 UTC (permalink / raw)
  To: davem
  Cc: john.r.fastabend, netdev, hadi, shemminger, tgraf, eric.dumazet,
	bhutchings, nhorman
In-Reply-To: <20101221192831.9703.56356.stgit@jf-dev1-dcblab>

This patch modifies the mq qdisc to allow multiple mq qdiscs
to be used. Allowing TX queues to be grouped for management.

This allows a root container qdisc to create multiple traffic
classes and use the mq qdisc as a default queueing discipline. It
is expected other queueing disciplines can then be grafted to the
container as needed.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/sched/sch_mq.c |   52 +++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index ecc302f..86da74c 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -19,17 +19,32 @@
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
+	struct netdev_tc_txq	tc_txq;
+	u8 num_tc;
 };
 
+static void mq_queues(struct net_device *dev, struct Qdisc *sch)
+{
+	struct mq_sched *priv = qdisc_priv(sch);
+	if (priv->num_tc) {
+		int queue = TC_H_MIN(sch->parent) - 1;
+		priv->tc_txq.count = dev->tc_to_txq[queue].count;
+		priv->tc_txq.offset = dev->tc_to_txq[queue].offset;
+	} else {
+		priv->tc_txq.count = dev->num_tx_queues;
+		priv->tc_txq.offset = 0;
+	}
+}
+
 static void mq_destroy(struct Qdisc *sch)
 {
-	struct net_device *dev = qdisc_dev(sch);
 	struct mq_sched *priv = qdisc_priv(sch);
 	unsigned int ntx;
 
 	if (!priv->qdiscs)
 		return;
-	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+
+	for (ntx = 0; ntx < priv->tc_txq.count && priv->qdiscs[ntx]; ntx++)
 		qdisc_destroy(priv->qdiscs[ntx]);
 	kfree(priv->qdiscs);
 }
@@ -42,20 +57,24 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 	struct Qdisc *qdisc;
 	unsigned int ntx;
 
-	if (sch->parent != TC_H_ROOT)
+	if (sch->parent != TC_H_ROOT && !dev->num_tc)
 		return -EOPNOTSUPP;
 
 	if (!netif_is_multiqueue(dev))
 		return -EOPNOTSUPP;
 
+	/* Record num tc info in priv so we can tear down cleanly */
+	priv->num_tc = dev->num_tc;
+	mq_queues(dev, sch);
+
 	/* pre-allocate qdiscs, attachment can't fail */
-	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+	priv->qdiscs = kcalloc(priv->tc_txq.count, sizeof(priv->qdiscs[0]),
 			       GFP_KERNEL);
 	if (priv->qdiscs == NULL)
 		return -ENOMEM;
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
-		dev_queue = netdev_get_tx_queue(dev, ntx);
+	for (ntx = 0; ntx < priv->tc_txq.count; ntx++) {
+		dev_queue = netdev_get_tx_queue(dev, ntx + priv->tc_txq.offset);
 		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
 					  TC_H_MAKE(TC_H_MAJ(sch->handle),
 						    TC_H_MIN(ntx + 1)));
@@ -65,7 +84,8 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 		priv->qdiscs[ntx] = qdisc;
 	}
 
-	sch->flags |= TCQ_F_MQROOT;
+	if (!priv->num_tc)
+		sch->flags |= TCQ_F_MQROOT;
 	return 0;
 
 err:
@@ -75,12 +95,11 @@ err:
 
 static void mq_attach(struct Qdisc *sch)
 {
-	struct net_device *dev = qdisc_dev(sch);
 	struct mq_sched *priv = qdisc_priv(sch);
 	struct Qdisc *qdisc;
 	unsigned int ntx;
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+	for (ntx = 0; ntx < priv->tc_txq.count; ntx++) {
 		qdisc = priv->qdiscs[ntx];
 		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
 		if (qdisc)
@@ -93,6 +112,7 @@ static void mq_attach(struct Qdisc *sch)
 static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct net_device *dev = qdisc_dev(sch);
+	struct mq_sched *priv = qdisc_priv(sch);
 	struct Qdisc *qdisc;
 	unsigned int ntx;
 
@@ -100,8 +120,9 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memset(&sch->bstats, 0, sizeof(sch->bstats));
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
-		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
+	for (ntx = 0; ntx < priv->tc_txq.count; ntx++) {
+		int txq = ntx + priv->tc_txq.offset;
+		qdisc = netdev_get_tx_queue(dev, txq)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
 		sch->q.qlen		+= qdisc->q.qlen;
 		sch->bstats.bytes	+= qdisc->bstats.bytes;
@@ -119,11 +140,12 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 static struct netdev_queue *mq_queue_get(struct Qdisc *sch, unsigned long cl)
 {
 	struct net_device *dev = qdisc_dev(sch);
+	struct mq_sched *priv = qdisc_priv(sch);
 	unsigned long ntx = cl - 1;
 
-	if (ntx >= dev->num_tx_queues)
+	if (ntx >= priv->tc_txq.count)
 		return NULL;
-	return netdev_get_tx_queue(dev, ntx);
+	return netdev_get_tx_queue(dev, priv->tc_txq.offset + ntx);
 }
 
 static struct netdev_queue *mq_select_queue(struct Qdisc *sch,
@@ -202,14 +224,14 @@ static int mq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 
 static void mq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 {
-	struct net_device *dev = qdisc_dev(sch);
+	struct mq_sched *priv = qdisc_priv(sch);
 	unsigned int ntx;
 
 	if (arg->stop)
 		return;
 
 	arg->count = arg->skip;
-	for (ntx = arg->skip; ntx < dev->num_tx_queues; ntx++) {
+	for (ntx = arg->skip; ntx < priv->tc_txq.count; ntx++) {
 		if (arg->fn(sch, ntx + 1, arg) < 0) {
 			arg->stop = 1;
 			break;


^ permalink raw reply related

* [net-next-2.6 PATCH v2 1/3] net: implement mechanism for HW based QOS
From: John Fastabend @ 2010-12-21 19:28 UTC (permalink / raw)
  To: davem
  Cc: john.r.fastabend, netdev, hadi, shemminger, tgraf, eric.dumazet,
	bhutchings, nhorman

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |   62 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   10 +++++++
 2 files changed, 71 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc916c5..453b2d7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,14 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */

+#define TC_MAX_QUEUE	16
+#define TC_BITMASK	15
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+	u16 count;
+	u16 offset;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1146,6 +1154,9 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	u8 num_tc;
+	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
+	u8 prio_tc_map[TC_MAX_QUEUE];

 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1173,57 @@ struct net_device {
 #define	NETDEV_ALIGN		32

 static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+	return dev->prio_tc_map[prio & TC_BITMASK];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
+	return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+	dev->num_tc = 0;
+	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->tc_to_txq[tc].count = count;
+	dev->tc_to_txq[tc].offset = offset;
+	return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+	if (num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	dev->num_tc = num_tc;
+	return 0;
+}
+
+static inline
+u8 netdev_get_num_tc(const struct net_device *dev)
+{
+	return dev->num_tc;
+}
+
+static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 5987729..5e7832e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2170,6 +2170,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		  unsigned int num_tx_queues)
 {
 	u32 hash;
+	u16 qoffset = 0;
+	u16 qcount = num_tx_queues;

 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2178,13 +2180,19 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}

+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol ^ skb->rxhash;
 	hash = jhash_1word(hash, hashrnd);

-	return (u16) (((u64) hash * num_tx_queues) >> 32);
+	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);

^ permalink raw reply related

* Re: [PATCH net-next-2.6 v2 1/1] can: c_can: Added support for Bosch C_CAN controller
From: Wolfgang Grandegger @ 2010-12-21 19:27 UTC (permalink / raw)
  To: Bhupesh SHARMA
  Cc: Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Marc Kleine-Budde
In-Reply-To: <D5ECB3C7A6F99444980976A8C6D896384DEAE481B6-8vAmw3ZAcdzhJTuQ9jeba9BPR1lH4CV8@public.gmane.org>

Hi Bhupesh,

On 12/21/2010 05:48 AM, Bhupesh SHARMA wrote:
> Hi Wolfgang,
...
>> In the meantime I compared the CAN chapter of the PCH manual with the
>> C_CAN manual. The paragraphs I checked are *identical*. This makes
>> clear, that the "pch_can" is a clone of the  C_CAN CAN controller, with
>> a few extensions, though. Therefore it would make sense, to implement a
>> bus sensitive interface like for the SJA1000 allowing to handle both
>> CAN
>> controllers with one driver sooner than later. Therefore, could you
>> please implement:
>>
>>   drivers/net/can/c_can/c_can.c
>>                        /c_can_platform.c
>>
>> Then an interface to the PCI based PCH CAN controller could be added
>> easily, e.g. as "pch_pci.c". You already had something similar in your
>> RFC version of the patch, IIRC.
> 
> This was the approach I initially proposed in my RFC V1 patch :)
> But unfortunately we could not agree to it.

I know. But at that time I was not aware of any other bus used for the
C_CAN controller.

> So, please let me reiterate what I understood and what was present
> in RFC version of the patch. Please add your comments/views:
> 
>         - drivers/net/can/c_can/c_can.c (similar on lines of sja1000.c)
>         i.e. a)no *probe* / *remove* functions here,
>              b)register read/write implemented here.
> 
>         - drivers/net/can/c_can/c_can_platform.c (similar on lines of sja1000_platform.c)
>         i.e. *probe* / *remove* implemented here,

Yes, that's what I'm thinking about.

> Marc and Tomoya can also add their suggestions so that I can finalize V3 a.s.a.p.

That would be nice, indeed. Also have a look to Tomoya's PCH driver,
which also looks very good in the meantime.

Wolfgang.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 4/4] net_sched: add MQSAFE flag to qdisc to identify mq like qdiscs
From: John Fastabend @ 2010-12-21 19:21 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, nhorman@tuxdriver.com
In-Reply-To: <1292887321.3055.43.camel@bwh-desktop>

On 12/20/2010 3:22 PM, Ben Hutchings wrote:
> On Fri, 2010-12-17 at 07:34 -0800, John Fastabend wrote:
>> Add a MQSAFE flag to the qdisc schedulers that can be safely
>> managed by sch_mclass. Without this flag schedulers that are
>> not aware of multiple tx queues can be grafted under the
>> mclass qdisc. Allowing incorrect qdiscs to be grafted causes
>> an invalid mapping from qdisc's to netdevice queues.
> [...]
> 
> This should be defined before adding sch_mclass, or at the same time,
> not after.
> 
> Ben.
> 

Right. I'll roll it into the initial mclass patch. Thanks.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/4] net_sched: Allow multiple mq qdisc to be used as non-root
From: John Fastabend @ 2010-12-21 19:21 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, nhorman@tuxdriver.com
In-Reply-To: <1292886762.3055.38.camel@bwh-desktop>

On 12/20/2010 3:12 PM, Ben Hutchings wrote:
> On Fri, 2010-12-17 at 07:34 -0800, John Fastabend wrote:
> [...]
>> diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
>> index ecc302f..35ed26d 100644
>> --- a/net/sched/sch_mq.c
>> +++ b/net/sched/sch_mq.c
>> @@ -19,17 +19,39 @@
>>  
>>  struct mq_sched {
>>  	struct Qdisc		**qdiscs;
>> +	u8 num_tc;
>>  };
>>  
>> +static void mq_queues(struct net_device *dev, struct Qdisc *sch,
>> +		      unsigned int *count, unsigned int *offset)
>> +{
>> +	struct mq_sched *priv = qdisc_priv(sch);
>> +	if (priv->num_tc) {
>> +		int queue = TC_H_MIN(sch->parent) - 1;
>> +		if (count)
>> +			*count = dev->tc_to_txq[queue].count;
>> +		if (offset)
>> +			*offset = dev->tc_to_txq[queue].offset;
>> +	} else {
>> +		if (count)
>> +			*count = dev->num_tx_queues;
>> +		if (offset)
>> +			*offset = 0;
>> +	}
>> +}
> [...]
> 
> It looks like num_tc will be set even for the root qdisc if the device
> is capable of QoS.  Would mq_queues() behave correctly then, i.e. is the
> queue range for priority 0 required to be [0, dev->num_tx_queues)?

If num_tc is set the mclass qdisc is loaded by default and not the mq qdisc. When mclass is destroyed it sets num_tc to zero. So I believe mq_queues() will behave correctly ie if mq is the root qdisc [0, dev->num_tx_queues) will be used.

> 
> Also it would be neater to return count and offset together as struct
> netdev_tc_txq, rather than through optional out-parameters.  Even better
> would be to cache these in struct mq_sched, if that's possible.
> 

Yes should be possible to embed this in mq_sched.

> Ben.
> 


^ permalink raw reply

* Re: [net-next-2.6 PATCH v2] net: implement mechanism for HW based QOS
From: John Fastabend @ 2010-12-21 19:15 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, nhorman@tuxdriver.com
In-Reply-To: <1292885287.3055.22.camel@bwh-desktop>

On 12/20/2010 2:48 PM, Ben Hutchings wrote:
> On Fri, 2010-12-17 at 08:56 -0800, John Fastabend wrote:
> [...]
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cc916c5..c5d7949 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -646,6 +646,12 @@ struct xps_dev_maps {
>>      (nr_cpu_ids * sizeof(struct xps_map *)))
>>  #endif /* CONFIG_XPS */
>>  
>> +/* HW offloaded queuing disciplines txq count and offset maps */
>> +struct netdev_tc_txq {
>> +	u16 count;
>> +	u16 offset;
>> +};
>> +
>>  /*
>>   * This structure defines the management hooks for network devices.
>>   * The following hooks can be defined; unless noted otherwise, they are
>> @@ -1146,6 +1152,9 @@ struct net_device {
>>  	/* Data Center Bridging netlink ops */
>>  	const struct dcbnl_rtnl_ops *dcbnl_ops;
>>  #endif
>> +	u8 num_tc;
>> +	struct netdev_tc_txq tc_to_txq[16];
>> +	u8 prio_tc_map[16];
> 
> That seems like a fair amount of data to add to every net device,
> considering that users may create e.g. a lot of VLAN devices and they
> won't use this state at all.  Have you considered putting these in a
> structure that is accessed indirectly?

This was Eric's suggestion per his response saves an indirection and avoids false sharing. 

> 
>>  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>>  	/* max exchange id for FCoE LRO by ddp */
>> @@ -1162,6 +1171,57 @@ struct net_device {
>>  #define	NETDEV_ALIGN		32
>>  
>>  static inline
>> +int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
>> +{
>> +	return dev->prio_tc_map[prio & 15];
>> +}
>> +
>> +static inline
>> +int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
>> +{
>> +	if (tc >= dev->num_tc)
>> +		return -EINVAL;
>> +
>> +	dev->prio_tc_map[prio & 15] = tc & 15;
>> +	return 0;
>> +}
> [...]
> 
> Please name the magic numbers 15 and 16.
> 

TC_MAX_QUEUES and TC_BITMASK should work. Thanks for looking over this.

> Ben.
> 


^ permalink raw reply

* Re: [patch -next] stmmac: unwind properly in stmmac_dvr_probe()
From: David Miller @ 2010-12-21 18:53 UTC (permalink / raw)
  To: error27; +Cc: peppe.cavallaro, netdev, kernel-janitors
In-Reply-To: <20101221073456.GG1936@bicker>

From: Dan Carpenter <error27@gmail.com>
Date: Tue, 21 Dec 2010 10:34:56 +0300

> The original code had a several problems:
> *) It had potential null dereferences of "priv" and "res".
> *) It released the memory region before it was aquired.
> *) It didn't free "ndev" after it was allocated.
> *) It didn't call unregister_netdev() after calling stmmac_probe().
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Nice work, applied.

Thanks Dan.

^ permalink raw reply

* Re: [patch -next] bnx2x: remove bogus check
From: David Miller @ 2010-12-21 18:51 UTC (permalink / raw)
  To: eilong; +Cc: error27, netdev, kernel-janitors
In-Reply-To: <1292939316.21210.7084.camel@lb-tlvb-eilong.il.broadcom.com>

From: "Eilon Greenstein" <eilong@broadcom.com>
Date: Tue, 21 Dec 2010 15:48:36 +0200

> On Mon, 2010-12-20 at 23:04 -0800, Dan Carpenter wrote:
>> We dereferenced params on the line before so it's too late to check if
>> params is NULL.  In fact, params can never be NULL and strict_cos is
>> either 0 or 1 so that part of the check is bogus too.  Let's remove it.
>> 
>> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> Thanks Dan!
> 
> Acked-by: Eilon Greenstein <eilong@broadcom.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit
From: David Miller @ 2010-12-21 18:50 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, netdev, kaber, jarkao2
In-Reply-To: <AANLkTin6pzkUAcmH12ywTsRAABEKch9=iHziZ0oURZYd@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Tue, 21 Dec 2010 15:56:17 +0800

> On Tue, Dec 21, 2010 at 3:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit :
>> [PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit
>>
>> Now we do one clone of skb if at least one sniffer might take packet,
>> we also can do the skb timestamping on the clone and let original packet
>> unchanged.
>>
>> This is a generalization of commit 8caf153974f2 (net: sch_netem: Fix an
>> inconsistency in ingress netem timestamps.)
>>
>> This way, we can have a good idea when packets are delivered to our
>> stack (tcpdump -i ifb0), while a tcpdump on original device gives
>> timestamps right before ingressing.
>>
>> This also speedup our stack, avoiding taking timestamps if not needed.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>> Cc: Patrick McHardy <kaber@trash.net>
>> Cc: Jarek Poplawski <jarkao2@gmail.com>
> 
> Acked-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH 1/1] TCP: increase default initial receive window.
From: Eric Dumazet @ 2010-12-21 18:49 UTC (permalink / raw)
  To: John Heffner
  Cc: Nandita Dukkipati, David S. Miller, netdev, Tom Herbert,
	Laurent Chavey, Yuchung Cheng
In-Reply-To: <AANLkTikcS-U9yA05dTp_y4EciwtUsjSonjc+bkA=57=4@mail.gmail.com>

Le mardi 21 décembre 2010 à 13:27 -0500, John Heffner a écrit :
> I know this has already been applied, but one thing to think about:
> Linux announces a small initial window to prevent overflowing the
> receive buffer when receiving segments smaller than the link MTU.

Overflowing receive buffer ? Which one ? Do you mean NIC RX ring
buffer ?

> Increasing this even to 10 segments might have some negative
> consequences.  I recall, for instance, some drivers when configured
> with a 9000 byte MTU, have a single pool of receive buffers all 16k
> (the next highest power of 2).  So each received segment will get 16k
> of allocated memory accounted to it, even if the incoming segments are
> <=1460 bytes long.  The default initial rcvbuf of 87380 bytes is less
> than the 160k of memory that the initial window might consume, so
> we're going to start hitting the very slow path of coalescing segments
> to get back under memory bounds.

Patch is not allowing 87380 bytes, but 10 segments, limited to 14600
bytes. Its very conservative IMHO.

> 
> Some drivers are smarter about having multiple pools of receive
> buffers with different sizes, so it might not be so easy to hit this
> condition.  I haven't looked at any of them for a while.  Is this
> still a real concern?

I dont think so. You would have problem anyway, since the patch changes
only _initial_ receive window. After some kbytes of data exchanged,
window is probably larger.



^ permalink raw reply

* Re: [RFC] ipv4: add ICMP socket kind
From: Colin Walters @ 2010-12-21 18:46 UTC (permalink / raw)
  To: Vasiliy Kulikov; +Cc: netdev, linux-kernel, Pavel Kankovsky, Solar Designer
In-Reply-To: <20101221181800.GA8166@albatros>

On Tue, Dec 21, 2010 at 1:18 PM, Vasiliy Kulikov <segooon@gmail.com> wrote:
> Hi,
>
> This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
> ICMP_ECHO messages and receive corresponding ICMP_ECHOREPLY messages
> without any special privileges.  In other words, the patch makes it
> possible to implement setuid-less /bin/ping.
>
> A new ping socket is created with
>
>    socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)

And the default is to allow any uid to do this (modulo LSM)?

If you really have a burning desire to get rid of setuid /bin/ping,
why not just do it in userspace via message passing to/from a
privileged process, and avoid a lot of code in the kernel?  It's much
more flexible.  You could, for example, limit it to once a second by
default, allow only one process doing this per uid, etc.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox