Netdev List

Netdev List
 help / color / mirror / Atom feed

* [patch] IPVS: one-packet scheduling
From: Simon Horman @ 2010-05-19  3:14 UTC (permalink / raw)
  To: netdev, lvs-devel, netfilter-devel
  Cc: Wensong Zhang, Julian Anastasov, Patrick McHardy, Nick Chalk

From: Nick Chalk <nick@loadbalancer.org>

IPVS: one-packet scheduling

Allow one-packet scheduling for UDP connections. When the fwmark-based or
normal virtual service is marked with '-o' or '--ops' options all
connections are created only to schedule one packet. Useful to schedule UDP
packets from same client port to different real servers. Recommended with
RR or WRR schedulers (the connections are not visible with ipvsadm -L).

Signed-off-by: Nick Chalk <nick@loadbalancer.org>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 
U
Up-port of http://www.ssi.bg/~ja/tmp/ops-2.4.32-1.diff

Patrick, please consider merging this

Index: nf-next-2.6/include/linux/ip_vs.h
===================================================================
--- nf-next-2.6.orig/include/linux/ip_vs.h	2010-05-19 11:59:32.000000000 +0900
+++ nf-next-2.6/include/linux/ip_vs.h	2010-05-19 12:06:23.000000000 +0900
@@ -19,6 +19,7 @@
  */
 #define IP_VS_SVC_F_PERSISTENT	0x0001		/* persistent port */
 #define IP_VS_SVC_F_HASHED	0x0002		/* hashed entry */
+#define IP_VS_SVC_F_ONEPACKET	0x0004		/* one-packet scheduling */
 
 /*
  *      Destination Server Flags
@@ -85,6 +86,7 @@
 #define IP_VS_CONN_F_SEQ_MASK	0x0600		/* in/out sequence mask */
 #define IP_VS_CONN_F_NO_CPORT	0x0800		/* no client port set yet */
 #define IP_VS_CONN_F_TEMPLATE	0x1000		/* template, not connection */
+#define IP_VS_CONN_F_ONE_PACKET	0x2000		/* forward only one packet */
 
 #define IP_VS_SCHEDNAME_MAXLEN	16
 #define IP_VS_IFNAME_MAXLEN	16
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-05-19 11:59:32.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-05-19 12:06:23.000000000 +0900
@@ -158,6 +158,9 @@ static inline int ip_vs_conn_hash(struct
 	unsigned hash;
 	int ret;
 
+	if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
+		return 0;
+
 	/* Hash by protocol, client address and port */
 	hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
 
@@ -355,8 +358,9 @@ struct ip_vs_conn *ip_vs_conn_out_get
  */
 void ip_vs_conn_put(struct ip_vs_conn *cp)
 {
-	/* reset it expire in its timeout */
-	mod_timer(&cp->timer, jiffies+cp->timeout);
+	unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ?
+		0 : cp->timeout;
+	mod_timer(&cp->timer, jiffies+t);
 
 	__ip_vs_conn_put(cp);
 }
@@ -649,7 +653,7 @@ static void ip_vs_conn_expire(unsigned l
 	/*
 	 *	unhash it if it is hashed in the conn table
 	 */
-	if (!ip_vs_conn_unhash(cp))
+	if (!ip_vs_conn_unhash(cp) && !(cp->flags & IP_VS_CONN_F_ONE_PACKET))
 		goto expire_later;
 
 	/*
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-05-19 11:59:32.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-05-19 12:08:08.000000000 +0900
@@ -194,6 +194,7 @@ ip_vs_sched_persist(struct ip_vs_service
 	struct ip_vs_dest *dest;
 	struct ip_vs_conn *ct;
 	__be16  dport;			/* destination port to forward */
+	__be16  flags;
 	union nf_inet_addr snet;	/* source network of the client,
 					   :wq
					   :wqafter masking */
 
@@ -340,6 +341,10 @@ ip_vs_sched_persist(struct ip_vs_service
 		dport = ports[1];
 	}
 
+	flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
+		 && iph.protocol == IPPROTO_UDP)?
+		IP_VS_CONN_F_ONE_PACKET : 0;
+
 	/*
 	 *    Create a new connection according to the template
 	 */
@@ -347,7 +352,7 @@ ip_vs_sched_persist(struct ip_vs_service
 			    &iph.saddr, ports[0],
 			    &iph.daddr, ports[1],
 			    &dest->addr, dport,
-			    0,
+			    flags,
 			    dest);
 	if (cp == NULL) {
 		ip_vs_conn_put(ct);
@@ -377,7 +382,7 @@ ip_vs_schedule(struct ip_vs_service *svc
 	struct ip_vs_conn *cp = NULL;
 	struct ip_vs_iphdr iph;
 	struct ip_vs_dest *dest;
-	__be16 _ports[2], *pptr;
+	__be16 _ports[2], *pptr, flags;
 
 	ip_vs_fill_iphdr(svc->af, skb_network_header(skb), &iph);
 	pptr = skb_header_pointer(skb, iph.len, sizeof(_ports), _ports);
@@ -407,6 +412,10 @@ ip_vs_schedule(struct ip_vs_service *svc
 		return NULL;
 	}
 
+	flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
+		 && iph.protocol == IPPROTO_UDP)?
+		IP_VS_CONN_F_ONE_PACKET : 0;
+
 	/*
 	 *    Create a connection entry.
 	 */
@@ -414,7 +423,7 @@ ip_vs_schedule(struct ip_vs_service *svc
 			    &iph.saddr, pptr[0],
 			    &iph.daddr, pptr[1],
 			    &dest->addr, dest->port ? dest->port : pptr[1],
-			    0,
+			    flags,
 			    dest);
 	if (cp == NULL)
 		return NULL;
@@ -464,6 +473,9 @@ int ip_vs_leave(struct ip_vs_service *sv
 	if (sysctl_ip_vs_cache_bypass && svc->fwmark && unicast) {
 		int ret, cs;
 		struct ip_vs_conn *cp;
+		__u16 flags = (svc->flags & IP_VS_SVC_F_ONEPACKET &&
+				iph.protocol == IPPROTO_UDP)?
+				IP_VS_CONN_F_ONE_PACKET : 0;
 		union nf_inet_addr daddr =  { .all = { 0, 0, 0, 0 } };
 
 		ip_vs_service_put(svc);
@@ -474,7 +486,7 @@ int ip_vs_leave(struct ip_vs_service *sv
 				    &iph.saddr, pptr[0],
 				    &iph.daddr, pptr[1],
 				    &daddr, 0,
-				    IP_VS_CONN_F_BYPASS,
+				    IP_VS_CONN_F_BYPASS | flags,
 				    NULL);
 		if (cp == NULL)
 			return NF_DROP;
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_ctl.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_ctl.c	2010-05-19 11:59:32.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_ctl.c	2010-05-19 12:06:23.000000000 +0900
@@ -1864,14 +1864,16 @@ static int ip_vs_info_seq_show(struct se
 					   svc->scheduler->name);
 			else
 #endif
-				seq_printf(seq, "%s  %08X:%04X %s ",
+				seq_printf(seq, "%s  %08X:%04X %s %s ",
 					   ip_vs_proto_name(svc->protocol),
 					   ntohl(svc->addr.ip),
 					   ntohs(svc->port),
-					   svc->scheduler->name);
+					   svc->scheduler->name,
+					   (svc->flags & IP_VS_SVC_F_ONEPACKET)?"ops ":"");
 		} else {
-			seq_printf(seq, "FWM  %08X %s ",
-				   svc->fwmark, svc->scheduler->name);
+			seq_printf(seq, "FWM  %08X %s %s",
+				   svc->fwmark, svc->scheduler->name,
+				   (svc->flags & IP_VS_SVC_F_ONEPACKET)?"ops ":"");
 		}
 
 		if (svc->flags & IP_VS_SVC_F_PERSISTENT)


^ permalink raw reply

* any change in socket systemcall or packet_mmap regarding multiqueue nic?
From: Jon Zhou @ 2010-05-19  2:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org

hi
the multiqueue networking can utilize multi-core to process packets from multiqueue nic,
but any change in related userspace application part, such as socket system call, packet_mmap? these userspace API can also utilize multicore to process packets from kernel?
otherwise they have to read data in serialization

thanks
jon 

^ permalink raw reply

* Re: [net-next PATCH 7/7] ixgbe: add support for active DA cables
From: David Miller @ 2010-05-19  2:45 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, donald.c.skidmore
In-Reply-To: <20100519020013.18654.78651.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:13 -0700

> From: Don Skidmore <donald.c.skidmore@intel.com>
> 
> This patch adds support of active DA cables.  This is
> renaming and adding some PHY type enumerations.
> 
> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 6/7] ixgbe: dcb, do not tag tc_prio_control frames
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, john.r.fastabend
In-Reply-To: <20100519020011.18654.3836.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:11 -0700

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> The network stack indicate packets should not be DCB
> tagged by setting the priority to TC_PRIO_CONTROL. One
> usage for this is lldp frames which are not suppossed
> to be tagged.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 5/7] ixgbe: fix ixgbe_tx_is_paused logic
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, john.r.fastabend
In-Reply-To: <20100519020010.18654.48130.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:10 -0700

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> The TFCS bits show the current XON state.  Meaning that the
> device is paused if these bits are 0.  This fixes the logic
> to work as it was intended.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 4/7] ixgbe: always enable vlan strip/insert when DCB is enabled
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, yi.zou
In-Reply-To: <20100519020008.18654.36543.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:08 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> when DCB mode is on, we want the HW VLAN stripping to be always enabled.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 3/7] ixgbe: remove some redundant code in setting FCoE FIP filter
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, yi.zou
In-Reply-To: <20100519020007.18654.62574.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:07 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> The ETQS setup for FIP out side the if..else is enough for the ETQS
> setup for FIP, so remove redundant code.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 2/7] ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, yi.zou
In-Reply-To: <20100519020005.18654.75760.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:05 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> Make sure we point to the right offset of the fc_frame_header when
> VLAN header exists and HW has VLAN stripping disabled.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 1/7] ixgbe: fix header len when unsplit packet overflows to data buffer
From: David Miller @ 2010-05-19  2:44 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, shannon.nelson
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 May 2010 19:00:03 -0700

> From: Shannon Nelson <shannon.nelson@intel.com>
> 
> When in packet split mode, packet type is not recognized, and the packet is
> larger than the header size, the 82599 overflows the packet into the data
> area, but doesn't set the HDR_LEN field.  We can safely assume the length
> is the current header size.  This fixes an obscure corner case that can be
> triggered by non-ip packet headers or (more likely) by disabling the L2
> packet recognition.
> 
> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* [net-next PATCH 7/7] ixgbe: add support for active DA cables
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Don Skidmore, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: Don Skidmore <donald.c.skidmore@intel.com>

This patch adds support of active DA cables.  This is
renaming and adding some PHY type enumerations.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_82599.c   |    8 ++++++--
 drivers/net/ixgbe/ixgbe_ethtool.c |    4 ++--
 drivers/net/ixgbe/ixgbe_main.c    |    6 ++++--
 drivers/net/ixgbe/ixgbe_phy.c     |   38 +++++++++++++++++++++++++++++++------
 drivers/net/ixgbe/ixgbe_phy.h     |    3 +++
 drivers/net/ixgbe/ixgbe_type.h    |    9 +++++++--
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_82599.c b/drivers/net/ixgbe/ixgbe_82599.c
index dc197a4..e9706eb 100644
--- a/drivers/net/ixgbe/ixgbe_82599.c
+++ b/drivers/net/ixgbe/ixgbe_82599.c
@@ -2155,10 +2155,14 @@ sfp_check:
 		goto out;
 
 	switch (hw->phy.type) {
-	case ixgbe_phy_tw_tyco:
-	case ixgbe_phy_tw_unknown:
+	case ixgbe_phy_sfp_passive_tyco:
+	case ixgbe_phy_sfp_passive_unknown:
 		physical_layer = IXGBE_PHYSICAL_LAYER_SFP_PLUS_CU;
 		break;
+	case ixgbe_phy_sfp_ftl_active:
+	case ixgbe_phy_sfp_active_unknown:
+		physical_layer = IXGBE_PHYSICAL_LAYER_SFP_ACTIVE_DA;
+		break;
 	case ixgbe_phy_sfp_avago:
 	case ixgbe_phy_sfp_ftl:
 	case ixgbe_phy_sfp_intel:
diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index 251767d..c50a754 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -212,8 +212,8 @@ static int ixgbe_get_settings(struct net_device *netdev,
 		ecmd->port = PORT_FIBRE;
 		break;
 	case ixgbe_phy_nl:
-	case ixgbe_phy_tw_tyco:
-	case ixgbe_phy_tw_unknown:
+	case ixgbe_phy_sfp_passive_tyco:
+	case ixgbe_phy_sfp_passive_unknown:
 	case ixgbe_phy_sfp_ftl:
 	case ixgbe_phy_sfp_avago:
 	case ixgbe_phy_sfp_intel:
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index caf1114..9551cbb 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3118,8 +3118,10 @@ static inline bool ixgbe_is_sfp(struct ixgbe_hw *hw)
 	case ixgbe_phy_sfp_ftl:
 	case ixgbe_phy_sfp_intel:
 	case ixgbe_phy_sfp_unknown:
-	case ixgbe_phy_tw_tyco:
-	case ixgbe_phy_tw_unknown:
+	case ixgbe_phy_sfp_passive_tyco:
+	case ixgbe_phy_sfp_passive_unknown:
+	case ixgbe_phy_sfp_active_unknown:
+	case ixgbe_phy_sfp_ftl_active:
 		return true;
 	default:
 		return false;
diff --git a/drivers/net/ixgbe/ixgbe_phy.c b/drivers/net/ixgbe/ixgbe_phy.c
index d6d5b84..22d21af 100644
--- a/drivers/net/ixgbe/ixgbe_phy.c
+++ b/drivers/net/ixgbe/ixgbe_phy.c
@@ -531,6 +531,7 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw)
 	u8 comp_codes_10g = 0;
 	u8 oui_bytes[3] = {0, 0, 0};
 	u8 cable_tech = 0;
+	u8 cable_spec = 0;
 	u16 enforce_sfp = 0;
 
 	if (hw->mac.ops.get_media_type(hw) != ixgbe_media_type_fiber) {
@@ -580,14 +581,30 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw)
 			else
 				hw->phy.sfp_type = ixgbe_sfp_type_unknown;
 		} else if (hw->mac.type == ixgbe_mac_82599EB) {
-			if (cable_tech & IXGBE_SFF_DA_PASSIVE_CABLE)
+			if (cable_tech & IXGBE_SFF_DA_PASSIVE_CABLE) {
 				if (hw->bus.lan_id == 0)
 					hw->phy.sfp_type =
 					             ixgbe_sfp_type_da_cu_core0;
 				else
 					hw->phy.sfp_type =
 					             ixgbe_sfp_type_da_cu_core1;
-			else if (comp_codes_10g & IXGBE_SFF_10GBASESR_CAPABLE)
+			} else if (cable_tech & IXGBE_SFF_DA_ACTIVE_CABLE) {
+				hw->phy.ops.read_i2c_eeprom(
+						hw, IXGBE_SFF_CABLE_SPEC_COMP,
+						&cable_spec);
+				if (cable_spec &
+				    IXGBE_SFF_DA_SPEC_ACTIVE_LIMITING) {
+					if (hw->bus.lan_id == 0)
+						hw->phy.sfp_type =
+						ixgbe_sfp_type_da_act_lmt_core0;
+					else
+						hw->phy.sfp_type =
+						ixgbe_sfp_type_da_act_lmt_core1;
+				} else {
+					hw->phy.sfp_type =
+						ixgbe_sfp_type_unknown;
+				}
+			} else if (comp_codes_10g & IXGBE_SFF_10GBASESR_CAPABLE)
 				if (hw->bus.lan_id == 0)
 					hw->phy.sfp_type =
 					              ixgbe_sfp_type_srlr_core0;
@@ -637,10 +654,14 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw)
 			switch (vendor_oui) {
 			case IXGBE_SFF_VENDOR_OUI_TYCO:
 				if (cable_tech & IXGBE_SFF_DA_PASSIVE_CABLE)
-					hw->phy.type = ixgbe_phy_tw_tyco;
+					hw->phy.type =
+						ixgbe_phy_sfp_passive_tyco;
 				break;
 			case IXGBE_SFF_VENDOR_OUI_FTL:
-				hw->phy.type = ixgbe_phy_sfp_ftl;
+				if (cable_tech & IXGBE_SFF_DA_ACTIVE_CABLE)
+					hw->phy.type = ixgbe_phy_sfp_ftl_active;
+				else
+					hw->phy.type = ixgbe_phy_sfp_ftl;
 				break;
 			case IXGBE_SFF_VENDOR_OUI_AVAGO:
 				hw->phy.type = ixgbe_phy_sfp_avago;
@@ -650,7 +671,11 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw)
 				break;
 			default:
 				if (cable_tech & IXGBE_SFF_DA_PASSIVE_CABLE)
-					hw->phy.type = ixgbe_phy_tw_unknown;
+					hw->phy.type =
+						ixgbe_phy_sfp_passive_unknown;
+				else if (cable_tech & IXGBE_SFF_DA_ACTIVE_CABLE)
+					hw->phy.type =
+						ixgbe_phy_sfp_active_unknown;
 				else
 					hw->phy.type = ixgbe_phy_sfp_unknown;
 				break;
@@ -658,7 +683,8 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw)
 		}
 
 		/* All passive DA cables are supported */
-		if (cable_tech & IXGBE_SFF_DA_PASSIVE_CABLE) {
+		if (cable_tech & (IXGBE_SFF_DA_PASSIVE_CABLE |
+		    IXGBE_SFF_DA_ACTIVE_CABLE)) {
 			status = 0;
 			goto out;
 		}
diff --git a/drivers/net/ixgbe/ixgbe_phy.h b/drivers/net/ixgbe/ixgbe_phy.h
index 9cf5f3b..c9c5459 100644
--- a/drivers/net/ixgbe/ixgbe_phy.h
+++ b/drivers/net/ixgbe/ixgbe_phy.h
@@ -40,9 +40,12 @@
 #define IXGBE_SFF_1GBE_COMP_CODES    0x6
 #define IXGBE_SFF_10GBE_COMP_CODES   0x3
 #define IXGBE_SFF_CABLE_TECHNOLOGY   0x8
+#define IXGBE_SFF_CABLE_SPEC_COMP    0x3C
 
 /* Bitmasks */
 #define IXGBE_SFF_DA_PASSIVE_CABLE           0x4
+#define IXGBE_SFF_DA_ACTIVE_CABLE            0x8
+#define IXGBE_SFF_DA_SPEC_ACTIVE_LIMITING    0x4
 #define IXGBE_SFF_1GBASESX_CAPABLE           0x1
 #define IXGBE_SFF_1GBASELX_CAPABLE           0x2
 #define IXGBE_SFF_10GBASESR_CAPABLE          0x10
diff --git a/drivers/net/ixgbe/ixgbe_type.h b/drivers/net/ixgbe/ixgbe_type.h
index bd69196..39b9be8 100644
--- a/drivers/net/ixgbe/ixgbe_type.h
+++ b/drivers/net/ixgbe/ixgbe_type.h
@@ -2108,6 +2108,7 @@ typedef u32 ixgbe_physical_layer;
 #define IXGBE_PHYSICAL_LAYER_1000BASE_BX  0x0400
 #define IXGBE_PHYSICAL_LAYER_10GBASE_KR   0x0800
 #define IXGBE_PHYSICAL_LAYER_10GBASE_XAUI 0x1000
+#define IXGBE_PHYSICAL_LAYER_SFP_ACTIVE_DA 0x2000
 
 /* Software ATR hash keys */
 #define IXGBE_ATR_BUCKET_HASH_KEY    0xE214AD3D
@@ -2177,10 +2178,12 @@ enum ixgbe_phy_type {
 	ixgbe_phy_qt,
 	ixgbe_phy_xaui,
 	ixgbe_phy_nl,
-	ixgbe_phy_tw_tyco,
-	ixgbe_phy_tw_unknown,
+	ixgbe_phy_sfp_passive_tyco,
+	ixgbe_phy_sfp_passive_unknown,
+	ixgbe_phy_sfp_active_unknown,
 	ixgbe_phy_sfp_avago,
 	ixgbe_phy_sfp_ftl,
+	ixgbe_phy_sfp_ftl_active,
 	ixgbe_phy_sfp_unknown,
 	ixgbe_phy_sfp_intel,
 	ixgbe_phy_sfp_unsupported,
@@ -2208,6 +2211,8 @@ enum ixgbe_sfp_type {
 	ixgbe_sfp_type_da_cu_core1 = 4,
 	ixgbe_sfp_type_srlr_core0 = 5,
 	ixgbe_sfp_type_srlr_core1 = 6,
+	ixgbe_sfp_type_da_act_lmt_core0 = 7,
+	ixgbe_sfp_type_da_act_lmt_core1 = 8,
 	ixgbe_sfp_type_not_present = 0xFFFE,
 	ixgbe_sfp_type_unknown = 0xFFFF
 };


^ permalink raw reply related

* [net-next PATCH 6/7] ixgbe: dcb, do not tag tc_prio_control frames
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, John Fastabend, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: John Fastabend <john.r.fastabend@intel.com>

The network stack indicate packets should not be DCB
tagged by setting the priority to TC_PRIO_CONTROL. One
usage for this is lldp frames which are not suppossed
to be tagged.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index d80bb1a..caf1114 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -6075,7 +6075,8 @@ static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
 		}
 		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
 		tx_flags |= IXGBE_TX_FLAGS_VLAN;
-	} else if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+	} else if (adapter->flags & IXGBE_FLAG_DCB_ENABLED &&
+		   skb->priority != TC_PRIO_CONTROL) {
 		tx_flags |= ((skb->queue_mapping & 0x7) << 13);
 		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
 		tx_flags |= IXGBE_TX_FLAGS_VLAN;


^ permalink raw reply related

* [net-next PATCH 5/7] ixgbe: fix ixgbe_tx_is_paused logic
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, John Fastabend, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: John Fastabend <john.r.fastabend@intel.com>

The TFCS bits show the current XON state.  Meaning that the
device is paused if these bits are 0.  This fixes the logic
to work as it was intended.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index f0f7329..d80bb1a 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -625,16 +625,16 @@ static void ixgbe_unmap_and_free_tx_resource(struct ixgbe_adapter *adapter,
 }
 
 /**
- * ixgbe_tx_is_paused - check if the tx ring is paused
+ * ixgbe_tx_xon_state - check the tx ring xon state
  * @adapter: the ixgbe adapter
  * @tx_ring: the corresponding tx_ring
  *
  * If not in DCB mode, checks TFCS.TXOFF, otherwise, find out the
  * corresponding TC of this tx_ring when checking TFCS.
  *
- * Returns : true if paused
+ * Returns : true if in xon state (currently not paused)
  */
-static inline bool ixgbe_tx_is_paused(struct ixgbe_adapter *adapter,
+static inline bool ixgbe_tx_xon_state(struct ixgbe_adapter *adapter,
                                       struct ixgbe_ring *tx_ring)
 {
 	u32 txoff = IXGBE_TFCS_TXOFF;
@@ -690,7 +690,7 @@ static inline bool ixgbe_check_tx_hang(struct ixgbe_adapter *adapter,
 	adapter->detect_tx_hung = false;
 	if (tx_ring->tx_buffer_info[eop].time_stamp &&
 	    time_after(jiffies, tx_ring->tx_buffer_info[eop].time_stamp + HZ) &&
-	    !ixgbe_tx_is_paused(adapter, tx_ring)) {
+	    ixgbe_tx_xon_state(adapter, tx_ring)) {
 		/* detected Tx unit hang */
 		union ixgbe_adv_tx_desc *tx_desc;
 		tx_desc = IXGBE_TX_DESC_ADV(*tx_ring, eop);


^ permalink raw reply related

* [net-next PATCH 4/7] ixgbe: always enable vlan strip/insert when DCB is enabled
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Yi Zou, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: Yi Zou <yi.zou@intel.com>

when DCB mode is on, we want the HW VLAN stripping to be always enabled.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a9e091c..f0f7329 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2845,7 +2845,11 @@ static void ixgbe_vlan_filter_disable(struct ixgbe_adapter *adapter)
 
 	switch (hw->mac.type) {
 	case ixgbe_mac_82598EB:
-		vlnctrl &= ~(IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE);
+		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+#ifdef CONFIG_IXGBE_DCB
+		if (!(adapter->flags & IXGBE_FLAG_DCB_ENABLED))
+			vlnctrl &= ~IXGBE_VLNCTRL_VME;
+#endif
 		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
 		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
 		break;
@@ -2853,6 +2857,10 @@ static void ixgbe_vlan_filter_disable(struct ixgbe_adapter *adapter)
 		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
 		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
 		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+#ifdef CONFIG_IXGBE_DCB
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
+			break;
+#endif
 		for (i = 0; i < adapter->num_rx_queues; i++) {
 			j = adapter->rx_ring[i]->reg_idx;
 			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));


^ permalink raw reply related

* [net-next PATCH 3/7] ixgbe: remove some redundant code in setting FCoE FIP filter
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Yi Zou, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: Yi Zou <yi.zou@intel.com>

The ETQS setup for FIP out side the if..else is enough for the ETQS
setup for FIP, so remove redundant code.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_fcoe.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_fcoe.c b/drivers/net/ixgbe/ixgbe_fcoe.c
index a82d2fc..45182ab 100644
--- a/drivers/net/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ixgbe/ixgbe_fcoe.c
@@ -539,12 +539,6 @@ void ixgbe_configure_fcoe(struct ixgbe_adapter *adapter)
 		}
 		IXGBE_WRITE_REG(hw, IXGBE_FCRECTL, IXGBE_FCRECTL_ENA);
 		IXGBE_WRITE_REG(hw, IXGBE_ETQS(IXGBE_ETQF_FILTER_FCOE), 0);
-		fcoe_i = f->mask;
-		fcoe_i &= IXGBE_FCRETA_ENTRY_MASK;
-		fcoe_q = adapter->rx_ring[fcoe_i]->reg_idx;
-		IXGBE_WRITE_REG(hw, IXGBE_ETQS(IXGBE_ETQF_FILTER_FIP),
-				IXGBE_ETQS_QUEUE_EN |
-				(fcoe_q << IXGBE_ETQS_RX_QUEUE_SHIFT));
 	} else  {
 		/* Use single rx queue for FCoE */
 		fcoe_i = f->mask;


^ permalink raw reply related

* [net-next PATCH 2/7] ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Yi Zou, Jeff Kirsher
In-Reply-To: <20100519020003.18654.94899.stgit@localhost.localdomain>

From: Yi Zou <yi.zou@intel.com>

Make sure we point to the right offset of the fc_frame_header when
VLAN header exists and HW has VLAN stripping disabled.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_fcoe.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_fcoe.c b/drivers/net/ixgbe/ixgbe_fcoe.c
index 6493049..a82d2fc 100644
--- a/drivers/net/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ixgbe/ixgbe_fcoe.c
@@ -32,6 +32,7 @@
 #endif /* CONFIG_IXGBE_DCB */
 #include <linux/if_ether.h>
 #include <linux/gfp.h>
+#include <linux/if_vlan.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_device.h>
 #include <scsi/fc/fc_fs.h>
@@ -312,10 +313,12 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
 	if (fcerr == IXGBE_FCERR_BADCRC)
 		skb->ip_summed = CHECKSUM_NONE;
 
-	skb_reset_network_header(skb);
-	skb_set_transport_header(skb, skb_network_offset(skb) +
-				 sizeof(struct fcoe_hdr));
-	fh = (struct fc_frame_header *)skb_transport_header(skb);
+	if (eth_hdr(skb)->h_proto == htons(ETH_P_8021Q))
+		fh = (struct fc_frame_header *)(skb->data +
+			sizeof(struct vlan_hdr) + sizeof(struct fcoe_hdr));
+	else
+		fh = (struct fc_frame_header *)(skb->data +
+			sizeof(struct fcoe_hdr));
 	fctl = ntoh24(fh->fh_f_ctl);
 	if (fctl & FC_FC_EX_CTX)
 		xid =  be16_to_cpu(fh->fh_ox_id);


^ permalink raw reply related

* [net-next PATCH 1/7] ixgbe: fix header len when unsplit packet overflows to data buffer
From: Jeff Kirsher @ 2010-05-19  2:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Shannon Nelson, Jeff Kirsher

From: Shannon Nelson <shannon.nelson@intel.com>

When in packet split mode, packet type is not recognized, and the packet is
larger than the header size, the 82599 overflows the packet into the data
area, but doesn't set the HDR_LEN field.  We can safely assume the length
is the current header size.  This fixes an obscure corner case that can be
triggered by non-ip packet headers or (more likely) by disabling the L2
packet recognition.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 926ad8c..a9e091c 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1201,9 +1201,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			hdr_info = le16_to_cpu(ixgbe_get_hdr_info(rx_desc));
 			len = (hdr_info & IXGBE_RXDADV_HDRBUFLEN_MASK) >>
 			       IXGBE_RXDADV_HDRBUFLEN_SHIFT;
-			if (len > IXGBE_RX_HDR_SIZE)
-				len = IXGBE_RX_HDR_SIZE;
 			upper_len = le16_to_cpu(rx_desc->wb.upper.length);
+			if ((len > IXGBE_RX_HDR_SIZE) ||
+			    (upper_len && !(hdr_info & IXGBE_RXDADV_SPH)))
+				len = IXGBE_RX_HDR_SIZE;
 		} else {
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}


^ permalink raw reply related

* Re: Network QoS support in applications
From: Philip A. Prindeville @ 2010-05-19  0:04 UTC (permalink / raw)
  To: David Miller; +Cc: dunc, kalle.valo, kaber, netdev, linux-wireless
In-Reply-To: <4B9944A0.5080308@redfish-solutions.com>

On 03/11/2010 12:29 PM, Philip A. Prindeville wrote:
> On 03/11/2010 12:27 PM, David Miller wrote:
>   
>> From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
>> Date: Thu, 11 Mar 2010 12:21:11 -0700
>>
>>     
>>> And yes, there will always be misbehaving users.  They are a fact of
>>> life.  That doesn't mean we should lobotomize the network.  We don't
>>> have an authentication mechanism on ICMP Redirects or Source-Quench,
>>>       
>> Which is why most networks block those packets from the outside.
>>
>>     
>>> Nor is ARP authenticated.
>>>       
>> Which is why people control who can plug into their physical
>> network.
>>
>> None of the things you are saying support the idea of having
>> applications decide what the DSCP marking should be.
>>     
>
> Does "decide what the DSCP marking should be" include complying to the recommendations of RFC-4594?
>   

If anyone cares, here's an update:

I've submitted patches for QoS configuration for:

APR/Apache (stalled);
Proftpd (committed);
Openssh (pending review);
Firefox/Thunderbird (reviewed and on-track for commit);
Cyrus (committed);
Sendmail (submittted and acknowledged, but not yet reviewed);
Curl (stalled);

All, as per the request of the maintainers, default to either no QoS
markings or previous RFC-791 QoS markings if that's what they already
supported (Proftpd and Openssh).

If anyone can think of anything else that needs to be supported to
impact a significant portion of network (or enterprise intranet)
traffic, please call it out.

And if anyone wants to see if they can help get Apache unstalled (it's
mostly an autoconf issue with Solaris that's holding things up), please
give me a holler offline.

Thanks.

^ permalink raw reply

* Re: QoS weirdness : HTB accuracy
From: Philip A. Prindeville @ 2010-05-19  0:07 UTC (permalink / raw)
  To: Julien Vehent; +Cc: Netdev, netfilter
In-Reply-To: <067c83163988908ef546d7ff7f560a17@localhost>

On 03/25/2010 12:06 PM, Julien Vehent wrote:
> Hello folks,
>
> I observe unused bandwidth on my QoS policy that I cannot explain.
> Conditions are: I have a HTB tree with 8 classes and a total rate of
> 768kbits. I use the ATM option so I assume the real rate to be something
> close to 675kbits (about 88% of the ethernet rate).
>
> The sum of my 8 rates is exactly 768kbits. Some have ceil values up to
> 768kbits.
>
> When class 20 "tcp_acks" starts borrowing, TC reduces the total bandwidth
> down to 595kbits/S (minus 79kbits/s). And I can't explain why....
>
> The attached graph "tc_htb_weirdness.png" shows the result: there are
> 'holes' in the sending rate. 
>
> I tried to play with burst sizes, r2q value and hysteresis mode, but the
> results are the same.
>
> System is debian squeeze, kernel version is 2.6.26, iproute2 version is
> 2.6.26 - 07/25/2008.
>
> I have attached two files:
> - "tcrules.txt" : the traffic control rules
> - "tc_htb_weirdness.png" : the rrdtool graph, resolution is 1 second.
>
> And here: http://jve.linuxwall.info/ressources/code/tc_hole_analysis.html
> a sheet with some of the measures values. I used it to calculate the size
> of one of the hole. The last table (with green and red cells) shows that,
> when class 20 "tcp_acks" starts sending at unixtime 1265496813, there is a
> lot of bandwidth left over (last column is all green). During the 95
> seconds while class 20 is sending, 3880776 bits could be sent but are not.
> That's about 40kbits/s on average. 
>
> Does anybody observess the same behavior? Any logical explanation to this
> or is it a bug ?
>
>
> Julien

Sorry for the late response:  could this be an "aliasing" issue caused
by sampling intervals (granularity)?

-Philip


^ permalink raw reply

* Re: [PATCH net-next] ixgbe: return error in set_rar when index out of range
From: Jeff Kirsher @ 2010-05-19  0:19 UTC (permalink / raw)
  To: Shirley Ma; +Cc: davem, kvm, netdev, e1000-devel
In-Reply-To: <1274196888.8701.2.camel@localhost.localdomain>

On Tue, May 18, 2010 at 08:34, Shirley Ma <mashirle@us.ibm.com> wrote:
> Return -1 when set_rar index is out of range
>
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>
>  drivers/net/ixgbe/ixgbe_common.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>


I think this should use IXGBE_ERR_<blah> instead and there is another
spot where this could be used.  Instead I propose this patch
instead...

ixgbe: return IXGBE_ERR_RAR_INDEX when out of range

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Based on patch from Shirley Ma <xma@us.ibm.com>
Return IXGBE_ERR_RAR_INDEX when RAR index is out of range, instead of
returning IXGBE_SUCCESS.

Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
---

 drivers/net/ixgbe/ixgbe_common.c |    2 ++
 drivers/net/ixgbe/ixgbe_type.h   |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/drivers/net/ixgbe/ixgbe_common.c b/drivers/net/ixgbe/ixgbe_common.c
index 1159d91..9595b1b 100644
--- a/drivers/net/ixgbe/ixgbe_common.c
+++ b/drivers/net/ixgbe/ixgbe_common.c
@@ -1188,6 +1188,7 @@ s32 ixgbe_set_rar_generic(struct ixgbe_hw *hw,
u32 index, u8 *addr, u32 vmdq,
 		IXGBE_WRITE_REG(hw, IXGBE_RAH(index), rar_high);
 	} else {
 		hw_dbg(hw, "RAR index %d is out of range.\n", index);
+		return IXGBE_ERR_RAR_INDEX;
 	}

 	return 0;
@@ -1219,6 +1220,7 @@ s32 ixgbe_clear_rar_generic(struct ixgbe_hw *hw,
u32 index)
 		IXGBE_WRITE_REG(hw, IXGBE_RAH(index), rar_high);
 	} else {
 		hw_dbg(hw, "RAR index %d is out of range.\n", index);
+		return IXGBE_ERR_RAR_INDEX;
 	}

 	/* clear VMDq pool/queue selection for this RAR */
diff --git a/drivers/net/ixgbe/ixgbe_type.h b/drivers/net/ixgbe/ixgbe_type.h
index bd69196..37d2807 100644
--- a/drivers/net/ixgbe/ixgbe_type.h
+++ b/drivers/net/ixgbe/ixgbe_type.h
@@ -2600,6 +2600,7 @@ struct ixgbe_info {
 #define IXGBE_ERR_FDIR_REINIT_FAILED            -23
 #define IXGBE_ERR_EEPROM_VERSION                -24
 #define IXGBE_ERR_NO_SPACE                      -25
+#define IXGBE_ERR_RAR_INDEX                     -26
 #define IXGBE_NOT_IMPLEMENTED                   0x7FFFFFFF

 #endif /* _IXGBE_TYPE_H_ */

^ permalink raw reply related

* [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Sridhar Samudrala @ 2010-05-19  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, netdev, lkml, kvm@vger.kernel.org

Add a new kernel API to create a singlethread workqueue and attach it's
task to current task's cgroup and cpumask.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9466e86..6d6f301 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -211,6 +211,7 @@ __create_workqueue_key(const char *name, int singlethread,
 #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
 #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
 
+extern struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name); 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
 extern int queue_work(struct workqueue_struct *wq, struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5bfb213..6ba226e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,7 @@
 #include <linux/lockdep.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/workqueue.h>
+#include <linux/cgroup.h>
 
 /*
  * The per-CPU workqueue (if single thread, we always use the first
@@ -193,6 +194,45 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
  */
 static cpumask_var_t cpu_populated_map __read_mostly;
 
+static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq)
+{
+	return (per_cpu_ptr(wq->cpu_wq, singlethread_cpu))->thread;
+}
+
+/* Create a singlethread workqueue and attach it's task to the current task's
+ * cgroup and set it's cpumask to the current task's cpumask.
+ */
+struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name)
+{
+	struct workqueue_struct *wq;
+	struct task_struct *task;
+	cpumask_var_t mask;
+
+	wq = create_singlethread_workqueue(name);
+	if (!wq)
+		goto out;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		goto err;
+			
+	if (sched_getaffinity(current->pid, mask))
+		goto err;
+
+	task = get_singlethread_wq_task(wq);
+	if (sched_setaffinity(task->pid, mask))
+		goto err;
+
+	if (cgroup_attach_task_current_cg(task))
+		goto err;
+out:	
+	return wq;
+err:
+	destroy_workqueue(wq);
+	wq = NULL;
+	goto out;
+}
+EXPORT_SYMBOL_GPL(create_singlethread_workqueue_in_current_cg);
+
 /* If it's single threaded, it isn't in the list of workqueues. */
 static inline int is_wq_single_threaded(struct workqueue_struct *wq)
 {
	


^ permalink raw reply related

* [ PATCH 3/3] vhost: make it more scalable by creating a vhost thread per device
From: Sridhar Samudrala @ 2010-05-19  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, netdev, lkml, kvm@vger.kernel.org

Make vhost more scalable by creating a separate vhost thread per
vhost device. This provides better scaling across virtio-net interfaces
in multiple guests.

Also attach each vhost thread to the cgroup and cpumask of the 
associated guest(qemu or libvirt).

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9777583..18bf5be 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -329,19 +329,22 @@ static void handle_rx_net(struct work_struct *work)
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+	struct vhost_dev *dev;
 	int r;
 	if (!n)
 		return -ENOMEM;
+
+	dev = &n->dev;
 	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
+	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
 		return r;
 	}
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 
 	f->private_data = n;
@@ -644,25 +647,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-	int r = vhost_init();
-	if (r)
-		goto err_init;
-	r = misc_register(&vhost_net_misc);
-	if (r)
-		goto err_reg;
-	return 0;
-err_reg:
-	vhost_cleanup();
-err_init:
-	return r;
-
+	return misc_register(&vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
-	vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 49fa953..b076b9b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,8 +37,6 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -57,18 +55,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	if (!((unsigned long)key & poll->mask))
 		return 0;
 
-	queue_work(vhost_workqueue, &poll->work);
+	queue_work(poll->dev->wq, &poll->work);
 	return 0;
 }
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask)
+		     unsigned long mask, struct vhost_dev *dev)
 {
 	INIT_WORK(&poll->work, func);
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
+	poll->dev = dev;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -97,7 +96,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	queue_work(vhost_workqueue, &poll->work);
+	queue_work(poll->dev->wq, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -129,6 +128,13 @@ long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
 	int i;
+	char vhost_name[20];
+
+	snprintf(vhost_name, 20, "vhost-%d", current->pid);
+	dev->wq = create_singlethread_workqueue_in_current_cg(vhost_name);
+	if (!dev->wq)
+		return -ENOMEM;
+
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
 	mutex_init(&dev->mutex);
@@ -144,7 +150,7 @@ long vhost_dev_init(struct vhost_dev *dev,
 		if (dev->vqs[i].handle_kick)
 			vhost_poll_init(&dev->vqs[i].poll,
 					dev->vqs[i].handle_kick,
-					POLLIN);
+					POLLIN, dev);
 	}
 	return 0;
 }
@@ -217,6 +223,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
+
+	destroy_workqueue(dev->wq);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1105,16 +1113,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
-
-int vhost_init(void)
-{
-	vhost_workqueue = create_singlethread_workqueue("vhost");
-	if (!vhost_workqueue)
-		return -ENOMEM;
-	return 0;
-}
-
-void vhost_cleanup(void)
-{
-	destroy_workqueue(vhost_workqueue);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..60fefd0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -29,10 +29,11 @@ struct vhost_poll {
 	/* struct which will handle all actual work. */
 	struct work_struct        work;
 	unsigned long		  mask;
+	struct vhost_dev	 *dev;
 };
 
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask);
+		     unsigned long mask, struct vhost_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -110,6 +111,7 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
+	struct workqueue_struct *wq;
 };
 
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
@@ -136,9 +138,6 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
 
-int vhost_init(void);
-void vhost_cleanup(void);
-
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \

^ permalink raw reply related

* [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup
From: Sridhar Samudrala @ 2010-05-19  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, netdev, kvm@vger.kernel.org, lkml

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);
 
 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6d870f2..6cfeb06 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
 	return retval;
 }
 
+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+	struct cgroupfs_root *root;
+	struct cgroup *cur_cg;
+	int retval = 0;
+
+	cgroup_lock();
+	for_each_active_root(root) {
+		cur_cg = task_cgroup_from_root(current, root);
+		retval = cgroup_attach_task(cur_cg, tsk);
+		if (retval)
+			break;
+	}
+	cgroup_unlock();
+
+	return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task



^ permalink raw reply related

* [PATCH 0/3] Make vhost multi-threaded and associate each thread to its guest's cgroup/cpumask
From: Sridhar Samudrala @ 2010-05-19  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, netdev, lkml, kvm@vger.kernel.org

The following set of patches create a new API to associate a
workqueue to the current thread's cgroup and cpumask.
This API is used by multi-threaded vhost to associate each
thread to the corresponding guest's cgroup and cpumask.

Thanks
Sridhar

^ permalink raw reply

* Re: [0/4] Fix addrconf race conditions
From: Herbert Xu @ 2010-05-18 22:35 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, jbohac, yoshfuji, netdev
In-Reply-To: <20100518.152759.218054760.davem@davemloft.net>

On Tue, May 18, 2010 at 03:27:59PM -0700, David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 18 May 2010 10:25:50 -0700
> 
> > I wonder if so many fine grained locks are really necessary at
> > all.  Everything but timers looks like it is under RTNL mutex
> > already.
> 
> I can't see any reasonable alternative, and it took weeks to get
> a fix at all.

Right, the issue is with forcing an immediate effect after an
NDISC packet triggers an address addition/deletion.

>From what I can gather, the hard case was with NDISC address
addition with DAD disabled.  That is, you have an NDISC packet
that causes an addition followed immediately by a packet destined
(or otherwise relying on) that new address.  In that case, we
simply have no chance of using the RTNL.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [0/4] Fix addrconf race conditions
From: David Miller @ 2010-05-18 22:27 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, jbohac, yoshfuji, netdev
In-Reply-To: <20100518102550.65ad3fdd@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 18 May 2010 10:25:50 -0700

> I wonder if so many fine grained locks are really necessary at
> all.  Everything but timers looks like it is under RTNL mutex
> already.

I can't see any reasonable alternative, and it took weeks to get
a fix at all.

So I'm going to apply Herbert's fixes with the __KERNEL__ header
bit fixed up.

Thanks Herbert!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox