Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH V4 1/5] cxgb4: Add T4 filter support
From: Vipul Pandya @ 2012-12-10  9:30 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	abhishek-ut6Up61K2wZBDgjK7y7TUQ, Vipul Pandya
In-Reply-To: <1355131856-29142-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

The T4 architecture is capable of filtering ingress packets at line rate
using the rule in TCAM. If packet hits a rule in the TCAM then it can be either
dropped or passed to the receive queues based on a rule settings.

This patch adds framework for managing filters and to use T4's filter
capabilities. It constructs a Firmware Filter Work Request which writes the
filter at a specified index to get the work done. It hosts shadow copy of
ingress filter entry to check field size limitations and save memory in the
case where the filter table is large.

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
V2: Changed commenting style from kernel-doc format('/**') to normal
V3: Resending the whole patch series again with the missing patches in V2
V4: Changed comments for networking from '/*' to '/* Comment' format.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  131 ++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  294 ++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |    2 +
 drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   32 +++
 drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   22 ++-
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |    1 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  279 +++++++++++++++++++++
 8 files changed, 757 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 378988b..24ce797 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -545,6 +545,129 @@ struct adapter {
 	spinlock_t stats_lock;
 };
 
+/* Defined bit width of user definable filter tuples
+ */
+#define ETHTYPE_BITWIDTH 16
+#define FRAG_BITWIDTH 1
+#define MACIDX_BITWIDTH 9
+#define FCOE_BITWIDTH 1
+#define IPORT_BITWIDTH 3
+#define MATCHTYPE_BITWIDTH 3
+#define PROTO_BITWIDTH 8
+#define TOS_BITWIDTH 8
+#define PF_BITWIDTH 8
+#define VF_BITWIDTH 8
+#define IVLAN_BITWIDTH 16
+#define OVLAN_BITWIDTH 16
+
+/* Filter matching rules.  These consist of a set of ingress packet field
+ * (value, mask) tuples.  The associated ingress packet field matches the
+ * tuple when ((field & mask) == value).  (Thus a wildcard "don't care" field
+ * rule can be constructed by specifying a tuple of (0, 0).)  A filter rule
+ * matches an ingress packet when all of the individual individual field
+ * matching rules are true.
+ *
+ * Partial field masks are always valid, however, while it may be easy to
+ * understand their meanings for some fields (e.g. IP address to match a
+ * subnet), for others making sensible partial masks is less intuitive (e.g.
+ * MPS match type) ...
+ *
+ * Most of the following data structures are modeled on T4 capabilities.
+ * Drivers for earlier chips use the subsets which make sense for those chips.
+ * We really need to come up with a hardware-independent mechanism to
+ * represent hardware filter capabilities ...
+ */
+struct ch_filter_tuple {
+	/* Compressed header matching field rules.  The TP_VLAN_PRI_MAP
+	 * register selects which of these fields will participate in the
+	 * filter match rules -- up to a maximum of 36 bits.  Because
+	 * TP_VLAN_PRI_MAP is a global register, all filters must use the same
+	 * set of fields.
+	 */
+	uint32_t ethtype:ETHTYPE_BITWIDTH;      /* Ethernet type */
+	uint32_t frag:FRAG_BITWIDTH;            /* IP fragmentation header */
+	uint32_t ivlan_vld:1;                   /* inner VLAN valid */
+	uint32_t ovlan_vld:1;                   /* outer VLAN valid */
+	uint32_t pfvf_vld:1;                    /* PF/VF valid */
+	uint32_t macidx:MACIDX_BITWIDTH;        /* exact match MAC index */
+	uint32_t fcoe:FCOE_BITWIDTH;            /* FCoE packet */
+	uint32_t iport:IPORT_BITWIDTH;          /* ingress port */
+	uint32_t matchtype:MATCHTYPE_BITWIDTH;  /* MPS match type */
+	uint32_t proto:PROTO_BITWIDTH;          /* protocol type */
+	uint32_t tos:TOS_BITWIDTH;              /* TOS/Traffic Type */
+	uint32_t pf:PF_BITWIDTH;                /* PCI-E PF ID */
+	uint32_t vf:VF_BITWIDTH;                /* PCI-E VF ID */
+	uint32_t ivlan:IVLAN_BITWIDTH;          /* inner VLAN */
+	uint32_t ovlan:OVLAN_BITWIDTH;          /* outer VLAN */
+
+	/* Uncompressed header matching field rules.  These are always
+	 * available for field rules.
+	 */
+	uint8_t lip[16];        /* local IP address (IPv4 in [3:0]) */
+	uint8_t fip[16];        /* foreign IP address (IPv4 in [3:0]) */
+	uint16_t lport;         /* local port */
+	uint16_t fport;         /* foreign port */
+};
+
+/* A filter ioctl command.
+ */
+struct ch_filter_specification {
+	/* Administrative fields for filter.
+	 */
+	uint32_t hitcnts:1;     /* count filter hits in TCB */
+	uint32_t prio:1;        /* filter has priority over active/server */
+
+	/* Fundamental filter typing.  This is the one element of filter
+	 * matching that doesn't exist as a (value, mask) tuple.
+	 */
+	uint32_t type:1;        /* 0 => IPv4, 1 => IPv6 */
+
+	/* Packet dispatch information.  Ingress packets which match the
+	 * filter rules will be dropped, passed to the host or switched back
+	 * out as egress packets.
+	 */
+	uint32_t action:2;      /* drop, pass, switch */
+
+	uint32_t rpttid:1;      /* report TID in RSS hash field */
+
+	uint32_t dirsteer:1;    /* 0 => RSS, 1 => steer to iq */
+	uint32_t iq:10;         /* ingress queue */
+
+	uint32_t maskhash:1;    /* dirsteer=0: store RSS hash in TCB */
+	uint32_t dirsteerhash:1;/* dirsteer=1: 0 => TCB contains RSS hash */
+				/*             1 => TCB contains IQ ID */
+
+	/* Switch proxy/rewrite fields.  An ingress packet which matches a
+	 * filter with "switch" set will be looped back out as an egress
+	 * packet -- potentially with some Ethernet header rewriting.
+	 */
+	uint32_t eport:2;       /* egress port to switch packet out */
+	uint32_t newdmac:1;     /* rewrite destination MAC address */
+	uint32_t newsmac:1;     /* rewrite source MAC address */
+	uint32_t newvlan:2;     /* rewrite VLAN Tag */
+	uint8_t dmac[ETH_ALEN]; /* new destination MAC address */
+	uint8_t smac[ETH_ALEN]; /* new source MAC address */
+	uint16_t vlan;          /* VLAN Tag to insert */
+
+	/* Filter rule value/mask pairs.
+	 */
+	struct ch_filter_tuple val;
+	struct ch_filter_tuple mask;
+};
+
+enum {
+	FILTER_PASS = 0,        /* default */
+	FILTER_DROP,
+	FILTER_SWITCH
+};
+
+enum {
+	VLAN_NOCHANGE = 0,      /* default */
+	VLAN_REMOVE,
+	VLAN_INSERT,
+	VLAN_REWRITE
+};
+
 static inline u32 t4_read_reg(struct adapter *adap, u32 reg_addr)
 {
 	return readl(adap->regs + reg_addr);
@@ -701,6 +824,12 @@ static inline int t4_wr_mbox_ns(struct adapter *adap, int mbox, const void *cmd,
 void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
 		       unsigned int data_reg, const u32 *vals,
 		       unsigned int nregs, unsigned int start_idx);
+void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
+		      unsigned int data_reg, u32 *vals, unsigned int nregs,
+		      unsigned int start_idx);
+
+struct fw_filter_wr;
+
 void t4_intr_enable(struct adapter *adapter);
 void t4_intr_disable(struct adapter *adapter);
 int t4_slow_intr_handler(struct adapter *adapter);
@@ -737,6 +866,8 @@ void t4_tp_get_tcp_stats(struct adapter *adap, struct tp_tcp_stats *v4,
 void t4_load_mtus(struct adapter *adap, const unsigned short *mtus,
 		  const unsigned short *alpha, const unsigned short *beta);
 
+void t4_mk_filtdelwr(unsigned int ftid, struct fw_filter_wr *wr, int qid);
+
 void t4_wol_magic_enable(struct adapter *adap, unsigned int port,
 			 const u8 *addr);
 int t4_wol_pat_enable(struct adapter *adap, unsigned int port, unsigned int map,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 0df1284..c2f1307 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -175,6 +175,30 @@ enum {
 	MIN_FL_ENTRIES       = 16
 };
 
+/* Host shadow copy of ingress filter entry.  This is in host native format
+ * and doesn't match the ordering or bit order, etc. of the hardware of the
+ * firmware command.  The use of bit-field structure elements is purely to
+ * remind ourselves of the field size limitations and save memory in the case
+ * where the filter table is large.
+ */
+struct filter_entry {
+	/* Administrative fields for filter.
+	 */
+	u32 valid:1;            /* filter allocated and valid */
+	u32 locked:1;           /* filter is administratively locked */
+
+	u32 pending:1;          /* filter action is pending firmware reply */
+	u32 smtidx:8;           /* Source MAC Table index for smac */
+	struct l2t_entry *l2t;  /* Layer Two Table entry for dmac */
+
+	/* The filter itself.  Most of this is a straight copy of information
+	 * provided by the extended ioctl().  Some fields are translated to
+	 * internal forms -- for instance the Ingress Queue ID passed in from
+	 * the ioctl() is translated into the Absolute Ingress Queue ID.
+	 */
+	struct ch_filter_specification fs;
+};
+
 #define DFLT_MSG_ENABLE (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK | \
 			 NETIF_MSG_TIMER | NETIF_MSG_IFDOWN | NETIF_MSG_IFUP |\
 			 NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
@@ -325,6 +349,9 @@ enum {
 
 static unsigned int tp_vlan_pri_map = TP_VLAN_PRI_MAP_DEFAULT;
 
+module_param(tp_vlan_pri_map, uint, 0644);
+MODULE_PARM_DESC(tp_vlan_pri_map, "global compressed filter configuration");
+
 static struct dentry *cxgb4_debugfs_root;
 
 static LIST_HEAD(adapter_list);
@@ -506,8 +533,67 @@ static int link_start(struct net_device *dev)
 	return ret;
 }
 
-/*
- * Response queue handler for the FW event queue.
+/* Clear a filter and release any of its resources that we own.  This also
+ * clears the filter's "pending" status.
+ */
+static void clear_filter(struct adapter *adap, struct filter_entry *f)
+{
+	/* If the new or old filter have loopback rewriteing rules then we'll
+	 * need to free any existing Layer Two Table (L2T) entries of the old
+	 * filter rule.  The firmware will handle freeing up any Source MAC
+	 * Table (SMT) entries used for rewriting Source MAC Addresses in
+	 * loopback rules.
+	 */
+	if (f->l2t)
+		cxgb4_l2t_release(f->l2t);
+
+	/* The zeroing of the filter rule below clears the filter valid,
+	 * pending, locked flags, l2t pointer, etc. so it's all we need for
+	 * this operation.
+	 */
+	memset(f, 0, sizeof(*f));
+}
+
+/* Handle a filter write/deletion reply.
+ */
+static void filter_rpl(struct adapter *adap, const struct cpl_set_tcb_rpl *rpl)
+{
+	unsigned int idx = GET_TID(rpl);
+	unsigned int nidx = idx - adap->tids.ftid_base;
+	unsigned int ret;
+	struct filter_entry *f;
+
+	if (idx >= adap->tids.ftid_base && nidx <
+	   (adap->tids.nftids + adap->tids.nsftids)) {
+		idx = nidx;
+		ret = GET_TCB_COOKIE(rpl->cookie);
+		f = &adap->tids.ftid_tab[idx];
+
+		if (ret == FW_FILTER_WR_FLT_DELETED) {
+			/* Clear the filter when we get confirmation from the
+			 * hardware that the filter has been deleted.
+			 */
+			clear_filter(adap, f);
+		} else if (ret == FW_FILTER_WR_SMT_TBL_FULL) {
+			dev_err(adap->pdev_dev, "filter %u setup failed due to full SMT\n",
+				idx);
+			clear_filter(adap, f);
+		} else if (ret == FW_FILTER_WR_FLT_ADDED) {
+			f->smtidx = (be64_to_cpu(rpl->oldval) >> 24) & 0xff;
+			f->pending = 0;  /* asynchronous setup completed */
+			f->valid = 1;
+		} else {
+			/* Something went wrong.  Issue a warning about the
+			 * problem and clear everything out.
+			 */
+			dev_err(adap->pdev_dev, "filter %u setup failed with error %u\n",
+				idx, ret);
+			clear_filter(adap, f);
+		}
+	}
+}
+
+/* Response queue handler for the FW event queue.
  */
 static int fwevtq_handler(struct sge_rspq *q, const __be64 *rsp,
 			  const struct pkt_gl *gl)
@@ -542,6 +628,10 @@ static int fwevtq_handler(struct sge_rspq *q, const __be64 *rsp,
 		const struct cpl_l2t_write_rpl *p = (void *)rsp;
 
 		do_l2t_write_rpl(q->adap, p);
+	} else if (opcode == CPL_SET_TCB_RPL) {
+		const struct cpl_set_tcb_rpl *p = (void *)rsp;
+
+		filter_rpl(q->adap, p);
 	} else
 		dev_err(q->adap->pdev_dev,
 			"unexpected CPL %#x on FW event queue\n", opcode);
@@ -983,6 +1073,148 @@ static void t4_free_mem(void *addr)
 		kfree(addr);
 }
 
+/* Send a Work Request to write the filter at a specified index.  We construct
+ * a Firmware Filter Work Request to have the work done and put the indicated
+ * filter into "pending" mode which will prevent any further actions against
+ * it till we get a reply from the firmware on the completion status of the
+ * request.
+ */
+static int set_filter_wr(struct adapter *adapter, int fidx)
+{
+	struct filter_entry *f = &adapter->tids.ftid_tab[fidx];
+	struct sk_buff *skb;
+	struct fw_filter_wr *fwr;
+	unsigned int ftid;
+
+	/* If the new filter requires loopback Destination MAC and/or VLAN
+	 * rewriting then we need to allocate a Layer 2 Table (L2T) entry for
+	 * the filter.
+	 */
+	if (f->fs.newdmac || f->fs.newvlan) {
+		/* allocate L2T entry for new filter */
+		f->l2t = t4_l2t_alloc_switching(adapter->l2t);
+		if (f->l2t == NULL)
+			return -EAGAIN;
+		if (t4_l2t_set_switching(adapter, f->l2t, f->fs.vlan,
+					f->fs.eport, f->fs.dmac)) {
+			cxgb4_l2t_release(f->l2t);
+			f->l2t = NULL;
+			return -ENOMEM;
+		}
+	}
+
+	ftid = adapter->tids.ftid_base + fidx;
+
+	skb = alloc_skb(sizeof(*fwr), GFP_KERNEL | __GFP_NOFAIL);
+	fwr = (struct fw_filter_wr *)__skb_put(skb, sizeof(*fwr));
+	memset(fwr, 0, sizeof(*fwr));
+
+	/* It would be nice to put most of the following in t4_hw.c but most
+	 * of the work is translating the cxgbtool ch_filter_specification
+	 * into the Work Request and the definition of that structure is
+	 * currently in cxgbtool.h which isn't appropriate to pull into the
+	 * common code.  We may eventually try to come up with a more neutral
+	 * filter specification structure but for now it's easiest to simply
+	 * put this fairly direct code in line ...
+	 */
+	fwr->op_pkd = htonl(FW_WR_OP(FW_FILTER_WR));
+	fwr->len16_pkd = htonl(FW_WR_LEN16(sizeof(*fwr)/16));
+	fwr->tid_to_iq =
+		htonl(V_FW_FILTER_WR_TID(ftid) |
+		      V_FW_FILTER_WR_RQTYPE(f->fs.type) |
+		      V_FW_FILTER_WR_NOREPLY(0) |
+		      V_FW_FILTER_WR_IQ(f->fs.iq));
+	fwr->del_filter_to_l2tix =
+		htonl(V_FW_FILTER_WR_RPTTID(f->fs.rpttid) |
+		      V_FW_FILTER_WR_DROP(f->fs.action == FILTER_DROP) |
+		      V_FW_FILTER_WR_DIRSTEER(f->fs.dirsteer) |
+		      V_FW_FILTER_WR_MASKHASH(f->fs.maskhash) |
+		      V_FW_FILTER_WR_DIRSTEERHASH(f->fs.dirsteerhash) |
+		      V_FW_FILTER_WR_LPBK(f->fs.action == FILTER_SWITCH) |
+		      V_FW_FILTER_WR_DMAC(f->fs.newdmac) |
+		      V_FW_FILTER_WR_SMAC(f->fs.newsmac) |
+		      V_FW_FILTER_WR_INSVLAN(f->fs.newvlan == VLAN_INSERT ||
+					     f->fs.newvlan == VLAN_REWRITE) |
+		      V_FW_FILTER_WR_RMVLAN(f->fs.newvlan == VLAN_REMOVE ||
+					    f->fs.newvlan == VLAN_REWRITE) |
+		      V_FW_FILTER_WR_HITCNTS(f->fs.hitcnts) |
+		      V_FW_FILTER_WR_TXCHAN(f->fs.eport) |
+		      V_FW_FILTER_WR_PRIO(f->fs.prio) |
+		      V_FW_FILTER_WR_L2TIX(f->l2t ? f->l2t->idx : 0));
+	fwr->ethtype = htons(f->fs.val.ethtype);
+	fwr->ethtypem = htons(f->fs.mask.ethtype);
+	fwr->frag_to_ovlan_vldm =
+		(V_FW_FILTER_WR_FRAG(f->fs.val.frag) |
+		 V_FW_FILTER_WR_FRAGM(f->fs.mask.frag) |
+		 V_FW_FILTER_WR_IVLAN_VLD(f->fs.val.ivlan_vld) |
+		 V_FW_FILTER_WR_OVLAN_VLD(f->fs.val.ovlan_vld) |
+		 V_FW_FILTER_WR_IVLAN_VLDM(f->fs.mask.ivlan_vld) |
+		 V_FW_FILTER_WR_OVLAN_VLDM(f->fs.mask.ovlan_vld));
+	fwr->smac_sel = 0;
+	fwr->rx_chan_rx_rpl_iq =
+		htons(V_FW_FILTER_WR_RX_CHAN(0) |
+		      V_FW_FILTER_WR_RX_RPL_IQ(adapter->sge.fw_evtq.abs_id));
+	fwr->maci_to_matchtypem =
+		htonl(V_FW_FILTER_WR_MACI(f->fs.val.macidx) |
+		      V_FW_FILTER_WR_MACIM(f->fs.mask.macidx) |
+		      V_FW_FILTER_WR_FCOE(f->fs.val.fcoe) |
+		      V_FW_FILTER_WR_FCOEM(f->fs.mask.fcoe) |
+		      V_FW_FILTER_WR_PORT(f->fs.val.iport) |
+		      V_FW_FILTER_WR_PORTM(f->fs.mask.iport) |
+		      V_FW_FILTER_WR_MATCHTYPE(f->fs.val.matchtype) |
+		      V_FW_FILTER_WR_MATCHTYPEM(f->fs.mask.matchtype));
+	fwr->ptcl = f->fs.val.proto;
+	fwr->ptclm = f->fs.mask.proto;
+	fwr->ttyp = f->fs.val.tos;
+	fwr->ttypm = f->fs.mask.tos;
+	fwr->ivlan = htons(f->fs.val.ivlan);
+	fwr->ivlanm = htons(f->fs.mask.ivlan);
+	fwr->ovlan = htons(f->fs.val.ovlan);
+	fwr->ovlanm = htons(f->fs.mask.ovlan);
+	memcpy(fwr->lip, f->fs.val.lip, sizeof(fwr->lip));
+	memcpy(fwr->lipm, f->fs.mask.lip, sizeof(fwr->lipm));
+	memcpy(fwr->fip, f->fs.val.fip, sizeof(fwr->fip));
+	memcpy(fwr->fipm, f->fs.mask.fip, sizeof(fwr->fipm));
+	fwr->lp = htons(f->fs.val.lport);
+	fwr->lpm = htons(f->fs.mask.lport);
+	fwr->fp = htons(f->fs.val.fport);
+	fwr->fpm = htons(f->fs.mask.fport);
+	if (f->fs.newsmac)
+		memcpy(fwr->sma, f->fs.smac, sizeof(fwr->sma));
+
+	/* Mark the filter as "pending" and ship off the Filter Work Request.
+	 * When we get the Work Request Reply we'll clear the pending status.
+	 */
+	f->pending = 1;
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, f->fs.val.iport & 0x3);
+	t4_ofld_send(adapter, skb);
+	return 0;
+}
+
+/* Delete the filter at a specified index.
+ */
+static int del_filter_wr(struct adapter *adapter, int fidx)
+{
+	struct filter_entry *f = &adapter->tids.ftid_tab[fidx];
+	struct sk_buff *skb;
+	struct fw_filter_wr *fwr;
+	unsigned int len, ftid;
+
+	len = sizeof(*fwr);
+	ftid = adapter->tids.ftid_base + fidx;
+
+	skb = alloc_skb(len, GFP_KERNEL | __GFP_NOFAIL);
+	fwr = (struct fw_filter_wr *)__skb_put(skb, len);
+	t4_mk_filtdelwr(ftid, fwr, adapter->sge.fw_evtq.abs_id);
+
+	/* Mark the filter as "pending" and ship off the Filter Work Request.
+	 * When we get the Work Request Reply we'll clear the pending status.
+	 */
+	f->pending = 1;
+	t4_mgmt_tx(adapter, skb);
+	return 0;
+}
+
 static inline int is_offload(const struct adapter *adap)
 {
 	return adap->params.offload;
@@ -2195,7 +2427,7 @@ int cxgb4_alloc_atid(struct tid_info *t, void *data)
 	if (t->afree) {
 		union aopen_entry *p = t->afree;
 
-		atid = p - t->atid_tab;
+		atid = (p - t->atid_tab) + t->atid_base;
 		t->afree = p->next;
 		p->data = data;
 		t->atids_in_use++;
@@ -2210,7 +2442,7 @@ EXPORT_SYMBOL(cxgb4_alloc_atid);
  */
 void cxgb4_free_atid(struct tid_info *t, unsigned int atid)
 {
-	union aopen_entry *p = &t->atid_tab[atid];
+	union aopen_entry *p = &t->atid_tab[atid - t->atid_base];
 
 	spin_lock_bh(&t->atid_lock);
 	p->next = t->afree;
@@ -2362,11 +2594,16 @@ EXPORT_SYMBOL(cxgb4_remove_tid);
 static int tid_init(struct tid_info *t)
 {
 	size_t size;
+	unsigned int stid_bmap_size;
 	unsigned int natids = t->natids;
 
-	size = t->ntids * sizeof(*t->tid_tab) + natids * sizeof(*t->atid_tab) +
+	stid_bmap_size = BITS_TO_LONGS(t->nstids);
+	size = t->ntids * sizeof(*t->tid_tab) +
+	       natids * sizeof(*t->atid_tab) +
 	       t->nstids * sizeof(*t->stid_tab) +
-	       BITS_TO_LONGS(t->nstids) * sizeof(long);
+	       stid_bmap_size * sizeof(long) +
+	       t->nftids * sizeof(*t->ftid_tab);
+
 	t->tid_tab = t4_alloc_mem(size);
 	if (!t->tid_tab)
 		return -ENOMEM;
@@ -2374,6 +2611,7 @@ static int tid_init(struct tid_info *t)
 	t->atid_tab = (union aopen_entry *)&t->tid_tab[t->ntids];
 	t->stid_tab = (struct serv_entry *)&t->atid_tab[natids];
 	t->stid_bmap = (unsigned long *)&t->stid_tab[t->nstids];
+	t->ftid_tab = (struct filter_entry *)&t->stid_bmap[stid_bmap_size];
 	spin_lock_init(&t->stid_lock);
 	spin_lock_init(&t->atid_lock);
 
@@ -2999,6 +3237,40 @@ static int cxgb_close(struct net_device *dev)
 	return t4_enable_vi(adapter, adapter->fn, pi->viid, false, false);
 }
 
+/* Return an error number if the indicated filter isn't writable ...
+ */
+static int writable_filter(struct filter_entry *f)
+{
+	if (f->locked)
+		return -EPERM;
+	if (f->pending)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* Delete the filter at the specified index (if valid).  The checks for all
+ * the common problems with doing this like the filter being locked, currently
+ * pending in another operation, etc.
+ */
+static int delete_filter(struct adapter *adapter, unsigned int fidx)
+{
+	struct filter_entry *f;
+	int ret;
+
+	if (fidx >= adapter->tids.nftids)
+		return -EINVAL;
+
+	f = &adapter->tids.ftid_tab[fidx];
+	ret = writable_filter(f);
+	if (ret)
+		return ret;
+	if (f->valid)
+		return del_filter_wr(adapter, fidx);
+
+	return 0;
+}
+
 static struct rtnl_link_stats64 *cxgb_get_stats(struct net_device *dev,
 						struct rtnl_link_stats64 *ns)
 {
@@ -4662,6 +4934,16 @@ static void __devexit remove_one(struct pci_dev *pdev)
 		if (adapter->debugfs_root)
 			debugfs_remove_recursive(adapter->debugfs_root);
 
+		/* If we allocated filters, free up state associated with any
+		 * valid filters ...
+		 */
+		if (adapter->tids.ftid_tab) {
+			struct filter_entry *f = &adapter->tids.ftid_tab[0];
+			for (i = 0; i < adapter->tids.nftids; i++, f++)
+				if (f->valid)
+					clear_filter(adapter, f);
+		}
+
 		if (adapter->flags & FULL_INIT_DONE)
 			cxgb_down(adapter);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index 39bec73..59a6133 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -97,7 +97,9 @@ struct tid_info {
 
 	union aopen_entry *atid_tab;
 	unsigned int natids;
+	unsigned int atid_base;
 
+	struct filter_entry *ftid_tab;
 	unsigned int nftids;
 	unsigned int ftid_base;
 	unsigned int aftid_base;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 6ac77a6..2987809 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -484,6 +484,38 @@ void t4_l2t_update(struct adapter *adap, struct neighbour *neigh)
 		handle_failed_resolution(adap, arpq);
 }
 
+/* Allocate an L2T entry for use by a switching rule.  Such need to be
+ * explicitly freed and while busy they are not on any hash chain, so normal
+ * address resolution updates do not see them.
+ */
+struct l2t_entry *t4_l2t_alloc_switching(struct l2t_data *d)
+{
+	struct l2t_entry *e;
+
+	write_lock_bh(&d->lock);
+	e = alloc_l2e(d);
+	if (e) {
+		spin_lock(&e->lock);          /* avoid race with t4_l2t_free */
+		e->state = L2T_STATE_SWITCHING;
+		atomic_set(&e->refcnt, 1);
+		spin_unlock(&e->lock);
+	}
+	write_unlock_bh(&d->lock);
+	return e;
+}
+
+/* Sets/updates the contents of a switching L2T entry that has been allocated
+ * with an earlier call to @t4_l2t_alloc_switching.
+ */
+int t4_l2t_set_switching(struct adapter *adap, struct l2t_entry *e, u16 vlan,
+		u8 port, u8 *eth_addr)
+{
+	e->vlan = vlan;
+	e->lport = port;
+	memcpy(e->dmac, eth_addr, ETH_ALEN);
+	return write_l2e(adap, e, 0);
+}
+
 struct l2t_data *t4_init_l2t(void)
 {
 	int i;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.h b/drivers/net/ethernet/chelsio/cxgb4/l2t.h
index 02b31d0..108c0f1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.h
@@ -100,6 +100,9 @@ struct l2t_entry *cxgb4_l2t_get(struct l2t_data *d, struct neighbour *neigh,
 				unsigned int priority);
 
 void t4_l2t_update(struct adapter *adap, struct neighbour *neigh);
+struct l2t_entry *t4_l2t_alloc_switching(struct l2t_data *d);
+int t4_l2t_set_switching(struct adapter *adap, struct l2t_entry *e, u16 vlan,
+			 u8 port, u8 *eth_addr);
 struct l2t_data *t4_init_l2t(void);
 void do_l2t_write_rpl(struct adapter *p, const struct cpl_l2t_write_rpl *rpl);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 730ae2c..54e9cba 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -109,7 +109,7 @@ void t4_set_reg_field(struct adapter *adapter, unsigned int addr, u32 mask,
  *	Reads registers that are accessed indirectly through an address/data
  *	register pair.
  */
-static void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
+void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
 			     unsigned int data_reg, u32 *vals,
 			     unsigned int nregs, unsigned int start_idx)
 {
@@ -2268,6 +2268,26 @@ int t4_wol_pat_enable(struct adapter *adap, unsigned int port, unsigned int map,
 	return 0;
 }
 
+/*     t4_mk_filtdelwr - create a delete filter WR
+ *     @ftid: the filter ID
+ *     @wr: the filter work request to populate
+ *     @qid: ingress queue to receive the delete notification
+ *
+ *     Creates a filter work request to delete the supplied filter.  If @qid is
+ *     negative the delete notification is suppressed.
+ */
+void t4_mk_filtdelwr(unsigned int ftid, struct fw_filter_wr *wr, int qid)
+{
+	memset(wr, 0, sizeof(*wr));
+	wr->op_pkd = htonl(FW_WR_OP(FW_FILTER_WR));
+	wr->len16_pkd = htonl(FW_WR_LEN16(sizeof(*wr) / 16));
+	wr->tid_to_iq = htonl(V_FW_FILTER_WR_TID(ftid) |
+			V_FW_FILTER_WR_NOREPLY(qid < 0));
+	wr->del_filter_to_l2tix = htonl(F_FW_FILTER_WR_DEL_FILTER);
+	if (qid >= 0)
+		wr->rx_chan_rx_rpl_iq = htons(V_FW_FILTER_WR_RX_RPL_IQ(qid));
+}
+
 #define INIT_CMD(var, cmd, rd_wr) do { \
 	(var).op_to_write = htonl(FW_CMD_OP(FW_##cmd##_CMD) | \
 				  FW_CMD_REQUEST | FW_CMD_##rd_wr); \
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index eb71b82..480eb43 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -332,6 +332,7 @@ struct cpl_set_tcb_field {
 	__be16 word_cookie;
 #define TCB_WORD(x)   ((x) << 0)
 #define TCB_COOKIE(x) ((x) << 5)
+#define GET_TCB_COOKIE(x) (((x) >> 5) & 7)
 	__be64 mask;
 	__be64 val;
 };
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index a636463..b4dc560 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -35,6 +35,10 @@
 #ifndef _T4FW_INTERFACE_H_
 #define _T4FW_INTERFACE_H_
 
+enum fw_ret_val {
+	FW_ENOEXEC      = 8,            /* Exec format error; inv microcode */
+};
+
 #define FW_T4VF_SGE_BASE_ADDR      0x0000
 #define FW_T4VF_MPS_BASE_ADDR      0x0100
 #define FW_T4VF_PL_BASE_ADDR       0x0200
@@ -81,6 +85,281 @@ struct fw_wr_hdr {
 
 #define HW_TPL_FR_MT_PR_IV_P_FC         0X32B
 
+/* filter wr reply code in cookie in CPL_SET_TCB_RPL */
+enum fw_filter_wr_cookie {
+	FW_FILTER_WR_SUCCESS,
+	FW_FILTER_WR_FLT_ADDED,
+	FW_FILTER_WR_FLT_DELETED,
+	FW_FILTER_WR_SMT_TBL_FULL,
+	FW_FILTER_WR_EINVAL,
+};
+
+struct fw_filter_wr {
+	__be32 op_pkd;
+	__be32 len16_pkd;
+	__be64 r3;
+	__be32 tid_to_iq;
+	__be32 del_filter_to_l2tix;
+	__be16 ethtype;
+	__be16 ethtypem;
+	__u8   frag_to_ovlan_vldm;
+	__u8   smac_sel;
+	__be16 rx_chan_rx_rpl_iq;
+	__be32 maci_to_matchtypem;
+	__u8   ptcl;
+	__u8   ptclm;
+	__u8   ttyp;
+	__u8   ttypm;
+	__be16 ivlan;
+	__be16 ivlanm;
+	__be16 ovlan;
+	__be16 ovlanm;
+	__u8   lip[16];
+	__u8   lipm[16];
+	__u8   fip[16];
+	__u8   fipm[16];
+	__be16 lp;
+	__be16 lpm;
+	__be16 fp;
+	__be16 fpm;
+	__be16 r7;
+	__u8   sma[6];
+};
+
+#define S_FW_FILTER_WR_TID      12
+#define M_FW_FILTER_WR_TID      0xfffff
+#define V_FW_FILTER_WR_TID(x)   ((x) << S_FW_FILTER_WR_TID)
+#define G_FW_FILTER_WR_TID(x)   \
+	(((x) >> S_FW_FILTER_WR_TID) & M_FW_FILTER_WR_TID)
+
+#define S_FW_FILTER_WR_RQTYPE           11
+#define M_FW_FILTER_WR_RQTYPE           0x1
+#define V_FW_FILTER_WR_RQTYPE(x)        ((x) << S_FW_FILTER_WR_RQTYPE)
+#define G_FW_FILTER_WR_RQTYPE(x)        \
+	(((x) >> S_FW_FILTER_WR_RQTYPE) & M_FW_FILTER_WR_RQTYPE)
+#define F_FW_FILTER_WR_RQTYPE   V_FW_FILTER_WR_RQTYPE(1U)
+
+#define S_FW_FILTER_WR_NOREPLY          10
+#define M_FW_FILTER_WR_NOREPLY          0x1
+#define V_FW_FILTER_WR_NOREPLY(x)       ((x) << S_FW_FILTER_WR_NOREPLY)
+#define G_FW_FILTER_WR_NOREPLY(x)       \
+	(((x) >> S_FW_FILTER_WR_NOREPLY) & M_FW_FILTER_WR_NOREPLY)
+#define F_FW_FILTER_WR_NOREPLY  V_FW_FILTER_WR_NOREPLY(1U)
+
+#define S_FW_FILTER_WR_IQ       0
+#define M_FW_FILTER_WR_IQ       0x3ff
+#define V_FW_FILTER_WR_IQ(x)    ((x) << S_FW_FILTER_WR_IQ)
+#define G_FW_FILTER_WR_IQ(x)    \
+	(((x) >> S_FW_FILTER_WR_IQ) & M_FW_FILTER_WR_IQ)
+
+#define S_FW_FILTER_WR_DEL_FILTER       31
+#define M_FW_FILTER_WR_DEL_FILTER       0x1
+#define V_FW_FILTER_WR_DEL_FILTER(x)    ((x) << S_FW_FILTER_WR_DEL_FILTER)
+#define G_FW_FILTER_WR_DEL_FILTER(x)    \
+	(((x) >> S_FW_FILTER_WR_DEL_FILTER) & M_FW_FILTER_WR_DEL_FILTER)
+#define F_FW_FILTER_WR_DEL_FILTER       V_FW_FILTER_WR_DEL_FILTER(1U)
+
+#define S_FW_FILTER_WR_RPTTID           25
+#define M_FW_FILTER_WR_RPTTID           0x1
+#define V_FW_FILTER_WR_RPTTID(x)        ((x) << S_FW_FILTER_WR_RPTTID)
+#define G_FW_FILTER_WR_RPTTID(x)        \
+	(((x) >> S_FW_FILTER_WR_RPTTID) & M_FW_FILTER_WR_RPTTID)
+#define F_FW_FILTER_WR_RPTTID   V_FW_FILTER_WR_RPTTID(1U)
+
+#define S_FW_FILTER_WR_DROP     24
+#define M_FW_FILTER_WR_DROP     0x1
+#define V_FW_FILTER_WR_DROP(x)  ((x) << S_FW_FILTER_WR_DROP)
+#define G_FW_FILTER_WR_DROP(x)  \
+	(((x) >> S_FW_FILTER_WR_DROP) & M_FW_FILTER_WR_DROP)
+#define F_FW_FILTER_WR_DROP     V_FW_FILTER_WR_DROP(1U)
+
+#define S_FW_FILTER_WR_DIRSTEER         23
+#define M_FW_FILTER_WR_DIRSTEER         0x1
+#define V_FW_FILTER_WR_DIRSTEER(x)      ((x) << S_FW_FILTER_WR_DIRSTEER)
+#define G_FW_FILTER_WR_DIRSTEER(x)      \
+	(((x) >> S_FW_FILTER_WR_DIRSTEER) & M_FW_FILTER_WR_DIRSTEER)
+#define F_FW_FILTER_WR_DIRSTEER V_FW_FILTER_WR_DIRSTEER(1U)
+
+#define S_FW_FILTER_WR_MASKHASH         22
+#define M_FW_FILTER_WR_MASKHASH         0x1
+#define V_FW_FILTER_WR_MASKHASH(x)      ((x) << S_FW_FILTER_WR_MASKHASH)
+#define G_FW_FILTER_WR_MASKHASH(x)      \
+	(((x) >> S_FW_FILTER_WR_MASKHASH) & M_FW_FILTER_WR_MASKHASH)
+#define F_FW_FILTER_WR_MASKHASH V_FW_FILTER_WR_MASKHASH(1U)
+
+#define S_FW_FILTER_WR_DIRSTEERHASH     21
+#define M_FW_FILTER_WR_DIRSTEERHASH     0x1
+#define V_FW_FILTER_WR_DIRSTEERHASH(x)  ((x) << S_FW_FILTER_WR_DIRSTEERHASH)
+#define G_FW_FILTER_WR_DIRSTEERHASH(x)  \
+	(((x) >> S_FW_FILTER_WR_DIRSTEERHASH) & M_FW_FILTER_WR_DIRSTEERHASH)
+#define F_FW_FILTER_WR_DIRSTEERHASH     V_FW_FILTER_WR_DIRSTEERHASH(1U)
+
+#define S_FW_FILTER_WR_LPBK     20
+#define M_FW_FILTER_WR_LPBK     0x1
+#define V_FW_FILTER_WR_LPBK(x)  ((x) << S_FW_FILTER_WR_LPBK)
+#define G_FW_FILTER_WR_LPBK(x)  \
+	(((x) >> S_FW_FILTER_WR_LPBK) & M_FW_FILTER_WR_LPBK)
+#define F_FW_FILTER_WR_LPBK     V_FW_FILTER_WR_LPBK(1U)
+
+#define S_FW_FILTER_WR_DMAC     19
+#define M_FW_FILTER_WR_DMAC     0x1
+#define V_FW_FILTER_WR_DMAC(x)  ((x) << S_FW_FILTER_WR_DMAC)
+#define G_FW_FILTER_WR_DMAC(x)  \
+	(((x) >> S_FW_FILTER_WR_DMAC) & M_FW_FILTER_WR_DMAC)
+#define F_FW_FILTER_WR_DMAC     V_FW_FILTER_WR_DMAC(1U)
+
+#define S_FW_FILTER_WR_SMAC     18
+#define M_FW_FILTER_WR_SMAC     0x1
+#define V_FW_FILTER_WR_SMAC(x)  ((x) << S_FW_FILTER_WR_SMAC)
+#define G_FW_FILTER_WR_SMAC(x)  \
+	(((x) >> S_FW_FILTER_WR_SMAC) & M_FW_FILTER_WR_SMAC)
+#define F_FW_FILTER_WR_SMAC     V_FW_FILTER_WR_SMAC(1U)
+
+#define S_FW_FILTER_WR_INSVLAN          17
+#define M_FW_FILTER_WR_INSVLAN          0x1
+#define V_FW_FILTER_WR_INSVLAN(x)       ((x) << S_FW_FILTER_WR_INSVLAN)
+#define G_FW_FILTER_WR_INSVLAN(x)       \
+	(((x) >> S_FW_FILTER_WR_INSVLAN) & M_FW_FILTER_WR_INSVLAN)
+#define F_FW_FILTER_WR_INSVLAN  V_FW_FILTER_WR_INSVLAN(1U)
+
+#define S_FW_FILTER_WR_RMVLAN           16
+#define M_FW_FILTER_WR_RMVLAN           0x1
+#define V_FW_FILTER_WR_RMVLAN(x)        ((x) << S_FW_FILTER_WR_RMVLAN)
+#define G_FW_FILTER_WR_RMVLAN(x)        \
+	(((x) >> S_FW_FILTER_WR_RMVLAN) & M_FW_FILTER_WR_RMVLAN)
+#define F_FW_FILTER_WR_RMVLAN   V_FW_FILTER_WR_RMVLAN(1U)
+
+#define S_FW_FILTER_WR_HITCNTS          15
+#define M_FW_FILTER_WR_HITCNTS          0x1
+#define V_FW_FILTER_WR_HITCNTS(x)       ((x) << S_FW_FILTER_WR_HITCNTS)
+#define G_FW_FILTER_WR_HITCNTS(x)       \
+	(((x) >> S_FW_FILTER_WR_HITCNTS) & M_FW_FILTER_WR_HITCNTS)
+#define F_FW_FILTER_WR_HITCNTS  V_FW_FILTER_WR_HITCNTS(1U)
+
+#define S_FW_FILTER_WR_TXCHAN           13
+#define M_FW_FILTER_WR_TXCHAN           0x3
+#define V_FW_FILTER_WR_TXCHAN(x)        ((x) << S_FW_FILTER_WR_TXCHAN)
+#define G_FW_FILTER_WR_TXCHAN(x)        \
+	(((x) >> S_FW_FILTER_WR_TXCHAN) & M_FW_FILTER_WR_TXCHAN)
+
+#define S_FW_FILTER_WR_PRIO     12
+#define M_FW_FILTER_WR_PRIO     0x1
+#define V_FW_FILTER_WR_PRIO(x)  ((x) << S_FW_FILTER_WR_PRIO)
+#define G_FW_FILTER_WR_PRIO(x)  \
+	(((x) >> S_FW_FILTER_WR_PRIO) & M_FW_FILTER_WR_PRIO)
+#define F_FW_FILTER_WR_PRIO     V_FW_FILTER_WR_PRIO(1U)
+
+#define S_FW_FILTER_WR_L2TIX    0
+#define M_FW_FILTER_WR_L2TIX    0xfff
+#define V_FW_FILTER_WR_L2TIX(x) ((x) << S_FW_FILTER_WR_L2TIX)
+#define G_FW_FILTER_WR_L2TIX(x) \
+	(((x) >> S_FW_FILTER_WR_L2TIX) & M_FW_FILTER_WR_L2TIX)
+
+#define S_FW_FILTER_WR_FRAG     7
+#define M_FW_FILTER_WR_FRAG     0x1
+#define V_FW_FILTER_WR_FRAG(x)  ((x) << S_FW_FILTER_WR_FRAG)
+#define G_FW_FILTER_WR_FRAG(x)  \
+	(((x) >> S_FW_FILTER_WR_FRAG) & M_FW_FILTER_WR_FRAG)
+#define F_FW_FILTER_WR_FRAG     V_FW_FILTER_WR_FRAG(1U)
+
+#define S_FW_FILTER_WR_FRAGM    6
+#define M_FW_FILTER_WR_FRAGM    0x1
+#define V_FW_FILTER_WR_FRAGM(x) ((x) << S_FW_FILTER_WR_FRAGM)
+#define G_FW_FILTER_WR_FRAGM(x) \
+	(((x) >> S_FW_FILTER_WR_FRAGM) & M_FW_FILTER_WR_FRAGM)
+#define F_FW_FILTER_WR_FRAGM    V_FW_FILTER_WR_FRAGM(1U)
+
+#define S_FW_FILTER_WR_IVLAN_VLD        5
+#define M_FW_FILTER_WR_IVLAN_VLD        0x1
+#define V_FW_FILTER_WR_IVLAN_VLD(x)     ((x) << S_FW_FILTER_WR_IVLAN_VLD)
+#define G_FW_FILTER_WR_IVLAN_VLD(x)     \
+	(((x) >> S_FW_FILTER_WR_IVLAN_VLD) & M_FW_FILTER_WR_IVLAN_VLD)
+#define F_FW_FILTER_WR_IVLAN_VLD        V_FW_FILTER_WR_IVLAN_VLD(1U)
+
+#define S_FW_FILTER_WR_OVLAN_VLD        4
+#define M_FW_FILTER_WR_OVLAN_VLD        0x1
+#define V_FW_FILTER_WR_OVLAN_VLD(x)     ((x) << S_FW_FILTER_WR_OVLAN_VLD)
+#define G_FW_FILTER_WR_OVLAN_VLD(x)     \
+	(((x) >> S_FW_FILTER_WR_OVLAN_VLD) & M_FW_FILTER_WR_OVLAN_VLD)
+#define F_FW_FILTER_WR_OVLAN_VLD        V_FW_FILTER_WR_OVLAN_VLD(1U)
+
+#define S_FW_FILTER_WR_IVLAN_VLDM       3
+#define M_FW_FILTER_WR_IVLAN_VLDM       0x1
+#define V_FW_FILTER_WR_IVLAN_VLDM(x)    ((x) << S_FW_FILTER_WR_IVLAN_VLDM)
+#define G_FW_FILTER_WR_IVLAN_VLDM(x)    \
+	(((x) >> S_FW_FILTER_WR_IVLAN_VLDM) & M_FW_FILTER_WR_IVLAN_VLDM)
+#define F_FW_FILTER_WR_IVLAN_VLDM       V_FW_FILTER_WR_IVLAN_VLDM(1U)
+
+#define S_FW_FILTER_WR_OVLAN_VLDM       2
+#define M_FW_FILTER_WR_OVLAN_VLDM       0x1
+#define V_FW_FILTER_WR_OVLAN_VLDM(x)    ((x) << S_FW_FILTER_WR_OVLAN_VLDM)
+#define G_FW_FILTER_WR_OVLAN_VLDM(x)    \
+	(((x) >> S_FW_FILTER_WR_OVLAN_VLDM) & M_FW_FILTER_WR_OVLAN_VLDM)
+#define F_FW_FILTER_WR_OVLAN_VLDM       V_FW_FILTER_WR_OVLAN_VLDM(1U)
+
+#define S_FW_FILTER_WR_RX_CHAN          15
+#define M_FW_FILTER_WR_RX_CHAN          0x1
+#define V_FW_FILTER_WR_RX_CHAN(x)       ((x) << S_FW_FILTER_WR_RX_CHAN)
+#define G_FW_FILTER_WR_RX_CHAN(x)       \
+	(((x) >> S_FW_FILTER_WR_RX_CHAN) & M_FW_FILTER_WR_RX_CHAN)
+#define F_FW_FILTER_WR_RX_CHAN  V_FW_FILTER_WR_RX_CHAN(1U)
+
+#define S_FW_FILTER_WR_RX_RPL_IQ        0
+#define M_FW_FILTER_WR_RX_RPL_IQ        0x3ff
+#define V_FW_FILTER_WR_RX_RPL_IQ(x)     ((x) << S_FW_FILTER_WR_RX_RPL_IQ)
+#define G_FW_FILTER_WR_RX_RPL_IQ(x)     \
+	(((x) >> S_FW_FILTER_WR_RX_RPL_IQ) & M_FW_FILTER_WR_RX_RPL_IQ)
+
+#define S_FW_FILTER_WR_MACI     23
+#define M_FW_FILTER_WR_MACI     0x1ff
+#define V_FW_FILTER_WR_MACI(x)  ((x) << S_FW_FILTER_WR_MACI)
+#define G_FW_FILTER_WR_MACI(x)  \
+	(((x) >> S_FW_FILTER_WR_MACI) & M_FW_FILTER_WR_MACI)
+
+#define S_FW_FILTER_WR_MACIM    14
+#define M_FW_FILTER_WR_MACIM    0x1ff
+#define V_FW_FILTER_WR_MACIM(x) ((x) << S_FW_FILTER_WR_MACIM)
+#define G_FW_FILTER_WR_MACIM(x) \
+	(((x) >> S_FW_FILTER_WR_MACIM) & M_FW_FILTER_WR_MACIM)
+
+#define S_FW_FILTER_WR_FCOE     13
+#define M_FW_FILTER_WR_FCOE     0x1
+#define V_FW_FILTER_WR_FCOE(x)  ((x) << S_FW_FILTER_WR_FCOE)
+#define G_FW_FILTER_WR_FCOE(x)  \
+	(((x) >> S_FW_FILTER_WR_FCOE) & M_FW_FILTER_WR_FCOE)
+#define F_FW_FILTER_WR_FCOE     V_FW_FILTER_WR_FCOE(1U)
+
+#define S_FW_FILTER_WR_FCOEM    12
+#define M_FW_FILTER_WR_FCOEM    0x1
+#define V_FW_FILTER_WR_FCOEM(x) ((x) << S_FW_FILTER_WR_FCOEM)
+#define G_FW_FILTER_WR_FCOEM(x) \
+	(((x) >> S_FW_FILTER_WR_FCOEM) & M_FW_FILTER_WR_FCOEM)
+#define F_FW_FILTER_WR_FCOEM    V_FW_FILTER_WR_FCOEM(1U)
+
+#define S_FW_FILTER_WR_PORT     9
+#define M_FW_FILTER_WR_PORT     0x7
+#define V_FW_FILTER_WR_PORT(x)  ((x) << S_FW_FILTER_WR_PORT)
+#define G_FW_FILTER_WR_PORT(x)  \
+	(((x) >> S_FW_FILTER_WR_PORT) & M_FW_FILTER_WR_PORT)
+
+#define S_FW_FILTER_WR_PORTM    6
+#define M_FW_FILTER_WR_PORTM    0x7
+#define V_FW_FILTER_WR_PORTM(x) ((x) << S_FW_FILTER_WR_PORTM)
+#define G_FW_FILTER_WR_PORTM(x) \
+	(((x) >> S_FW_FILTER_WR_PORTM) & M_FW_FILTER_WR_PORTM)
+
+#define S_FW_FILTER_WR_MATCHTYPE        3
+#define M_FW_FILTER_WR_MATCHTYPE        0x7
+#define V_FW_FILTER_WR_MATCHTYPE(x)     ((x) << S_FW_FILTER_WR_MATCHTYPE)
+#define G_FW_FILTER_WR_MATCHTYPE(x)     \
+	(((x) >> S_FW_FILTER_WR_MATCHTYPE) & M_FW_FILTER_WR_MATCHTYPE)
+
+#define S_FW_FILTER_WR_MATCHTYPEM       0
+#define M_FW_FILTER_WR_MATCHTYPEM       0x7
+#define V_FW_FILTER_WR_MATCHTYPEM(x)    ((x) << S_FW_FILTER_WR_MATCHTYPEM)
+#define G_FW_FILTER_WR_MATCHTYPEM(x)    \
+	(((x) >> S_FW_FILTER_WR_MATCHTYPEM) & M_FW_FILTER_WR_MATCHTYPEM)
+
 struct fw_ulptx_wr {
 	__be32 op_to_compl;
 	__be32 flowid_len16;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH V4 0/5] Add LE hash collision bug fix for active and passive offloaded connections
From: Vipul Pandya @ 2012-12-10  9:30 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	abhishek-ut6Up61K2wZBDgjK7y7TUQ, Vipul Pandya

This patch series fixes the LE hash collision issue in cxgb4 and RDMA/cxgb4
drivers in kernel.org.

If the hash functionality is enabled in T4 then tuple information of active and
passive offloaded connections are stored in DDR3 memory. LE (Lookup Engine)
implements the interface to search this tuple entries using hash algorithm. To
avoid LE hash collision, firmware provides new mechanisms to setup active and
passive open connections.

In case of active open connection, firmware detects LE hash collision situation
and notifies driver. Driver uses fw_ofld_connection work request to offload
that connection. Its tuple information will be stored in TCAM memory array. 

In case of passive open connection, server filter region is created in TCAM.
This region stores the filter which will redirect the incoming SYN packet to
offload queues. After this driver tries to establish the connection using
firmware work request.

This patch series also adds framework for managing filters and to use T4's
filter capabilities.

Following testing has been carried out successfully on this patch series.
1. 1000 active open connections created and saw that tcam_full counter is
getting incremented through debugfs file.
2. Multiple passive open connections were created and ran the same test
repetatively to verify server filter is getting created and deleted properly.

The patch-series is based on Roland's infiniband tree (for-next branch),
and involves changes on cxgb4 and RDMA/cxgb4 drivers. Both linux-rdma and netdev
are included in this post for review.

We have some more bug fixes in RDMA/cxgb4 after this patch series. So, we would
like to request this series to get mereged through Roland's infiniband for-next
tree.

V2: Changed commenting style from kernel-doc format('/**') to normal
V3: Resending the whole patch series again with the missing patches in V2
V4: Changed comments for networking from '/*' to '/* Comment' format.

Thanks,
Vipul Pandya

Vipul Pandya (5):
  cxgb4: Add T4 filter support
  cxgb4: Add LE hash collision bug fix path in LLD driver
  RDMA/cxgb4: Fix LE hash collision bug for active open connection
  RDMA/cxgb4: Fix LE hash collision bug for passive open connection
  RDMA/cxgb4: Fix bug for active and passive LE hash collision path

 drivers/infiniband/hw/cxgb4/cm.c                |  789 +++++++++++++++++++---
 drivers/infiniband/hw/cxgb4/device.c            |  210 ++++++-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   33 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  146 +++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  473 +++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   23 +-
 drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +
 drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   66 ++
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   37 ++
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  388 +++++++++++
 12 files changed, 2100 insertions(+), 125 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-10  9:29 UTC (permalink / raw)
  To: David Miller
  Cc: eric-EVVnsjFE0OfYtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ, Eric Dumazet
In-Reply-To: <1354919017.9124.33.camel-8Nb76shvtaUJvtFkdXX2HixXY32XiHfO@public.gmane.org>

On Fri, 2012-12-07 at 23:23 +0100, Johannes Berg wrote:
> Oh I should say ... IP is like black magic to me ;-)
> 
> > -	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> > +	if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
> >  		return skb;
> >  
> > -	iph = ip_hdr(skb);
> > -	if (iph->ihl < 5 || iph->version != 4)
> > +	if (iph.ihl < 5 || iph.version != 4)
> >  		return skb;
> > -	if (!pskb_may_pull(skb, iph->ihl*4))
> > -		return skb;
> > -	iph = ip_hdr(skb);
> > -	len = ntohs(iph->tot_len);
> > -	if (skb->len < len || len < (iph->ihl * 4))
> > +
> > +	len = ntohs(iph.tot_len);
> > +	if (skb->len < len || len < (iph.ihl * 4))
> >  		return skb;
> >  
> > -	if (ip_is_fragment(ip_hdr(skb))) {
> > +	if (ip_is_fragment(&iph)) {
> >  		skb = skb_share_check(skb, GFP_ATOMIC);
> >  		if (skb) {
> > +			if (!pskb_may_pull(skb, iph.ihl*4))
> > +				return skb;
> 
> I moved this here but I have no idea what it does.

Well, ok, looking further it does seem kinda obvious -- ip_defrag()
assumes the IP header is (fully?) present in the skb header, so that's
what this does.

Eric (Leblond), could you test this patch?

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v3] tipc: sk_recv_queue size check only for connectionless sockets
From: Ying Xue @ 2012-12-10  9:23 UTC (permalink / raw)
  To: nhorman, Paul.Gortmaker, jon.maloy; +Cc: netdev, tipc-discussion

The sk_receive_queue limit control is currently performed for all
arriving messages, disregarding socket and message type. But for
connectionless sockets this check is redundant, since the protocol
flow already makes queue overflow impossible.

We move the sk_receive_queue limit control so that it's only performed
for connectionless sockets, i.e. SOCK_RDM and SOCK_DGRAM type sockets.

However, as Neil Horman specified, we cannot simply force the socket
receive queue limit against connectionless sockets as it may create a
DoS vulnerability. For example, if a sender floods a receiver with
messages containing an invalid set of message importance bits or
CRITICAL importance, we will queue messages indefinitely.

To avoid DoS attack, socket receive queue will be marked as overflow
if we receive messages with invalid message importances, meanwhile,
we also set one new threshold for CRITICAL importance messages.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
v3 changes:
 - set new threshold for CRITICAL message
 - defined an importance factor table to avoid multiplication and
   division operations in rx_queue_full().
 - changed return value of rx_queue_full() from integer to boolean.

 net/tipc/socket.c |   44 +++++++++++++++++++-------------------------
 1 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 9b4e483..a18a757 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -43,7 +43,7 @@
 #define SS_LISTENING	-1	/* socket is listening */
 #define SS_READY	-2	/* socket is connectionless */
 
-#define OVERLOAD_LIMIT_BASE	10000
+#define OVERLOAD_LIMIT_BASE	5000
 #define CONN_TIMEOUT_DEFAULT	8000	/* default connect timeout = 8s */
 
 struct tipc_sock {
@@ -73,6 +73,13 @@ static struct proto tipc_proto;
 
 static int sockets_enabled;
 
+static const u32 msg_importance_factor[] = {
+	OVERLOAD_LIMIT_BASE,		/* TIPC_LOW_IMPORTANCE limit */
+	OVERLOAD_LIMIT_BASE * 2,	/* TIPC_MEDIUM_IMPORTANCE limit */
+	OVERLOAD_LIMIT_BASE * 100,	/* TIPC_HIGH_IMPORTANCE limit */
+	OVERLOAD_LIMIT_BASE * 200	/* TIPC_CRITICAL_IMPORTANCE limit */
+	};
+
 /*
  * Revised TIPC socket locking policy:
  *
@@ -1158,28 +1165,17 @@ static void tipc_data_ready(struct sock *sk, int len)
  * rx_queue_full - determine if receive queue can accept another message
  * @msg: message to be added to queue
  * @queue_size: current size of queue
- * @base: nominal maximum size of queue
  *
- * Returns 1 if queue is unable to accept message, 0 otherwise
+ * Returns true if queue is unable to accept message, false otherwise
  */
-static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
+static bool rx_queue_full(struct tipc_msg *msg, u32 queue_size)
 {
-	u32 threshold;
 	u32 imp = msg_importance(msg);
 
-	if (imp == TIPC_LOW_IMPORTANCE)
-		threshold = base;
-	else if (imp == TIPC_MEDIUM_IMPORTANCE)
-		threshold = base * 2;
-	else if (imp == TIPC_HIGH_IMPORTANCE)
-		threshold = base * 100;
-	else
-		return 0;
+	if (unlikely(imp > TIPC_CRITICAL_IMPORTANCE))
+		return true;
 
-	if (msg_connected(msg))
-		threshold *= 4;
-
-	return queue_size >= threshold;
+	return queue_size >= msg_importance_factor[imp];
 }
 
 /**
@@ -1275,7 +1271,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 {
 	struct socket *sock = sk->sk_socket;
 	struct tipc_msg *msg = buf_msg(buf);
-	u32 recv_q_len;
 	u32 res = TIPC_OK;
 
 	/* Reject message if it is wrong sort of message for socket */
@@ -1285,19 +1280,18 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	if (sock->state == SS_READY) {
 		if (msg_connected(msg))
 			return TIPC_ERR_NO_PORT;
+		/* Reject SOCK_DGRAM and SOCK_RDM message if there isn't room
+		 * to queue it
+		 */
+		if (unlikely(rx_queue_full(msg,
+			     skb_queue_len(&sk->sk_receive_queue))))
+			return TIPC_ERR_OVERLOAD;
 	} else {
 		res = filter_connect(tipc_sk(sk), &buf);
 		if (res != TIPC_OK || buf == NULL)
 			return res;
 	}
 
-	/* Reject message if there isn't room to queue it */
-	recv_q_len = skb_queue_len(&sk->sk_receive_queue);
-	if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
-		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
-			return TIPC_ERR_OVERLOAD;
-	}
-
 	/* Enqueue message (finally!) */
 	TIPC_SKB_CB(buf)->handle = 0;
 	__skb_queue_tail(&sk->sk_receive_queue, buf);
-- 
1.7.1


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d

^ permalink raw reply related

* Re: [PATCH 5/5] smsc95xx: expand check_ macros
From: Steve Glendinning @ 2012-12-10  9:16 UTC (permalink / raw)
  To: David Laight; +Cc: netdev, David Miller
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70D4@saturn3.aculab.com>

On 3 December 2012 10:41, David Laight <David.Laight@aculab.com> wrote:
>> -             check_warn_return(ret, "Error reading MII_ACCESS\n");
>> +             if (ret < 0) {
>> +                     netdev_warn(dev->net, "Error reading MII_ACCESS\n");
>> +                     return ret;
>> +             }
>> +
>
> It might be worth defining something like:
>
> #define check_warn(dev, ret, errmsg) \
>         (ret >= 0 ? 0 : (netdev_warn(dev->net, errmsg), ret))
>
> so the above code can be:
>         if (check_warn(dev, ret, "Error reading MII_ACCESS\n"))
>                 return ret;
>

Thanks David, I'd rather not define this kind of macro locally though.
 I'm happy to use something like this but only if it's something
that's centrally defined, blessed by davem, and if it's something that
all drivers are intended to use.  Otherwise we end up with subtly
different macros in different drivers, and subtly different behaviours
in each.

Steve

^ permalink raw reply

* [PATCH RESEND] net: remove obsolete simple_strto<foo>
From: Abhijit Pawar @ 2012-12-10  9:12 UTC (permalink / raw)
  To: David S. Miller, Pablo Neira Ayuso, Patrick McHardy,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	John W. Linville, Johannes Berg, Cong Wang, Eric Dumazet,
	Neil Horman, Joe Perches
  Cc: netdev, linux-kernel, netfilter-devel, netfilter, coreteam,
	linux-wireless, Abhijit Pawar

This patch replace the obsolete simple_strto<foo> with kstrto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
---
 net/core/netpoll.c                 |    5 ++++-
 net/ipv4/netfilter/ipt_CLUSTERIP.c |    9 +++++++--
 net/mac80211/debugfs_sta.c         |    3 +++
 net/netfilter/nf_conntrack_core.c  |    5 ++++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 77a0388..12c129f 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -674,7 +674,8 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
 		if ((delim = strchr(cur, '@')) == NULL)
 			goto parse_failed;
 		*delim = 0;
-		np->local_port = simple_strtol(cur, NULL, 10);
+		if (kstrtou16(cur, 10, &np->local_port))
+			goto parse_failed;
 		cur = delim;
 	}
 	cur++;
@@ -706,6 +707,8 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
 		if (*cur == ' ' || *cur == '\t')
 			np_info(np, "warning: whitespace is not allowed\n");
 		np->remote_port = simple_strtol(cur, NULL, 10);
+		if (kstrtou16(cur, 10, &np->remote_port))
+			goto parse_failed;
 		cur = delim;
 	}
 	cur++;
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index fe5daea..75e33a7 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -661,6 +661,7 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
 #define PROC_WRITELEN	10
 	char buffer[PROC_WRITELEN+1];
 	unsigned long nodenum;
+	int rc;
 
 	if (size > PROC_WRITELEN)
 		return -EIO;
@@ -669,11 +670,15 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
 	buffer[size] = 0;
 
 	if (*buffer == '+') {
-		nodenum = simple_strtoul(buffer+1, NULL, 10);
+		rc = kstrtoul(buffer+1, 10, &nodenum);
+		if (rc)
+			return rc;
 		if (clusterip_add_node(c, nodenum))
 			return -ENOMEM;
 	} else if (*buffer == '-') {
-		nodenum = simple_strtoul(buffer+1, NULL,10);
+		rc = kstrtoul(buffer+1, 10, &nodenum);
+		if (rc)
+			return rc;
 		if (clusterip_del_node(c, nodenum))
 			return -ENOENT;
 	} else
diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
index 49a1c70..0dedb4b 100644
--- a/net/mac80211/debugfs_sta.c
+++ b/net/mac80211/debugfs_sta.c
@@ -221,6 +221,9 @@ static ssize_t sta_agg_status_write(struct file *file, const char __user *userbu
 		return -EINVAL;
 
 	tid = simple_strtoul(buf, NULL, 0);
+	ret = kstrtoul(buf, 0, &tid);
+	if (ret)
+		return ret;
 
 	if (tid >= IEEE80211_NUM_TIDS)
 		return -EINVAL;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index af17516..37d9e62 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1409,7 +1409,7 @@ EXPORT_SYMBOL_GPL(nf_ct_alloc_hashtable);
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 {
-	int i, bucket;
+	int i, bucket, rc;
 	unsigned int hashsize, old_size;
 	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
@@ -1423,6 +1423,9 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 		return param_set_uint(val, kp);
 
 	hashsize = simple_strtoul(val, NULL, 0);
+	rc = kstrtouint(val, 0, &hashsize);
+	if (rc)
+		return rc;
 	if (!hashsize)
 		return -EINVAL;
 
-- 
1.7.7.6


^ permalink raw reply related

* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Jon Maloy @ 2012-12-10  8:46 UTC (permalink / raw)
  To: Ying Xue; +Cc: Neil Horman, Paul Gortmaker, David Miller, netdev
In-Reply-To: <50C580E6.7030905@windriver.com>

On 12/10/2012 01:27 AM, Ying Xue wrote:
> Neil Horman wrote:
>> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>>   
>>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>>     
>>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>>       
>>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>>
>>>>> The sk_receive_queue limit control is currently performed for
>>>>> all arriving messages, disregarding socket and message type.
>>>>> But for connected sockets this check is redundant, since the protocol
>>>>> flow control already makes queue overflow impossible.
>>>>>
>>>>>         
>>>> Can you explain where that occurs?  
>>>>       
>>> It happens in the functions port_dispatcher_sigh() and tipc_send(), 
>>> among other places. Both are to be found in the file port.c, which 
>>> was supposed to contain the 'generic' (i.e., API independent) part 
>>> of the send/receive code.
>>> Now that we have only one API left, the socket API, we are 
>>> planning to merge the code in socket.c and port.c, and get rid of 
>>> some code overhead.
>>>
>>> The flow control in TIPC is message based, where the sender
>>> requires to receive an explicit acknowledge message for each 
>>> 512 message the receiver reads to user space.
>>> If the sender has more than 1024 messages outstanding without having
>>> received an acknowledge he will be suspended or receive EAGAIN until 
>>> he does.
>>> The plan going forward is to replace this mechanism with a more 
>>> standard looking byte based flow control, while maintaining 
>>> backwards compatibility.
>>>
>>>     
>> Ok, That makes more sense, thank you.  Although I still don't think this is
>> safe (but the problem may not be solely introduced by this patch).  Using a
>> global limit that assumes the sender will block when the congestion window is
>> reached just doesn't seem sane to me.  It clearly works with the Linux
>> implementation, as it conforms to your expectations, but an alternate
>> implementation could create a DOS situation by simply ignoring the window limit,
>> and continuing to send.  I see that we drop frames over the global limit in
>> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
>> value of msg_importance(msg), but that threshold is ignored if the value of
>> msg_importance is invalid.  All a sender needs to do is flood a receiver with
>> frames containing an invalid set of message importance bits, and you will queue
>> frames indefinately.  In fact that will also happen if you send message of
>> CRITICAL importance as well, so you don't even need to supply an invalid value
>> here.
>>
>>   
> 
> You are absolutely right. I will correct these drawbacks in next version.

I think we should rather just drop this patch. We introduce a major vulnerability,
as Neil correctly points out. We will anyway have to do a rework of this code.

> 
>>>> I see where the tipc dispatch function calls
>>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>>> the receiving socket is connection oriented or connectionless), but if the
>>>> receiver doesn't call receive very often, This just adds a check against your
>>>> global limit, doing nothing for your per-socket limits. 
>>>>       
>>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>>> our per-socket limit. In fact, TIPC connectionless overflow control currently 
>>> is a kind of a hybrid, based on a message counter when the socket is not locked, 
>>> and based on sk_rcv_queue's byte limit when a message has to be added to the 
>>> backlog.
>>> We are planning to fix this inconsistency too.
>>>     
>> Good, thank you,  that was seeming quite wrong to me.
>>
>>   
>>>  In fact it seems to
>>>     
>>>> repeat the same check twice, as in the worst case of the incomming message being
>>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>>> OVERLOAD_LIMIT_BASE/2 again.
>>>>       
>>> Yes, you are right. The intention is that only the first test, 
>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>>> will be run for the vast majority of messages, since we must assume
>>> that there is no overload most of the time.
>>> An inelegant optimization, perhaps, but not logically wrong.
>>>     
>> No, not logically wrong, but not an optimization either.  With this change,
>> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
>> rx_queue_full, and then you do some multiplication based on that.  

It is still in the "unlikely" (in fact, very unlikely) branch. And the multiplication
is by two, i.e. just a left-shift operation. Our approach was rather to let the 
compiler decide about inlining, which in this case might be a sub-optimization.

If you really
>> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
>> doubling it like this patch series does), mark rx_queue_full as inline, and just
>> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
>> the conditional branch and a call instruction.  If you add a multiplication
>> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>>
>>   
> 
> Good suggestion with a factor table. Maybe it's unnecessary to 
> explicitly mark rx_queue_full as inline. Currently it sounds like we let 
> complier decide whether a function is defined as inline or not.

One approach I had in mind was to just left-shift OVERLOAD_LIMIT_BASE with
message priority, and compare that to the per-socket counter. This way,
we obtain the limit set [10000,20000,30000,40000] without having to read
data memory. The limits will not be the same as now, but probably good
enough. We don't even need a separate function for this check.
Something we should look into when we move on to make this mechanism 
byte-based.

> 
> Regards,
> Ying
> 
>> Neil
>>
>>   
>>> ///jon
>>>
>>>     
>>>> Neil
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>       
>>>     
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>   
> 

^ permalink raw reply

* Re: [PATCH 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Christian Riesch @ 2012-12-10  8:24 UTC (permalink / raw)
  To: Mugunthan V N; +Cc: netdev, davem, linux-arm-kernel, linux-omap, s.hauer
In-Reply-To: <1355125052-28448-1-git-send-email-mugunthanvnm@ti.com>

Hi again,

On Mon, Dec 10, 2012 at 8:37 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
> When there is heavy transmission traffic in the CPDMA, then Rx descriptors
> memory is also utilized as tx desc memory this leads to reduced rx desc memory
> which leads to poor performance.
>

"poor performance" is an understatement, see Sascha's description of
his patch. At initialization of the driver, half of the descriptors in
the pool are allocated for rx. When a packet arrives, one of the rx
descriptors is released and a new one is allocated. If tx allocates
this descriptor in the meantime, it is lost for rx forever! If tx
consumes all rx descriptors this way, the rx channel is dead!

Regards, Christian

> This patch adds boundary for tx and rx descriptors in bd ram dividing the
> descriptor memory to ensure that during heavy transmission tx doesn't use
> rx descriptors.
>
> This patch is already applied to davinci_emac driver, since CPSW and
> davici_dmac uses the same CPDMA, moving the boundry seperation from
> Davinci EMAC driver to CPDMA driver which was done in the following
> commit
>
> commit 86d8c07ff2448eb4e860e50f34ef6ee78e45c40c
> Author: Sascha Hauer <s.hauer@pengutronix.de>
> Date:   Tue Jan 3 05:27:47 2012 +0000
>
>     net/davinci: do not use all descriptors for tx packets
>
>     The driver uses a shared pool for both rx and tx descriptors.
>     During open it queues fixed number of 128 descriptors for receive
>     packets. For each received packet it tries to queue another
>     descriptor. If this fails the descriptor is lost for rx.
>     The driver has no limitation on tx descriptors to use, so it
>     can happen during a nmap / ping -f attack that the driver
>     allocates all descriptors for tx and looses all rx descriptors.
>     The driver stops working then.
>     To fix this limit the number of tx descriptors used to half of
>     the descriptors available, the rx path uses the other half.
>
>     Tested on a custom board using nmap / ping -f to the board from
>     two different hosts.
>
> Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>

^ permalink raw reply

* Re: [PATCH 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Christian Riesch @ 2012-12-10  8:11 UTC (permalink / raw)
  To: Mugunthan V N; +Cc: netdev, s.hauer, linux-omap, davem, linux-arm-kernel
In-Reply-To: <1355125052-28448-1-git-send-email-mugunthanvnm@ti.com>

Hi,

On Mon, Dec 10, 2012 at 8:37 AM, Mugunthan V N <mugunthanvnm@ti.com> wrote:
> When there is heavy transmission traffic in the CPDMA, then Rx descriptors
> memory is also utilized as tx desc memory this leads to reduced rx desc memory
> which leads to poor performance.
>
> This patch adds boundary for tx and rx descriptors in bd ram dividing the
> descriptor memory to ensure that during heavy transmission tx doesn't use
> rx descriptors.
>
> This patch is already applied to davinci_emac driver, since CPSW and
> davici_dmac uses the same CPDMA, moving the boundry seperation from
> Davinci EMAC driver to CPDMA driver which was done in the following
> commit
>
> commit 86d8c07ff2448eb4e860e50f34ef6ee78e45c40c
> Author: Sascha Hauer <s.hauer@pengutronix.de>
> Date:   Tue Jan 3 05:27:47 2012 +0000
>
>     net/davinci: do not use all descriptors for tx packets
>
>     The driver uses a shared pool for both rx and tx descriptors.
>     During open it queues fixed number of 128 descriptors for receive
>     packets. For each received packet it tries to queue another
>     descriptor. If this fails the descriptor is lost for rx.
>     The driver has no limitation on tx descriptors to use, so it
>     can happen during a nmap / ping -f attack that the driver
>     allocates all descriptors for tx and looses all rx descriptors.
>     The driver stops working then.
>     To fix this limit the number of tx descriptors used to half of
>     the descriptors available, the rx path uses the other half.
>
>     Tested on a custom board using nmap / ping -f to the board from
>     two different hosts.
>
> Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
> ---
>  drivers/net/ethernet/ti/davinci_cpdma.c |   20 ++++++++++++++------
>  drivers/net/ethernet/ti/davinci_emac.c  |    8 --------
>  2 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
> index 4995673..d37f546 100644
> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
> @@ -105,13 +105,13 @@ struct cpdma_ctlr {
>  };
>
>  struct cpdma_chan {
> +       struct cpdma_desc __iomem       *head, *tail;
> +       void __iomem                    *hdp, *cp, *rxfree;
>         enum cpdma_state                state;
>         struct cpdma_ctlr               *ctlr;
>         int                             chan_num;
>         spinlock_t                      lock;
> -       struct cpdma_desc __iomem       *head, *tail;
>         int                             count;
> -       void __iomem                    *hdp, *cp, *rxfree;

Why?

>         u32                             mask;
>         cpdma_handler_fn                handler;
>         enum dma_data_direction         dir;
> @@ -217,7 +217,7 @@ desc_from_phys(struct cpdma_desc_pool *pool, dma_addr_t dma)
>  }
>
>  static struct cpdma_desc __iomem *
> -cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
> +cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc, bool is_rx)
>  {
>         unsigned long flags;
>         int index;
> @@ -225,8 +225,14 @@ cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
>
>         spin_lock_irqsave(&pool->lock, flags);
>
> -       index = bitmap_find_next_zero_area(pool->bitmap, pool->num_desc, 0,
> -                                          num_desc, 0);
> +       if (is_rx) {
> +               index = bitmap_find_next_zero_area(pool->bitmap,
> +                               pool->num_desc/2, 0, num_desc, 0);
> +        } else {
> +               index = bitmap_find_next_zero_area(pool->bitmap,
> +                               pool->num_desc, pool->num_desc/2, num_desc, 0);
> +       }

Would it make sense to use two separate pools for rx and tx instead,
struct cpdma_desc_pool *rxpool, *txpool? It would result in using
separate spinlocks for rx and tx, could this be an advantage? (I am a
newbie in this field...)

> +
>         if (index < pool->num_desc) {
>                 bitmap_set(pool->bitmap, index, num_desc);
>                 desc = pool->iomap + pool->desc_size * index;
> @@ -660,6 +666,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
>         unsigned long                   flags;
>         u32                             mode;
>         int                             ret = 0;
> +       bool                            is_rx;
>
>         spin_lock_irqsave(&chan->lock, flags);
>
> @@ -668,7 +675,8 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
>                 goto unlock_ret;
>         }
>
> -       desc = cpdma_desc_alloc(ctlr->pool, 1);
> +       is_rx = (chan->rxfree != 0);
> +       desc = cpdma_desc_alloc(ctlr->pool, 1, is_rx);
>         if (!desc) {
>                 chan->stats.desc_alloc_fail++;
>                 ret = -ENOMEM;
> diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
> index fce89a0..f349273 100644
> --- a/drivers/net/ethernet/ti/davinci_emac.c
> +++ b/drivers/net/ethernet/ti/davinci_emac.c
> @@ -120,7 +120,6 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
>  #define EMAC_DEF_TX_CH                 (0) /* Default 0th channel */
>  #define EMAC_DEF_RX_CH                 (0) /* Default 0th channel */
>  #define EMAC_DEF_RX_NUM_DESC           (128)
> -#define EMAC_DEF_TX_NUM_DESC           (128)
>  #define EMAC_DEF_MAX_TX_CH             (1) /* Max TX channels configured */
>  #define EMAC_DEF_MAX_RX_CH             (1) /* Max RX channels configured */
>  #define EMAC_POLL_WEIGHT               (64) /* Default NAPI poll weight */
> @@ -342,7 +341,6 @@ struct emac_priv {
>         u32 mac_hash2;
>         u32 multicast_hash_cnt[EMAC_NUM_MULTICAST_BITS];
>         u32 rx_addr_type;
> -       atomic_t cur_tx;
>         const char *phy_id;
>  #ifdef CONFIG_OF
>         struct device_node *phy_node;
> @@ -1050,9 +1048,6 @@ static void emac_tx_handler(void *token, int len, int status)
>  {
>         struct sk_buff          *skb = token;
>         struct net_device       *ndev = skb->dev;
> -       struct emac_priv        *priv = netdev_priv(ndev);
> -
> -       atomic_dec(&priv->cur_tx);
>
>         if (unlikely(netif_queue_stopped(ndev)))
>                 netif_start_queue(ndev);
> @@ -1101,9 +1096,6 @@ static int emac_dev_xmit(struct sk_buff *skb, struct net_device *ndev)
>                 goto fail_tx;
>         }
>
> -       if (atomic_inc_return(&priv->cur_tx) >= EMAC_DEF_TX_NUM_DESC)
> -               netif_stop_queue(ndev);
> -
>         return NETDEV_TX_OK;
>
>  fail_tx:
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Mugunthan V N @ 2012-12-10  7:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-arm-kernel, linux-omap, s.hauer, Mugunthan V N

When there is heavy transmission traffic in the CPDMA, then Rx descriptors
memory is also utilized as tx desc memory this leads to reduced rx desc memory
which leads to poor performance.

This patch adds boundary for tx and rx descriptors in bd ram dividing the
descriptor memory to ensure that during heavy transmission tx doesn't use
rx descriptors.

This patch is already applied to davinci_emac driver, since CPSW and
davici_dmac uses the same CPDMA, moving the boundry seperation from
Davinci EMAC driver to CPDMA driver which was done in the following
commit

commit 86d8c07ff2448eb4e860e50f34ef6ee78e45c40c
Author: Sascha Hauer <s.hauer@pengutronix.de>
Date:   Tue Jan 3 05:27:47 2012 +0000

    net/davinci: do not use all descriptors for tx packets

    The driver uses a shared pool for both rx and tx descriptors.
    During open it queues fixed number of 128 descriptors for receive
    packets. For each received packet it tries to queue another
    descriptor. If this fails the descriptor is lost for rx.
    The driver has no limitation on tx descriptors to use, so it
    can happen during a nmap / ping -f attack that the driver
    allocates all descriptors for tx and looses all rx descriptors.
    The driver stops working then.
    To fix this limit the number of tx descriptors used to half of
    the descriptors available, the rx path uses the other half.

    Tested on a custom board using nmap / ping -f to the board from
    two different hosts.

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c |   20 ++++++++++++++------
 drivers/net/ethernet/ti/davinci_emac.c  |    8 --------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 4995673..d37f546 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -105,13 +105,13 @@ struct cpdma_ctlr {
 };
 
 struct cpdma_chan {
+	struct cpdma_desc __iomem	*head, *tail;
+	void __iomem			*hdp, *cp, *rxfree;
 	enum cpdma_state		state;
 	struct cpdma_ctlr		*ctlr;
 	int				chan_num;
 	spinlock_t			lock;
-	struct cpdma_desc __iomem	*head, *tail;
 	int				count;
-	void __iomem			*hdp, *cp, *rxfree;
 	u32				mask;
 	cpdma_handler_fn		handler;
 	enum dma_data_direction		dir;
@@ -217,7 +217,7 @@ desc_from_phys(struct cpdma_desc_pool *pool, dma_addr_t dma)
 }
 
 static struct cpdma_desc __iomem *
-cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
+cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc, bool is_rx)
 {
 	unsigned long flags;
 	int index;
@@ -225,8 +225,14 @@ cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
 
 	spin_lock_irqsave(&pool->lock, flags);
 
-	index = bitmap_find_next_zero_area(pool->bitmap, pool->num_desc, 0,
-					   num_desc, 0);
+	if (is_rx) {
+		index = bitmap_find_next_zero_area(pool->bitmap,
+				pool->num_desc/2, 0, num_desc, 0);
+	 } else {
+		index = bitmap_find_next_zero_area(pool->bitmap,
+				pool->num_desc, pool->num_desc/2, num_desc, 0);
+	}
+
 	if (index < pool->num_desc) {
 		bitmap_set(pool->bitmap, index, num_desc);
 		desc = pool->iomap + pool->desc_size * index;
@@ -660,6 +666,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
 	unsigned long			flags;
 	u32				mode;
 	int				ret = 0;
+	bool                            is_rx;
 
 	spin_lock_irqsave(&chan->lock, flags);
 
@@ -668,7 +675,8 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
 		goto unlock_ret;
 	}
 
-	desc = cpdma_desc_alloc(ctlr->pool, 1);
+	is_rx = (chan->rxfree != 0);
+	desc = cpdma_desc_alloc(ctlr->pool, 1, is_rx);
 	if (!desc) {
 		chan->stats.desc_alloc_fail++;
 		ret = -ENOMEM;
diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
index fce89a0..f349273 100644
--- a/drivers/net/ethernet/ti/davinci_emac.c
+++ b/drivers/net/ethernet/ti/davinci_emac.c
@@ -120,7 +120,6 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
 #define EMAC_DEF_TX_CH			(0) /* Default 0th channel */
 #define EMAC_DEF_RX_CH			(0) /* Default 0th channel */
 #define EMAC_DEF_RX_NUM_DESC		(128)
-#define EMAC_DEF_TX_NUM_DESC		(128)
 #define EMAC_DEF_MAX_TX_CH		(1) /* Max TX channels configured */
 #define EMAC_DEF_MAX_RX_CH		(1) /* Max RX channels configured */
 #define EMAC_POLL_WEIGHT		(64) /* Default NAPI poll weight */
@@ -342,7 +341,6 @@ struct emac_priv {
 	u32 mac_hash2;
 	u32 multicast_hash_cnt[EMAC_NUM_MULTICAST_BITS];
 	u32 rx_addr_type;
-	atomic_t cur_tx;
 	const char *phy_id;
 #ifdef CONFIG_OF
 	struct device_node *phy_node;
@@ -1050,9 +1048,6 @@ static void emac_tx_handler(void *token, int len, int status)
 {
 	struct sk_buff		*skb = token;
 	struct net_device	*ndev = skb->dev;
-	struct emac_priv	*priv = netdev_priv(ndev);
-
-	atomic_dec(&priv->cur_tx);
 
 	if (unlikely(netif_queue_stopped(ndev)))
 		netif_start_queue(ndev);
@@ -1101,9 +1096,6 @@ static int emac_dev_xmit(struct sk_buff *skb, struct net_device *ndev)
 		goto fail_tx;
 	}
 
-	if (atomic_inc_return(&priv->cur_tx) >= EMAC_DEF_TX_NUM_DESC)
-		netif_stop_queue(ndev);
-
 	return NETDEV_TX_OK;
 
 fail_tx:
-- 
1.7.9.5

^ permalink raw reply related

* Re: ixgbe:  pci_get_device() call without counterpart call of pci_dev_put()
From: Jeff Kirsher @ 2012-12-10  7:17 UTC (permalink / raw)
  To: Elena Gurevich; +Cc: netdev, e1000-devel, Greg Rose, Don Skidmore
In-Reply-To: <01b301cdd5f2$3a9b2360$afd16a20$@toganetworks.com>

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

On 12/09/2012 01:47 AM, Elena Gurevich wrote:
> Hi all,
> I am pioneer in linux device drivers here and using Intel 82599 NIC as
> reference model, 
> During investigation to drivers sources I found the suspicious code:  
>
> Is code sequence (1)   and (2) the possible device reference count leakage
> ???
>
>
> Thanks a lot in advance
> Lena
>
> --------snipped from ixgbe_main.c  file function ixgbe_io_error_detected()
> -----------
>
> . . . 
> 		/* Find the pci device of the offending VF */
> 		vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, device_id,
> NULL);
> 		while (vfdev) {
> 			if (vfdev->devfn == (req_id & 0xFF))
> 				break;
> <------------------------------      (1) leaves the loop with successful get
> call  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
> 			vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
> 					       device_id, vfdev);
> 		}
> 		/*
> 		 * There's a slim chance the VF could have been hot plugged,
> 		 * so if it is no longer present we don't need to issue the
> 		 * VFLR.  Just clean up the AER in that case.
> 		 */
> 		if (vfdev) {
> 			e_dev_err("Issuing VFLR to VF %d\n", vf);
> 			pci_write_config_dword(vfdev, 0xA8, 0x00008000);
> 		}
>
> 		pci_cleanup_aer_uncorrect_error_status(pdev);
> 	}
>
> 	/*
> 	 * Even though the error may have occurred on the other port
> 	 * we still need to increment the vf error reference count for
> 	 * both ports because the I/O resume function will be called
> 	 * for both of them.
> 	 */
> 	adapter->vferr_refcount++;
>
> 	return PCI_ERS_RESULT_RECOVERED;
> <--------------------------------------------  (2) leaves the function
> without put call !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ----------------------------------  snipped
> -----------------------------------
>
Added Greg Rose (ixgbevf maintainer) & Don Skidmore (ixgbe maintainer)

Since the code you have questions about is dealing with the vf, Greg
will most likely be able to shed some light to your questions.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* [PATCH] net: Allow DCBnl to use other namespaces besides init_net
From: John Fastabend @ 2012-12-10  6:48 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm

Allow DCB and net namespace to work together. This is useful if you
have containers that are bound to 'phys' interfaces that want to
also manage their DCB attributes.

The net namespace is taken from sock_net(skb->sk) of the netlink skb.

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/dcb/dcbnl.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index b07c75d..1b588e2 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1665,9 +1665,6 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	if ((nlh->nlmsg_type == RTM_SETDCB) && !capable(CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (!net_eq(net, &init_net))
-		return -EINVAL;
-
 	ret = nlmsg_parse(nlh, sizeof(*dcb), tb, DCB_ATTR_MAX,
 			  dcbnl_rtnl_policy);
 	if (ret < 0)
@@ -1684,7 +1681,7 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	if (!tb[DCB_ATTR_IFNAME])
 		return -EINVAL;
 
-	netdev = dev_get_by_name(&init_net, nla_data(tb[DCB_ATTR_IFNAME]));
+	netdev = dev_get_by_name(net, nla_data(tb[DCB_ATTR_IFNAME]));
 	if (!netdev)
 		return -ENODEV;
 
@@ -1708,7 +1705,7 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 
 	nlmsg_end(reply_skb, reply_nlh);
 
-	ret = rtnl_unicast(reply_skb, &init_net, portid);
+	ret = rtnl_unicast(reply_skb, net, portid);
 out:
 	dev_put(netdev);
 	return ret;

^ permalink raw reply related

* RE: linux-next: manual merge of the net-next tree with the pci tree
From: Grumbach, Emmanuel @ 2012-12-10  6:44 UTC (permalink / raw)
  To: Stephen Rothwell, David Miller, netdev@vger.kernel.org
  Cc: linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Bjorn Helgaas, Berg, Johannes
In-Reply-To: <20121210120825.49096bafd378139e98becb8d@canb.auug.org.au>

> Today's linux-next merge of the net-next tree got a conflict in
> drivers/net/wireless/iwlwifi/pcie/trans.c between commit b9d146e30a2d
> ("iwlwifi: collapse wrapper for pcie_capability_read_word()") from the pci
> tree and commit 7afe3705cd4e ("iwlwifi: continue clean up -
> pcie/trans.c") from the net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no action is
> required).
> 
Looks right - thanks Stephen!
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply

* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Ying Xue @ 2012-12-10  6:27 UTC (permalink / raw)
  To: Neil Horman; +Cc: Jon Maloy, Paul Gortmaker, David Miller, netdev, Ying Xue
In-Reply-To: <20121209165020.GA4362@neilslaptop.think-freely.org>

Neil Horman wrote:
> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>   
>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>     
>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>       
>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>
>>>> The sk_receive_queue limit control is currently performed for
>>>> all arriving messages, disregarding socket and message type.
>>>> But for connected sockets this check is redundant, since the protocol
>>>> flow control already makes queue overflow impossible.
>>>>
>>>>         
>>> Can you explain where that occurs?  
>>>       
>> It happens in the functions port_dispatcher_sigh() and tipc_send(), 
>> among other places. Both are to be found in the file port.c, which 
>> was supposed to contain the 'generic' (i.e., API independent) part 
>> of the send/receive code.
>> Now that we have only one API left, the socket API, we are 
>> planning to merge the code in socket.c and port.c, and get rid of 
>> some code overhead.
>>
>> The flow control in TIPC is message based, where the sender
>> requires to receive an explicit acknowledge message for each 
>> 512 message the receiver reads to user space.
>> If the sender has more than 1024 messages outstanding without having
>> received an acknowledge he will be suspended or receive EAGAIN until 
>> he does.
>> The plan going forward is to replace this mechanism with a more 
>> standard looking byte based flow control, while maintaining 
>> backwards compatibility.
>>
>>     
> Ok, That makes more sense, thank you.  Although I still don't think this is
> safe (but the problem may not be solely introduced by this patch).  Using a
> global limit that assumes the sender will block when the congestion window is
> reached just doesn't seem sane to me.  It clearly works with the Linux
> implementation, as it conforms to your expectations, but an alternate
> implementation could create a DOS situation by simply ignoring the window limit,
> and continuing to send.  I see that we drop frames over the global limit in
> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
> value of msg_importance(msg), but that threshold is ignored if the value of
> msg_importance is invalid.  All a sender needs to do is flood a receiver with
> frames containing an invalid set of message importance bits, and you will queue
> frames indefinately.  In fact that will also happen if you send message of
> CRITICAL importance as well, so you don't even need to supply an invalid value
> here.
>
>   

You are absolutely right. I will correct these drawbacks in next version.

>>> I see where the tipc dispatch function calls
>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>> the receiving socket is connection oriented or connectionless), but if the
>>> receiver doesn't call receive very often, This just adds a check against your
>>> global limit, doing nothing for your per-socket limits. 
>>>       
>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>> our per-socket limit. In fact, TIPC connectionless overflow control currently 
>> is a kind of a hybrid, based on a message counter when the socket is not locked, 
>> and based on sk_rcv_queue's byte limit when a message has to be added to the 
>> backlog.
>> We are planning to fix this inconsistency too.
>>     
> Good, thank you,  that was seeming quite wrong to me.
>
>   
>>  In fact it seems to
>>     
>>> repeat the same check twice, as in the worst case of the incomming message being
>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>> OVERLOAD_LIMIT_BASE/2 again.
>>>       
>> Yes, you are right. The intention is that only the first test, 
>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>> will be run for the vast majority of messages, since we must assume
>> that there is no overload most of the time.
>> An inelegant optimization, perhaps, but not logically wrong.
>>     
> No, not logically wrong, but not an optimization either.  With this change,
> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
> rx_queue_full, and then you do some multiplication based on that.  If you really
> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
> doubling it like this patch series does), mark rx_queue_full as inline, and just
> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
> the conditional branch and a call instruction.  If you add a multiplication
> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>
>   

Good suggestion with a factor table. Maybe it's unnecessary to 
explicitly mark rx_queue_full as inline. Currently it sounds like we let 
complier decide whether a function is defined as inline or not.

Regards,
Ying

> Neil
>
>   
>> ///jon
>>
>>     
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>       
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   

^ permalink raw reply

* Re: [PATCHv6] virtio-spec: virtio network device multiqueue support
From: Jason Wang @ 2012-12-10  5:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: bhutchings, netdev, kvm, virtualization
In-Reply-To: <20121207141856.GA17901@redhat.com>

On Friday, December 07, 2012 04:18:56 PM Michael S. Tsirkin wrote:
> Add multiqueue support to virtio network device.
> Add a new feature flag VIRTIO_NET_F_MQ for this feature, a new
> configuration field max_virtqueue_pairs to detect supported number of
> virtqueues as well as a new command VIRTIO_NET_CTRL_MQ to program
> packet steering for unidirectional protocols.
> 
> ---
> 
> Changes in v6:
> - rename RFS -> multiqueue to avoid confusion with RFS in linux
>   mention automatic receive steering as Rusty suggested
> 
> Changes in v5:
> - Address Rusty's comments.
>   Changes are only in the text, not the ideas.
> - Some minor formatting changes.
> 
> Changes in v4:
> - address Jason's comments
> - have configuration specify the number of VQ pairs and not pairs - 1
> 
> Changes in v3:
> - rename multiqueue -> rfs this is what we support
> - Be more explicit about what driver should do.
> - Simplify layout making VQs functionality depend on feature.
> - Remove unused commands, only leave in programming # of queues
> 
> Changes in v2:
> Address Jason's comments on v2:
> - Changed STEERING_HOST to STEERING_RX_FOLLOWS_TX:
>    this is both clearer and easier to support.
>    It does not look like we need a separate steering command
>    since host can just watch tx packets as they go.
> - Moved RX and TX steering sections near each other.
> - Add motivation for other changes in v2
> 
> Changes in v1 (from Jason's rfc):
> - reserved vq 3: this makes all rx vqs even and tx vqs odd, which
>    looks nicer to me.
> - documented packet steering, added a generalized steering programming
>    command. Current modes are single queue and host driven multiqueue,
>    but I envision support for guest driven multiqueue in the future.
> - make default vqs unused when in mq mode - this wastes some memory
>    but makes it more efficient to switch between modes as
>    we can avoid this causing packet reordering.
> 
> Rusty, could you please take a look and comment soon?
> If this looks OK to everyone, we can proceed with finalizing the
> implementation. Would be nice to try and put it in 3.8.
> 
> diff --git a/virtio-spec.lyx b/virtio-spec.lyx
> index 83f2771..c5b32c4 100644
> --- a/virtio-spec.lyx
> +++ b/virtio-spec.lyx
> @@ -59,6 +59,7 @@
>  \author -608949062 "Rusty Russell,,,"
>  \author -385801441 "Cornelia Huck" cornelia.huck@de.ibm.com
>  \author 1531152142 "Paolo Bonzini,,,"
> +\author 1986246365 "Michael S. Tsirkin"
>  \end_header
[...]
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594263
> +
> +#define VIRTIO_NET_CTRL_MQ    1

Should be 4 here.
> +\end_layout
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594273
> +
> + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET        0
> +\end_layout
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594273
> +
> + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN        1
> +\end_layout
> +
[...]

^ permalink raw reply

* Re: [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: Neal Cardwell @ 2012-12-10  5:17 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121209.190223.1179297442063515404.davem@davemloft.net>

On Sun, Dec 9, 2012 at 7:02 PM, David Miller <davem@davemloft.net> wrote:
> I added the missing "return true" at the end of this new function.
>
> Applied, but please be more careful and at least look at the compiler
> warnings when you submit your changes.
>
> Thanks.

Roger that. Oof. Can't believe I missed that.

thanks,
neal

^ permalink raw reply

* [PATCH v5] iproute2: add mdb sub-command to bridge
From: Cong Wang @ 2012-12-10  3:47 UTC (permalink / raw)
  To: netdev
  Cc: bridge, Cong Wang, Herbert Xu, Stephen Hemminger, David S. Miller,
	Thomas Graf, Jesper Dangaard Brouer

From: Cong Wang <amwang@redhat.com>

V5: make the output pretty

V4: fix filter_dev
    remove some useless #include

V3: improve the output, display router info only for -d
    fix router parsing code

V2: sync with the kernel patch
    handle IPv6 addr
    a few cleanup

Sample output:

	# ./bridge/bridge mdb show dev br0
	bridge br0 port eth1 group 224.8.8.9
	bridge br0 port eth0 group 224.8.8.8

	# ./bridge/bridge -d mdb show dev br0
	bridge br0 port eth1 group 224.8.8.9
	bridge br0 port eth0 group 224.8.8.8
	router ports on br0: eth0 

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
 bridge/Makefile    |    2 +-
 bridge/br_common.h |    3 +-
 bridge/bridge.c    |    1 +
 bridge/mdb.c       |  172 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/bridge/Makefile b/bridge/Makefile
index 9a6743e..67aceb4 100644
--- a/bridge/Makefile
+++ b/bridge/Makefile
@@ -1,4 +1,4 @@
-BROBJ = bridge.o fdb.o monitor.o link.o
+BROBJ = bridge.o fdb.o monitor.o link.o mdb.o
 
 include ../Config
 
diff --git a/bridge/br_common.h b/bridge/br_common.h
index ce11a0b..71f7caf 100644
--- a/bridge/br_common.h
+++ b/bridge/br_common.h
@@ -5,11 +5,12 @@ extern int print_fdb(const struct sockaddr_nl *who,
 		     struct nlmsghdr *n, void *arg);
 
 extern int do_fdb(int argc, char **argv);
+extern int do_mdb(int argc, char **argv);
 extern int do_monitor(int argc, char **argv);
 
 extern int preferred_family;
 extern int show_stats;
-extern int show_detail;
+extern int show_details;
 extern int timestamp;
 extern struct rtnl_handle rth;
 
diff --git a/bridge/bridge.c b/bridge/bridge.c
index e2c33b0..1fcd365 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -43,6 +43,7 @@ static const struct cmd {
 	int (*func)(int argc, char **argv);
 } cmds[] = {
 	{ "fdb", 	do_fdb },
+	{ "mdb", 	do_mdb },
 	{ "monitor",	do_monitor },
 	{ "help",	do_help },
 	{ 0 }
diff --git a/bridge/mdb.c b/bridge/mdb.c
new file mode 100644
index 0000000..390d7f6
--- /dev/null
+++ b/bridge/mdb.c
@@ -0,0 +1,172 @@
+/*
+ * Get mdb table with netlink
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <linux/if_bridge.h>
+#include <linux/if_ether.h>
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "libnetlink.h"
+#include "br_common.h"
+#include "rt_names.h"
+#include "utils.h"
+
+#ifndef MDBA_RTA
+#define MDBA_RTA(r) \
+	((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct br_port_msg))))
+#endif
+
+int filter_index;
+
+static void usage(void)
+{
+	fprintf(stderr, "       bridge mdb {show} [ dev DEV ]\n");
+	exit(-1);
+}
+
+static void br_print_router_ports(FILE *f, struct rtattr *attr)
+{
+	uint32_t *port_ifindex;
+	struct rtattr *i;
+	int rem;
+
+	rem = RTA_PAYLOAD(attr);
+	for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+		port_ifindex = RTA_DATA(i);
+		fprintf(f, "%s ", ll_index_to_name(*port_ifindex));
+	}
+
+	fprintf(f, "\n");
+}
+
+static void print_mdb_entry(FILE *f, int ifindex, struct br_mdb_entry *e)
+{
+	SPRINT_BUF(abuf);
+
+	if (e->addr.proto == htons(ETH_P_IP))
+		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+			ll_index_to_name(e->ifindex),
+			inet_ntop(AF_INET, &e->addr.u.ip4, abuf, sizeof(abuf)));
+	else
+		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+			ll_index_to_name(e->ifindex),
+			inet_ntop(AF_INET6, &e->addr.u.ip6, abuf, sizeof(abuf)));
+}
+
+static void br_print_mdb_entry(FILE *f, int ifindex, struct rtattr *attr)
+{
+	struct rtattr *i;
+	int rem;
+	struct br_mdb_entry *e;
+
+	rem = RTA_PAYLOAD(attr);
+	for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+		e = RTA_DATA(i);
+		print_mdb_entry(f, ifindex, e);
+	}
+}
+
+int print_mdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	FILE *fp = arg;
+	struct br_port_msg *r = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	struct rtattr * tb[MDBA_MAX+1];
+
+	if (n->nlmsg_type != RTM_GETMDB) {
+		fprintf(stderr, "Not RTM_GETMDB: %08x %08x %08x\n",
+			n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
+
+		return 0;
+	}
+
+	len -= NLMSG_LENGTH(sizeof(*r));
+	if (len < 0) {
+		fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
+		return -1;
+	}
+
+	if (filter_index && filter_index != r->ifindex)
+		return 0;
+
+	parse_rtattr(tb, MDBA_MAX, MDBA_RTA(r), n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
+
+	if (tb[MDBA_MDB]) {
+		struct rtattr *i;
+		int rem = RTA_PAYLOAD(tb[MDBA_MDB]);
+
+		for (i = RTA_DATA(tb[MDBA_MDB]); RTA_OK(i, rem); i = RTA_NEXT(i, rem))
+			br_print_mdb_entry(fp, r->ifindex, i);
+	}
+
+	if (tb[MDBA_ROUTER]) {
+		if (show_details) {
+			fprintf(fp, "router ports on %s: ", ll_index_to_name(r->ifindex));
+			br_print_router_ports(fp, tb[MDBA_ROUTER]);
+		}
+	}
+
+	return 0;
+}
+
+static int mdb_show(int argc, char **argv)
+{
+	char *filter_dev = NULL;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			if (filter_dev)
+				duparg("dev", *argv);
+			filter_dev = *argv;
+		}
+		argc--; argv++;
+	}
+
+	if (filter_dev) {
+		filter_index = if_nametoindex(filter_dev);
+		if (filter_index == 0) {
+			fprintf(stderr, "Cannot find device \"%s\"\n",
+				filter_dev);
+			return -1;
+		}
+	}
+
+	if (rtnl_wilddump_request(&rth, PF_BRIDGE, RTM_GETMDB) < 0) {
+		perror("Cannot send dump request");
+		exit(1);
+	}
+
+	if (rtnl_dump_filter(&rth, print_mdb, stdout) < 0) {
+		fprintf(stderr, "Dump terminated\n");
+		exit(1);
+	}
+
+	return 0;
+}
+
+int do_mdb(int argc, char **argv)
+{
+	ll_init_map(&rth);
+
+	if (argc > 0) {
+		if (matches(*argv, "show") == 0 ||
+		    matches(*argv, "lst") == 0 ||
+		    matches(*argv, "list") == 0)
+			return mdb_show(argc-1, argv+1);
+		if (matches(*argv, "help") == 0)
+			usage();
+	} else
+		return mdb_show(0, NULL);
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"bridge mdb help\".\n", *argv);
+	exit(-1);
+}

^ permalink raw reply related

* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-10  3:40 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121209.190129.2104270281417362831.davem@davemloft.net>

On Sun, Dec 9, 2012 at 7:01 PM, David Miller <davem@davemloft.net> wrote:
> From: Neal Cardwell <ncardwell@google.com>
> Date: Sun,  9 Dec 2012 00:43:21 -0500
>
>> Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
>> instantiated for IPv4 traffic and in the SYN-RECV state were actually
>> created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
>> means that for such connections inet6_rsk(req) returns a pointer to a
>> random spot in memory up to roughly 64KB beyond the end of the
>> request_sock.
>>
>> With this bug, for a server using AF_INET6 TCP sockets and serving
>> IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
>> inet_diag_fill_req() causing an oops or the export to user space of 16
>> bytes of kernel memory as a garbage IPv6 address, depending on where
>> the garbage inet6_rsk(req) pointed.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>
> Applied.

Thanks, David.

neal

^ permalink raw reply

* Re: [PATCH v4] iproute2: add mdb sub-command to bridge
From: Cong Wang @ 2012-12-10  3:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer, Thomas Graf,
	David S. Miller
In-Reply-To: <1a884874-7123-401f-a53c-cc8e301b1ffe@tahiti.vyatta.com>

On Fri, 2012-12-07 at 09:07 -0800, Stephen Hemminger wrote:
> 
> > 
> > 	# ./bridge/bridge mdb
> > 	bridge br0:
> > 	port eth0, group 224.8.8.9
> > 	port eth1, group 224.8.8.8
> 
> I like the ability to see what is going on. It would
> be good if there was also a monitor like hook to see new entries
> created and destroyed (later version).

I am working on a patch to implement RTM_NEWMDB and RTM_DELMDB, so we
will have this.

> 
> The output format should be one line per entry for easier
> parsing by utilities. Something similar to 'ip neigh show'
> 

Right, I will update this in v5.

Thanks!

^ permalink raw reply

* linux-next: build failure after merge of the virtio tree
From: Stephen Rothwell @ 2012-12-10  2:37 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-next, linux-kernel, Jason Wang, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

Hi Rusty,

After merging the virtio tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/net/virtio_net.c: In function 'vq2txq':
drivers/net/virtio_net.c:150:2: error: implicit declaration of function 'virtqueue_get_queue_index' [-Werror=implicit-function-declaration]

Caused by commit 986a4f4d452d ("virtio_net: multiqueue support") from the
net-next tree interacting with commit 105e892960e1 ("virtio: move
queue_index and num_free fields into core struct virtqueue") from the
virtio tree.

I applied the patch below and can carry it as necessary.

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 10 Dec 2012 13:33:57 +1100
Subject: [PATCH] virtnet: for up for virtqueue_get_queue_index removal

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/virtio_net.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 33d6f6f..8afe32d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -147,7 +147,7 @@ struct padded_vnet_hdr {
  */
 static int vq2txq(struct virtqueue *vq)
 {
-	return (virtqueue_get_queue_index(vq) - 1) / 2;
+	return (vq->index - 1) / 2;
 }
 
 static int txq2vq(int txq)
@@ -157,7 +157,7 @@ static int txq2vq(int txq)
 
 static int vq2rxq(struct virtqueue *vq)
 {
-	return virtqueue_get_queue_index(vq) / 2;
+	return vq->index / 2;
 }
 
 static int rxq2vq(int rxq)
-- 
1.7.10.280.gaa39

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* linux-next: manual merge of the virtio tree with the net-next tree
From: Stephen Rothwell @ 2012-12-10  2:28 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-next, linux-kernel, Jason Wang, David Miller, netdev,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6360 bytes --]

Hi Rusty,

Today's linux-next merge of the virtio tree got a conflict in
drivers/net/virtio_net.c between commit e9d7417b97f4 ("virtio-net:
separate fields of sending/receiving queue from virtnet_info") and
986a4f4d452d ("virtio_net: multiqueue support") from the net-next tree
and commit a89f05573fa2 ("virtio-net: remove unused skb_vnet_hdr->num_sg
field"), 2c6d439a7316 ("virtio-net: correct capacity math on ring full"),
e794093a52cd ("virtio_net: don't rely on virtqueue_add_buf() returning
capacity") and 7dc5f95d9b6c ("virtio: net: make it clear that
virtqueue_add_buf() no longer returns > 0") from the virtio tree.

I fixed it up (I think - see below) and can carry the fix as necessary
(no action is required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/virtio_net.c
index a644eeb,6289891..0000000
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@@ -523,20 -464,21 +522,21 @@@ static bool try_fill_recv(struct receiv
  
  	do {
  		if (vi->mergeable_rx_bufs)
 -			err = add_recvbuf_mergeable(vi, gfp);
 +			err = add_recvbuf_mergeable(rq, gfp);
  		else if (vi->big_packets)
 -			err = add_recvbuf_big(vi, gfp);
 +			err = add_recvbuf_big(rq, gfp);
  		else
 -			err = add_recvbuf_small(vi, gfp);
 +			err = add_recvbuf_small(rq, gfp);
  
  		oom = err == -ENOMEM;
- 		if (err < 0)
+ 		if (err)
  			break;
 -		++vi->num;
 -	} while (vi->rvq->num_free);
 +		++rq->num;
- 	} while (err > 0);
++	} while (rq->vq->num_free);
+ 
 -	if (unlikely(vi->num > vi->max))
 -		vi->max = vi->num;
 -	virtqueue_kick(vi->rvq);
 +	if (unlikely(rq->num > rq->max))
 +		rq->max = rq->num;
 +	virtqueue_kick(rq->vq);
  	return !oom;
  }
  
@@@ -625,29 -557,13 +625,29 @@@ again
  	return received;
  }
  
 -static void free_old_xmit_skbs(struct virtnet_info *vi)
 +static int virtnet_open(struct net_device *dev)
 +{
 +	struct virtnet_info *vi = netdev_priv(dev);
 +	int i;
 +
 +	for (i = 0; i < vi->max_queue_pairs; i++) {
 +		/* Make sure we have some buffers: if oom use wq. */
 +		if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 +			schedule_delayed_work(&vi->refill, 0);
 +		virtnet_napi_enable(&vi->rq[i]);
 +	}
 +
 +	return 0;
 +}
 +
- static unsigned int free_old_xmit_skbs(struct send_queue *sq)
++static void free_old_xmit_skbs(struct send_queue *sq)
  {
  	struct sk_buff *skb;
- 	unsigned int len, tot_sgs = 0;
+ 	unsigned int len;
 +	struct virtnet_info *vi = sq->vq->vdev->priv;
  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
  
 -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 +	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
  		pr_debug("Sent skb %p\n", skb);
  
  		u64_stats_update_begin(&stats->tx_syncp);
@@@ -655,17 -571,15 +655,16 @@@
  		stats->tx_packets++;
  		u64_stats_update_end(&stats->tx_syncp);
  
- 		tot_sgs += skb_vnet_hdr(skb)->num_sg;
  		dev_kfree_skb_any(skb);
  	}
- 	return tot_sgs;
  }
  
 -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 +static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
  {
  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+ 	unsigned num_sg;
 +	struct virtnet_info *vi = sq->vq->vdev->priv;
  
  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
  
@@@ -700,42 -614,32 +699,35 @@@
  
  	/* Encode metadata header at front. */
  	if (vi->mergeable_rx_bufs)
 -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 +		sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);
  	else
 -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 +		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
  
- 	hdr->num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
- 	return virtqueue_add_buf(sq->vq, sq->sg, hdr->num_sg,
 -	num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
 -	return virtqueue_add_buf(vi->svq, vi->tx_sg, num_sg,
++	num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
++	return virtqueue_add_buf(sq->vq, sq->sg, num_sg,
  				 0, skb, GFP_ATOMIC);
  }
  
  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
  {
  	struct virtnet_info *vi = netdev_priv(dev);
 +	int qnum = skb_get_queue_mapping(skb);
 +	struct send_queue *sq = &vi->sq[qnum];
- 	int capacity;
+ 	int err;
  
  	/* Free up any pending old buffers before queueing new ones. */
 -	free_old_xmit_skbs(vi);
 +	free_old_xmit_skbs(sq);
  
  	/* Try to transmit */
- 	capacity = xmit_skb(sq, skb);
- 
- 	/* This can happen with OOM and indirect buffers. */
- 	if (unlikely(capacity < 0)) {
- 		if (likely(capacity == -ENOMEM)) {
- 			if (net_ratelimit())
- 				dev_warn(&dev->dev,
- 					 "TXQ (%d) failure: out of memory\n",
- 					 qnum);
- 		} else {
- 			dev->stats.tx_fifo_errors++;
- 			if (net_ratelimit())
- 				dev_warn(&dev->dev,
- 					 "Unexpected TXQ (%d) failure: %d\n",
- 					 qnum, capacity);
- 		}
 -	err = xmit_skb(vi, skb);
++	err = xmit_skb(sq, skb);
+ 
+ 	/* This should not happen! */
+ 	if (unlikely(err)) {
+ 		dev->stats.tx_fifo_errors++;
+ 		if (net_ratelimit())
+ 			dev_warn(&dev->dev,
 -				 "Unexpected TX queue failure: %d\n", err);
++				 "Unexpected TXQ (%d) failure: %d\n",
++				 qnum, err);
  		dev->stats.tx_dropped++;
  		kfree_skb(skb);
  		return NETDEV_TX_OK;
@@@ -748,14 -652,14 +740,13 @@@
  
  	/* Apparently nice girls don't return TX_BUSY; stop the queue
  	 * before it gets out of hand.  Naturally, this wastes entries. */
- 	if (capacity < 2+MAX_SKB_FRAGS) {
 -	if (vi->svq->num_free < 2+MAX_SKB_FRAGS) {
 -		netif_stop_queue(dev);
 -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
++	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 +		netif_stop_subqueue(dev, qnum);
 +		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
  			/* More just got used, free them then recheck. */
- 			capacity += free_old_xmit_skbs(sq);
- 			if (capacity >= 2+MAX_SKB_FRAGS) {
 -			free_old_xmit_skbs(vi);
 -			if (vi->svq->num_free >= 2+MAX_SKB_FRAGS) {
 -				netif_start_queue(dev);
 -				virtqueue_disable_cb(vi->svq);
++			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 +				netif_start_subqueue(dev, qnum);
 +				virtqueue_disable_cb(sq->vq);
  			}
  		}
  	}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-10  1:39 UTC (permalink / raw)
  To: chas williams - CONTRACTOR; +Cc: David Miller, netdev
In-Reply-To: <50C1415C.2060104@asianux.com>

Hello Chas Williams:

  Excuse me:
    I am a reporter (not a reviewer), and not suitable to review another's patch.

  so:
    I suggest you to send your patch (as your willing) to the reviewers, directly.
    and need not cc to me (I am not a reviewer).

  thanks.

gchen.


于 2012年12月07日 09:07, Chen Gang 写道:
> 于 2012年12月06日 22:08, chas williams - CONTRACTOR 写道:
>> On Thu, 06 Dec 2012 09:15:10 +0800
>> Chen Gang <gang.chen@asianux.com> wrote:
>>
>>> 于 2012年12月05日 22:55, chas williams - CONTRACTOR 写道:
>>
>>>> did you mean '\0' instead of '\n'?  scnprintf() considers the trailing
>>>> '\0' when formatting.
>>>
>>>   no, originally, the end is "\n\0".
>>>
>>>   I prefer we still compatible "\n" when the contents are very large.
>>>   if count already == (PAGE_SIZE - 1), we have no chance to append "\n" to the end.
>>>
>>> -		pos += sprintf(pos, "\n");
>>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
>>
>> i would make the code a bit messy to do this for not much gain.  again,
>> it isnt likely you would run into this in a normal situation.
> 
>   surely.
> 
>   thanks.
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the pci tree
From: Stephen Rothwell @ 2012-12-10  1:08 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Bjorn Helgaas, Emmanuel Grumbach,
	Johannes Berg

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/wireless/iwlwifi/pcie/trans.c between commit b9d146e30a2d
("iwlwifi: collapse wrapper for pcie_capability_read_word()") from the
pci tree and commit 7afe3705cd4e ("iwlwifi: continue clean up -
pcie/trans.c") from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/wireless/iwlwifi/pcie/trans.c
index 1dfa6be,f6c21e7..0000000
--- a/drivers/net/wireless/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/iwlwifi/pcie/trans.c
@@@ -670,8 -94,10 +94,8 @@@ static void iwl_pcie_set_pwr_vmain(stru
  
  /* PCI registers */
  #define PCI_CFG_RETRY_TIMEOUT	0x041
 -#define PCI_CFG_LINK_CTRL_VAL_L0S_EN	0x01
 -#define PCI_CFG_LINK_CTRL_VAL_L1_EN	0x02
  
- static void iwl_apm_config(struct iwl_trans *trans)
+ static void iwl_pcie_apm_config(struct iwl_trans *trans)
  {
  	struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
  	u16 lctl;
@@@ -684,20 -110,19 +108,17 @@@
  	 * If not (unlikely), enable L0S, so there is at least some
  	 *    power savings, even without L1.
  	 */
- 
  	pcie_capability_read_word(trans_pcie->pci_dev, PCI_EXP_LNKCTL, &lctl);
 -
 -	if ((lctl & PCI_CFG_LINK_CTRL_VAL_L1_EN) ==
 -				PCI_CFG_LINK_CTRL_VAL_L1_EN) {
 +	if (lctl & PCI_EXP_LNKCTL_ASPM_L1) {
  		/* L1-ASPM enabled; disable(!) L0S */
  		iwl_set_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED);
- 		dev_printk(KERN_INFO, trans->dev,
- 			   "L1 Enabled; Disabling L0S\n");
+ 		dev_info(trans->dev, "L1 Enabled; Disabling L0S\n");
  	} else {
  		/* L1-ASPM disabled; enable(!) L0S */
  		iwl_clear_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED);
- 		dev_printk(KERN_INFO, trans->dev,
- 			   "L1 Disabled; Enabling L0S\n");
+ 		dev_info(trans->dev, "L1 Disabled; Enabling L0S\n");
  	}
 -	trans->pm_support = !(lctl & PCI_CFG_LINK_CTRL_VAL_L0S_EN);
 +	trans->pm_support = !(lctl & PCI_EXP_LNKCTL_ASPM_L0S);
  }
  
  /*

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: David Miller @ 2012-12-10  0:02 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355086980-30438-1-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 16:03:00 -0500

> Add logic to verify that a port comparison byte code operation
> actually has the second inet_diag_bc_op from which we read the port
> for such operations.
> 
> Previously the code blindly referenced op[1] without first checking
> whether a second inet_diag_bc_op struct could fit there. So a
> malicious user could make the kernel read 4 bytes beyond the end of
> the bytecode array by claiming to have a whole port comparison byte
> code (2 inet_diag_bc_op structs) when in fact the bytecode was not
> long enough to hold both.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Haste makes waste...

> +/* Validate a port comparison operator. */
> +static inline bool valid_port_comparison(const struct inet_diag_bc_op *op,
> +					 int len, int *min_len)
> +{
> +	/* Port comparisons put the port in a follow-on inet_diag_bc_op. */
> +	*min_len += sizeof(struct inet_diag_bc_op);
> +	if (len < *min_len)
> +		return false;
> +}
> +

I added the missing "return true" at the end of this new function.

Applied, but please be more careful and at least look at the compiler
warnings when you submit your changes.

Thanks.

^ permalink raw reply

* Re: [PATCH net 3/3] inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
From: David Miller @ 2012-12-10  0:01 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355031803-14547-3-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 00:43:23 -0500

> Add logic to check the address family of the user-supplied conditional
> and the address family of the connection entry. We now do not do
> prefix matching of addresses from different address families (AF_INET
> vs AF_INET6), except for the previously existing support for having an
> IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
> maintains as-is).
> 
> This change is needed for two reasons:
> 
> (1) The addresses are different lengths, so comparing a 128-bit IPv6
> prefix match condition to a 32-bit IPv4 connection address can cause
> us to unwittingly walk off the end of the IPv4 address and read
> garbage or oops.
> 
> (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
> simple bit-wise comparison of the prefixes is not meaningful, and
> would lead to bogus results (except for the IPv4-mapped IPv6 case,
> which this commit maintains).
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox