Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net v7 1/8] i40e: main driver core
From: Joe Perches @ 2013-09-11 12:16 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378893056-4821-2-git-send-email-jeffrey.t.kirsher@intel.com>

On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> 
> This is the driver for the Intel(R) Ethernet Controller XL710 Family.

trivial:

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
[]
> +int i40e_allocate_dma_mem_d(struct i40e_hw *hw, struct i40e_dma_mem *mem,
> +			    u64 size, u32 alignment)
> +{
> +	struct i40e_pf *pf = (struct i40e_pf *)hw->back;
> +
> +	mem->size = ALIGN(size, alignment);
> +	mem->va = dma_zalloc_coherent(&pf->pdev->dev, mem->size,
> +				      &mem->pa, GFP_KERNEL);
> +	if (mem->va)
> +		return 0;
> +
> +	return -ENOMEM;
> +}

It's much more common to use

	foo = alloc(...)
	if (!foo)
		return -ENOMEM;

	...

	return 0;

[]

> +int i40e_allocate_virt_mem_d(struct i40e_hw *hw, struct i40e_virt_mem *mem,
> +			     u32 size)
> +{
> +	mem->size = size;
> +	mem->va = kzalloc(size, GFP_KERNEL);
> +
> +	if (mem->va)
> +		return 0;
> +
> +	return -ENOMEM;
> +}

here too.

> +static int i40e_get_lump(struct i40e_pf *pf, struct i40e_lump_tracking *pile,
> +			 u16 needed, u16 id)
> +{
> +	int ret = -ENOMEM;
> +	int i = 0;
> +	int j = 0;
> +
> +	if (!pile || needed == 0 || id >= I40E_PILE_VALID_BIT) {
> +		dev_info(&pf->pdev->dev,
> +			 "param err: pile=%p needed=%d id=0x%04x\n",
> +			 pile, needed, id);
> +		return -EINVAL;
> +	}
> +
> +	/* start the linear search with an imperfect hint */
> +	i = pile->search_hint;
> +	while (i < pile->num_entries && ret < 0) {
> +		/* skip already allocated entries */
> +		if (pile->list[i] & I40E_PILE_VALID_BIT) {
> +			i++;
> +			continue;
> +		}
> +
> +		/* do we have enough in this lump? */
> +		for (j = 0; (j < needed) && ((i+j) < pile->num_entries); j++) {
> +			if (pile->list[i+j] & I40E_PILE_VALID_BIT)
> +				break;
> +		}
> +
> +		if (j == needed) {
> +			/* there was enough, so assign it to the requestor */
> +			for (j = 0; j < needed; j++)
> +				pile->list[i+j] = id | I40E_PILE_VALID_BIT;
> +			ret = i;
> +			pile->search_hint = i + j;

I think it'd read better with a break;
removing the ret < 0 test from the while loop above

> +		} else {
> +			/* not enough, so skip over it and continue looking */
> +			i += j;
> +		}
> +	}
> +
> +	return ret;
> +}
> +

[]

> +int i40e_sync_vsi_filters(struct i40e_vsi *vsi)
> +{

[]

> +	while (test_and_set_bit(__I40E_CONFIG_BUSY, &vsi->state))
> +		usleep_range(1000, 2000);

Possible hardware fault sleeps forever?

[]

> +/**
> + * i40e_find_filter - Search VSI filter list for specific mac/vlan filter
> + * @vsi: the VSI to be searched
> + * @macaddr: the MAC address
> + * @vlan: the vlan
> + * @is_vf: make sure its a vf filter, else doesn't matter
> + * @is_netdev: make sure its a netdev filter, else doesn't matter
> + *
> + * Returns ptr to the filter object or NULL

kernel-doc uses a "* Return: " style keyword.
(there are lots of these)

from Documentation/kernel-doc-nano-HOWTO.ext

Example kernel-doc function comment:

/**
 * foobar() - short function description of foobar
 * @arg1:	Describe the first argument to foobar.
 * @arg2:	Describe the second argument to foobar.
 *		One can provide multiple line descriptions
 *		for arguments.
 *
 * A longer description, with more discussion of the function foobar()
 * that might be useful to those using or modifying it.  Begins with
 * empty comment line, and may include additional embedded empty
 * comment lines.
 *
 * The longer description can have multiple paragraphs.
 *
 * Return: Describe the return value of foobar.
 */

[]

> +/**
> + * i40e_vsi_kill_vlan - Remove vsi membership for given vlan
> + * @vsi: the vsi being configured
> + * @vid: vlan id to be removed (0 = untagged only , -1 = any)

* Return: 0 on success or negative error code otherwise

[]

> +/**
> + * i40e_vlan_rx_add_vid - Add a vlan id filter to HW offload
> + * @netdev: network interface to be adjusted
> + * @vid: vlan id to be added
> + **/
> +static int i40e_vlan_rx_add_vid(struct net_device *netdev,
> +				__always_unused __be16 proto, u16 vid)
> +{
> +	struct i40e_netdev_priv *np = netdev_priv(netdev);
> +	struct i40e_vsi *vsi = np->vsi;
> +	int ret;
> +
> +	if (vid > 4095)
> +		return 0;

return -err or...

> +
> +	netdev_info(vsi->netdev, "adding %pM vid=%d\n",
> +		    netdev->dev_addr, vid);
> +	/* If the network stack called us with vid = 0, we should
> +	 * indicate to i40e_vsi_add_vlan() that we want to receive
> +	 * any traffic (i.e. with any vlan tag, or untagged)
> +	 */
> +	ret = i40e_vsi_add_vlan(vsi, vid ? vid : I40E_VLAN_ANY);
> +
> +	if (!ret) {
> +		if (vid < VLAN_N_VID)
> +			set_bit(vid, vsi->active_vlans);
> +	}
> +
> +	return 0;

make it a void return instead?

> +}
> +
> +/**
> + * i40e_vlan_rx_kill_vid - Remove a vlan id filter from HW offload
> + * @netdev: network interface to be adjusted
> + * @vid: vlan id to be removed
> + **/
> +static int i40e_vlan_rx_kill_vid(struct net_device *netdev,
> +				 __always_unused __be16 proto, u16 vid)
> +{
> +	struct i40e_netdev_priv *np = netdev_priv(netdev);
> +	struct i40e_vsi *vsi = np->vsi;
> +
> +	netdev_info(vsi->netdev, "removing %pM vid=%d\n",
> +		    netdev->dev_addr, vid);
> +	/* return code is ignored as there is nothing a user
> +	 * can do about failure to remove and a log message was
> +	 * already printed from another function
> +	 */
> +	i40e_vsi_kill_vlan(vsi, vid);
> +
> +	clear_bit(vid, vsi->active_vlans);
> +	return 0;

here too?  There are others below.
Maybe change all the functions that have
a fixed return 0 to void?

> +/**
> + * i40e_dcb_get_num_tc -  Get the number of TCs from DCBx config
> + * @dcbcfg: the corresponding DCBx configuration structure
> + *
> + * Return the number of TCs from given DCBx configuration
> + **/
> +static u8 i40e_dcb_get_num_tc(struct i40e_dcbx_config *dcbcfg)
> +{
> +	int num_tc = 0, i;
> +
> +	/* Scan the ETS Config Priority Table to find
> +	 * traffic class enabled for a given priority
> +	 * and use the traffic class index to get the
> +	 * number of traffic classes enabled
> +	 */
> +	for (i = 0; i < I40E_MAX_USER_PRIORITY; i++) {
> +		if (dcbcfg->etscfg.prioritytable[i] > num_tc)
> +			num_tc = dcbcfg->etscfg.prioritytable[i];
> +	}
> +
> +	/* Traffic class index starts from zero so
> +	 * increment to return the actual count
> +	 */
> +	num_tc++;
> +
> +	return num_tc;

There's a possible loss of precision here.
why isn't num_tc u8 like the function below?

> +}
> +
> +/**
> + * i40e_dcb_get_enabled_tc - Get enabled traffic classes
> + * @dcbcfg: the corresponding DCBx configuration structure
> + *
> + * Query the current DCB configuration and return the number of
> + * traffic classes enabled from the given DCBX config
> + **/
> +static u8 i40e_dcb_get_enabled_tc(struct i40e_dcbx_config *dcbcfg)
> +{
> +	u8 num_tc = i40e_dcb_get_num_tc(dcbcfg);
> +	u8 enabled_tc = 1;
> +	u8 i;
> +
> +	for (i = 0; i < num_tc; i++)
> +		enabled_tc |= 1 << i;
> +
> +	return enabled_tc;
> +}

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Joe Perches @ 2013-09-11 12:20 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378893056-4821-9-git-send-email-jeffrey.t.kirsher@intel.com>

On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> New driver build option is CONFIG_I40E

> diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
[]
> +i40e.txt
> +	- README for the Intel Ethernet Controller XL710 Driver (i40e).

Just curious but why the XL710 / i40e name mismatch?

^ permalink raw reply

* Re: 3.10.0 network trace
From: Josh Boyer @ 2013-09-11 12:49 UTC (permalink / raw)
  To: Michael Sterrett; +Cc: Yuchung Cheng, netdev
In-Reply-To: <CA+rTEUOXAH4TGMGvgBJ6FktA6E=hFhQmFHRf4Zaw5YJ6giOpOA@mail.gmail.com>

On Sun, Jul 21, 2013 at 9:36 PM, Michael Sterrett <michael@sterretts.net> wrote:
> Since upgraded to 3.10.1 which exhibits the same issue (probably not
> surprising).
>
> sysctl net.ipv4.tcp_frto=0 net.ipv4.tcp_early_retrans=3 seems to take
> care of it.

We're still getting reports of this with 3.10.10 in Fedora [1].  At
least one of the reporters has said those sysctl settings didn't help.

Is there anything else that can be gathered to help track this down?

josh

[1] https://bugzilla.redhat.com/show_bug.cgi?id=989251

^ permalink raw reply

* RE: usbnet transmit path problems
From: David Laight @ 2013-09-11 12:56 UTC (permalink / raw)
  To: Ming Lei, Oliver Neukum; +Cc: netdev
In-Reply-To: <20130911193430.0825ac95@tom-ThinkPad-T410>

> > > 2) If 'length % dev->maxpacket == 0' for a multi-fragment packet then
> > >    the extra byte isn't added correctly (the code probably falls off
> > >    the end of the scatter-gather list).
> >
> > Indeed. Ming Lei, should usbnet handle this in the sg case or better
> > leave it to the subdriver you introduced this for?

Is the ZLP issue a problem with the host or with the target?
If it is a host problem then the necessity comes from the host,
but the fix needs to be target dependant.
If it is a common target problem then generic code can apply
a common fix.

AFICT there are at least 3 fixes:
1) Extend the ethernet frame by one byte and hope the receiving
   system doesn't object to the padding.
   This is probably the only option if tx_fixup() doesn't
   add a header.
2) Put the ethernet frame length in the header and have the
   target discard the added pad byte (ax88179_178a.c).
3) Add a second zero-length frame in the same USB data block
   (ax88172a.c).

Only the third requires that tx_fixup() append to the packet.
For the other two actual pad can be added by usbnet.

> IMO, it should be handled by usbnet, could you comment on below patch?

Seems excessive to kmalloc() one byte!
If you can't assume that the 'dev' structure itself can be dma'd from
allocate the extra byte in the sg list.

	David

^ permalink raw reply

* RE: usbnet transmit path problems
From: David Laight @ 2013-09-11 13:00 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: netdev, Ming Lei
In-Reply-To: <1378894312.28160.15.camel@linux-fkkt.site>

> > 4) I read that USB3 has a different scheme for terminating bulk data
> >    that is a multiple of the packet size.
> >    Does this mean that the pad byte isn't needed for USB3?
> >    (Or are USB3 controllers/targets just as buggy?)
> 
> We don't have enough examples to tell.

Perhaps there should be a separate flag for USB3?
A device might have issues on USB2, but not USB3!

The flag is target specific - so I presume it isn't
a host/sending problem.

	David


^ permalink raw reply

* [PATCH 1/1] net: sched: Make netns available for ematch extensions
From: Jozsef Kadlecsik @ 2013-09-11 13:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, V. Lavrov
In-Reply-To: <1378905614-29691-1-git-send-email-kadlec@blackhole.kfki.hu>

Ematch API (change, destroy) doesn't pass netns data to ematch extensions.
This prevents adding netns support to ipset, which is an ematch too.

The patch adds the required pointer to "struct tcf_proto", thus
making it available for every ematch.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/net/sch_generic.h |    1 +
 net/sched/cls_api.c       |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f4eb365..38e5e4b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -225,6 +225,7 @@ struct tcf_proto {
 	struct Qdisc		*q;
 	void			*data;
 	const struct tcf_proto_ops	*ops;
+	struct net		*net;
 };

 struct qdisc_skb_cb {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8e118af..1b1cb11 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -266,6 +266,7 @@ replay:
 		tp->q = q;
 		tp->classify = tp_ops->classify;
 		tp->classid = parent;
+		tp->net = net;

 		err = tp_ops->init(tp);
 		if (err != 0) {
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH 0/1] Make netns available for ematch extensions
From: Jozsef Kadlecsik @ 2013-09-11 13:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, V. Lavrov

Hi Dave,

ipset needs netns support and ipset is supported as an ematch extension.
However netns data is not available for ematch extensions. In the next
patch I tried to find the less obtrusive way to pass the required data
to the extensions. Please consider applying it to net-next.

Best regards,
Jozsef

Jozsef Kadlecsik (1):
  net: sched: Make netns available for ematch extensions

 include/net/sch_generic.h |    1 +
 net/sched/cls_api.c       |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

^ permalink raw reply

* usbnet transmit scatter-gather
From: David Laight @ 2013-09-11 13:19 UTC (permalink / raw)
  To: netdev

usbnet and ax88179_178a.c where changed fairly recently to support
scatter-gather transmit.
(This would let TSO be re-enabled.)

The change removed the skb_serialize() from ax88179_178a.c.
This means that things will go horribly wrong if the hardware
doesn't support SG (my i7 box must support SG), I'm sure
I can find something that doesn't :-)

I think the interface to usbnet needs changing so that:
1) The driver reports whether it can process segmented skb.
2) If the driver can't process segmented skb, usbnet will
   call skb_serialize() prior to calling tx_fixup().
3) If the usb hardware can't do SG, usbnet will call
   skb_serialize() after calling tx_fixup().

Would it make sense for usbnet mask out TSO support
when SG dma is unavailable?

	David

^ permalink raw reply

* [PATCHv2 net] xen-netback: count number required slots for an skb more carefully
From: David Vrabel @ 2013-09-11 13:52 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, Boris Ostrovsky, netdev,
	Ian Campbell

From: David Vrabel <david.vrabel@citrix.com>

When a VM is providing an iSCSI target and the LUN is used by the
backend domain, the generated skbs for direct I/O writes to the disk
have large, multi-page skb->data but no frags.

With some lengths and starting offsets, xen_netbk_count_skb_slots()
would be one short because the simple calculation of
DIV_ROUND_UP(skb_headlen(), PAGE_SIZE) was not accounting for the
decisions made by start_new_rx_buffer() which does not guarantee
responses are fully packed.

For example, a skb with length < 2 pages but which spans 3 pages would
be counted as requiring 2 slots but would actually use 3 slots.

skb->data:

    |        1111|222222222222|3333        |

Fully packed, this would need 2 slots:

    |111122222222|22223333    |

But because the 2nd page wholy fits into a slot it is not split across
slots and goes into a slot of its own:

    |1111        |222222222222|3333        |

Miscounting the number of slots means netback may push more responses
than the number of available requests.  This will cause the frontend
to get very confused and report "Too many frags/slots".  The frontend
never recovers and will eventually BUG.

Fix this by counting the number of required slots more carefully.  In
xen_netbk_count_skb_slots(), more closely follow the algorithm used by
xen_netbk_gop_skb() by introducing xen_netbk_count_frag_slots() which
is the dry-run equivalent of netbk_gop_frag_copy().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
--
v2: xen_netbk_* -> xenvif_* to match new style
---
 drivers/net/xen-netback/netback.c |   94 +++++++++++++++++++++++++------------
 1 files changed, 64 insertions(+), 30 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 956130c..f3e591c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -212,6 +212,49 @@ static bool start_new_rx_buffer(int offset, unsigned long size, int head)
 	return false;
 }
 
+struct xenvif_count_slot_state {
+	unsigned long copy_off;
+	bool head;
+};
+
+unsigned int xenvif_count_frag_slots(struct xenvif *vif,
+				     unsigned long offset, unsigned long size,
+				     struct xenvif_count_slot_state *state)
+{
+	unsigned count = 0;
+
+	offset &= ~PAGE_MASK;
+
+	while (size > 0) {
+		unsigned long bytes;
+
+		bytes = PAGE_SIZE - offset;
+
+		if (bytes > size)
+			bytes = size;
+
+		if (start_new_rx_buffer(state->copy_off, bytes, state->head)) {
+			count++;
+			state->copy_off = 0;
+		}
+
+		if (state->copy_off + bytes > MAX_BUFFER_OFFSET)
+			bytes = MAX_BUFFER_OFFSET - state->copy_off;
+
+		state->copy_off += bytes;
+
+		offset += bytes;
+		size -= bytes;
+
+		if (offset == PAGE_SIZE)
+			offset = 0;
+
+		state->head = false;
+	}
+
+	return count;
+}
+
 /*
  * Figure out how many ring slots we're going to need to send @skb to
  * the guest. This function is essentially a dry run of
@@ -219,48 +262,39 @@ static bool start_new_rx_buffer(int offset, unsigned long size, int head)
  */
 unsigned int xenvif_count_skb_slots(struct xenvif *vif, struct sk_buff *skb)
 {
+	struct xenvif_count_slot_state state;
 	unsigned int count;
-	int i, copy_off;
+	unsigned char *data;
+	unsigned i;
 
-	count = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE);
+	state.head = true;
+	state.copy_off = 0;
 
-	copy_off = skb_headlen(skb) % PAGE_SIZE;
+	/* Slot for the first (partial) page of data. */
+	count = 1;
 
+	/* Need a slot for the GSO prefix for GSO extra data? */
 	if (skb_shinfo(skb)->gso_size)
 		count++;
 
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		unsigned long size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
-		unsigned long offset = skb_shinfo(skb)->frags[i].page_offset;
-		unsigned long bytes;
-
-		offset &= ~PAGE_MASK;
-
-		while (size > 0) {
-			BUG_ON(offset >= PAGE_SIZE);
-			BUG_ON(copy_off > MAX_BUFFER_OFFSET);
-
-			bytes = PAGE_SIZE - offset;
-
-			if (bytes > size)
-				bytes = size;
+	data = skb->data;
+	while (data < skb_tail_pointer(skb)) {
+		unsigned long offset = offset_in_page(data);
+		unsigned long size = PAGE_SIZE - offset;
 
-			if (start_new_rx_buffer(copy_off, bytes, 0)) {
-				count++;
-				copy_off = 0;
-			}
+		if (data + size > skb_tail_pointer(skb))
+			size = skb_tail_pointer(skb) - data;
 
-			if (copy_off + bytes > MAX_BUFFER_OFFSET)
-				bytes = MAX_BUFFER_OFFSET - copy_off;
+		count += xenvif_count_frag_slots(vif, offset, size, &state);
 
-			copy_off += bytes;
+		data += size;
+	}
 
-			offset += bytes;
-			size -= bytes;
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		unsigned long size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+		unsigned long offset = skb_shinfo(skb)->frags[i].page_offset;
 
-			if (offset == PAGE_SIZE)
-				offset = 0;
-		}
+		count += xenvif_count_frag_slots(vif, offset, size, &state);
 	}
 	return count;
 }
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH 1/1] net: sched: Make netns available for ematch extensions
From: Eric Dumazet @ 2013-09-11 14:31 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: David Miller, netdev, V. Lavrov
In-Reply-To: <1378905614-29691-2-git-send-email-kadlec@blackhole.kfki.hu>

On Wed, 2013-09-11 at 15:20 +0200, Jozsef Kadlecsik wrote:
> Ematch API (change, destroy) doesn't pass netns data to ematch extensions.
> This prevents adding netns support to ipset, which is an ematch too.
> 
> The patch adds the required pointer to "struct tcf_proto", thus
> making it available for every ematch.
> 
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  include/net/sch_generic.h |    1 +
>  net/sched/cls_api.c       |    1 +
>  2 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f4eb365..38e5e4b 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -225,6 +225,7 @@ struct tcf_proto {
>  	struct Qdisc		*q;
>  	void			*data;
>  	const struct tcf_proto_ops	*ops;
> +	struct net		*net;
>  };
>  

Note that qdisc_dev(tp->q) should give you the pointer to device

Then ->nd_net gives you the struct net pointer.

On management path, this should be enough ;)

^ permalink raw reply

* Re: [PATCH 1/1] net: sched: Make netns available for ematch extensions
From: Eric Dumazet @ 2013-09-11 14:32 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: David Miller, netdev, V. Lavrov
In-Reply-To: <1378909915.21474.2.camel@edumazet-glaptop>

On Wed, 2013-09-11 at 07:31 -0700, Eric Dumazet wrote:

> 
> Note that qdisc_dev(tp->q) should give you the pointer to device
> 
> Then ->nd_net gives you the struct net pointer.
> 
> On management path, this should be enough ;)

(A device can be moved from net xxxx to net yyyy)

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Jeff Kirsher @ 2013-09-11 14:32 UTC (permalink / raw)
  To: Joe Perches
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378902002.606.42.camel@joe-AO722>

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]

On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > New driver build option is CONFIG_I40E
> 
> > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> []
> > +i40e.txt
> > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> 
> Just curious but why the XL710 / i40e name mismatch?

i40e stands for Intel 40 GbE Ethernet which is more generic than naming
the driver the same as the first part (XL710).  That way when future
silicon is made, we are not stuck with a driver named after previous
silicon.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* How to enable extended Ethtool support for SOCs like the i.mx6
From: Christian Ege @ 2013-09-11 14:37 UTC (permalink / raw)
  To: netdev

Hello

I am currently developping on a freescale i.mx6 custom design with a
Micrel ksz9031 Phy. Both are supported by Linux. The adjustment of
speed and duplex modes can be done by ethtool without any problems.

Then our hardware department started a request to toggle mdi and mdi-x
modes,this is done through the vendor specific Register 0x1C, of the
PHY. The PHY itself supports this feature. But I can not find any
clear way to add this support in the Kernel.

The ethtool communicates over the freescale fec driver with the PHY.
The FEC Driver itself does not really know what PHY it is attached to.
The only examples about the mdi/mdi-x features in the kernel are
located in the intel drivers which I think do know which PHY
capabilities they can rely on.

What is the preferred way to make this settings available to the user-space?

Is extending the phy_device struct a way to go?

with kind regards,

Christian Ege

^ permalink raw reply

* Re: Use-after-free in TUNSETIFF
From: Ben Hutchings @ 2013-09-11 14:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Wannes Rombouts, davem, jasowang, mst, edumazet, nhorman, netdev,
	Kevin Soules
In-Reply-To: <20130910173257.47e2dc08@nehalam.linuxnetplumber.net>

On Tue, 2013-09-10 at 17:32 -0700, Stephen Hemminger wrote:
[...]
> [1] A user with CAP_NET_ADMIN can basically hose the system many other ways.
> Capabilities are a failed security model.
> Almost all distro's limit CAP_NET_ADMIN to root anyway.

tun uses ns_capable(), not capable().  If user namespaces are enabled
then I think any user can create their own user & net namespaces, be
'root' in those namespaces and then invoke TUNSETIFF successfully.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Joe Perches @ 2013-09-11 14:47 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378909937.2026.8.camel@jtkirshe-mobl>

On Wed, 2013-09-11 at 07:32 -0700, Jeff Kirsher wrote:
> On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> > On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > > New driver build option is CONFIG_I40E
> > 
> > > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> > []
> > > +i40e.txt
> > > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> > 
> > Just curious but why the XL710 / i40e name mismatch?
> 
> i40e stands for Intel 40 GbE Ethernet which is more generic than naming
> the driver the same as the first part (XL710).

Too bad if also means that when 100 comes out
this driver won't be expected to work.

^ permalink raw reply

* Re: [PATCH nf] netfilter: use RCU safe kfree for conntrack extensions
From: Phil Oester @ 2013-09-11 14:57 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netfilter-devel, netdev, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, coreteam
In-Reply-To: <20130911090900.0129AE8AFF@unicorn.suse.cz>

On Wed, Sep 11, 2013 at 10:17:27AM +0200, Michal Kubecek wrote:
> Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
> RCU protection for freeing extension data when reallocation
> moves them to a new location. We need the same protection when
> freeing them in nf_ct_ext_free() in order to prevent a
> use-after-free by other threads referencing a NAT extension data
> via bysource list.

Hi Michal - 

coincidentally I've been looking into this area this week due to another
bug report (https://bugzilla.kernel.org/show_bug.cgi?id=60853).  Looking at
your proposed fix, the NAT extension data should have been cleaned
from the bysource list in nf_nat_cleanup_conntrack (via __nf_ct_ext_destroy)
before reaching the kfree.  Would you agree?

The reporter of #60853 suggested adding a synchronize_rcu to the end of the
nf_nat_cleanup_conntrack function, which seems sane.  I have been trying to 
reproduce the crash to test that theory.

Are you able to reproduce an OOPS in your testing?  Or is there a bug report
you are working from?

Phil

^ permalink raw reply

* [PATCH net v2] net: sctp: fix ipv6 ipsec encryption bug in sctp_v6_xmit
From: Daniel Borkmann @ 2013-09-11 14:58 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp, adobriyan, Steffen Klassert,
	Hannes Frederic Sowa

Alan Chester reported an issue with IPv6 on SCTP that IPsec traffic is not
being encrypted, whereas on IPv4 it is. Setting up an AH + ESP transport
does not seem to have the desired effect:

SCTP + IPv4:

  22:14:20.809645 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 116)
    192.168.0.2 > 192.168.0.5: AH(spi=0x00000042,sumlen=16,seq=0x1): ESP(spi=0x00000044,seq=0x1), length 72
  22:14:20.813270 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 340)
    192.168.0.5 > 192.168.0.2: AH(spi=0x00000043,sumlen=16,seq=0x1):

SCTP + IPv6:

  22:31:19.215029 IP6 (class 0x02, hlim 64, next-header SCTP (132) payload length: 364)
    fe80::222:15ff:fe87:7fc.3333 > fe80::92e6:baff:fe0d:5a54.36767: sctp
    1) [INIT ACK] [init tag: 747759530] [rwnd: 62464] [OS: 10] [MIS: 10]

Moreover, Alan says:

  This problem was seen with both Racoon and Racoon2. Other people have seen
  this with OpenSwan. When IPsec is configured to encrypt all upper layer
  protocols the SCTP connection does not initialize. After using Wireshark to
  follow packets, this is because the SCTP packet leaves Box A unencrypted and
  Box B believes all upper layer protocols are to be encrypted so it drops
  this packet, causing the SCTP connection to fail to initialize. When IPsec
  is configured to encrypt just SCTP, the SCTP packets are observed unencrypted.

In fact, using `socat sctp6-listen:3333 -` on one end and transferring "plaintext"
string on the other end, results in cleartext on the wire where SCTP eventually
does not report any errors, thus in the latter case that Alan reports, the
non-paranoid user might think he's communicating over an encrypted transport on
SCTP although he's not (tcpdump ... -X):

  ...
  0x0030: 5d70 8e1a 0003 001a 177d eb6c 0000 0000  ]p.......}.l....
  0x0040: 0000 0000 706c 6169 6e74 6578 740a 0000  ....plaintext...

Only in /proc/net/xfrm_stat we can see XfrmInTmplMismatch increasing on the
receiver side. Initial follow-up analysis from Alan's bug report was done by
Alexey Dobriyan. Also thanks to Vlad Yasevich for feedback on this.

SCTP has its own implementation of sctp_v6_xmit() not calling inet6_csk_xmit().
This has the implication that it probably never really got updated along with
changes in inet6_csk_xmit() and therefore does not seem to invoke xfrm handlers.

SCTP's IPv4 xmit however, properly calls ip_queue_xmit() to do the work. Since
a call to inet6_csk_xmit() would solve this problem, but result in unecessary
route lookups, let us just use the cached flowi6 instead that we got through
sctp_v6_get_dst(). Since all SCTP packets are being sent through sctp_packet_transmit(),
we do the route lookup / flow caching in sctp_transport_route(), hold it in
tp->dst and skb_dst_set() right after that. If we would alter fl6->daddr in
sctp_v6_xmit() to np->opt->srcrt, we possibly could run into the same effect
of not having xfrm layer pick it up, hence, use fl6_update_dst() in sctp_v6_get_dst()
instead to get the correct source routed dst entry, which we assign to the skb.

Also source address routing example from 625034113 ("sctp: fix sctp to work with
ipv6 source address routing") still works with this patch! Nevertheless, in RFC5095
it is actually 'recommended' to not use that anyway due to traffic amplification [1].
So it seems we're not supposed to do that anyway in sctp_v6_xmit(). Moreover, if
we overwrite the flow destination here, the lower IPv6 layer will be unable to
put the correct destination address into IP header, as routing header is added in
ipv6_push_nfrag_opts() but then probably with wrong final destination. Things aside,
result of this patch is that we do not have any XfrmInTmplMismatch increase plus on
the wire with this patch it now looks like:

SCTP + IPv6:

  08:17:47.074080 IP6 2620:52:0:102f:7a2b:cbff:fe27:1b0a > 2620:52:0:102f:213:72ff:fe32:7eba:
    AH(spi=0x00005fb4,seq=0x1): ESP(spi=0x00005fb5,seq=0x1), length 72
  08:17:47.074264 IP6 2620:52:0:102f:213:72ff:fe32:7eba > 2620:52:0:102f:7a2b:cbff:fe27:1b0a:
    AH(spi=0x00003d54,seq=0x1): ESP(spi=0x00003d55,seq=0x1), length 296

This fixes Kernel Bugzilla 24412. This security issue seems to be present since
2.6.18 kernels. Lets just hope some big passive adversary in the wild didn't have
its fun with that. lksctp-tools IPv6 regression test suite passes as well with
this patch.

 [1] http://www.secdev.org/conf/IPv6_RH_security-csw07.pdf

Reported-by: Alan Chester <alan.chester@tekelec.com>
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 v1->v2:
  - use cached flow
  - improved commit msg

 net/sctp/ipv6.c | 42 +++++++++++++-----------------------------
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index da613ce..4f52e2c 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -204,44 +204,23 @@ out:
 		in6_dev_put(idev);
 }

-/* Based on tcp_v6_xmit() in tcp_ipv6.c. */
 static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *transport)
 {
 	struct sock *sk = skb->sk;
 	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct flowi6 fl6;
-
-	memset(&fl6, 0, sizeof(fl6));
-
-	fl6.flowi6_proto = sk->sk_protocol;
-
-	/* Fill in the dest address from the route entry passed with the skb
-	 * and the source address from the transport.
-	 */
-	fl6.daddr = transport->ipaddr.v6.sin6_addr;
-	fl6.saddr = transport->saddr.v6.sin6_addr;
-
-	fl6.flowlabel = np->flow_label;
-	IP6_ECN_flow_xmit(sk, fl6.flowlabel);
-	if (ipv6_addr_type(&fl6.saddr) & IPV6_ADDR_LINKLOCAL)
-		fl6.flowi6_oif = transport->saddr.v6.sin6_scope_id;
-	else
-		fl6.flowi6_oif = sk->sk_bound_dev_if;
-
-	if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
-		fl6.daddr = *rt0->addr;
-	}
+	struct flowi6 *fl6 = &transport->fl.u.ip6;

 	pr_debug("%s: skb:%p, len:%d, src:%pI6 dst:%pI6\n", __func__, skb,
-		 skb->len, &fl6.saddr, &fl6.daddr);
+		 skb->len, &fl6->saddr, &fl6->daddr);

-	SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS);
+	IP6_ECN_flow_xmit(sk, fl6->flowlabel);

 	if (!(transport->param_flags & SPP_PMTUD_ENABLE))
 		skb->local_df = 1;

-	return ip6_xmit(sk, skb, &fl6, np->opt, np->tclass);
+	SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS);
+
+	return ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
 }

 /* Returns the dst cache entry for the given source and destination ip
@@ -254,10 +233,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 	struct dst_entry *dst = NULL;
 	struct flowi6 *fl6 = &fl->u.ip6;
 	struct sctp_bind_addr *bp;
+	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct sctp_sockaddr_entry *laddr;
 	union sctp_addr *baddr = NULL;
 	union sctp_addr *daddr = &t->ipaddr;
 	union sctp_addr dst_saddr;
+	struct in6_addr *final_p, final;
 	__u8 matchlen = 0;
 	__u8 bmatchlen;
 	sctp_scope_t scope;
@@ -281,7 +262,8 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 		pr_debug("src=%pI6 - ", &fl6->saddr);
 	}

-	dst = ip6_dst_lookup_flow(sk, fl6, NULL, false);
+	final_p = fl6_update_dst(fl6, np->opt, &final);
+	dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
 	if (!asoc || saddr)
 		goto out;

@@ -333,10 +315,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 		}
 	}
 	rcu_read_unlock();
+
 	if (baddr) {
 		fl6->saddr = baddr->v6.sin6_addr;
 		fl6->fl6_sport = baddr->v6.sin6_port;
-		dst = ip6_dst_lookup_flow(sk, fl6, NULL, false);
+		final_p = fl6_update_dst(fl6, np->opt, &final);
+		dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
 	}

 out:
-- 
1.7.11.7

^ permalink raw reply related

* Re: usbnet transmit path problems
From: Ming Lei @ 2013-09-11 15:11 UTC (permalink / raw)
  To: David Laight; +Cc: Oliver Neukum, Network Development, linux-usb
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B7326@saturn3.aculab.com>

On Wed, Sep 11, 2013 at 8:56 PM, David Laight <David.Laight@aculab.com> wrote:
>> > > 2) If 'length % dev->maxpacket == 0' for a multi-fragment packet then
>> > >    the extra byte isn't added correctly (the code probably falls off
>> > >    the end of the scatter-gather list).
>> >
>> > Indeed. Ming Lei, should usbnet handle this in the sg case or better
>> > leave it to the subdriver you introduced this for?
>
> Is the ZLP issue a problem with the host or with the target?

Sorry, what do you mean the ZLP issue here? I understand Oliver
thinks one commit from me may break ZLP handling, are you discussing
this problem? If not, could you explain it in a bit detail?

> If it is a host problem then the necessity comes from the host,
> but the fix needs to be target dependant.
> If it is a common target problem then generic code can apply
> a common fix.

All usbnet device should have sent one ZLP in case the size of
bulk out transfer can be divided by max packet size, but the one
byte transfer might be introduced for avoiding some target problem
(can't deal with zlp well), as said by David, see below discussion:

   http://marc.info/?l=linux-usb&m=127067487604112&w=2

>
> AFICT there are at least 3 fixes:
> 1) Extend the ethernet frame by one byte and hope the receiving
>    system doesn't object to the padding.
>    This is probably the only option if tx_fixup() doesn't
>    add a header.
> 2) Put the ethernet frame length in the header and have the
>    target discard the added pad byte (ax88179_178a.c).
> 3) Add a second zero-length frame in the same USB data block
>    (ax88172a.c).

Why do we need the above 3 fixes? The patch in my last email can
fix the problem which is introduced recently, can't it?

>
> Only the third requires that tx_fixup() append to the packet.
> For the other two actual pad can be added by usbnet.
>
>> IMO, it should be handled by usbnet, could you comment on below patch?
>
> Seems excessive to kmalloc() one byte!

It isn't strange, many usb drivers have to do that(maybe kmalloc
two or three, or four bytes) since the buffer is involved into DMA.

> If you can't assume that the 'dev' structure itself can be dma'd from
> allocate the extra byte in the sg list.

It is better to always obey rule of DMA-API, so don't do that since
one extra kmalloc() per device is needed, even though we can allocate
only one global buffer for this purpose.


Thanks,
--
Ming Lei

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Waskiewicz Jr, Peter P @ 2013-09-11 15:24 UTC (permalink / raw)
  To: Joe Perches
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, Brandeburg, Jesse,
	netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com,
	Nelson, Shannon, e1000-devel@lists.sourceforge.net
In-Reply-To: <1378902002.606.42.camel@joe-AO722>

On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > New driver build option is CONFIG_I40E
> 
> > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> []
> > +i40e.txt
> > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> 
> Just curious but why the XL710 / i40e name mismatch?

ixgbe is a good example.  ixgbe is the driver name, but it supports the
82598, 82599 / X520, and X540 chips.

Cheers,
-PJ

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Jeff Kirsher @ 2013-09-11 15:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378910832.4714.2.camel@joe-AO722>

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

On Wed, 2013-09-11 at 07:47 -0700, Joe Perches wrote:
> On Wed, 2013-09-11 at 07:32 -0700, Jeff Kirsher wrote:
> > On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> > > On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > > > New driver build option is CONFIG_I40E
> > > 
> > > > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> > > []
> > > > +i40e.txt
> > > > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> > > 
> > > Just curious but why the XL710 / i40e name mismatch?
> > 
> > i40e stands for Intel 40 GbE Ethernet which is more generic than naming
> > the driver the same as the first part (XL710).
> 
> Too bad if also means that when 100 comes out
> this driver won't be expected to work.

It will work for our 40GbE parts definitely, as far as future technology
and silicon, I cannot say.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH nf] netfilter: use RCU safe kfree for conntrack extensions
From: Michal Kubecek @ 2013-09-11 15:28 UTC (permalink / raw)
  To: Phil Oester
  Cc: netfilter-devel, netdev, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, coreteam
In-Reply-To: <20130911145715.GA2882@linuxace.com>

On Wed, Sep 11, 2013 at 07:57:15AM -0700, Phil Oester wrote:
> On Wed, Sep 11, 2013 at 10:17:27AM +0200, Michal Kubecek wrote:
> > Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
> > RCU protection for freeing extension data when reallocation
> > moves them to a new location. We need the same protection when
> > freeing them in nf_ct_ext_free() in order to prevent a
> > use-after-free by other threads referencing a NAT extension data
> > via bysource list.
> 
> Hi Michal - 
> 
> coincidentally I've been looking into this area this week due to another
> bug report (https://bugzilla.kernel.org/show_bug.cgi?id=60853).

Looking at the initial command, I would say this bug report is actually
of the same origin as ours.

> Looking at
> your proposed fix, the NAT extension data should have been cleaned
> from the bysource list in nf_nat_cleanup_conntrack (via __nf_ct_ext_destroy)
> before reaching the kfree.  Would you agree?

It is cleaned from the list but as it is an RCU list, other readers can
still be holding pointers to it. We have to wait for the RCU grace
period before we can reuse it.

> The reporter of #60853 suggested adding a synchronize_rcu to the end of the
> nf_nat_cleanup_conntrack function, which seems sane.

That was also my first idea. However, nf_nat_cleanup_conntrack() is
called from __nf_ct_ext_destroy() inside a rcu_read_lock() /
rcu_read_unlock() block. Even if this block is for a different RCU list,
we still cannot call synchronize_rcu() while inside it.

We could call synchronize_rcu() in __nf_ct_ext_destroy() after
rcu_read_unlock() but this would IMHO add an unnecessary delay so it is
more efficient and more appropriate to wait before the actual kfree()
which is the operation that needs to wait for RCU grace period.

> I have been trying to reproduce the crash to test that theory.
> Are you able to reproduce an OOPS in your testing?  Or is there a bug
> report you are working from?

No, it is a bugreport from our customer. And even that customer
encountered it only once so far. Which is not very surprising as to
reproduce it, you have to be (un)lucky twice: first to have someone
overwrite the area soon enough and second to have someone access the
area after it is overwritten. This is IMHO the reason why the reporter
cleared the block with memset() for testing purposes.

                                                     Michal Kubecek

^ permalink raw reply

* Re: [PATCH net v2] net: sctp: fix ipv6 ipsec encryption bug in sctp_v6_xmit
From: Vlad Yasevich @ 2013-09-11 15:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, linux-sctp, adobriyan, Steffen Klassert,
	Hannes Frederic Sowa
In-Reply-To: <1378911516-15942-1-git-send-email-dborkman@redhat.com>

On 09/11/2013 10:58 AM, Daniel Borkmann wrote:
> Alan Chester reported an issue with IPv6 on SCTP that IPsec traffic is not
> being encrypted, whereas on IPv4 it is. Setting up an AH + ESP transport
> does not seem to have the desired effect:
>
> SCTP + IPv4:
>
>    22:14:20.809645 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 116)
>      192.168.0.2 > 192.168.0.5: AH(spi=0x00000042,sumlen=16,seq=0x1): ESP(spi=0x00000044,seq=0x1), length 72
>    22:14:20.813270 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 340)
>      192.168.0.5 > 192.168.0.2: AH(spi=0x00000043,sumlen=16,seq=0x1):
>
> SCTP + IPv6:
>
>    22:31:19.215029 IP6 (class 0x02, hlim 64, next-header SCTP (132) payload length: 364)
>      fe80::222:15ff:fe87:7fc.3333 > fe80::92e6:baff:fe0d:5a54.36767: sctp
>      1) [INIT ACK] [init tag: 747759530] [rwnd: 62464] [OS: 10] [MIS: 10]
>
> Moreover, Alan says:
>
>    This problem was seen with both Racoon and Racoon2. Other people have seen
>    this with OpenSwan. When IPsec is configured to encrypt all upper layer
>    protocols the SCTP connection does not initialize. After using Wireshark to
>    follow packets, this is because the SCTP packet leaves Box A unencrypted and
>    Box B believes all upper layer protocols are to be encrypted so it drops
>    this packet, causing the SCTP connection to fail to initialize. When IPsec
>    is configured to encrypt just SCTP, the SCTP packets are observed unencrypted.
>
> In fact, using `socat sctp6-listen:3333 -` on one end and transferring "plaintext"
> string on the other end, results in cleartext on the wire where SCTP eventually
> does not report any errors, thus in the latter case that Alan reports, the
> non-paranoid user might think he's communicating over an encrypted transport on
> SCTP although he's not (tcpdump ... -X):
>
>    ...
>    0x0030: 5d70 8e1a 0003 001a 177d eb6c 0000 0000  ]p.......}.l....
>    0x0040: 0000 0000 706c 6169 6e74 6578 740a 0000  ....plaintext...
>
> Only in /proc/net/xfrm_stat we can see XfrmInTmplMismatch increasing on the
> receiver side. Initial follow-up analysis from Alan's bug report was done by
> Alexey Dobriyan. Also thanks to Vlad Yasevich for feedback on this.
>
> SCTP has its own implementation of sctp_v6_xmit() not calling inet6_csk_xmit().
> This has the implication that it probably never really got updated along with
> changes in inet6_csk_xmit() and therefore does not seem to invoke xfrm handlers.
>
> SCTP's IPv4 xmit however, properly calls ip_queue_xmit() to do the work. Since
> a call to inet6_csk_xmit() would solve this problem, but result in unecessary
> route lookups, let us just use the cached flowi6 instead that we got through
> sctp_v6_get_dst(). Since all SCTP packets are being sent through sctp_packet_transmit(),
> we do the route lookup / flow caching in sctp_transport_route(), hold it in
> tp->dst and skb_dst_set() right after that. If we would alter fl6->daddr in
> sctp_v6_xmit() to np->opt->srcrt, we possibly could run into the same effect
> of not having xfrm layer pick it up, hence, use fl6_update_dst() in sctp_v6_get_dst()
> instead to get the correct source routed dst entry, which we assign to the skb.
>
> Also source address routing example from 625034113 ("sctp: fix sctp to work with
> ipv6 source address routing") still works with this patch! Nevertheless, in RFC5095
> it is actually 'recommended' to not use that anyway due to traffic amplification [1].
> So it seems we're not supposed to do that anyway in sctp_v6_xmit(). Moreover, if
> we overwrite the flow destination here, the lower IPv6 layer will be unable to
> put the correct destination address into IP header, as routing header is added in
> ipv6_push_nfrag_opts() but then probably with wrong final destination. Things aside,
> result of this patch is that we do not have any XfrmInTmplMismatch increase plus on
> the wire with this patch it now looks like:
>
> SCTP + IPv6:
>
>    08:17:47.074080 IP6 2620:52:0:102f:7a2b:cbff:fe27:1b0a > 2620:52:0:102f:213:72ff:fe32:7eba:
>      AH(spi=0x00005fb4,seq=0x1): ESP(spi=0x00005fb5,seq=0x1), length 72
>    08:17:47.074264 IP6 2620:52:0:102f:213:72ff:fe32:7eba > 2620:52:0:102f:7a2b:cbff:fe27:1b0a:
>      AH(spi=0x00003d54,seq=0x1): ESP(spi=0x00003d55,seq=0x1), length 296
>
> This fixes Kernel Bugzilla 24412. This security issue seems to be present since
> 2.6.18 kernels. Lets just hope some big passive adversary in the wild didn't have
> its fun with that. lksctp-tools IPv6 regression test suite passes as well with
> this patch.
>
>   [1] http://www.secdev.org/conf/IPv6_RH_security-csw07.pdf
>
> Reported-by: Alan Chester <alan.chester@tekelec.com>
> Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

-vlad

> ---
>   v1->v2:
>    - use cached flow
>    - improved commit msg
>
>   net/sctp/ipv6.c | 42 +++++++++++++-----------------------------
>   1 file changed, 13 insertions(+), 29 deletions(-)
>
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index da613ce..4f52e2c 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -204,44 +204,23 @@ out:
>   		in6_dev_put(idev);
>   }
>
> -/* Based on tcp_v6_xmit() in tcp_ipv6.c. */
>   static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *transport)
>   {
>   	struct sock *sk = skb->sk;
>   	struct ipv6_pinfo *np = inet6_sk(sk);
> -	struct flowi6 fl6;
> -
> -	memset(&fl6, 0, sizeof(fl6));
> -
> -	fl6.flowi6_proto = sk->sk_protocol;
> -
> -	/* Fill in the dest address from the route entry passed with the skb
> -	 * and the source address from the transport.
> -	 */
> -	fl6.daddr = transport->ipaddr.v6.sin6_addr;
> -	fl6.saddr = transport->saddr.v6.sin6_addr;
> -
> -	fl6.flowlabel = np->flow_label;
> -	IP6_ECN_flow_xmit(sk, fl6.flowlabel);
> -	if (ipv6_addr_type(&fl6.saddr) & IPV6_ADDR_LINKLOCAL)
> -		fl6.flowi6_oif = transport->saddr.v6.sin6_scope_id;
> -	else
> -		fl6.flowi6_oif = sk->sk_bound_dev_if;
> -
> -	if (np->opt && np->opt->srcrt) {
> -		struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
> -		fl6.daddr = *rt0->addr;
> -	}
> +	struct flowi6 *fl6 = &transport->fl.u.ip6;
>
>   	pr_debug("%s: skb:%p, len:%d, src:%pI6 dst:%pI6\n", __func__, skb,
> -		 skb->len, &fl6.saddr, &fl6.daddr);
> +		 skb->len, &fl6->saddr, &fl6->daddr);
>
> -	SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS);
> +	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
>
>   	if (!(transport->param_flags & SPP_PMTUD_ENABLE))
>   		skb->local_df = 1;
>
> -	return ip6_xmit(sk, skb, &fl6, np->opt, np->tclass);
> +	SCTP_INC_STATS(sock_net(sk), SCTP_MIB_OUTSCTPPACKS);
> +
> +	return ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
>   }
>
>   /* Returns the dst cache entry for the given source and destination ip
> @@ -254,10 +233,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
>   	struct dst_entry *dst = NULL;
>   	struct flowi6 *fl6 = &fl->u.ip6;
>   	struct sctp_bind_addr *bp;
> +	struct ipv6_pinfo *np = inet6_sk(sk);
>   	struct sctp_sockaddr_entry *laddr;
>   	union sctp_addr *baddr = NULL;
>   	union sctp_addr *daddr = &t->ipaddr;
>   	union sctp_addr dst_saddr;
> +	struct in6_addr *final_p, final;
>   	__u8 matchlen = 0;
>   	__u8 bmatchlen;
>   	sctp_scope_t scope;
> @@ -281,7 +262,8 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
>   		pr_debug("src=%pI6 - ", &fl6->saddr);
>   	}
>
> -	dst = ip6_dst_lookup_flow(sk, fl6, NULL, false);
> +	final_p = fl6_update_dst(fl6, np->opt, &final);
> +	dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
>   	if (!asoc || saddr)
>   		goto out;
>
> @@ -333,10 +315,12 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
>   		}
>   	}
>   	rcu_read_unlock();
> +
>   	if (baddr) {
>   		fl6->saddr = baddr->v6.sin6_addr;
>   		fl6->fl6_sport = baddr->v6.sin6_port;
> -		dst = ip6_dst_lookup_flow(sk, fl6, NULL, false);
> +		final_p = fl6_update_dst(fl6, np->opt, &final);
> +		dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
>   	}
>
>   out:
>

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Joe Perches @ 2013-09-11 15:31 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Jesse Brandeburg, netdev, gospo, sassmann, Shannon Nelson,
	PJ Waskiewicz, e1000-devel
In-Reply-To: <1378909937.2026.8.camel@jtkirshe-mobl>

On Wed, 2013-09-11 at 07:32 -0700, Jeff Kirsher wrote:
> On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> > On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > > New driver build option is CONFIG_I40E
> > 
> > > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> > []
> > > +i40e.txt
> > > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> > 
> > Just curious but why the XL710 / i40e name mismatch?
> 
> i40e stands for Intel 40 GbE Ethernet which is more generic than naming
> the driver the same as the first part (XL710).  That way when future
> silicon is made, we are not stuck with a driver named after previous
> silicon.

>From the intro and most of the rest of the patches:

----------------------------
This series implements the new i40e driver for Intel's upcoming
Intel(R) Ethernet Controller XL710 Family of devices.
----------------------------

If the xl710 is a specific instance of the i40e family,
I think this is incorrect.

^ permalink raw reply

* Re: [net  8/8] i40e: include i40e in kernel proper
From: Jeff Kirsher @ 2013-09-11 15:37 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Joe Perches, davem@davemloft.net, Brandeburg, Jesse,
	netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com,
	Nelson, Shannon, e1000-devel@lists.sourceforge.net
In-Reply-To: <1378913071.3863.27.camel@ppwaskie-mobl2>

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Wed, 2013-09-11 at 08:24 -0700, Waskiewicz Jr, Peter P wrote:
> On Wed, 2013-09-11 at 05:20 -0700, Joe Perches wrote:
> > On Wed, 2013-09-11 at 02:50 -0700, Jeff Kirsher wrote:
> > > New driver build option is CONFIG_I40E
> > 
> > > diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
> > []
> > > +i40e.txt
> > > +	- README for the Intel Ethernet Controller XL710 Driver (i40e).
> > 
> > Just curious but why the XL710 / i40e name mismatch?
> 
> ixgbe is a good example.  ixgbe is the driver name, but it supports the
> 82598, 82599 / X520, and X540 chips.
> 

Another good example is e100, e1000, e1000e, and igb. :-)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* RE: usbnet transmit path problems
From: David Laight @ 2013-09-11 16:05 UTC (permalink / raw)
  To: Ming Lei; +Cc: Oliver Neukum, Network Development, linux-usb
In-Reply-To: <CACVXFVN6ZHLesrsMVNMWrikRs1mMbk=aZD9qZybNn1gB7aFTZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

> On Wed, Sep 11, 2013 at 8:56 PM, David Laight <David.Laight-JxhZ9S5GRejQT0dZR+AlfA@public.gmane.org> wrote:
> >> > > 2) If 'length % dev->maxpacket == 0' for a multi-fragment packet then
> >> > >    the extra byte isn't added correctly (the code probably falls off
> >> > >    the end of the scatter-gather list).
> >> >
> >> > Indeed. Ming Lei, should usbnet handle this in the sg case or better
> >> > leave it to the subdriver you introduced this for?
> >
> > Is the ZLP issue a problem with the host or with the target?
> 
> Sorry, what do you mean the ZLP issue here? I understand Oliver
> thinks one commit from me may break ZLP handling, are you discussing
> this problem? If not, could you explain it in a bit detail?

I was thinking of the general ZLP problem.
 
> > If it is a host problem then the necessity comes from the host,
> > but the fix needs to be target dependant.
> > If it is a common target problem then generic code can apply
> > a common fix.
> 
> All usbnet device should have sent one ZLP in case the size of
> bulk out transfer can be divided by max packet size, but the one
> byte transfer might be introduced for avoiding some target problem
> (can't deal with zlp well), as said by David, see below discussion:
> 
>    http://marc.info/?l=linux-usb&m=127067487604112&w=2

AFAICT the code avoids sending a zero length packet (that would
terminate a USB bulk transfer packet) by increasing the length
of the bulk packet by (at least) one byte.

> > AFICT there are at least 3 fixes:
> > 1) Extend the ethernet frame by one byte and hope the receiving
> >    system doesn't object to the padding.
> >    This is probably the only option if tx_fixup() doesn't
> >    add a header.
> > 2) Put the ethernet frame length in the header and have the
> >    target discard the added pad byte (ax88179_178a.c).
> > 3) Add a second zero-length frame in the same USB data block
> >    (ax88172a.c).
> 
> Why do we need the above 3 fixes? The patch in my last email can
> fix the problem which is introduced recently, can't it?

I meant there are 3 ways of avoiding the ZLP, each driver will
pick one of them.

I've just looked at all the drivers in net/usb.
It doesn't look like they all handle fragmented skb, shared skb,
or ZLP properly.

A lot of common code could be removed if usbnet knew the size of the
header and allocated it before calling tx_fixup().

None of this is helping me sort out why netperf udp rr tests with
burst 19 are losing all the packets at once :-(

	David



--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox