Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] ipvs: Avoid null-pointer deref in debug code
From: Simon Horman @ 2014-10-17  6:27 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Alex Gartrell, dan.carpenter, lvs-devel, netdev, kernel-team
In-Reply-To: <alpine.LFD.2.11.1410062159290.2258@ja.home.ssi.bg>

On Mon, Oct 06, 2014 at 10:01:08PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Mon, 6 Oct 2014, Alex Gartrell wrote:
> 
> > Use daddr instead of reaching into dest.
> > 
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Signed-off-by: Alex Gartrell <agartrell@fb.com>
> 
> 	Thanks!
> 
> Acked-by: Julian Anastasov <ja@ssi.bg>

Thanks, I have applied this to the ipvs tree and will see about
both getting it included in v3.18 and v3.17-stable.

> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 91f17c1..437a366 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> >  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> >  						  local))) {
> >  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > -			     " daddr=%pI4\n", &dest->addr.ip);
> > +			     " daddr=%pI4\n", &daddr);
> >  		goto err_put;
> >  	}
> >  
> > @@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> >  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> >  						  local))) {
> >  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > -			     " daddr=%pI6\n", &dest->addr.in6);
> > +			     " daddr=%pI6\n", daddr);
> >  		goto err_put;
> >  	}
> >  
> > -- 
> > Alex Gartrell <agartrell@fb.com>
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 

^ permalink raw reply

* [PATCH net-next] openvswitch: Create right mask with disabled megaflows
From: Pravin B Shelar @ 2014-10-17  4:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, Pravin B Shelar, Daniele Di Proietto, Andy Zhou

If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.

sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0.

This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.

Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.

This bug is introduced by commit 0714812134d7dcadeb7ecfbfeb18788aa7e1eaac
("openvswitch: Eliminate memset() from flow_extract").

Reported-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
 net/openvswitch/flow_netlink.c |   93 +++++++++++++++++++++++++++++++---------
 1 files changed, 72 insertions(+), 21 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 368f233..939bcb3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -103,10 +103,19 @@ static void update_range__(struct sw_flow_match *match,
 	SW_FLOW_KEY_MEMCPY_OFFSET(match, offsetof(struct sw_flow_key, field), \
 				  value_p, len, is_mask)
 
-static u16 range_n_bytes(const struct sw_flow_key_range *range)
-{
-	return range->end - range->start;
-}
+#define SW_FLOW_KEY_MEMSET_FIELD(match, field, value, is_mask) \
+	do { \
+		update_range__(match, offsetof(struct sw_flow_key, field),  \
+				     sizeof((match)->key->field), is_mask); \
+		if (is_mask) {						    \
+			if ((match)->mask)				    \
+				memset((u8 *)&(match)->mask->key.field, value,\
+				       sizeof((match)->mask->key.field));   \
+		} else {                                                    \
+			memset((u8 *)&(match)->key->field, value,           \
+			       sizeof((match)->key->field));                \
+		}                                                           \
+	} while (0)
 
 static bool match_validate(const struct sw_flow_match *match,
 			   u64 key_attrs, u64 mask_attrs)
@@ -809,13 +818,26 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
 	return 0;
 }
 
-static void sw_flow_mask_set(struct sw_flow_mask *mask,
-			     struct sw_flow_key_range *range, u8 val)
+static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
 {
-	u8 *m = (u8 *)&mask->key + range->start;
+	struct nlattr *nla;
+	int rem;
+
+	/* The nlattr stream should already have been validated */
+	nla_for_each_nested(nla, attr, rem) {
+		/* We assume that ovs_key_lens[type] == -1 means that type is a
+		 * nested attribute
+		 */
+		if (is_attr_mask_key && ovs_key_lens[nla_type(nla)] == -1)
+			nlattr_set(nla, val, false);
+		else
+			memset(nla_data(nla), val, nla_len(nla));
+	}
+}
 
-	mask->range = *range;
-	memset(m, val, range_n_bytes(range));
+static void mask_set_nlattr(struct nlattr *attr, u8 val)
+{
+	nlattr_set(attr, val, true);
 }
 
 /**
@@ -836,6 +858,7 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 {
 	const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
 	const struct nlattr *encap;
+	struct nlattr *newmask = NULL;
 	u64 key_attrs = 0;
 	u64 mask_attrs = 0;
 	bool encap_valid = false;
@@ -882,18 +905,44 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 	if (err)
 		return err;
 
+	if (match->mask && !mask) {
+		/* Create an exact match mask. We need to set to 0xff all the
+		 * 'match->mask' fields that have been touched in 'match->key'.
+		 * We cannot simply memset 'match->mask', because padding bytes
+		 * and fields not specified in 'match->key' should be left to 0.
+		 * Instead, we use a stream of netlink attributes, copied from
+		 * 'key' and set to 0xff: ovs_key_from_nlattrs() will take care
+		 * of filling 'match->mask' appropriately.
+		 */
+		newmask = kmemdup(key, nla_total_size(nla_len(key)),
+				  GFP_KERNEL);
+		if (!newmask)
+			return -ENOMEM;
+
+		mask_set_nlattr(newmask, 0xff);
+
+		/* The userspace does not send tunnel attributes that are 0,
+		 * but we should not wildcard them nonetheless.
+		 */
+		if (match->key->tun_key.ipv4_dst)
+			SW_FLOW_KEY_MEMSET_FIELD(match, tun_key, 0xff, true);
+
+		mask = newmask;
+	}
+
 	if (mask) {
 		err = parse_flow_mask_nlattrs(mask, a, &mask_attrs);
 		if (err)
-			return err;
+			goto free_newmask;
 
-		if (mask_attrs & 1 << OVS_KEY_ATTR_ENCAP)  {
+		if (mask_attrs & 1 << OVS_KEY_ATTR_ENCAP) {
 			__be16 eth_type = 0;
 			__be16 tci = 0;
 
 			if (!encap_valid) {
 				OVS_NLERR("Encap mask attribute is set for non-VLAN frame.\n");
-				return  -EINVAL;
+				err = -EINVAL;
+				goto free_newmask;
 			}
 
 			mask_attrs &= ~(1 << OVS_KEY_ATTR_ENCAP);
@@ -904,10 +953,13 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 				mask_attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE);
 				encap = a[OVS_KEY_ATTR_ENCAP];
 				err = parse_flow_mask_nlattrs(encap, a, &mask_attrs);
+				if (err)
+					goto free_newmask;
 			} else {
 				OVS_NLERR("VLAN frames must have an exact match on the TPID (mask=%x).\n",
 						ntohs(eth_type));
-				return -EINVAL;
+				err = -EINVAL;
+				goto free_newmask;
 			}
 
 			if (a[OVS_KEY_ATTR_VLAN])
@@ -915,23 +967,22 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 
 			if (!(tci & htons(VLAN_TAG_PRESENT))) {
 				OVS_NLERR("VLAN tag present bit must have an exact match (tci_mask=%x).\n", ntohs(tci));
-				return -EINVAL;
+				err = -EINVAL;
+				goto free_newmask;
 			}
 		}
 
 		err = ovs_key_from_nlattrs(match, mask_attrs, a, true);
 		if (err)
-			return err;
-	} else {
-		/* Populate exact match flow's key mask. */
-		if (match->mask)
-			sw_flow_mask_set(match->mask, &match->range, 0xff);
+			goto free_newmask;
 	}
 
 	if (!match_validate(match, key_attrs, mask_attrs))
-		return -EINVAL;
+		err = -EINVAL;
 
-	return 0;
+free_newmask:
+	kfree(newmask);
+	return err;
 }
 
 /**
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net] r8152: use cancel_delayed_work for runtime suspend
From: Bjørn Mork @ 2014-10-17  7:42 UTC (permalink / raw)
  To: Hayes Wang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1394712342-15778-64-Taiwan-albertk@realtek.com>

Hayes Wang <hayeswang@realtek.com> writes:

> It would cause dead lock for runtime suspend, when the workqueue
> is running and runtime suspend occurs before the workqueue wakes
> up the device. The rtl8152_suspend() waits the workqueue to finish
> because of calling cancel_delayed_work_sync(). The workqueue waits
> the suspend function to finish for waking up the device because of
> calling usb_autopm_get_interface().
>
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> ---
>  drivers/net/usb/r8152.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
> index 864159e..7d4e55a 100644
> --- a/drivers/net/usb/r8152.c
> +++ b/drivers/net/usb/r8152.c
> @@ -3200,12 +3200,13 @@ static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
>  	if (netif_running(tp->netdev)) {
>  		clear_bit(WORK_ENABLE, &tp->flags);
>  		usb_kill_urb(tp->intr_urb);
> -		cancel_delayed_work_sync(&tp->schedule);
>  		tasklet_disable(&tp->tl);
>  		if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
> +			cancel_delayed_work(&tp->schedule);
>  			rtl_stop_rx(tp);
>  			rtl_runtime_suspend_enable(tp, true);
>  		} else {
> +			cancel_delayed_work_sync(&tp->schedule);
>  			tp->rtl_ops.down(tp);
>  		}
>  		tasklet_enable(&tp->tl);

This looks strange to me. The delayed work will cause an immediate
resume due to the usb_autopm_get_interface() it starts with.  Wouldn't
it be better to just prevent runtime suspending by returning -EBUSY if
there is any delayed work scheduled?


Bjørn

^ permalink raw reply

* Re: [RFC PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Lorenzo Colitti @ 2014-10-17  8:31 UTC (permalink / raw)
  To: Erik Kline; +Cc: netdev@vger.kernel.org, Hannes Frederic Sowa, Eric Dumazet
In-Reply-To: <1413520309-13814-1-git-send-email-ek@google.com>

On Fri, Oct 17, 2014 at 1:31 PM, Erik Kline <ek@google.com> wrote:
> Add a sysctl that causes an interface's optimistic addresses
> to be considered equivalent to other non-deprecated addresses
> for source address selection purposes.  Preferred addresses
> will still take precendence over optimistic addresses, subject
> to other ranking in the source address selection algorithm.
> [...]
> Signed-off-by: Erik Kline <ek@google.com>

Acked-by: Lorenzo Colitti <lorenzo@google.com>

^ permalink raw reply

* [PATCH] ipv4: fix a potential use after free
From: roy.qing.li @ 2014-10-17  8:53 UTC (permalink / raw)
  To: netdev; +Cc: pshelar

From: Li RongQing <roy.qing.li@gmail.com>

pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
so set eth after pskb_may_pull()

Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 net/ipv4/ip_tunnel_core.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index f4c987b..88c386c 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -91,11 +91,12 @@ int iptunnel_pull_header(struct sk_buff *skb, int hdr_len, __be16 inner_proto)
 	skb_pull_rcsum(skb, hdr_len);
 
 	if (inner_proto == htons(ETH_P_TEB)) {
-		struct ethhdr *eh = (struct ethhdr *)skb->data;
+		struct ethhdr *eh;
 
 		if (unlikely(!pskb_may_pull(skb, ETH_HLEN)))
 			return -ENOMEM;
 
+		eh = (struct ethhdr *)skb->data;
 		if (likely(ntohs(eh->h_proto) >= ETH_P_802_3_MIN))
 			skb->protocol = eh->h_proto;
 		else
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] ipv4: fix a potential use after free in fou.c
From: roy.qing.li @ 2014-10-17  8:53 UTC (permalink / raw)
  To: netdev; +Cc: therbert

From: Li RongQing <roy.qing.li@gmail.com>

pskb_may_pull() maybe change skb->data and make uh pointer oboslete,
so reload uh and guehdr

Fixes: 37dd0247 ("gue: Receive side for Generic UDP Encapsulation")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 net/ipv4/fou.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index efa70ad..32e7892 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -87,6 +87,9 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, len))
 		goto drop;
 
+	uh = udp_hdr(skb);
+	guehdr = (struct guehdr *)&uh[1];
+
 	if (guehdr->version != 0)
 		goto drop;
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net] r8152: return -EBUSY for runtime suspend
From: Hayes Wang @ 2014-10-17  8:55 UTC (permalink / raw)
  To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <87zjcvgn1x.fsf@nemi.mork.no>

Remove calling cancel_delayed_work_sync() for runtime suspend,
because it would cause dead lock. Instead, return -EBUSY to
avoid the device enters suspending if the net is running and
the delayed work is pending or running. The delayed work would
try to wake up the device later, so the suspending is not
necessary.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/usb/r8152.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 864159e..e3d84c3 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3189,31 +3189,39 @@ static void r8153_init(struct r8152 *tp)
 static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
 {
 	struct r8152 *tp = usb_get_intfdata(intf);
+	struct net_device *netdev = tp->netdev;
+	int ret = 0;
 
 	mutex_lock(&tp->control);
 
-	if (PMSG_IS_AUTO(message))
+	if (PMSG_IS_AUTO(message)) {
+		if (netif_running(netdev) && work_busy(&tp->schedule.work)) {
+			ret = -EBUSY;
+			goto out1;
+		}
+
 		set_bit(SELECTIVE_SUSPEND, &tp->flags);
-	else
-		netif_device_detach(tp->netdev);
+	} else {
+		netif_device_detach(netdev);
+	}
 
-	if (netif_running(tp->netdev)) {
+	if (netif_running(netdev)) {
 		clear_bit(WORK_ENABLE, &tp->flags);
 		usb_kill_urb(tp->intr_urb);
-		cancel_delayed_work_sync(&tp->schedule);
 		tasklet_disable(&tp->tl);
 		if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
 			rtl_stop_rx(tp);
 			rtl_runtime_suspend_enable(tp, true);
 		} else {
+			cancel_delayed_work_sync(&tp->schedule);
 			tp->rtl_ops.down(tp);
 		}
 		tasklet_enable(&tp->tl);
 	}
-
+out1:
 	mutex_unlock(&tp->control);
 
-	return 0;
+	return ret;
 }
 
 static int rtl8152_resume(struct usb_interface *intf)
-- 
1.9.3

^ permalink raw reply related

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: David Laight @ 2014-10-17  9:11 UTC (permalink / raw)
  To: 'Alexander Duyck', Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54400C6C.7010405@redhat.com>

From: Alexander Duyck
...
> Actually the likelihood of anything holding onto the 4K page for very
> long doesn't seem to occur, at least from the drivers perspective.  It
> is one of the reasons why I went for the page reuse approach rather than
> just partitioning a single large page.  It allows us to avoid having to
> call IOMMU map/unmap for the pages since the entire page is usually back
> in the driver ownership before we need to reuse the portion given to the
> stack.

That is almost certainly true for most benchmarks, benchmark processes
consume receive data.
But what about real life situations?

There must be some 'normal' workloads where receive data doesn't get consumed.

	David


^ permalink raw reply

* Re: [PATCH] virtio_net: fix use after free
From: Michael S. Tsirkin @ 2014-10-17  9:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20141015.164727.298073723527552510.davem@davemloft.net>

On Wed, Oct 15, 2014 at 04:47:27PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Wed, 15 Oct 2014 16:23:28 +0300
> 
> > You used __netif_subqueue_stopped but that seems to use
> > a slightly more expensive test_bit internally.
> 
> More expensive in what sense?  It should be roughly the same
> as "x & y" sans the volatile.

I really just meant volatile - this might prevent some compiler
optimizations. I have't actually checked the produced binary so
I don't know for sure.

> Anyways I'm ambivalent and I want to see this bug fixes, so I'll
> apply your patch.
> 
> Thanks!

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 14:05 UTC (permalink / raw)
  To: Jiafei.Pan@freescale.com
  Cc: Alexander Duyck, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <92e45ba6015c4c48b21baeade3f9c929@BLUPR03MB517.namprd03.prod.outlook.com>

On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:

> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
> you have catch my concerns. For example, I want to add igb net card 
> into my bridge, and want to igb net driver allocate skb by using
> my specified memory address, but I don’t want to modify igb net driver
> directly, how to do this in my bridge drivers?

This is exactly our point : We do not want to modify all drivers so that
your bridge is happy with them.

You'll have to make your bridge using standard infra instead.

IGB has no way to know in advance that a particular frame should
eventually reach your bridge anyway.

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 14:12 UTC (permalink / raw)
  To: Eric Dumazet, Jiafei.Pan@freescale.com
  Cc: Alexander Duyck, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413554709.24953.5.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/17/2014 07:05 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:
>
>> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
>> you have catch my concerns. For example, I want to add igb net card 
>> into my bridge, and want to igb net driver allocate skb by using
>> my specified memory address, but I don’t want to modify igb net driver
>> directly, how to do this in my bridge drivers?
> This is exactly our point : We do not want to modify all drivers so that
> your bridge is happy with them.
>
> You'll have to make your bridge using standard infra instead.
>
> IGB has no way to know in advance that a particular frame should
> eventually reach your bridge anyway.

Also why is it igb should use your buffers, is there any reason why your
device cannot use the receive buffers that are handed off to the stack
from igb?  It isn't as if there is a copy in the routing or bridging
path.  The receive buffer is normally handed off to the stack from the
ingress device, a few headers might get tweaked, and then that buffer is
transmitted by the egress interface.  Only in the case of a buffer being
handed to multiple egress devices or user space should it ever be copied.

Thanks,

Alex

^ permalink raw reply

* [PATCH net-next] sfc: add support for skb->xmit_more
From: Edward Cree @ 2014-10-17 14:32 UTC (permalink / raw)
  To: Robert Stonehouse
  Cc: Edward Cree, Daniel Borkmann, davem, nikolay, netdev,
	Shradha Shah, Jon Cooper, linux-net-drivers
In-Reply-To: <543FF408.3000808@solarflare.com>

Don't ring the doorbell, and don't do PIO.  This will also prevent
 TX Push, because there will be more than one buffer waiting when
 the doorbell is rung.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/nic.h | 29 +++++++++++++++++++++++-----
 drivers/net/ethernet/sfc/tx.c  | 43 +++++++++++++++++++-----------------------
 2 files changed, 43 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index 60f8514..f77cce0 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -71,9 +71,17 @@ efx_tx_desc(struct efx_tx_queue *tx_queue, unsigned int index)
 	return ((efx_qword_t *) (tx_queue->txd.buf.addr)) + index;
 }
 
-/* Report whether the NIC considers this TX queue empty, given the
- * write_count used for the last doorbell push.  May return false
- * negative.
+/* Get partner of a TX queue, seen as part of the same net core queue */
+static struct efx_tx_queue *efx_tx_queue_partner(struct efx_tx_queue *tx_queue)
+{
+	if (tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD)
+		return tx_queue - EFX_TXQ_TYPE_OFFLOAD;
+	else
+		return tx_queue + EFX_TXQ_TYPE_OFFLOAD;
+}
+
+/* Report whether this TX queue would be empty for the given write_count.
+ * May return false negative.
  */
 static inline bool __efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue,
 					 unsigned int write_count)
@@ -86,9 +94,18 @@ static inline bool __efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue,
 	return ((empty_read_count ^ write_count) & ~EFX_EMPTY_COUNT_VALID) == 0;
 }
 
-static inline bool efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue)
+/* Decide whether we can use TX PIO, ie. write packet data directly into
+ * a buffer on the device.  This can reduce latency at the expense of
+ * throughput, so we only do this if both hardware and software TX rings
+ * are empty.  This also ensures that only one packet at a time can be
+ * using the PIO buffer.
+ */
+static inline bool efx_nic_may_tx_pio(struct efx_tx_queue *tx_queue)
 {
-	return __efx_nic_tx_is_empty(tx_queue, tx_queue->write_count);
+	struct efx_tx_queue *partner = efx_tx_queue_partner(tx_queue);
+	return tx_queue->piobuf &&
+	       __efx_nic_tx_is_empty(tx_queue, tx_queue->insert_count) &&
+	       __efx_nic_tx_is_empty(partner, partner->insert_count);
 }
 
 /* Decide whether to push a TX descriptor to the NIC vs merely writing
@@ -96,6 +113,8 @@ static inline bool efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue)
  * descriptor to an empty queue, but is otherwise pointless.  Further,
  * Falcon and Siena have hardware bugs (SF bug 33851) that may be
  * triggered if we don't check this.
+ * We use the write_count used for the last doorbell push, to get the
+ * NIC's view of the tx queue.
  */
 static inline bool efx_nic_may_push_tx_desc(struct efx_tx_queue *tx_queue,
 					    unsigned int write_count)
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 3206098..ee84a90 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -132,15 +132,6 @@ unsigned int efx_tx_max_skb_descs(struct efx_nic *efx)
 	return max_descs;
 }
 
-/* Get partner of a TX queue, seen as part of the same net core queue */
-static struct efx_tx_queue *efx_tx_queue_partner(struct efx_tx_queue *tx_queue)
-{
-	if (tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD)
-		return tx_queue - EFX_TXQ_TYPE_OFFLOAD;
-	else
-		return tx_queue + EFX_TXQ_TYPE_OFFLOAD;
-}
-
 static void efx_tx_maybe_stop_queue(struct efx_tx_queue *txq1)
 {
 	/* We need to consider both queues that the net core sees as one */
@@ -344,6 +335,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	struct efx_nic *efx = tx_queue->efx;
 	struct device *dma_dev = &efx->pci_dev->dev;
 	struct efx_tx_buffer *buffer;
+	unsigned int old_insert_count = tx_queue->insert_count;
 	skb_frag_t *fragment;
 	unsigned int len, unmap_len = 0;
 	dma_addr_t dma_addr, unmap_addr = 0;
@@ -351,7 +343,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	unsigned short dma_flags;
 	int i = 0;
 
-	EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
+	EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);
 
 	if (skb_shinfo(skb)->gso_size)
 		return efx_enqueue_skb_tso(tx_queue, skb);
@@ -369,9 +361,8 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 
 	/* Consider using PIO for short packets */
 #ifdef EFX_USE_PIO
-	if (skb->len <= efx_piobuf_size && tx_queue->piobuf &&
-	    efx_nic_tx_is_empty(tx_queue) &&
-	    efx_nic_tx_is_empty(efx_tx_queue_partner(tx_queue))) {
+	if (skb->len <= efx_piobuf_size && !skb->xmit_more &&
+	    efx_nic_may_tx_pio(tx_queue)) {
 		buffer = efx_enqueue_skb_pio(tx_queue, skb);
 		dma_flags = EFX_TX_BUF_OPTION;
 		goto finish_packet;
@@ -439,13 +430,14 @@ finish_packet:
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
+	efx_tx_maybe_stop_queue(tx_queue);
+
 	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
 
 	tx_queue->tx_packets++;
 
-	efx_tx_maybe_stop_queue(tx_queue);
-
 	return NETDEV_TX_OK;
 
  dma_err:
@@ -458,7 +450,7 @@ finish_packet:
 	dev_kfree_skb_any(skb);
 
 	/* Work backwards until we hit the original insert pointer value */
-	while (tx_queue->insert_count != tx_queue->write_count) {
+	while (tx_queue->insert_count != old_insert_count) {
 		unsigned int pkts_compl = 0, bytes_compl = 0;
 		--tx_queue->insert_count;
 		buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
@@ -989,12 +981,13 @@ static int efx_tso_put_header(struct efx_tx_queue *tx_queue,
 /* Remove buffers put into a tx_queue.  None of the buffers must have
  * an skb attached.
  */
-static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue)
+static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
+			       unsigned int insert_count)
 {
 	struct efx_tx_buffer *buffer;
 
 	/* Work backwards until we hit the original insert pointer value */
-	while (tx_queue->insert_count != tx_queue->write_count) {
+	while (tx_queue->insert_count != insert_count) {
 		--tx_queue->insert_count;
 		buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
 		efx_dequeue_buffer(tx_queue, buffer, NULL, NULL);
@@ -1258,13 +1251,14 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 			       struct sk_buff *skb)
 {
 	struct efx_nic *efx = tx_queue->efx;
+	unsigned int old_insert_count = tx_queue->insert_count;
 	int frag_i, rc;
 	struct tso_state state;
 
 	/* Find the packet protocol and sanity-check it */
 	state.protocol = efx_tso_check_protocol(skb);
 
-	EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
+	EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);
 
 	rc = tso_start(&state, efx, skb);
 	if (rc)
@@ -1308,11 +1302,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 
 	netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
 
-	/* Pass off to hardware */
-	efx_nic_push_buffers(tx_queue);
-
 	efx_tx_maybe_stop_queue(tx_queue);
 
+	/* Pass off to hardware */
+	if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+		efx_nic_push_buffers(tx_queue);
+
 	tx_queue->tso_bursts++;
 	return NETDEV_TX_OK;
 
@@ -1336,6 +1331,6 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 		dma_unmap_single(&efx->pci_dev->dev, state.header_dma_addr,
 				 state.header_unmap_len, DMA_TO_DEVICE);
 
-	efx_enqueue_unwind(tx_queue);
+	efx_enqueue_unwind(tx_queue, old_insert_count);
 	return NETDEV_TX_OK;
 }
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 14:40 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9D895B@AcuExch.aculab.com>


On 10/17/2014 02:11 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>> Actually the likelihood of anything holding onto the 4K page for very
>> long doesn't seem to occur, at least from the drivers perspective.  It
>> is one of the reasons why I went for the page reuse approach rather than
>> just partitioning a single large page.  It allows us to avoid having to
>> call IOMMU map/unmap for the pages since the entire page is usually back
>> in the driver ownership before we need to reuse the portion given to the
>> stack.
> That is almost certainly true for most benchmarks, benchmark processes
> consume receive data.
> But what about real life situations?
>
> There must be some 'normal' workloads where receive data doesn't get consumed.
>
> 	David
>

Yes, but for workloads where receive data doesn't get consumed it is 
very unlikely that much receive data is generated.  As such from the 
device perspective the time the socket is holding the page is still not 
all that long as the descriptor ring is not being looped through that 
quickly.  The page has almost always been freed by the time we have 
processed our way through the full descriptor ring.

Thanks,

Alex

^ permalink raw reply

* Re: tcpdump's capture filter: "vlan" doesn't match
From: Daniel Borkmann @ 2014-10-17 15:48 UTC (permalink / raw)
  To: Lukas Tribus
  Cc: netdev@vger.kernel.org, John Fastabend, Michał Mirosław,
	Jiri Pirko, Ben Hutchings, Atzm Watanabe, Patrick McHardy,
	Jesse Gross, Michael Richardson, Ani Sinha
In-Reply-To: <DUB123-W25C5DA9BEDC92CD2524B68EDAB0@phx.gbl>

On 10/17/2014 01:25 AM, Lukas Tribus wrote:
>>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>>
>> There were some discussions on this in the past e.g. [1]. We have
>> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
>> this, but libpcap is currently not making use of any of them.
>>
>> [1] http://thread.gmane.org/gmane.linux.network/247947
>
> Thanks for the link. I see the situation is unfortunate and although those
> new BPF filters in the kernel may fix the actual filtering problem, one
> thing seems to remain impossible: disabling all this kernel magic and
> passing the frame as-is to libpcap without interception (avoiding any
> kind of artificial header reconstruction).
>
> How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
> in the kernel?

So in netsniff-ng we don't do obscure header reconstruction as it
hurts performance and it can result in incorrect reconstruction cases.

You however can define a bpf_asm program (e.g. tools/net/ in kernel
tree) and use instruction overloading for the vlan case from there.
Thus, you're not tied to the libpcap compiler which misses this.

For more details, I refer you to Documentation/networking/filter.txt .

^ permalink raw reply

* Re: [PATCH] atm: simplify lanai.c by using module_pci_driver
From: David Miller @ 2014-10-17 15:55 UTC (permalink / raw)
  To: michael.opdenacker; +Cc: chas, linux-atm-general, netdev, linux-kernel
In-Reply-To: <1413359150-13552-1-git-send-email-michael.opdenacker@free-electrons.com>

From: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Date: Wed, 15 Oct 2014 09:45:50 +0200

> This simplifies the lanai.c driver by using
> the module_pci_driver() macro, at the expense
> of losing only debugging messages.
> 
> Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv4: dst_entry leak in ip_send_unicast_reply()
From: David Miller @ 2014-10-17 15:59 UTC (permalink / raw)
  To: vvs; +Cc: eric.dumazet, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <543E6762.9090807@parallels.com>

From: Vasily Averin <vvs@parallels.com>
Date: Wed, 15 Oct 2014 16:24:02 +0400

> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
> 
> ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
> and in case errors in __ip_append_data() nobody frees stolen dst entry
> 
> Signed-off-by: Vasily Averin <vvs@parallels.com>

Please, in the future, put the Fixes: tag first in the list of signoffs.

Eric, please review this change, it looks good to me.

^ permalink raw reply

* Re: [PATCH] openvswitch: fix a use after free
From: Jesse Gross @ 2014-10-17 15:59 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev, Eric Dumazet
In-Reply-To: <1413525788-2272-1-git-send-email-roy.qing.li@gmail.com>

On Fri, Oct 17, 2014 at 8:03 AM,  <roy.qing.li@gmail.com> wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
> setting after arphdr_ok to avoid the use the freed memory
>
> Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.")
> Cc: Jesse Gross <jesse@nicira.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

Acked-by: Jesse Gross <jesse@nicira.com>

^ permalink raw reply

* [PATCH net] ipv6: introduce tcp_v6_iif()
From: Eric Dumazet @ 2014-10-17 16:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line
misses") added a regression for SO_BINDTODEVICE on IPv6.

This is because we still use inet6_iif() which expects that IP6 control
block is still at the beginning of skb->cb[]

This patch adds tcp_v6_iif() helper and uses it where necessary.

Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
parameter to it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
---
 include/net/inet6_hashtables.h |    5 +++--
 include/net/tcp.h              |    9 +++++++++
 net/dccp/ipv6.c                |    3 ++-
 net/ipv6/syncookies.c          |    2 +-
 net/ipv6/tcp_ipv6.c            |   26 +++++++++++++++-----------
 5 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index ae0613544308..d1d272843b3b 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -80,7 +80,8 @@ static inline struct sock *__inet6_lookup(struct net *net,
 static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 					      struct sk_buff *skb,
 					      const __be16 sport,
-					      const __be16 dport)
+					      const __be16 dport,
+					      int iif)
 {
 	struct sock *sk = skb_steal_sock(skb);
 
@@ -90,7 +91,7 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
 			      &ipv6_hdr(skb)->saddr, sport,
 			      &ipv6_hdr(skb)->daddr, ntohs(dport),
-			      inet6_iif(skb));
+			      iif);
 }
 
 struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 74efeda994b3..12e895c52139 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -730,6 +730,15 @@ struct tcp_skb_cb {
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
 
+
+/* This is the variant of inet6_iif() that must be used by TCP,
+ * as TCP moves IP6CB into a different location in skb->cb[]
+ */
+static inline int tcp_v6_iif(const struct sk_buff *skb)
+{
+	return TCP_SKB_CB(skb)->header.h6.iif;
+}
+
 /* Due to TSO, an SKB can be composed of multiple actual
  * packets.  To keep these tracked properly, we use this.
  */
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ad2acfe1ca61..6bcaa33cd804 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -757,7 +757,8 @@ static int dccp_v6_rcv(struct sk_buff *skb)
 	/* Step 2:
 	 *	Look up flow ID in table and get corresponding socket */
 	sk = __inet6_lookup_skb(&dccp_hashinfo, skb,
-			        dh->dccph_sport, dh->dccph_dport);
+			        dh->dccph_sport, dh->dccph_dport,
+				inet6_iif(skb));
 	/*
 	 * Step 2:
 	 *	If no socket ...
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 9a2838e93cc5..2a86a0f00f2b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -214,7 +214,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	/* So that link locals have meaning */
 	if (!sk->sk_bound_dev_if &&
 	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-		ireq->ir_iif = inet6_iif(skb);
+		ireq->ir_iif = tcp_v6_iif(skb);
 
 	ireq->ir_mark = inet_request_mark(sk, skb);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index cf2e45ab2fa4..831495529b82 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -424,6 +424,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		if (sock_owned_by_user(sk))
 			goto out;
 
+		/* Note : We use inet6_iif() here, not tcp_v6_iif() */
 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
 					   &hdr->saddr, inet6_iif(skb));
 		if (!req)
@@ -738,7 +739,7 @@ static void tcp_v6_init_req(struct request_sock *req, struct sock *sk,
 	/* So that link locals have meaning */
 	if (!sk->sk_bound_dev_if &&
 	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-		ireq->ir_iif = inet6_iif(skb);
+		ireq->ir_iif = tcp_v6_iif(skb);
 
 	if (!TCP_SKB_CB(skb)->tcp_tw_isn &&
 	    (ipv6_opt_accepted(sk, skb, &TCP_SKB_CB(skb)->header.h6) ||
@@ -860,7 +861,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 
 	fl6.flowi6_proto = IPPROTO_TCP;
 	if (rt6_need_strict(&fl6.daddr) && !oif)
-		fl6.flowi6_oif = inet6_iif(skb);
+		fl6.flowi6_oif = tcp_v6_iif(skb);
 	else
 		fl6.flowi6_oif = oif;
 	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
@@ -918,7 +919,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
 		sk1 = inet6_lookup_listener(dev_net(skb_dst(skb)->dev),
 					   &tcp_hashinfo, &ipv6h->saddr,
 					   th->source, &ipv6h->daddr,
-					   ntohs(th->source), inet6_iif(skb));
+					   ntohs(th->source), tcp_v6_iif(skb));
 		if (!sk1)
 			return;
 
@@ -1000,13 +1001,14 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
 	/* Find possible connection requests. */
 	req = inet6_csk_search_req(sk, &prev, th->source,
 				   &ipv6_hdr(skb)->saddr,
-				   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
+				   &ipv6_hdr(skb)->daddr, tcp_v6_iif(skb));
 	if (req)
 		return tcp_check_req(sk, skb, req, prev, false);
 
 	nsk = __inet6_lookup_established(sock_net(sk), &tcp_hashinfo,
-			&ipv6_hdr(skb)->saddr, th->source,
-			&ipv6_hdr(skb)->daddr, ntohs(th->dest), inet6_iif(skb));
+					 &ipv6_hdr(skb)->saddr, th->source,
+					 &ipv6_hdr(skb)->daddr, ntohs(th->dest),
+					 tcp_v6_iif(skb));
 
 	if (nsk) {
 		if (nsk->sk_state != TCP_TIME_WAIT) {
@@ -1090,7 +1092,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		newnp->ipv6_fl_list = NULL;
 		newnp->pktoptions  = NULL;
 		newnp->opt	   = NULL;
-		newnp->mcast_oif   = inet6_iif(skb);
+		newnp->mcast_oif   = tcp_v6_iif(skb);
 		newnp->mcast_hops  = ipv6_hdr(skb)->hop_limit;
 		newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
 		if (np->repflow)
@@ -1174,7 +1176,7 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 			skb_set_owner_r(newnp->pktoptions, newsk);
 	}
 	newnp->opt	  = NULL;
-	newnp->mcast_oif  = inet6_iif(skb);
+	newnp->mcast_oif  = tcp_v6_iif(skb);
 	newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;
 	newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
 	if (np->repflow)
@@ -1360,7 +1362,7 @@ ipv6_pktoptions:
 	if (TCP_SKB_CB(opt_skb)->end_seq == tp->rcv_nxt &&
 	    !((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) {
 		if (np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo)
-			np->mcast_oif = inet6_iif(opt_skb);
+			np->mcast_oif = tcp_v6_iif(opt_skb);
 		if (np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim)
 			np->mcast_hops = ipv6_hdr(opt_skb)->hop_limit;
 		if (np->rxopt.bits.rxflow || np->rxopt.bits.rxtclass)
@@ -1427,7 +1429,8 @@ static int tcp_v6_rcv(struct sk_buff *skb)
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
 	TCP_SKB_CB(skb)->sacked = 0;
 
-	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest,
+				tcp_v6_iif(skb));
 	if (!sk)
 		goto no_tcp_socket;
 
@@ -1514,7 +1517,7 @@ do_time_wait:
 		sk2 = inet6_lookup_listener(dev_net(skb->dev), &tcp_hashinfo,
 					    &ipv6_hdr(skb)->saddr, th->source,
 					    &ipv6_hdr(skb)->daddr,
-					    ntohs(th->dest), inet6_iif(skb));
+					    ntohs(th->dest), tcp_v6_iif(skb));
 		if (sk2 != NULL) {
 			struct inet_timewait_sock *tw = inet_twsk(sk);
 			inet_twsk_deschedule(tw, &tcp_death_row);
@@ -1553,6 +1556,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
 	if (th->doff < sizeof(struct tcphdr) / 4)
 		return;
 
+	/* Note : We use inet6_iif() here, not tcp_v6_iif() */
 	sk = __inet6_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
 					&hdr->saddr, th->source,
 					&hdr->daddr, ntohs(th->dest),

^ permalink raw reply related

* Re: [Patch net 1/3] ipv4: call __ip_options_echo() in cookie_v4_check()
From: David Miller @ 2014-10-17 16:18 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, kkolasa, edumazet, cwang
In-Reply-To: <1413408802-21052-1-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed, 15 Oct 2014 14:33:20 -0700

> From: Cong Wang <cwang@twopensource.com>
> 
> commit 971f10eca186cab238c49da ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
> missed that cookie_v4_check() still calls ip_options_echo() which uses
> IPCB(). It should use TCPCB() at TCP layer, so call __ip_options_echo()
> instead.
> 
> Fixes: commit 971f10eca186cab238c49da ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
> Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Cc: Eric Dumazet <edumazet@google.com>
> Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Signed-off-by: Cong Wang <cwang@twopensource.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied.

^ permalink raw reply

* Re: [Patch net 2/3] ipv4: share tcp_v4_save_options() with cookie_v4_check()
From: David Miller @ 2014-10-17 16:18 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, kkolasa, edumazet, cwang
In-Reply-To: <1413408802-21052-2-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed, 15 Oct 2014 14:33:21 -0700

> From: Cong Wang <cwang@twopensource.com>
> 
> cookie_v4_check() allocates ip_options_rcu in the same way
> with tcp_v4_save_options(), we can just make it a helper function.
> 
> Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <cwang@twopensource.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied.

^ permalink raw reply

* Re: [Patch net 3/3] ipv4: clean up cookie_v4_check()
From: David Miller @ 2014-10-17 16:18 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, kkolasa, edumazet, cwang
In-Reply-To: <1413408802-21052-3-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed, 15 Oct 2014 14:33:22 -0700

> From: Cong Wang <cwang@twopensource.com>
> 
> We can retrieve opt from skb, no need to pass it as a parameter.
> And opt should always be non-NULL, no need to check.
> 
> Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Cc: Eric Dumazet <edumazet@google.com>
> Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Signed-off-by: Cong Wang <cwang@twopensource.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied.

^ permalink raw reply

* RE: [PATCH net] ipv6: introduce tcp_v6_iif()
From: David Laight @ 2014-10-17 16:26 UTC (permalink / raw)
  To: 'Eric Dumazet', David Miller; +Cc: netdev
In-Reply-To: <1413562640.24953.26.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet
> Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line
> misses") added a regression for SO_BINDTODEVICE on IPv6.
> 
> This is because we still use inet6_iif() which expects that IP6 control
> block is still at the beginning of skb->cb[]
> 
> This patch adds tcp_v6_iif() helper and uses it where necessary.
> 
> Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
> parameter to it.
...
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index cf2e45ab2fa4..831495529b82 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -424,6 +424,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
>  		if (sock_owned_by_user(sk))
>  			goto out;
> 
> +		/* Note : We use inet6_iif() here, not tcp_v6_iif() */
>  		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
>  					   &hdr->saddr, inet6_iif(skb));
>  		if (!req)

That comment isn't particularly informative....

	David


^ permalink raw reply

* Re: [PATCH net] ipv6: introduce tcp_v6_iif()
From: Eric Dumazet @ 2014-10-17 16:52 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, netdev
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9D8C89@AcuExch.aculab.com>

On Fri, 2014-10-17 at 16:26 +0000, David Laight wrote:
> From: Eric Dumazet
> > Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line
> > misses") added a regression for SO_BINDTODEVICE on IPv6.
> > 
> > This is because we still use inet6_iif() which expects that IP6 control
> > block is still at the beginning of skb->cb[]
> > 
> > This patch adds tcp_v6_iif() helper and uses it where necessary.
> > 
> > Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
> > parameter to it.
> ...
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index cf2e45ab2fa4..831495529b82 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -424,6 +424,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
> >  		if (sock_owned_by_user(sk))
> >  			goto out;
> > 
> > +		/* Note : We use inet6_iif() here, not tcp_v6_iif() */
> >  		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
> >  					   &hdr->saddr, inet6_iif(skb));
> >  		if (!req)
> 
> That comment isn't particularly informative....

It is showing I considered this spot, and the right thing here is to use
inet6_iif()

The commit changelog will give to curious people the reasons.

Adding fat comments is superseded with good changelog.

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 16:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan@freescale.com, David Miller,
	jkosina@suse.cz, netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54412A59.7070508@redhat.com>


On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
> On 10/17/2014 02:11 AM, David Laight wrote:
> > From: Alexander Duyck
> > ...
> >> Actually the likelihood of anything holding onto the 4K page for very
> >> long doesn't seem to occur, at least from the drivers perspective.  It
> >> is one of the reasons why I went for the page reuse approach rather than
> >> just partitioning a single large page.  It allows us to avoid having to
> >> call IOMMU map/unmap for the pages since the entire page is usually back
> >> in the driver ownership before we need to reuse the portion given to the
> >> stack.
> > That is almost certainly true for most benchmarks, benchmark processes
> > consume receive data.
> > But what about real life situations?
> >
> > There must be some 'normal' workloads where receive data doesn't get consumed.
> >
> > 	David
> >
> 
> Yes, but for workloads where receive data doesn't get consumed it is 
> very unlikely that much receive data is generated. 

This is very optimistic.

Any kind of flood can generate 5 or 6 Million packets per second.

So in stress condition, we possibly consume twice more ram than alloted
in tcp_mem.  (About 3GBytes per second, think about it)

This is fine, if admins are aware of that and can adjust tcp_mem
accordingly to this.

Apparently none of your customers suffered from this, maybe they had
enough headroom to absorb the over commit or they trusted us and could not
find culprit if they had issues.

Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
are busy on disk access or doing cpu intensive work). As traffic is interleaved
(between consumed data and non consumed data), you'll have the side
effect of consuming more ram than advertised.

Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
'free', and you'll see that we are not good citizens.

I will work on TCP stack, to go beyond what I did in commit
b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

So that TCP should not care if a driver chose to potentially use 4K per
MSS.

Right now, it seems we can drop few packets, and get a slight reduction in
throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)



^ permalink raw reply

* Re: [PATCH] ipv4: dst_entry leak in ip_send_unicast_reply()
From: Eric Dumazet @ 2014-10-17 16:56 UTC (permalink / raw)
  To: David Miller; +Cc: vvs, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20141017.115932.740288542547693461.davem@davemloft.net>

On Fri, 2014-10-17 at 11:59 -0400, David Miller wrote:
> From: Vasily Averin <vvs@parallels.com>
> Date: Wed, 15 Oct 2014 16:24:02 +0400
> 
> > Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
> > 
> > ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
> > and in case errors in __ip_append_data() nobody frees stolen dst entry
> > 
> > Signed-off-by: Vasily Averin <vvs@parallels.com>
> 
> Please, in the future, put the Fixes: tag first in the list of signoffs.
> 
> Eric, please review this change, it looks good to me.

It is good indeed, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox