Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Add ceph root filesystem functionality and documentation.
From: mark.doffman-4yDnlxn2s6sWdaTGBSpHTA @ 2014-01-15 17:26 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: Mark Doffman, sage-4GqslpFJ+cxBDgjK7y7TUQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	rob.taylor-4yDnlxn2s6sWdaTGBSpHTA
In-Reply-To: <1385000024-23463-1-git-send-email-mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>

From: Mark Doffman <mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>

Hi All,

The following is a second version of a patch series that adds the ability to use
a ceph distributed file system as the root device.

Changes from version 1

fs/ceph/root.c:

The parsing code that takes the DHCP option 17 and kernel command line
parameters has been extensively altered.

The parsing now accepts multiple monitor addresses and ipv6 addresses.

The monitors listed in DHCP option 17 are now concatenated with those
listed on the kernel command line.

The patch series applies to v3.13-rc8-7-g3539717

Thanks

Mark

Mark Doffman (1):
  init: Add a new root device option, the Ceph file system

Rob Taylor (1):
  Documentation: Document the cephroot functionality

 Documentation/filesystems/{ => ceph}/ceph.txt |   0
 Documentation/filesystems/ceph/cephroot.txt   |  86 +++++++++++++
 fs/ceph/Kconfig                               |  10 ++
 fs/ceph/Makefile                              |   1 +
 fs/ceph/root.c                                | 176 ++++++++++++++++++++++++++
 include/linux/ceph/ceph_root.h                |  10 ++
 include/linux/root_dev.h                      |   1 +
 init/do_mounts.c                              |  32 ++++-
 net/ipv4/ipconfig.c                           |  10 +-
 9 files changed, 323 insertions(+), 3 deletions(-)
 rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
 create mode 100644 Documentation/filesystems/ceph/cephroot.txt
 create mode 100644 fs/ceph/root.c
 create mode 100644 include/linux/ceph/ceph_root.h

-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] net:  Check skb->rxhash in gro_receive
From: Eric Dumazet @ 2014-01-15 17:25 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, Jerry Chu
In-Reply-To: <alpine.DEB.2.02.1401150853400.14933@tomh.mtv.corp.google.com>

On Wed, 2014-01-15 at 08:58 -0800, Tom Herbert wrote:
> When initializing a gro_list for a packet, first check the rxhash of
> the incoming skb against that of the skb's in the list. This should be
> a very strong inidicator of whether the flow is going to be matched,
> and potentially allows a lot of other checks to be short circuited.
> Use skb_hash_raw so that we don't force the hash to be calculated.
> 
> Tested by running netperf 200 TCP_STREAMs between two machines with
> GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
> of time in dev_gro_receive.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  net/core/dev.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 20c834e..c063c7c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3818,10 +3818,18 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
>  {
>  	struct sk_buff *p;
>  	unsigned int maclen = skb->dev->hard_header_len;
> +	u32 hash = skb_get_hash_raw(skb);
>  
>  	for (p = napi->gro_list; p; p = p->next) {
>  		unsigned long diffs;
>  
> +		NAPI_GRO_CB(p)->flush = 0;
> +
> +		if (hash != skb_get_hash_raw(p)) {
> +			NAPI_GRO_CB(p)->same_flow = 0;
> +			continue;
> +		}
> +
>  		diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
>  		diffs |= p->vlan_tci ^ skb->vlan_tci;
>  		if (maclen == ETH_HLEN)
> @@ -3832,7 +3840,6 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
>  				       skb_gro_mac_header(skb),
>  				       maclen);
>  		NAPI_GRO_CB(p)->same_flow = !diffs;
> -		NAPI_GRO_CB(p)->flush = 0;
>  	}
>  }
>  

Acked-by: Eric Dumazet <edumazet@google.com>

Hmm, this looks like we should clear flush_id in ipv6 handler,
otherwise we might reuse a flush_id set from a prior gro invocation in
ipv4 (skb can be reused in napi_reuse_skb())

Jerry, what do you think of following fix ?

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 1e8683b135bb..598acd76ca4a 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -256,6 +256,7 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head,
                /* flush if Traffic Class fields are different */
                NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
                NAPI_GRO_CB(p)->flush |= flush;
+               NAPI_GRO_CB(p)->flush_id = 0;
        }
 
        NAPI_GRO_CB(skb)->flush |= flush;

^ permalink raw reply related

* Re: [PATCH 2/4] Documentation: Document the cephroot functionality
From: Mark Doffman @ 2014-01-15 17:22 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, Rob Taylor,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.00.1312062154530.1560-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>

Hi Sage,

On 12/06/2013 11:57 PM, Sage Weil wrote:
> On Wed, 20 Nov 2013, mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org wrote:
>> From: Rob Taylor <rob.taylor-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>>
>> Document using the cephfs as a root device, its purpose,
>> functionality and use.
>>
>> Signed-off-by: Mark Doffman <mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> Signed-off-by: Rob Taylor <rob.taylor-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> Reviewed-by: Ian Molton <ian.molton-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> ---
>>   Documentation/filesystems/{ => ceph}/ceph.txt |  0
>>   Documentation/filesystems/ceph/cephroot.txt   | 81 +++++++++++++++++++++++++++
>>   2 files changed, 81 insertions(+)
>>   rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
>>   create mode 100644 Documentation/filesystems/ceph/cephroot.txt
>>
>> diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph/ceph.txt
>> similarity index 100%
>> rename from Documentation/filesystems/ceph.txt
>> rename to Documentation/filesystems/ceph/ceph.txt
>> diff --git a/Documentation/filesystems/ceph/cephroot.txt b/Documentation/filesystems/ceph/cephroot.txt
>> new file mode 100644
>> index 0000000..ae0f5bb
>> --- /dev/null
>> +++ b/Documentation/filesystems/ceph/cephroot.txt
>> @@ -0,0 +1,81 @@
>> +Mounting the root filesystem via Ceph (cephroot)
>> +===============================================
>> +
>> +Written 2013 by Rob Taylor <rob.taylor-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> +
>> +derived from nfsroot.txt:
>> +
>> +Written 1996 by Gero Kuhlmann <gero-TuicA7gkpRym5h6znurzUg@public.gmane.org>
>> +Updated 1997 by Martin Mares <mj-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org>
>> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot-xuaVFQXs+5hIG4jRRZ66WA@public.gmane.org>
>> +Updated 2006 by Horms <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>
>> +
>> +
>> +
>> +In order to use a diskless system, such as an X-terminal or printer server
>> +for example, it is necessary for the root filesystem to be present on a
>> +non-disk device. This may be an initramfs (see Documentation/filesystems/
>> +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt), a
>> +filesystem mounted via NFS or a filesystem mounted via Ceph. The following
>> +text describes on how to use Ceph for the root filesystem.
>> +
>> +For the rest of this text 'client' means the diskless system, and 'server'
>> +means the Ceph server.
>> +
>> +
>> +1.) Enabling cephroot capabilities
>> +    -----------------------------
>> +
>> +In order to use cephroot, CEPH_FS needs to be selected as
>> +built-in during configuration. Once this has been selected, the cephroot
>> +option will become available, which should also be selected.
>> +
>> +In the networking options, kernel level autoconfiguration can be selected,
>> +along with the types of autoconfiguration to support. Selecting all of
>> +DHCP, BOOTP and RARP is safe.
>> +
>> +
>> +2.) Kernel command line
>> +    -------------------
>> +
>> +When the kernel has been loaded by a boot loader (see below) it needs to be
>> +told what root fs device to use. And in the case of cephroot, where to find
>> +both the server and the name of the directory on the server to mount as root.
>> +This can be established using the following kernel command line parameters:
>> +
>> +root=/dev/ceph
>> +
>> +This is necessary to enable the pseudo-Ceph-device. Note that it's not a
>> +real device but just a synonym to tell the kernel to use Ceph instead of
>> +a real device.
>> +
>> +cephroot=<monaddr>:/[<subdir>],<ceph-opts>
>> +
>> +  <monaddr>     Monitor address. Each takes the form host[:port]. If the port
>> +		is not specified, the Ceph default of 6789 is assumed.
>> +
>> +  <subdir>	A subdirectory subdir may be specified if a subset of the file
>> +		system is to be mounted
>> +
>> +  <ceph-opts>	Standard Ceph options. All options are separated by commas.
>> +		See Documentation/filesystems/ceph/ceph.txt for options and
>> +		their defaults.
>
> Maybe there is an existing convention here, but: it seems like it would be
> simpler to do something like
>
>   cephroot=<ip[:<port>][,...]>:/[<subdir>]
>
> i.e., the existing syntax used by mount, that (among other things) can
> also include a port, or be a list of mon ips, so that the parsing code
> can be re-used.  Then,
>
>   cephopts=<ceph-opts>
>
> Hopefully this would avoid the parsing in root.c and make things behave
> more consistently with respect to how mount(8) is used?

This would make things more consistent with mount, and easier! The 
reason to keep it the way it is is for consistency with NFS and DHCP 
option 17.

NFS concatenates the options in DHCP root-path (option 17) with the ones 
placed on the kernel command line. We could separate out the device and 
path strings from the options, but they would still be merged together 
in the DHCP string. Some parsing would still be required to split the 
DHCP string and merge with command line options. I'd prefer to keep them 
together on the command line also, just to have things stay similar to NFS.

Thanks

Mark

>
> sage
>
>> +
>> +4.) References
>> +    ----------
>> +
>> +
>> +5.) Credits
>> +    -------
>> +
>> +  cephroot was derived from nfsroot by Rob Taylor <rob.taylor-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> +  and Mark Doffman <mark.doffman-4yDnlxn2s6sWdaTGBSpHTA@public.gmane.org>
>> +
>> +  The nfsroot code in the kernel and the RARP support have been written
>> +  by Gero Kuhlmann <gero-TuicA7gkpRym5h6znurzUg@public.gmane.org>.
>> +
>> +  The rest of the IP layer autoconfiguration code has been written
>> +  by Martin Mares <mj-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org>.
>> +
>> +  In order to write the initial version of nfsroot I would like to thank
>> +  Jens-Uwe Mager <jum-gG2S4stXkm6Shm5Tz/htGQ@public.gmane.org> for his help.
>> --
>> 1.8.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] net: Add skb_get_hash_raw
From: Eric Dumazet @ 2014-01-15 17:15 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.02.1401150853090.14881@tomh.mtv.corp.google.com>

On Wed, 2014-01-15 at 08:57 -0800, Tom Herbert wrote:
> Function to just return skb->rxhash without checking to see if it needs
> to be recomputed.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/skbuff.h | 5 +++++
>  1 file changed, 5 insertions(+)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net-next v2] xen-netback: Rework rx_work_todo
From: Zoltan Kiss @ 2014-01-15 17:11 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies
  Cc: Zoltan Kiss

The recent patch to fix receive side flow control (11b57f) solved the spinning
thread problem, however caused an another one. The receive side can stall, if:
- [THREAD] xenvif_rx_action sets rx_queue_stopped to true
- [INTERRUPT] interrupt happens, and sets rx_event to true
- [THREAD] then xenvif_kthread sets rx_event to false
- [THREAD] rx_work_todo doesn't return true anymore

Also, if interrupt sent but there is still no room in the ring, it take quite a
long time until xenvif_rx_action realize it. This patch ditch that two variable,
and rework rx_work_todo. If the thread finds it can't fit more skb's into the
ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
kept as 0. Then rx_work_todo will check if:
- there is something to send to the ring (like before)
- there is space for the topmost packet in the queue

I think that's more natural and optimal thing to test than two bool which are
set somewhere else.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
---
 drivers/net/xen-netback/common.h    |    6 +-----
 drivers/net/xen-netback/interface.c |    1 -
 drivers/net/xen-netback/netback.c   |   16 ++++++----------
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 4c76bcb..ae413a2 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -143,11 +143,7 @@ struct xenvif {
 	char rx_irq_name[IFNAMSIZ+4]; /* DEVNAME-rx */
 	struct xen_netif_rx_back_ring rx;
 	struct sk_buff_head rx_queue;
-	bool rx_queue_stopped;
-	/* Set when the RX interrupt is triggered by the frontend.
-	 * The worker thread may need to wake the queue.
-	 */
-	bool rx_event;
+	RING_IDX rx_last_skb_slots;
 
 	/* This array is allocated seperately as it is large */
 	struct gnttab_copy *grant_copy_op;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index b9de31e..7669d49 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -100,7 +100,6 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id)
 {
 	struct xenvif *vif = dev_id;
 
-	vif->rx_event = true;
 	xenvif_kick_thread(vif);
 
 	return IRQ_HANDLED;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 2738563..bb241d0 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -477,7 +477,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 	unsigned long offset;
 	struct skb_cb_overlay *sco;
 	bool need_to_notify = false;
-	bool ring_full = false;
 
 	struct netrx_pending_operations npo = {
 		.copy  = vif->grant_copy_op,
@@ -487,7 +486,7 @@ static void xenvif_rx_action(struct xenvif *vif)
 	skb_queue_head_init(&rxq);
 
 	while ((skb = skb_dequeue(&vif->rx_queue)) != NULL) {
-		int max_slots_needed;
+		RING_IDX max_slots_needed;
 		int i;
 
 		/* We need a cheap worse case estimate for the number of
@@ -510,9 +509,10 @@ static void xenvif_rx_action(struct xenvif *vif)
 		if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) {
 			skb_queue_head(&vif->rx_queue, skb);
 			need_to_notify = true;
-			ring_full = true;
+			vif->rx_last_skb_slots = max_slots_needed;
 			break;
-		}
+		} else
+			vif->rx_last_skb_slots = 0;
 
 		sco = (struct skb_cb_overlay *)skb->cb;
 		sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
@@ -523,8 +523,6 @@ static void xenvif_rx_action(struct xenvif *vif)
 
 	BUG_ON(npo.meta_prod > ARRAY_SIZE(vif->meta));
 
-	vif->rx_queue_stopped = !npo.copy_prod && ring_full;
-
 	if (!npo.copy_prod)
 		goto done;
 
@@ -1727,8 +1725,8 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
 
 static inline int rx_work_todo(struct xenvif *vif)
 {
-	return (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) ||
-		vif->rx_event;
+	return !skb_queue_empty(&vif->rx_queue) &&
+	       xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
 }
 
 static inline int tx_work_todo(struct xenvif *vif)
@@ -1814,8 +1812,6 @@ int xenvif_kthread(void *data)
 		if (!skb_queue_empty(&vif->rx_queue))
 			xenvif_rx_action(vif);
 
-		vif->rx_event = false;
-
 		if (skb_queue_empty(&vif->rx_queue) &&
 		    netif_queue_stopped(vif->dev))
 			xenvif_start_queue(vif);

^ permalink raw reply related

* RE: [Xen-devel] [PATCH net-next] xen-netfront: add support for IPv6 offloads
From: Paul Durrant @ 2014-01-15 17:08 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: David Vrabel, xen-devel@lists.xen.org, Boris Ostrovsky,
	netdev@vger.kernel.org
In-Reply-To: <52D6BF630200007800113EFB@nat28.tlf.novell.com>

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 15 January 2014 16:04
> To: Andrew Cooper; Paul Durrant
> Cc: David Vrabel; xen-devel@lists.xen.org; Boris Ostrovsky;
> netdev@vger.kernel.org
> Subject: Re: [Xen-devel] [PATCH net-next] xen-netfront: add support for
> IPv6 offloads
> 
> >>> On 15.01.14 at 16:54, Paul Durrant <Paul.Durrant@citrix.com> wrote:
> >> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> On 15/01/14 15:18, Paul Durrant wrote:
> >> > +	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6",
> "%d", 1);
> >>
> >> "%d", 1 results in a constant string.  xenbus_write() would avoid a
> >> transitory memory allocation.
> >
> > This code is consistent with all the other xenbus_printf()s in the
> > neighbourhood and does it really matter?
> 
> I think we should always strive to have the simplest possible code
> that fulfills the purpose. And hence we shouldn't be setting further
> bad precedents. (In fact I have a patch queued to replace all the
> unnecessary xenbus_printf()s with xenbus_write()s on
> linux-2.6.18-xen.hg, and may look into porting this to the
> respective upstream components.)
> 

Ok. Personally I'd go for code consistency with this patch and then a full replacement... but I'll re-work it.

  Paul

^ permalink raw reply

* [PATCH net-next v3 2/2] net:  Check skb->rxhash in gro_receive
From: Tom Herbert @ 2014-01-15 16:58 UTC (permalink / raw)
  To: davem, netdev

When initializing a gro_list for a packet, first check the rxhash of
the incoming skb against that of the skb's in the list. This should be
a very strong inidicator of whether the flow is going to be matched,
and potentially allows a lot of other checks to be short circuited.
Use skb_hash_raw so that we don't force the hash to be calculated.

Tested by running netperf 200 TCP_STREAMs between two machines with
GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
of time in dev_gro_receive.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/core/dev.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..c063c7c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3818,10 +3818,18 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 {
 	struct sk_buff *p;
 	unsigned int maclen = skb->dev->hard_header_len;
+	u32 hash = skb_get_hash_raw(skb);
 
 	for (p = napi->gro_list; p; p = p->next) {
 		unsigned long diffs;
 
+		NAPI_GRO_CB(p)->flush = 0;
+
+		if (hash != skb_get_hash_raw(p)) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
 		diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
 		diffs |= p->vlan_tci ^ skb->vlan_tci;
 		if (maclen == ETH_HLEN)
@@ -3832,7 +3840,6 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 				       skb_gro_mac_header(skb),
 				       maclen);
 		NAPI_GRO_CB(p)->same_flow = !diffs;
-		NAPI_GRO_CB(p)->flush = 0;
 	}
 }
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v3 1/2] net: Add skb_get_hash_raw
From: Tom Herbert @ 2014-01-15 16:57 UTC (permalink / raw)
  To: davem, netdev

Function to just return skb->rxhash without checking to see if it needs
to be recomputed.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48b7605..1f689e6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -771,6 +771,11 @@ static inline __u32 skb_get_hash(struct sk_buff *skb)
 	return skb->rxhash;
 }
 
+static inline __u32 skb_get_hash_raw(const struct sk_buff *skb)
+{
+	return skb->rxhash;
+}
+
 static inline void skb_clear_hash(struct sk_buff *skb)
 {
 	skb->rxhash = 0;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next] vxge: make local functions static
From: Stephen Hemminger @ 2014-01-15 16:28 UTC (permalink / raw)
  To: Jon Mason, David Miller; +Cc: netdev

Remove unused function vxge_hw_vpath_vid_get

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

---
 drivers/net/ethernet/neterion/vxge/vxge-config.c  |    2 -
 drivers/net/ethernet/neterion/vxge/vxge-main.c    |    1 
 drivers/net/ethernet/neterion/vxge/vxge-main.h    |    1 
 drivers/net/ethernet/neterion/vxge/vxge-traffic.c |   37 +---------------------
 drivers/net/ethernet/neterion/vxge/vxge-traffic.h |    8 ----
 5 files changed, 4 insertions(+), 45 deletions(-)

--- a/drivers/net/ethernet/neterion/vxge/vxge-config.c	2014-01-15 08:10:24.862145270 -0800
+++ b/drivers/net/ethernet/neterion/vxge/vxge-config.c	2014-01-15 08:17:28.164022998 -0800
@@ -2148,7 +2148,7 @@ __vxge_hw_ring_mempool_item_alloc(struct
  * __vxge_hw_ring_replenish - Initial replenish of RxDs
  * This function replenishes the RxDs from reserve array to work array
  */
-enum vxge_hw_status
+static enum vxge_hw_status
 vxge_hw_ring_replenish(struct __vxge_hw_ring *ring)
 {
 	void *rxd;
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.c	2014-01-15 08:10:24.862145270 -0800
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c	2014-01-15 08:17:28.164022998 -0800
@@ -87,6 +87,7 @@ static unsigned int bw_percentage[VXGE_H
 module_param_array(bw_percentage, uint, NULL, 0);
 
 static struct vxge_drv_config *driver_config;
+static enum vxge_hw_status vxge_reset_all_vpaths(struct vxgedev *vdev);
 
 static inline int is_vxge_card_up(struct vxgedev *vdev)
 {
@@ -1971,7 +1972,7 @@ static enum vxge_hw_status vxge_rth_conf
 }
 
 /* reset vpaths */
-enum vxge_hw_status vxge_reset_all_vpaths(struct vxgedev *vdev)
+static enum vxge_hw_status vxge_reset_all_vpaths(struct vxgedev *vdev)
 {
 	enum vxge_hw_status status = VXGE_HW_OK;
 	struct vxge_vpath *vpath;
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.h	2014-01-15 08:10:24.862145270 -0800
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.h	2014-01-15 08:17:28.164022998 -0800
@@ -427,7 +427,6 @@ void vxge_os_timer(struct timer_list *ti
 }
 
 void vxge_initialize_ethtool_ops(struct net_device *ndev);
-enum vxge_hw_status vxge_reset_all_vpaths(struct vxgedev *vdev);
 int vxge_fw_upgrade(struct vxgedev *vdev, char *fw_name, int override);
 
 /* #define VXGE_DEBUG_INIT: debug for initialization functions
--- a/drivers/net/ethernet/neterion/vxge/vxge-traffic.c	2014-01-15 08:10:24.862145270 -0800
+++ b/drivers/net/ethernet/neterion/vxge/vxge-traffic.c	2014-01-15 08:17:28.168022940 -0800
@@ -1956,8 +1956,7 @@ exit:
  * @vid: vlan id to be added for this vpath into the list
  *
  * Adds the given vlan id into the list for this  vpath.
- * see also: vxge_hw_vpath_vid_delete, vxge_hw_vpath_vid_get and
- * vxge_hw_vpath_vid_get_next
+ * see also: vxge_hw_vpath_vid_delete
  *
  */
 enum vxge_hw_status
@@ -1979,45 +1978,13 @@ exit:
 }
 
 /**
- * vxge_hw_vpath_vid_get - Get the first vid entry for this vpath
- *               from vlan id table.
- * @vp: Vpath handle.
- * @vid: Buffer to return vlan id
- *
- * Returns the first vlan id in the list for this vpath.
- * see also: vxge_hw_vpath_vid_get_next
- *
- */
-enum vxge_hw_status
-vxge_hw_vpath_vid_get(struct __vxge_hw_vpath_handle *vp, u64 *vid)
-{
-	u64 data;
-	enum vxge_hw_status status = VXGE_HW_OK;
-
-	if (vp == NULL) {
-		status = VXGE_HW_ERR_INVALID_HANDLE;
-		goto exit;
-	}
-
-	status = __vxge_hw_vpath_rts_table_get(vp,
-			VXGE_HW_RTS_ACCESS_STEER_CTRL_ACTION_LIST_FIRST_ENTRY,
-			VXGE_HW_RTS_ACCESS_STEER_CTRL_DATA_STRUCT_SEL_VID,
-			0, vid, &data);
-
-	*vid = VXGE_HW_RTS_ACCESS_STEER_DATA0_GET_VLAN_ID(*vid);
-exit:
-	return status;
-}
-
-/**
  * vxge_hw_vpath_vid_delete - Delete the vlan id entry for this vpath
  *               to vlan id table.
  * @vp: Vpath handle.
  * @vid: vlan id to be added for this vpath into the list
  *
  * Adds the given vlan id into the list for this  vpath.
- * see also: vxge_hw_vpath_vid_add, vxge_hw_vpath_vid_get and
- * vxge_hw_vpath_vid_get_next
+ * see also: vxge_hw_vpath_vid_add
  *
  */
 enum vxge_hw_status
--- a/drivers/net/ethernet/neterion/vxge/vxge-traffic.h	2014-01-15 08:10:24.862145270 -0800
+++ b/drivers/net/ethernet/neterion/vxge/vxge-traffic.h	2014-01-15 08:17:28.168022940 -0800
@@ -1918,9 +1918,6 @@ vxge_hw_ring_rxd_post_post(
 	struct __vxge_hw_ring *ring_handle,
 	void *rxdh);
 
-enum vxge_hw_status
-vxge_hw_ring_replenish(struct __vxge_hw_ring *ring_handle);
-
 void
 vxge_hw_ring_rxd_post_post_wmb(
 	struct __vxge_hw_ring *ring_handle,
@@ -2186,11 +2183,6 @@ vxge_hw_vpath_vid_add(
 	u64			vid);
 
 enum vxge_hw_status
-vxge_hw_vpath_vid_get(
-	struct __vxge_hw_vpath_handle *vpath_handle,
-	u64			*vid);
-
-enum vxge_hw_status
 vxge_hw_vpath_vid_delete(
 	struct __vxge_hw_vpath_handle *vpath_handle,
 	u64			vid);

^ permalink raw reply

* Re: throughput problems with realtek
From: Dmitry Kasatkin @ 2014-01-15 16:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: nic_swsd, romieu, netdev, l.moiseichuk
In-Reply-To: <52D6B5F4.1060201@hp.com>

On Wed, Jan 15, 2014 at 6:23 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 01/15/2014 03:56 AM, Dmitry Kasatkin wrote:
>>
>> Hi,
>>
>> We have several devices with such adapter..
>>
>> Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
>> PCI Express Gigabit Ethernet Controller (rev 06)
>> See output of the lspci -vvv bellow...
>>
>> And I suddenly investigated throughput issues..
>>
>> After couple minutes of running 'iperf -c server' transmission speed
>> drops substantially...
>>
>> [  4]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
>> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60508
>> [  5]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
>> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60509
>> [  4]  0.0-10.0 sec  1.10 GBytes   949 Mbits/sec
>> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60510
>> [  5]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
>> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60511
>> [  4]  0.0-10.0 sec   626 MBytes   525 Mbits/sec
>> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60512
>> [  5]  0.0-10.0 sec  84.4 MBytes  70.5 Mbits/sec
>> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60513
>> [  4]  0.0-10.0 sec  87.4 MBytes  73.0 Mbits/sec
>> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port
>> 60514
>>
>>
>> But it seems after certain time of inactivity (low load) speed will be
>> up again...
>>
>> It happens almost the same way on desktop machines and also on Samsung
>> Series 7 laptop NP770Z5E...
>>
>> Does anyone have any ideas about it?
>>
>
> The card flipping back and forth between 1000 and 100 Mbit/s operation
> perhaps?
>
> rick jones


I do not see any link speed changes... it stays the same...
The same problem is visible on absolutely different computers.

-- 
Thanks,
Dmitry

^ permalink raw reply

* [PATCH net-next] bnad: code cleanup
From: Stephen Hemminger @ 2014-01-15 16:24 UTC (permalink / raw)
  To: Rasesh Mody, David Miller; +Cc: netdev

Use 'make namespacecheck' to code that could be declared static.
After that remove code that is not being used.

Compile tested only.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

---
 drivers/net/ethernet/brocade/bna/bfa_ioc.c |   27 +--------------------------
 drivers/net/ethernet/brocade/bna/bnad.c    |    2 +-
 2 files changed, 2 insertions(+), 27 deletions(-)

--- a/drivers/net/ethernet/brocade/bna/bnad.c	2014-01-14 09:46:15.261710097 -0800
+++ b/drivers/net/ethernet/brocade/bna/bnad.c	2014-01-14 09:47:43.800754593 -0800
@@ -2108,7 +2108,7 @@ bnad_rx_ctrl_init(struct bnad *bnad, u32
 }
 
 /* Called with mutex_lock(&bnad->conf_mutex) held */
-u32
+static u32
 bnad_reinit_rx(struct bnad *bnad)
 {
 	struct net_device *netdev = bnad->netdev;
--- a/drivers/net/ethernet/brocade/bna/bfa_ioc.c	2014-01-14 09:46:15.261710097 -0800
+++ b/drivers/net/ethernet/brocade/bna/bfa_ioc.c	2014-01-14 09:47:43.800754593 -0800
@@ -1147,25 +1147,6 @@ bfa_nw_ioc_sem_release(void __iomem *sem
 	writel(1, sem_reg);
 }
 
-/* Invalidate fwver signature */
-enum bfa_status
-bfa_nw_ioc_fwsig_invalidate(struct bfa_ioc *ioc)
-{
-	u32	pgnum, pgoff;
-	u32	loff = 0;
-	enum bfi_ioc_state ioc_fwstate;
-
-	ioc_fwstate = bfa_ioc_get_cur_ioc_fwstate(ioc);
-	if (!bfa_ioc_state_disabled(ioc_fwstate))
-		return BFA_STATUS_ADAPTER_ENABLED;
-
-	pgnum = bfa_ioc_smem_pgnum(ioc, loff);
-	pgoff = PSS_SMEM_PGOFF(loff);
-	writel(pgnum, ioc->ioc_regs.host_page_num_fn);
-	writel(BFI_IOC_FW_INV_SIGN, ioc->ioc_regs.smem_page_start + loff);
-	return BFA_STATUS_OK;
-}
-
 /* Clear fwver hdr */
 static void
 bfa_ioc_fwver_clear(struct bfa_ioc *ioc)
@@ -1780,15 +1761,9 @@ bfa_flash_raw_read(void __iomem *pci_bar
 	return BFA_STATUS_OK;
 }
 
-u32
-bfa_nw_ioc_flash_img_get_size(struct bfa_ioc *ioc)
-{
-	return BFI_FLASH_IMAGE_SZ/sizeof(u32);
-}
-
 #define BFA_FLASH_PART_FWIMG_ADDR	0x100000 /* fw image address */
 
-enum bfa_status
+static enum bfa_status
 bfa_nw_ioc_flash_img_get_chnk(struct bfa_ioc *ioc, u32 off,
 			      u32 *fwimg)
 {

^ permalink raw reply

* Re: throughput problems with realtek
From: Rick Jones @ 2014-01-15 16:23 UTC (permalink / raw)
  To: Dmitry Kasatkin, nic_swsd, romieu, netdev; +Cc: l.moiseichuk
In-Reply-To: <CACE9dm_3jw08_dfXRJRMQ=r4X1NZ1kHF6TZopFSNy3k+DCKgTA@mail.gmail.com>

On 01/15/2014 03:56 AM, Dmitry Kasatkin wrote:
> Hi,
>
> We have several devices with such adapter..
>
> Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411
> PCI Express Gigabit Ethernet Controller (rev 06)
> See output of the lspci -vvv bellow...
>
> And I suddenly investigated throughput issues..
>
> After couple minutes of running 'iperf -c server' transmission speed
> drops substantially...
>
> [  4]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60508
> [  5]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60509
> [  4]  0.0-10.0 sec  1.10 GBytes   949 Mbits/sec
> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60510
> [  5]  0.0-10.0 sec  1.10 GBytes   948 Mbits/sec
> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60511
> [  4]  0.0-10.0 sec   626 MBytes   525 Mbits/sec
> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60512
> [  5]  0.0-10.0 sec  84.4 MBytes  70.5 Mbits/sec
> [  4] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60513
> [  4]  0.0-10.0 sec  87.4 MBytes  73.0 Mbits/sec
> [  5] local 106.122.1.113 port 5001 connected with 106.122.1.121 port 60514
>
>
> But it seems after certain time of inactivity (low load) speed will be
> up again...
>
> It happens almost the same way on desktop machines and also on Samsung
> Series 7 laptop NP770Z5E...
>
> Does anyone have any ideas about it?
>

The card flipping back and forth between 1000 and 100 Mbit/s operation 
perhaps?

rick jones

^ permalink raw reply

* RE: i40e sym version file
From: Nelson, Shannon @ 2014-01-15 16:22 UTC (permalink / raw)
  To: Stephen Hemminger, Kirsher, Jeffrey T, David Miller
  Cc: netdev@vger.kernel.org, Brown, Aaron F
In-Reply-To: <20140115081645.484c124d@nehalam.linuxnetplumber.net>

Ooo, ick, that shouldn't be there.  Jeff is on sabbatical and Aaron is covering, I'll work with Aaron to find what happened and get it straightened out.

Thanks,
sln

________________________________________
From: Stephen Hemminger [stephen@networkplumber.org]
Sent: Wednesday, January 15, 2014 8:16 AM
To: Nelson, Shannon; Kirsher, Jeffrey T; David Miller
Cc: netdev@vger.kernel.org
Subject: i40e sym version file

The latest pull from Intel added a file to net-next git repo which is not
supposed to be part of build, it is a derived file:
 create mode 100644 drivers/net/ethernet/intel/i40e/Module.symvers


bad commit was
commit 9d8bf54723e9f21c502a410495840d8771f769ef
Author: Shannon Nelson <shannon.nelson@intel.com>
Date:   Tue Jan 14 00:49:50 2014 -0800

    i40e: associate VMDq queue with VM type

    Fix a bug where the queue was not associated with the right set-up
    within the hardware.  The fix is to use the right QTX_CTL VSI type
    when associating it to the VSI.

    Change-ID: I65ef6c5a8205601c640a6593e4b7e78d6ba45545
    Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
    Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Tested-by: Sibai Li <sibai.li@intel.com>
    Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH-next v2] net/ipv4: don't use module_init in non-modular gre_offload
From: Paul Gortmaker @ 2014-01-15 16:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Paul Gortmaker, Eric Dumazet

Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4
("gre_offload: statically build GRE offloading support") added
new module_init/module_exit calls to the gre_offload.c file.

The file is obj-y and can't be anything other than built-in.
Currently it can never be built modular, so using module_init
as an alias for __initcall can be somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future.  If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.  We also make the inclusion explicit.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups.  As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.

As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---

v2: dump gre_offload_exit entirely as suggested by Eric.

 net/ipv4/gre_offload.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 29512e3e7e7c..f1d32280cb54 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -11,6 +11,7 @@
  */

 #include <linux/skbuff.h>
+#include <linux/init.h>
 #include <net/protocol.h>
 #include <net/gre.h>

@@ -283,11 +284,4 @@ static int __init gre_offload_init(void)
 {
 	return inet_add_offload(&gre_offload, IPPROTO_GRE);
 }
-
-static void __exit gre_offload_exit(void)
-{
-	inet_del_offload(&gre_offload, IPPROTO_GRE);
-}
-
-module_init(gre_offload_init);
-module_exit(gre_offload_exit);
+device_initcall(gre_offload_init);
-- 
1.8.5.2

^ permalink raw reply related

* Re: [net-next v4 3/7] ixgbe: Use static inlines instead of macros
From: Rustad, Mark D @ 2014-01-15 16:16 UTC (permalink / raw)
  To: Joe Perches
  Cc: Brown, Aaron F, David Miller, Netdev, gospo@redhat.com,
	sassmann@redhat.com
In-Reply-To: <1389755839.14001.6.camel@joe-AO722>

On Jan 14, 2014, at 7:17 PM, Joe Perches <joe@perches.com> wrote:

> On Tue, 2014-01-14 at 18:53 -0800, Aaron Brown wrote:
>> From: Mark Rustad <mark.d.rustad@intel.com>
> 
> trivia:
> 
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.h
> []
>> @@ -124,24 +124,40 @@ s32 ixgbe_reset_pipeline_82599(struct ixgbe_hw *hw);
> []
>> -#define IXGBE_WRITE_REG(a, reg, value) writel((value), ((a)->hw_addr + (reg)))
>> +static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value)
>> +{
>> +	writel(value, hw->hw_addr + reg);
>> +}
>> +#define IXGBE_WRITE_REG(a, reg, value) ixgbe_write_reg((a), (reg), (value))
> 
> There's no real value in adding parentheses to these macros.

I suppose that is true in this case. I have it so ingrained to always put parens around the macro parameter references, that I just automatically do it. Still, it makes it safer for any future changes, though the most likely next change here will be deletion anyway. :-)

-- 
Mark Rustad, Networking Division, Intel Corporation

^ permalink raw reply

* i40e sym version file
From: Stephen Hemminger @ 2014-01-15 16:16 UTC (permalink / raw)
  To: Shannon Nelson, Jeff Kirsher, David Miller; +Cc: netdev

The latest pull from Intel added a file to net-next git repo which is not
supposed to be part of build, it is a derived file:
 create mode 100644 drivers/net/ethernet/intel/i40e/Module.symvers


bad commit was
commit 9d8bf54723e9f21c502a410495840d8771f769ef
Author: Shannon Nelson <shannon.nelson@intel.com>
Date:   Tue Jan 14 00:49:50 2014 -0800

    i40e: associate VMDq queue with VM type
    
    Fix a bug where the queue was not associated with the right set-up
    within the hardware.  The fix is to use the right QTX_CTL VSI type
    when associating it to the VSI.
    
    Change-ID: I65ef6c5a8205601c640a6593e4b7e78d6ba45545
    Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
    Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Tested-by: Sibai Li <sibai.li@intel.com>
    Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH net-next] netfilter: remove double colon
From: Stephen Hemminger @ 2014-01-15 16:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso, David S. Miller; +Cc: netdev, netfilter-devel

This is C not shell script

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

--- a/net/ipv4/netfilter.c	2013-12-31 17:45:31.993942921 -0800
+++ b/net/ipv4/netfilter.c	2014-01-15 08:10:49.793785943 -0800
@@ -61,7 +61,7 @@ int ip_route_me_harder(struct sk_buff *s
 		skb_dst_set(skb, NULL);
 		dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
 		if (IS_ERR(dst))
-			return PTR_ERR(dst);;
+			return PTR_ERR(dst);
 		skb_dst_set(skb, dst);
 	}
 #endif

^ permalink raw reply

* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Wei Liu @ 2014-01-15 16:10 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Wei Liu, ian.campbell, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <52D6A45C.1060705@citrix.com>

On Wed, Jan 15, 2014 at 03:08:12PM +0000, Zoltan Kiss wrote:
> On 15/01/14 14:59, Wei Liu wrote:
> >On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
> >>On 15/01/14 14:45, Wei Liu wrote:
> >>>>>>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>>>>>>>>thread problem, however caused an another one. The receive side can stall, if:
> >>>>>>>>>- xenvif_rx_action sets rx_queue_stopped to false
> >>>>>>>>>- interrupt happens, and sets rx_event to true
> >>>>>>>>>- then xenvif_kthread sets rx_event to false
> >>>>>>>>>
> >>>>>>>
> >>>>>>>If you mean "rx_work_todo" returns false.
> >>>>>>>
> >>>>>>>In this case
> >>>>>>>
> >>>>>>>(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >>>>>>>
> >>>>>>>can still be true, can't it?
> >>>>>Sorry, I should wrote rx_queue_stopped to true
> >>>>>
> >>>In this case, if rx_queue_stopped is true, then we're expecting frontend
> >>>to notify us, right?
> >>>
> >>>rx_queue_stopped is set to true if we cannot make any progress to queue
> >>>packet into the ring. In that situation we can expect frontend will send
> >>>notification to backend after it goes through the backlog in the ring.
> >>>That means rx_event is set to true, and rx_work_todo is true again. So
> >>>the ring is actually not stalled in this case as well. Did I miss
> >>>something?
> >>>
> >>
> >>Yes, we expect the guest to notify us, and it does, and we set
> >>rx_event to true (see second point), but then the thread set it to
> >>false (see third point). Talking with Paul, another solution could
> >>be to set rx_event false before calling xenvif_rx_action. But using
> >>rx_last_skb_slots makes it quicker for the thread to see if it
> >>doesn't have to do anything.
> >>
> >
> >OK, this is a better explaination. So actually there's no bug in the
> >original implementation and your patch is sort of an improvement.
> >
> >Could you send a new version of this patch with relevant information in
> >commit message? Talking to people offline is faster, but I would like to
> >have public discussion and relevant information archived in a searchable
> >form. Thanks.
> 
> No, there is a bug in the original implementation:
> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
> - [INTERRUPT] interrupt happens, and sets rx_event to true
> - [THREAD] then xenvif_kthread sets rx_event to false
> - [THREAD] rx_work_todo never returns true anymore
> 

I see what you mean. The interrupt is "lost", that's why it's stalled.

> I will update the explanation and send in the patch again.
> 

Thanks.

Wei.

> Zoli

^ permalink raw reply

* Re: [PATCH v2 net] bpf: do not use reciprocal divide
From: Eric Dumazet @ 2014-01-15 16:09 UTC (permalink / raw)
  To: Matt Evans
  Cc: Heiko Carstens, Martin Schwidefsky, Hannes Frederic Sowa, netdev,
	dborkman, darkjames-ws, Mircea Gherzan, Russell King
In-Reply-To: <927a6073e8f73b53027480e7609b4c53@ozlabs.org>

On Wed, 2014-01-15 at 15:10 +0000, Matt Evans wrote:
> Hi Eric,
...
> PPC looks fine; I had a look at the core/ARM parts which also look good.
> 
> I'd forgotten where the DIV0 checking occurred, so I also benefited from 
> your hint to Heiko. :)
> 

Thanks for reviewing !

^ permalink raw reply

* Re: [PATCH net] bpf: do not use reciprocal divide
From: Eric Dumazet @ 2014-01-15 16:07 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Heiko Carstens, Hannes Frederic Sowa, netdev, dborkman,
	darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <20140115162623.39472f8f@mschwide>

On Wed, 2014-01-15 at 16:26 +0100, Martin Schwidefsky wrote:

> The C compiler naturally can to a u32/u32 division, it either uses
> the "dlr" instruction which is unsigned, or uses a call to a function
> to do the u32/u32 math. See the lovely code in arch/s390/lib/div64.c
> for the kernel version of that code.

If the C code uses unsigned, then all arch jit code must use the same.

Behavior of the filter should not depend of jit being enabled or not ;)

^ permalink raw reply

* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge)
From: Daniel Borkmann @ 2014-01-15 16:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm,
	Jared Hulbert, netdev
In-Reply-To: <52D69AB4.6000309@suse.cz>

[keeping netdev in loop as well]

On 01/15/2014 03:27 PM, Vlastimil Babka wrote:
> On 01/13/2014 12:39 PM, Daniel Borkmann wrote:
>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote:
>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote:
>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote:
>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>>
>>>>>> This is being reliably triggered for each mmaped() packet(7)
>>>>>> socket from user space, basically during unmapping resp.
>>>>>> closing the TX socket.
>>>>>>
>>>>>> I believe due to some change in transparent hugepages code ?
>>>>>>
>>>>>> When I disable transparent hugepages, everything works fine,
>>>>>> no BUG triggered.
>>>>>>
>>>>>> I'd be happy to test patches.
>>>>>
>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug
>>>>> where THP tail page is encountered") in current mainline fix this?
>>>>
>>>> Thanks for your answer Andrew!
>>>>
>>>> Hm, I just cherry-picked that onto current net-next as I have some work
>>>> there, and this time I got ...
>>>>
>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE)
>>>>       and on shutdown munlockall() ...)
>>>>
>>>> [   63.863672] ------------[ cut here ]------------
>>>> [   63.863702] kernel BUG at mm/mlock.c:507!
>>>> [   63.863721] invalid opcode: 0000 [#1] SMP
>>>> [   63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev 
 media uinput i915
>>>> [   63.864152]  i2c_algo_bit drm_kms_helper drm i2c_core video
>>>> [   63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15
>>>> [   63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
>>>> [   63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000
>>>> [   63.864274] RIP: 0010:[<ffffffff8116fa9a>]  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>> [   63.864318] RSP: 0018:ffff8800b5955e08  EFLAGS: 00010202
>>>> [   63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034
>>>> [   63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700
>>>> [   63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000
>>>> [   63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700
>>>> [   63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000
>>>> [   63.864492] FS:  00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
>>>> [   63.864526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0
>>>> [   63.864581] Stack:
>>>> [   63.864593]  ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48
>>>> [   63.864632]  000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000
>>>> [   63.864669]  ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000
>>>> [   63.864708] Call Trace:
>>>> [   63.864724]  [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30
>>>> [   63.864754]  [<ffffffff81171b52>] ? vma_merge+0xc2/0x330
>>>> [   63.864786]  [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190
>>>> [   63.864812]  [<ffffffff8116fde7>] do_mlockall+0x87/0xc0
>>>> [   63.864836]  [<ffffffff811702df>] sys_munlockall+0x2f/0x50
>>>> [   63.864873]  [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b
>>>> [   63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55
>>>> [   63.865114] RIP  [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0
>>>> [   63.865148]  RSP <ffff8800b5955e08>
>>>> [   63.874968] ------------[ cut here ]------------
>>>>
>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some
>>>> other patches are missing as well, not sure right now.
>>>
>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline.
>>>
>>> If you can still try debug patches, please try this. Thanks.
>>
>> Yes, thanks, I'll come back to you some time by today.
>
> Daniel sent me (off-list) instructions to reproduce:
>
>> Then in the kernel source tree, you'll find:
>>
>>     tools/testing/selftests/net/
>>
>> There, just do a 'make' and run ./psock_tpacket
>
> It reproduces deterministically in mainline since 3.12, i.e. my munlock
> performance series. Based on the initial debug output, I've expanded the
> debug patch below a bit:
>
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Date: Mon, 13 Jan 2014 11:13:53 +0100
>>> Subject: [PATCH] debug munlock_vma_pages_range
>>>
>>> ---
>>>     mm/mlock.c | 22 ++++++++++++++++++++--
>>>     1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>> index c59c420..7d0e29a 100644
>>> --- a/mm/mlock.c
>>> +++ b/mm/mlock.c
>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>>     void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>     			     unsigned long start, unsigned long end)
>>>     {
>>> +	unsigned long orig_start = start;
>>> +	unsigned long page_increm = 0;
>>> +
>>>     	vma->vm_flags &= ~VM_LOCKED;
>>>
>>>     	while (start < end) {
>>>     		struct page *page = NULL;
>>>     		unsigned int page_mask;
>>> -		unsigned long page_increm;
>>>     		struct pagevec pvec;
>>>     		struct zone *zone;
>>>     		int zoneid;
>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>>>     			}
>>>     		}
>>>     		/* It's a bug to munlock in the middle of a THP page */
>>> -		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>>> +		if ((start >> PAGE_SHIFT) & page_mask) {
>>> +			dump_page(page);
>>> +			printk("start=%lu pfn=%lu orig_start=%lu "
>>> +			       "prev_page_increm=%lu page_mask=%u "
>>> +			       "vm_start=%lu vm_end=%lu vm_flags=%lu\n",
>>> +				start, page_to_pfn(page), orig_start,
>>> +				page_increm, page_mask,
>>> +				vma->vm_start, vma->vm_end,
>>> +				vma->vm_flags);
> +                        printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops,
> +                                vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages);
> +                        if (PageCompound(page)) {
> +                                printk("page is compound with order=%d\n", compound_order(page));
> +                        }
>>> +			if (PageTail(page)) {
>>> +				struct page *first_page = page->first_page;
>>> +				printk("first_page pfn=%lu\n",
>>> +						page_to_pfn(first_page));
>>> +				dump_page(first_page);
>>> +			}
>>> +			VM_BUG_ON(true);
>>> +		}
>>>     		page_increm = 1 + page_mask;
>>>     		start += page_increm * PAGE_SIZE;
>>>     next:
>>>
>
> And got output like this:
>
> page:ffffea0002474a40 count:5 mapcount:1 mapping:          (null) index:0x0
> page flags: 0x100000000004004(referenced|head)
> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707
> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault=          (null), remap_pages=          (null)
> page is compound with order=2
>
> Observations:
> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages
>    (so it cannot be a THP head which the munlock expects). Yet there is a head page
>    that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page()
>    That's why page_mask is determined to be 511 and the code thinks it's in the
>    middle of a THP page.
> - in fact, the page is a compound page with order=2
> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP
> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses
>    vm_insert_page(), which adds the VM_MIXEDMAP flag
> - the buffers that are mapped were allocated by alloc_one_pg_vec_page()
>    where flags indeed include __GFP_COMP
>
> So clearly there is a way to have mlock/munlock operate on a vma that contains
> compound pages and confuse the checks for PageTransHuge().
>
> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock()
> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however
> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page,
> when it encounters a head page. If the head page is of smaller order and is followed
> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere),
> they wouldn't get munlocked.
>
> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check
> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8
> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes
> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock,
> which leads to this assertion failing.
>
> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP
> to decide.
>
> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP
>            and added to VM_SPECIAL, which makes m(un)lock skip them completely.
>
> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful,
>            VM_NO_THP should be checked in munlock before attempting PageTransHuge() and
>            friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be
>            no THP means we don't try optimize for it and no unexpected head pages trip us.
>
> Thoughts?

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next] xen-netfront: add support for IPv6 offloads
From: Jan Beulich @ 2014-01-15 16:03 UTC (permalink / raw)
  To: Andrew Cooper, Paul Durrant
  Cc: David Vrabel, xen-devel@lists.xen.org, Boris Ostrovsky,
	netdev@vger.kernel.org
In-Reply-To: <9AAE0902D5BC7E449B7C8E4E778ABCD0208071@AMSPEX01CL01.citrite.net>

>>> On 15.01.14 at 16:54, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> On 15/01/14 15:18, Paul Durrant wrote:
>> > +	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6", "%d", 1);
>> 
>> "%d", 1 results in a constant string.  xenbus_write() would avoid a
>> transitory memory allocation.
> 
> This code is consistent with all the other xenbus_printf()s in the 
> neighbourhood and does it really matter?

I think we should always strive to have the simplest possible code
that fulfills the purpose. And hence we shouldn't be setting further
bad precedents. (In fact I have a patch queued to replace all the
unnecessary xenbus_printf()s with xenbus_write()s on
linux-2.6.18-xen.hg, and may look into porting this to the
respective upstream components.)

Jan

^ permalink raw reply

* Re: [PATCH iproute2 2/2] netem: add 64bit rates support
From: Eric Dumazet @ 2014-01-15 15:56 UTC (permalink / raw)
  To: Yang Yingliang; +Cc: stephen, netdev
In-Reply-To: <1389778932-17404-3-git-send-email-yangyingliang@huawei.com>

On Wed, 2014-01-15 at 17:42 +0800, Yang Yingliang wrote:
> netem support 64bit rates start from linux-3.13.
> Add 64bit rates support in tc tools.
> 
> tc qdisc show dev eth0
> qdisc netem 1: dev eth4 root refcnt 2 limit 1000 rate 35Gbit
> 
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>  tc/q_netem.c | 29 ++++++++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/tc/q_netem.c b/tc/q_netem.c
> index 9dd8712..1312eb5 100644
> --- a/tc/q_netem.c
> +++ b/tc/q_netem.c
> @@ -183,6 +183,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
>  	__s16 *dist_data = NULL;
>  	__u16 loss_type = NETEM_LOSS_UNSPEC;
>  	int present[__TCA_NETEM_MAX];
> +	__u64 rate64 = 0;
>  
>  	memset(&cor, 0, sizeof(cor));
>  	memset(&reorder, 0, sizeof(reorder));
> @@ -391,7 +392,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
>  		} else if (matches(*argv, "rate") == 0) {
>  			++present[TCA_NETEM_RATE];
>  			NEXT_ARG();
> -			if (get_rate(&rate.rate, *argv)) {
> +			if (get_rate64(&rate64, *argv)) {
>  				explain1("rate");
>  				return -1;
>  			}
> @@ -496,9 +497,18 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
>  		addattr_nest_end(n, start);
>  	}
>  
> -	if (present[TCA_NETEM_RATE] &&
> -	    addattr_l(n, 1024, TCA_NETEM_RATE, &rate, sizeof(rate)) < 0)
> -		return -1;
> +	if (present[TCA_NETEM_RATE]) {
> +		if (rate64 >= (1ULL << 32)) {
> +			if (addattr_l(n, 1024,
> +				      TCA_NETEM_RATE64, &rate64, sizeof(rate64)) < 0)
> +				return -1;
> +			rate.rate = ~0U;
> +		} else {
> +			rate.rate = rate64;
> +		}
> +		if (addattr_l(n, 1024, TCA_NETEM_RATE, &rate, sizeof(rate)) < 0)
> +			return -1;
> +	}
>  
>  	if (dist_data) {
>  		if (addattr_l(n, MAX_DIST * sizeof(dist_data[0]),
> @@ -522,6 +532,7 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>  	struct tc_netem_qopt qopt;
>  	const struct tc_netem_rate *rate = NULL;
>  	int len = RTA_PAYLOAD(opt) - sizeof(qopt);
> +	__u64 *rate64 = NULL;

__u64 rate64 = 0;

>  	SPRINT_BUF(b1);
>  
>  	if (opt == NULL)
> @@ -572,6 +583,11 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>  				return -1;
>  			ecn = RTA_DATA(tb[TCA_NETEM_ECN]);
>  		}
> +		if (tb[TCA_NETEM_RATE64]) {
> +			if (RTA_PAYLOAD(tb[TCA_NETEM_RATE64]) < sizeof(*rate64))
> +				return -1;
> +			rate64 = RTA_DATA(tb[TCA_NETEM_RATE64]);

rate64 = rta_getattr_u64(tb[TCA_NETEM_RATE64]);

> +		}
>  	}
>  
>  	fprintf(f, "limit %d", qopt.limit);
> @@ -632,7 +648,10 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>  	}
>  
>  	if (rate && rate->rate) {
> -		fprintf(f, " rate %s", sprint_rate(rate->rate, b1));
> +		if (rate64)
> +			fprintf(f, " rate %s", sprint_rate(rate64, b1));
> +		else
> +			fprintf(f, " rate %s", sprint_rate(rate->rate, b1));
>  		if (rate->packet_overhead)
>  			fprintf(f, " packetoverhead %d", rate->packet_overhead);
>  		if (rate->cell_size)

^ permalink raw reply

* RE: [Xen-devel] [PATCH net-next] xen-netfront: add support for IPv6 offloads
From: Paul Durrant @ 2014-01-15 15:54 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: netdev@vger.kernel.org, xen-devel@lists.xen.org, Boris Ostrovsky,
	David Vrabel
In-Reply-To: <52D6A868.6040707@citrix.com>

> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: 15 January 2014 15:25
> To: Paul Durrant
> Cc: netdev@vger.kernel.org; xen-devel@lists.xen.org; Boris Ostrovsky; David
> Vrabel
> Subject: Re: [Xen-devel] [PATCH net-next] xen-netfront: add support for
> IPv6 offloads
> 
> On 15/01/14 15:18, Paul Durrant wrote:
> > This patch adds support for IPv6 checksum offload and GSO when those
> > features are available in the backend.
> >
> > Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > Cc: David Vrabel <david.vrabel@citrix.com>
> > ---
> >  drivers/net/xen-netfront.c |   48
> +++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 43 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index c41537b..321759f 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -617,7 +617,9 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
> >  		tx->flags |= XEN_NETTXF_extra_info;
> >
> >  		gso->u.gso.size = skb_shinfo(skb)->gso_size;
> > -		gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
> > +		gso->u.gso.type = (skb_shinfo(skb)->gso_type &
> SKB_GSO_TCPV6) ?
> > +			XEN_NETIF_GSO_TYPE_TCPV6 :
> > +			XEN_NETIF_GSO_TYPE_TCPV4;
> >  		gso->u.gso.pad = 0;
> >  		gso->u.gso.features = 0;
> >
> > @@ -809,15 +811,18 @@ static int xennet_set_skb_gso(struct sk_buff
> *skb,
> >  		return -EINVAL;
> >  	}
> >
> > -	/* Currently only TCPv4 S.O. is supported. */
> > -	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
> > +	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4 &&
> > +	    gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV6) {
> >  		if (net_ratelimit())
> >  			pr_warn("Bad GSO type %d\n", gso->u.gso.type);
> >  		return -EINVAL;
> >  	}
> >
> >  	skb_shinfo(skb)->gso_size = gso->u.gso.size;
> > -	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> > +	skb_shinfo(skb)->gso_type =
> > +		(gso->u.gso.type == XEN_NETIF_GSO_TYPE_TCPV4) ?
> > +		SKB_GSO_TCPV4 :
> > +		SKB_GSO_TCPV6;
> >
> >  	/* Header must be checked, and gso_segs computed. */
> >  	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> > @@ -1191,6 +1196,15 @@ static netdev_features_t
> xennet_fix_features(struct net_device *dev,
> >  			features &= ~NETIF_F_SG;
> >  	}
> >
> > +	if (features & NETIF_F_IPV6_CSUM) {
> > +		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> > +				 "feature-ipv6-csum-offload", "%d", &val) <
> 0)
> > +			val = 0;
> > +
> > +		if (!val)
> > +			features &= ~NETIF_F_IPV6_CSUM;
> > +	}
> > +
> >  	if (features & NETIF_F_TSO) {
> >  		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> >  				 "feature-gso-tcpv4", "%d", &val) < 0)
> > @@ -1200,6 +1214,15 @@ static netdev_features_t
> xennet_fix_features(struct net_device *dev,
> >  			features &= ~NETIF_F_TSO;
> >  	}
> >
> > +	if (features & NETIF_F_TSO6) {
> > +		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> > +				 "feature-gso-tcpv6", "%d", &val) < 0)
> > +			val = 0;
> > +
> > +		if (!val)
> > +			features &= ~NETIF_F_TSO6;
> > +	}
> > +
> >  	return features;
> >  }
> >
> > @@ -1338,7 +1361,9 @@ static struct net_device
> *xennet_create_dev(struct xenbus_device *dev)
> >  	netif_napi_add(netdev, &np->napi, xennet_poll, 64);
> >  	netdev->features        = NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
> >  				  NETIF_F_GSO_ROBUST;
> > -	netdev->hw_features	= NETIF_F_IP_CSUM | NETIF_F_SG |
> NETIF_F_TSO;
> > +	netdev->hw_features	= NETIF_F_SG |
> > +				  NETIF_F_IPV6_CSUM |
> > +				  NETIF_F_TSO | NETIF_F_TSO6;
> >
> >  	/*
> >           * Assume that all hw features are available for now. This set
> > @@ -1716,6 +1741,19 @@ again:
> >  		goto abort_transaction;
> >  	}
> >
> > +	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6",
> "%d", 1);
> 
> "%d", 1 results in a constant string.  xenbus_write() would avoid a
> transitory memory allocation.
> 
> ~Andrew
> 

This code is consistent with all the other xenbus_printf()s in the neighbourhood and does it really matter?

  Paul

> > +	if (err) {
> > +		message = "writing feature-gso-tcpv6";
> > +		goto abort_transaction;
> > +	}
> > +
> > +	err = xenbus_printf(xbt, dev->nodename, "feature-ipv6-csum-
> offload",
> > +			    "%d", 1);
> > +	if (err) {
> > +		message = "writing feature-ipv6-csum-offload";
> > +		goto abort_transaction;
> > +	}
> > +
> >  	err = xenbus_transaction_end(xbt, 0);
> >  	if (err) {
> >  		if (err == -EAGAIN)

^ permalink raw reply

* Re: [PATCH net] bpf: do not use reciprocal divide
From: Martin Schwidefsky @ 2014-01-15 15:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiko Carstens, Hannes Frederic Sowa, netdev, dborkman,
	darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <1389795926.31367.334.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 15 Jan 2014 06:25:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-01-15 at 06:21 -0800, Eric Dumazet wrote:
> > On Wed, 2014-01-15 at 11:51 +0100, Heiko Carstens wrote:
> > > On Wed, Jan 15, 2014 at 09:13:22AM +0100, Martin Schwidefsky wrote:
> > > > On Wed, 15 Jan 2014 09:00:07 +0100
> > > > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > > 
> > > > > On Tue, Jan 14, 2014 at 11:02:41PM -0800, Eric Dumazet wrote:
> > > > > > diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
> > > > > > index 16871da37371..e349dc7d0992 100644
> > > > > > --- a/arch/s390/net/bpf_jit_comp.c
> > > > > > +++ b/arch/s390/net/bpf_jit_comp.c
> > > > > > @@ -371,11 +371,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
> > > > > >  		/* dr %r4,%r12 */
> > > > > >  		EMIT2(0x1d4c);
> > > > > >  		break;
> > > > > > -	case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
> > > > > > -		/* m %r4,<d(K)>(%r13) */
> > > > > > -		EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
> > > > > > -		/* lr %r5,%r4 */
> > > > > > -		EMIT2(0x1854);
> > > > > > +	case BPF_S_ALU_DIV_K: /* A /= K */
> > > > > > +		/* lhi %r4,0 */
> > > > > > +		EMIT4(0xa7480000);
> > > > > > +		/* d %r4,<d(K)>(%r13) */
> > > > > > +		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
> > > > > >  		break;
> > > > > 
> > > > > The s390 part looks good.
> > > > 
> > > > Does it? The divide instruction is signed, for the special
> > > > case of K==1 this can now cause an exception if the quotient
> > > > gets too large. We should add a check for K==1 and do nothing
> > > > in this case. With a divisor of at least 2 the result will
> > > > stay in the limit.
> > > 
> > > Indeed. That's quite subtle.
> > 
> > net/core/filter.c does :
> > 
> > A /= K;
> > 
> > Why is this working in generic code (if K == 1), not in s390 one ?
> 
> Note that I copied code found in BPF_S_ALU_MOD_K, so this one would need
> a fix as well.

Hmm, that is true BPF_S_ALU_DIV_X, BPF_S_ALU_MOD_K and BPF_S_ALU_MOD_X all
suffer from the same problem. As the BPF jit is only used for 64-bit 
kernel the simplest solution is to replace the "dr" instruction with "dlr".

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox