Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] rtlwifi/rtl8192de: remove redundant else if check
From: Colin King @ 2015-01-13 14:07 UTC (permalink / raw)
  To: Larry Finger, Chaoming Li, Kalle Valo, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The else if check condition checks for the opposite of the
if check, hence the else if check is redundant and can be
replaced with a simple else:

if (rtlpriv->rtlhal.macphymode == SINGLEMAC_SINGLEPHY) {
	..
} else if (rtlpriv->rtlhal.macphymode != SINGLEMAC_SINGLEPHY) {
	..
}

replaced with:

if (rtlpriv->rtlhal.macphymode == SINGLEMAC_SINGLEPHY) {
	..
} else {
	..
}

Signed-off-by: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 drivers/net/wireless/rtlwifi/rtl8192de/hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/rtlwifi/rtl8192de/hw.c b/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
index 280c3da..01bcc2d 100644
--- a/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8192de/hw.c
@@ -546,7 +546,7 @@ static bool _rtl92de_llt_table_init(struct ieee80211_hw *hw)
 		txpktbuf_bndy = 246;
 		value8 = 0;
 		value32 = 0x80bf0d29;
-	} else if (rtlpriv->rtlhal.macphymode != SINGLEMAC_SINGLEPHY) {
+	} else {
 		maxPage = 127;
 		txpktbuf_bndy = 123;
 		value8 = 0;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] bridge: only provide proxy ARP when CONFIG_INET is enabled
From: Arnd Bergmann @ 2015-01-13 14:10 UTC (permalink / raw)
  To: netdev; +Cc: davem, Kyeyoon Park, bridge, Stephen Hemminger

When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:

net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'

This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP")
Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 1f1de715197c..e2aa7be3a847 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -154,7 +154,8 @@ int br_handle_frame_finish(struct sk_buff *skb)
 	dst = NULL;
 
 	if (is_broadcast_ether_addr(dest)) {
-		if (p->flags & BR_PROXYARP &&
+		if (IS_ENABLED(CONFIG_INET) &&
+		    p->flags & BR_PROXYARP &&
 		    skb->protocol == htons(ETH_P_ARP))
 			br_do_proxy_arp(skb, br, vid);
 

^ permalink raw reply related

* Re: [PATCH 8/8] ath10k: fix error return code
From: Kalle Valo @ 2015-01-13 14:16 UTC (permalink / raw)
  To: Julia Lawall
  Cc: linux-wireless, kernel-janitors, linux-kernel, ath10k, netdev
In-Reply-To: <1419872683-32709-9-git-send-email-Julia.Lawall@lip6.fr>

Julia Lawall <Julia.Lawall@lip6.fr> writes:

> Return a negative error code on failure.
>
> A simplified version of the semantic match that finds this problem is as
> follows: (http://coccinelle.lip6.fr/)
>
> // <smpl>
> @@
> identifier ret; expression e1,e2;
> @@
> (
> if (\(ret < 0\|ret != 0\))
>  { ... return ret; }
> |
> ret = 0
> )
> ... when != ret = e1
>     when != &ret
> *if(...)
> {
>   ... when != ret = e2
>       when forall
>  return ret;
> }
> // </smpl>
>
> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

Thanks, applied to ath.git.

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH v3] ath10k: fixup wait_for_completion_timeout return handling
From: Kalle Valo @ 2015-01-13 14:20 UTC (permalink / raw)
  To: Nicholas Mc Guire
  Cc: Chun-Yeow Yeoh, Sergei Shtylyov, netdev, linux-wireless,
	linux-kernel, ath10k, Michal Kazior, Yanbo Li, Ben Greear
In-Reply-To: <1420720054-27870-1-git-send-email-der.herr@hofr.at>

Nicholas Mc Guire <der.herr@hofr.at> writes:

> wait_for_completion_timeout does not return negative values so the tests
> for <= 0 are not needed and the case differentiation in the error handling
> path unnecessary.
>
> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>

Thanks, applied to ath.git.

-- 
Kalle Valo

^ permalink raw reply

* Re: [net-next 03/15] i40evf: Remove some scary log messages
From: Sergei Shtylyov @ 2015-01-13 14:22 UTC (permalink / raw)
  To: Jeff Kirsher, davem; +Cc: Mitch A Williams, netdev, nhorman, sassmann, jogreene
In-Reply-To: <1421148811-9763-4-git-send-email-jeffrey.t.kirsher@intel.com>

Hello.

On 1/13/2015 2:33 PM, Jeff Kirsher wrote:

> From: Mitch A Williams <mitch.a.williams@intel.com>

> These messages may be triggered during normal init of the driver if the
> PF or FW take a long time to respond. There's nothing really wrong, so
> don't freak people out logging messages.

> If the communication channel really is dead, then we'll retry a few
> times and give up. This will log a different more scary message that
> should cause consternation. This allows the user to more easily detect a
> genuine failure.

> Change-ID: I6e2b758d4234a3a09c1015c82c8f2442a697cbdb
> Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
> Acked-by: Shannon Nelson <shannon.nelson@intel.com>
> Tested-by: Jim Young <james.m.young@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[...]

> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> index ee0db59..f8f1d26 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> @@ -2026,10 +2026,7 @@ static void i40evf_init_task(struct work_struct *work)
>   		/* aq msg sent, awaiting reply */
>   		err = i40evf_verify_api_ver(adapter);
>   		if (err) {
> -			dev_info(&pdev->dev, "Unable to verify API version (%d), retrying\n",
> -				 err);
>   			if (err == I40E_ERR_ADMIN_QUEUE_NO_WORK) {
> -				dev_info(&pdev->dev, "Resending request\n");
>   				err = i40evf_send_api_ver(adapter);
>   			}

    {} not needed anymore, should have removed them.

[...]

WBR, Sergei

^ permalink raw reply

* [PATCH] rocker: fix harmless warning on 32-bit machines
From: Arnd Bergmann @ 2015-01-13 14:23 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Jiri Pirko, Scott Feldman, linux-arm-kernel

The rocker driver tries to assign a pointer to a 64-bit integer
and then back to a pointer. This is safe on all architectures,
but causes a compiler warning when pointers are shorter than
64-bit:

rocker/rocker.c: In function 'rocker_desc_cookie_ptr_get':
rocker/rocker.c:809:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  return (void *) desc_info->desc->cookie;
         ^

This adds another cast to uintptr_t to tell the compiler
that it's safe.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
index 2f398fa4b9e6..cad8cf962cdf 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -806,13 +806,13 @@ static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
 
 static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
 {
-	return (void *) desc_info->desc->cookie;
+	return (void *)(uintptr_t)desc_info->desc->cookie;
 }
 
 static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
 				       void *ptr)
 {
-	desc_info->desc->cookie = (long) ptr;
+	desc_info->desc->cookie = (uintptr_t) ptr;
 }
 
 static struct rocker_desc_info *

^ permalink raw reply related

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
From: Daniel Borkmann @ 2015-01-13 14:26 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer
In-Reply-To: <20150113043542.29985.15658.stgit@nitbit.x32>

On 01/13/2015 05:35 AM, John Fastabend wrote:
...
> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
> +				       unsigned int start_from,
> +				       unsigned int qpairs_num,
> +				       struct sock *sk)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	unsigned int qpair_index;

We should probably return -EINVAL, still from within the setsockopt
call when qpairs_num is 0?

> +	/* allocate whatever available qpairs */
> +	if (start_from == -1) {

I guess we should define the notion of auto-select into a uapi
define instead of -1, which might not be overly obvious.

Anyway, extending Documentation/networking/packet_mmap.txt with
API details/examples at least for a non-RFC version is encouraged. ;)

> +		unsigned int count = 0;
> +
> +		for (qpair_index = adapter->num_rx_queues;
> +		     qpair_index < MAX_RX_QUEUES;
> +		     qpair_index++) {
> +			if (!adapter->user_queue_info[qpair_index].sk_handle) {
> +				count++;
> +				if (count == qpairs_num) {
> +					start_from = qpair_index - count + 1;
> +					break;
> +				}
> +			} else {
> +				count = 0;
> +			}
> +		}
> +	}
> +
> +	/* otherwise the caller specified exact queues */
> +	if ((start_from > MAX_TX_QUEUES) ||
> +	    (start_from > MAX_RX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_RX_QUEUES))
> +		return -EINVAL;

Shouldn't this be '>=' if I see this correctly?

> +	/* If the qpairs are being used by the driver do not let user space
> +	 * consume the queues. Also if the queue has already been allocated
> +	 * to a socket do fail the request.
> +	 */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		if ((qpair_index < adapter->num_tx_queues) ||
> +		    (qpair_index < adapter->num_rx_queues))
> +			return -EINVAL;
> +
> +		if (adapter->user_queue_info[qpair_index].sk_handle)
> +			return -EBUSY;
> +	}
> +
> +	/* remember the sk handle for each queue pair */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		adapter->user_queue_info[qpair_index].sk_handle = sk;
> +		adapter->user_queue_info[qpair_index].num_of_regions = 0;
> +	}
> +
> +	return 0;
> +}

I guess many drivers would need to implement similar code, do you see
a chance to move generic parts to the core, at least for some helper
functions?

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC PATCHv1 net-next] xen-netback: always fully coalesce guest Rx packets
From: Wei Liu @ 2015-01-13 14:30 UTC (permalink / raw)
  To: David Vrabel; +Cc: netdev, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1421157917-31333-1-git-send-email-david.vrabel@citrix.com>

On Tue, Jan 13, 2015 at 02:05:17PM +0000, David Vrabel wrote:
> Always fully coalesce guest Rx packets into the minimum number of ring
> slots.  Reducing the number of slots per packet has significant
> performance benefits (e.g., 7.2 Gbit/s to 11 Gbit/s in an off-host
> receive test).
> 

Good number.

> However, this does increase the number of grant ops per packet which
> decreases performance with some workloads (intrahost VM to VM)

Do you have figures before and after this change?

> /unless/ grant copy has been optimized for adjacent ops with the same
> source or destination (see "grant-table: defer releasing pages
> acquired in a grant copy"[1]).
> 
> Do we need to retain the existing path and make the always coalesce
> path conditional on a suitable version of Xen?
> 

It the new path improves off-host RX on all Xen versions and doesn't
degrade intrahost VM to VM RX that much, I think we should use it
unconditionally.  Is intrahost VM to VM RX important to XenServer?

I don't consider intrahost VM to VM RX a very important use case, at
least not as important as off-host RX. I would expect in a could
environment users would not count on their VMs reside on the same host.
Plus, some could provider might deliberately route traffic off-host for
various reasons even if VMs are on the same host.  (Verizon for one,
mentioned they do that during last year's Xen Summit IIRC).

Others might disagree. Let's wait for other people to chime in.

> [1] http://lists.xen.org/archives/html/xen-devel/2015-01/msg01118.html
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netback/common.h  |    1 -
>  drivers/net/xen-netback/netback.c |  106 ++-----------------------------------
>  2 files changed, 3 insertions(+), 104 deletions(-)

Love the diffstat!

Wei.

^ permalink raw reply

* Re: [PATCH] rocker: fix harmless warning on 32-bit machines
From: Jiri Pirko @ 2015-01-13 14:36 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: netdev, David Miller, Scott Feldman, linux-arm-kernel
In-Reply-To: <4047824.AYNhYQQ6UI@wuerfel>

Tue, Jan 13, 2015 at 03:23:52PM CET, arnd@arndb.de wrote:
>The rocker driver tries to assign a pointer to a 64-bit integer
>and then back to a pointer. This is safe on all architectures,
>but causes a compiler warning when pointers are shorter than
>64-bit:
>
>rocker/rocker.c: In function 'rocker_desc_cookie_ptr_get':
>rocker/rocker.c:809:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
>  return (void *) desc_info->desc->cookie;
>         ^
>
>This adds another cast to uintptr_t to tell the compiler
>that it's safe.
>
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>


Acked-by: Jiri Pirko <jiri@resnulli.us>

>
>diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>index 2f398fa4b9e6..cad8cf962cdf 100644
>--- a/drivers/net/ethernet/rocker/rocker.c
>+++ b/drivers/net/ethernet/rocker/rocker.c
>@@ -806,13 +806,13 @@ static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
> 
> static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
> {
>-	return (void *) desc_info->desc->cookie;
>+	return (void *)(uintptr_t)desc_info->desc->cookie;
> }
> 
> static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
> 				       void *ptr)
> {
>-	desc_info->desc->cookie = (long) ptr;
>+	desc_info->desc->cookie = (uintptr_t) ptr;
> }
> 
> static struct rocker_desc_info *
>

^ permalink raw reply

* [PATCH 1/6] virtio/9p: verify device has config space
From: Michael S. Tsirkin @ 2015-01-13 14:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	David S. Miller, v9fs-developer, netdev
In-Reply-To: <1421160167-18498-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/9p needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/9p/trans_virtio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index daa749c..d8e376a 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -524,6 +524,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	int err;
 	struct virtio_chan *chan;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	chan = kmalloc(sizeof(struct virtio_chan), GFP_KERNEL);
 	if (!chan) {
 		pr_err("Failed to allocate virtio 9P channel\n");
-- 
MST

^ permalink raw reply related

* [PATCH 4/6] virtio/net: verify device has config space
From: Michael S. Tsirkin @ 2015-01-13 14:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1421160167-18498-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/net needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ca9771..9bc1072 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1713,6 +1713,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	struct virtnet_info *vi;
 	u16 max_queue_pairs;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	if (!virtnet_validate_features(vdev))
 		return -EINVAL;
 
-- 
MST

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 08/14] xen-netback: use foreign page information from the pages themselves
From: David Vrabel @ 2015-01-13 14:43 UTC (permalink / raw)
  To: David Vrabel, xen-devel, David S. Miller
  Cc: Boris Ostrovsky, Jenny Herbert, netdev@vger.kernel.org
In-Reply-To: <1421077417-7162-9-git-send-email-david.vrabel@citrix.com>

On 12/01/15 15:43, David Vrabel wrote:
> From: Jenny Herbert <jenny.herbert@citrix.com>
> 
> Use the foreign page flag in netback to get the domid and grant ref
> needed for the grant copy.  This signficiantly simplifies the netback
> code and makes netback work with foreign pages from other backends
> (e.g., blkback).
> 
> This allows blkback to use iSCSI disks provided by domUs running on
> the same host.

Dave,

This depends on several xen changes.  It's been Acked-by: Ian Campbell
<ian.campbell@citrix.com>

Are you happy for me to merge this via the xen tree in 3.20?

David

> Signed-off-by: Jenny Herbert <jennifer.herbert@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netback/netback.c |  100 ++++---------------------------------
>  1 file changed, 9 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 6441318..ae3ab37 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -314,9 +314,7 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif_queue *queue,
>  static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb,
>  				 struct netrx_pending_operations *npo,
>  				 struct page *page, unsigned long size,
> -				 unsigned long offset, int *head,
> -				 struct xenvif_queue *foreign_queue,
> -				 grant_ref_t foreign_gref)
> +				 unsigned long offset, int *head)
>  {
>  	struct gnttab_copy *copy_gop;
>  	struct xenvif_rx_meta *meta;
> @@ -333,6 +331,8 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  	offset &= ~PAGE_MASK;
>  
>  	while (size > 0) {
> +		struct xen_page_foreign *foreign;
> +
>  		BUG_ON(offset >= PAGE_SIZE);
>  		BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);
>  
> @@ -361,9 +361,10 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  		copy_gop->flags = GNTCOPY_dest_gref;
>  		copy_gop->len = bytes;
>  
> -		if (foreign_queue) {
> -			copy_gop->source.domid = foreign_queue->vif->domid;
> -			copy_gop->source.u.ref = foreign_gref;
> +		foreign = xen_page_foreign(page);
> +		if (foreign) {
> +			copy_gop->source.domid = foreign->domid;
> +			copy_gop->source.u.ref = foreign->gref;
>  			copy_gop->flags |= GNTCOPY_source_gref;
>  		} else {
>  			copy_gop->source.domid = DOMID_SELF;
> @@ -406,35 +407,6 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
>  }
>  
>  /*
> - * Find the grant ref for a given frag in a chain of struct ubuf_info's
> - * skb: the skb itself
> - * i: the frag's number
> - * ubuf: a pointer to an element in the chain. It should not be NULL
> - *
> - * Returns a pointer to the element in the chain where the page were found. If
> - * not found, returns NULL.
> - * See the definition of callback_struct in common.h for more details about
> - * the chain.
> - */
> -static const struct ubuf_info *xenvif_find_gref(const struct sk_buff *const skb,
> -						const int i,
> -						const struct ubuf_info *ubuf)
> -{
> -	struct xenvif_queue *foreign_queue = ubuf_to_queue(ubuf);
> -
> -	do {
> -		u16 pending_idx = ubuf->desc;
> -
> -		if (skb_shinfo(skb)->frags[i].page.p ==
> -		    foreign_queue->mmap_pages[pending_idx])
> -			break;
> -		ubuf = (struct ubuf_info *) ubuf->ctx;
> -	} while (ubuf);
> -
> -	return ubuf;
> -}
> -
> -/*
>   * Prepare an SKB to be transmitted to the frontend.
>   *
>   * This function is responsible for allocating grant operations, meta
> @@ -459,8 +431,6 @@ static int xenvif_gop_skb(struct sk_buff *skb,
>  	int head = 1;
>  	int old_meta_prod;
>  	int gso_type;
> -	const struct ubuf_info *ubuf = skb_shinfo(skb)->destructor_arg;
> -	const struct ubuf_info *const head_ubuf = ubuf;
>  
>  	old_meta_prod = npo->meta_prod;
>  
> @@ -507,68 +477,16 @@ static int xenvif_gop_skb(struct sk_buff *skb,
>  			len = skb_tail_pointer(skb) - data;
>  
>  		xenvif_gop_frag_copy(queue, skb, npo,
> -				     virt_to_page(data), len, offset, &head,
> -				     NULL,
> -				     0);
> +				     virt_to_page(data), len, offset, &head);
>  		data += len;
>  	}
>  
>  	for (i = 0; i < nr_frags; i++) {
> -		/* This variable also signals whether foreign_gref has a real
> -		 * value or not.
> -		 */
> -		struct xenvif_queue *foreign_queue = NULL;
> -		grant_ref_t foreign_gref;
> -
> -		if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) &&
> -			(ubuf->callback == &xenvif_zerocopy_callback)) {
> -			const struct ubuf_info *const startpoint = ubuf;
> -
> -			/* Ideally ubuf points to the chain element which
> -			 * belongs to this frag. Or if frags were removed from
> -			 * the beginning, then shortly before it.
> -			 */
> -			ubuf = xenvif_find_gref(skb, i, ubuf);
> -
> -			/* Try again from the beginning of the list, if we
> -			 * haven't tried from there. This only makes sense in
> -			 * the unlikely event of reordering the original frags.
> -			 * For injected local pages it's an unnecessary second
> -			 * run.
> -			 */
> -			if (unlikely(!ubuf) && startpoint != head_ubuf)
> -				ubuf = xenvif_find_gref(skb, i, head_ubuf);
> -
> -			if (likely(ubuf)) {
> -				u16 pending_idx = ubuf->desc;
> -
> -				foreign_queue = ubuf_to_queue(ubuf);
> -				foreign_gref =
> -					foreign_queue->pending_tx_info[pending_idx].req.gref;
> -				/* Just a safety measure. If this was the last
> -				 * element on the list, the for loop will
> -				 * iterate again if a local page were added to
> -				 * the end. Using head_ubuf here prevents the
> -				 * second search on the chain. Or the original
> -				 * frags changed order, but that's less likely.
> -				 * In any way, ubuf shouldn't be NULL.
> -				 */
> -				ubuf = ubuf->ctx ?
> -					(struct ubuf_info *) ubuf->ctx :
> -					head_ubuf;
> -			} else
> -				/* This frag was a local page, added to the
> -				 * array after the skb left netback.
> -				 */
> -				ubuf = head_ubuf;
> -		}
>  		xenvif_gop_frag_copy(queue, skb, npo,
>  				     skb_frag_page(&skb_shinfo(skb)->frags[i]),
>  				     skb_frag_size(&skb_shinfo(skb)->frags[i]),
>  				     skb_shinfo(skb)->frags[i].page_offset,
> -				     &head,
> -				     foreign_queue,
> -				     foreign_queue ? foreign_gref : UINT_MAX);
> +				     &head);
>  	}
>  
>  	return npo->meta_prod - old_meta_prod;
> 

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: David Ahern @ 2015-01-13 14:53 UTC (permalink / raw)
  To: Hannes Frederic Sowa, YOSHIFUJI Hideaki
  Cc: Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <1421152613.13626.24.camel@redhat.com>

On 1/13/15 5:36 AM, Hannes Frederic Sowa wrote:
> Hi,
>
> On Di, 2015-01-13 at 21:15 +0900, YOSHIFUJI Hideaki wrote:
>> YOSHIFUJI Hideaki wrote:
>>> Hi,
>>>
>>> Hannes Frederic Sowa wrote:
>>>> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
>>>>> On Mon, 12 Jan 2015 22:06:44 -0700
>>>>> David Ahern <dsahern@gmail.com> wrote:
>>>>>
>>>>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
>>>>>>      ip link set dev eth1
>>>>>>
>>>>>>
>>>>>> Looking at the code it appears to be this code path in addrconf.c:
>>>>>>
>>>>>>            case NETDEV_DOWN:
>>>>>>            case NETDEV_UNREGISTER:
>>>>>>                    /*
>>>>>>                     *      Remove all addresses from this interface.
>>>>>>                     */
>>>>>>                    addrconf_ifdown(dev, event != NETDEV_DOWN);
>>>>>>                    break;
>>>>>>
>>>>>> IPv4 addresses are NOT removed on a link down. Is there a particular
>>>>>> reason IPv6 addresses are?
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>
>>>>> See RFC's which describes how IPv6 does Duplicate Address Detection.
>>>>> Address is not valid when link is down, since DAD is not possible.
>>>>
>>>> It should be no problem if the kernel would reacquire them on ifup and
>>>> do proper DAD. We simply must not use them while the interface is dead
>>>> (also making sure they don't get used for loopback routing).
>>>>
>>>> The problem the IPv6 addresses get removed is much more a historical
>>>> artifact nowadays, I think. It is part of user space API and scripts
>>>> deal with that already.
>>>
>>> We might have another "detached" state which essintially drops
>>> outgoing packets while link is down.  Just after recovering link,
>>> we could start receiving packet from the link and perform optimistic
>>> DAD. And then, after it succeeds, we may start sending packets.
>>>
>>> Since "detached" state is like the state just before completing
>>> Optimistic DAD, it is not so difficult to implement this extended
>>> behavior, I guess.
>>>
>>
>> Note that node is allowed to send packets to neighbours or default
>> routers if the node knows their link-layer addresses during Optimistic
>> DAD.
>>
>
> I don't think it should be a problem from internal state handling of the
> addresses.
>
> I am much more concerned with scripts expecting the addresses to be
> flushed on interface down/up and not reacting appropriate.

The current code seems inconsistent: I can put an IPv6 address on a link 
in the down state. On a link up the address is retained. Only on a 
subsequent link down is it removed. If DAD or anything else is the 
reason for the current logic then why allow an address to be assigned in 
the down state? Similarly that it currently seems to work ok then it 
suggests the right thing is done on a link up in which case a flush is 
not needed.

Bottom line is there a harm in removing the flush? If there is no harm 
will mainline kernel take a patch to do that or is your backward 
compatibility concern enough to block it?

David

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Hannes Frederic Sowa @ 2015-01-13 15:00 UTC (permalink / raw)
  To: David Ahern; +Cc: YOSHIFUJI Hideaki, Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <54B53187.7080306@gmail.com>

On Di, 2015-01-13 at 07:53 -0700, David Ahern wrote:
> On 1/13/15 5:36 AM, Hannes Frederic Sowa wrote:
> > Hi,
> >
> > On Di, 2015-01-13 at 21:15 +0900, YOSHIFUJI Hideaki wrote:
> >> YOSHIFUJI Hideaki wrote:
> >>> Hi,
> >>>
> >>> Hannes Frederic Sowa wrote:
> >>>> On Mo, 2015-01-12 at 23:10 -0800, Stephen Hemminger wrote:
> >>>>> On Mon, 12 Jan 2015 22:06:44 -0700
> >>>>> David Ahern <dsahern@gmail.com> wrote:
> >>>>>
> >>>>>> We noticed that IPv6 addresses are removed on a link down. e.g.,
> >>>>>>      ip link set dev eth1
> >>>>>>
> >>>>>>
> >>>>>> Looking at the code it appears to be this code path in addrconf.c:
> >>>>>>
> >>>>>>            case NETDEV_DOWN:
> >>>>>>            case NETDEV_UNREGISTER:
> >>>>>>                    /*
> >>>>>>                     *      Remove all addresses from this interface.
> >>>>>>                     */
> >>>>>>                    addrconf_ifdown(dev, event != NETDEV_DOWN);
> >>>>>>                    break;
> >>>>>>
> >>>>>> IPv4 addresses are NOT removed on a link down. Is there a particular
> >>>>>> reason IPv6 addresses are?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>
> >>>>> See RFC's which describes how IPv6 does Duplicate Address Detection.
> >>>>> Address is not valid when link is down, since DAD is not possible.
> >>>>
> >>>> It should be no problem if the kernel would reacquire them on ifup and
> >>>> do proper DAD. We simply must not use them while the interface is dead
> >>>> (also making sure they don't get used for loopback routing).
> >>>>
> >>>> The problem the IPv6 addresses get removed is much more a historical
> >>>> artifact nowadays, I think. It is part of user space API and scripts
> >>>> deal with that already.
> >>>
> >>> We might have another "detached" state which essintially drops
> >>> outgoing packets while link is down.  Just after recovering link,
> >>> we could start receiving packet from the link and perform optimistic
> >>> DAD. And then, after it succeeds, we may start sending packets.
> >>>
> >>> Since "detached" state is like the state just before completing
> >>> Optimistic DAD, it is not so difficult to implement this extended
> >>> behavior, I guess.
> >>>
> >>
> >> Note that node is allowed to send packets to neighbours or default
> >> routers if the node knows their link-layer addresses during Optimistic
> >> DAD.
> >>
> >
> > I don't think it should be a problem from internal state handling of the
> > addresses.
> >
> > I am much more concerned with scripts expecting the addresses to be
> > flushed on interface down/up and not reacting appropriate.
> 
> The current code seems inconsistent: I can put an IPv6 address on a link 
> in the down state. On a link up the address is retained. Only on a 
> subsequent link down is it removed. If DAD or anything else is the 
> reason for the current logic then why allow an address to be assigned in 
> the down state? Similarly that it currently seems to work ok then it 
> suggests the right thing is done on a link up in which case a flush is 
> not needed.
> 
> Bottom line is there a harm in removing the flush? If there is no harm 
> will mainline kernel take a patch to do that or is your backward 
> compatibility concern enough to block it?

This was already discussed several times here, e.g. one patch I just 
found:

http://lists.openwall.net/netdev/2011/01/24/8
and
http://patchwork.ozlabs.org/patch/17558/

Albeit I hate sysctls for things like this, it might I tend to find it
acceptable because it solves a problem which happened to lots of people.
And I don't like the current behavior neither.

I think this can work, but we should follow up all the old discussions
to not introduce any kind of new undesired behavior this time.

Thanks,
Hannes

^ permalink raw reply

* RE: [PATCH net-next] rhashtable: Lower/upper bucket may map to same lock while shrinking
From: David Laight @ 2015-01-13 15:00 UTC (permalink / raw)
  To: 'Thomas Graf'
  Cc: davem@davemloft.net, Fengguang Wu, LKP,
	linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, netdev@vger.kernel.org
In-Reply-To: <20150113112507.GH20387@casper.infradead.org>

From: Thomas Graf
...
> > Thought, could the shrunk table use the same locks as the lower half
> > of the old table?
> 
> No. A new bucket table and thus a new set of locks is allocated when the
> table is shrunk or grown. We only have check for overlapping locks
> when holding multiple locks for the same table at the same time.

I was guessing that when locks are shared buckets k and 2^n+k use the
same lock.
Under those conditions if the 'grow' decided not to allocate extra
locks then it could save work by using exactly the same locks as the
old table.
Similarly 'shrink' could do the reverse.

It was only a thought.

	David

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Sowmini Varadhan @ 2015-01-13 15:00 UTC (permalink / raw)
  To: David Ahern
  Cc: Hannes Frederic Sowa, YOSHIFUJI Hideaki, Stephen Hemminger,
	netdev@vger.kernel.org
In-Reply-To: <54B53187.7080306@gmail.com>

On (01/13/15 07:53), David Ahern wrote:
> 
> The current code seems inconsistent: I can put an IPv6 address on a
> link in the down state. On a link up the address is retained. Only
> on a subsequent link down is it removed. If DAD or anything else is
> the reason for the current logic then why allow an address to be
> assigned in the down state? Similarly that it currently seems to
> work ok then it suggests the right thing is done on a link up in
> which case a flush is not needed.
> 
> Bottom line is there a harm in removing the flush? If there is no
> harm will mainline kernel take a patch to do that or is your
> backward compatibility concern enough to block it?

Does some of this have to do with the manner in which this interacts
with SLAAC? I recall that there were two schools of thought for doing
DAD when SLAAC is present: one says it is sufficient to just do DAD
on the interface-id, the other requies DAD on the whole 128-bit IPv6
address. I'm not sure which choice linux makes.

--Sowmini

^ permalink raw reply

* RE: [PATCH net-next] rhashtable: Lower/upper bucket may map to same lock while shrinking
From: David Laight @ 2015-01-13 15:06 UTC (permalink / raw)
  To: 'Thomas Graf'
  Cc: davem@davemloft.net, Fengguang Wu, LKP,
	linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, netdev@vger.kernel.org
In-Reply-To: <20150113112507.GH20387@casper.infradead.org>

From: Thomas Graf
...
> > >  		spin_lock_bh(old_bucket_lock1);
> > > -		spin_lock_bh_nested(old_bucket_lock2, RHT_LOCK_NESTED);
> > > -		spin_lock_bh_nested(new_bucket_lock, RHT_LOCK_NESTED2);
> > > +
> > > +		/* Depending on the lock per buckets mapping, the bucket in
> > > +		 * the lower and upper region may map to the same lock.
> > > +		 */
> > > +		if (old_bucket_lock1 != old_bucket_lock2) {
> > > +			spin_lock_bh_nested(old_bucket_lock2, RHT_LOCK_NESTED);
> > > +			spin_lock_bh_nested(new_bucket_lock, RHT_LOCK_NESTED2);
> > > +		} else {
> > > +			spin_lock_bh_nested(new_bucket_lock, RHT_LOCK_NESTED);
> > > +		}
> >
> > Acquiring 3 locks of much the same type looks like a locking hierarchy
> > violation just waiting to happen.
> 
> I'm not claiming it's extremely pretty, lockless lookup with deferred
> resizing doesn't come for free ;-) If you have a suggestion on how to
> implement this differently I'm all ears.

runs away....

> That said, it's well isolated
> and the user of rhashtable does not have to deal with it. All code paths
> which take multiple locks are mutually exclusive to each other (ht->mutex).

OK, ht->mutes saves the day.
Might be worth a comment to save people looking at the code in isolation
from worrying and doing a bit search.
OTOH it might be obvious from a slightly larger fragment than the diff.

	David

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Hannes Frederic Sowa @ 2015-01-13 15:09 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: David Ahern, YOSHIFUJI Hideaki, Stephen Hemminger,
	netdev@vger.kernel.org
In-Reply-To: <20150113150048.GA28371@oracle.com>

On Di, 2015-01-13 at 10:00 -0500, Sowmini Varadhan wrote:
> On (01/13/15 07:53), David Ahern wrote:
> > 
> > The current code seems inconsistent: I can put an IPv6 address on a
> > link in the down state. On a link up the address is retained. Only
> > on a subsequent link down is it removed. If DAD or anything else is
> > the reason for the current logic then why allow an address to be
> > assigned in the down state? Similarly that it currently seems to
> > work ok then it suggests the right thing is done on a link up in
> > which case a flush is not needed.
> > 
> > Bottom line is there a harm in removing the flush? If there is no
> > harm will mainline kernel take a patch to do that or is your
> > backward compatibility concern enough to block it?
> 
> Does some of this have to do with the manner in which this interacts
> with SLAAC? I recall that there were two schools of thought for doing
> DAD when SLAAC is present: one says it is sufficient to just do DAD
> on the interface-id, the other requies DAD on the whole 128-bit IPv6
> address. I'm not sure which choice linux makes.

Yes, it does have something to do with it. But I didn't understand what
you meant by doing DAD on the interface-id.

If you look at the patches I just posted, only addresses which are in
link-local and not in permanent state will be flushed.

I also need to do research on how to safely approach this, I don't know,
yet.

Bye,
Hannes

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Daniel Borkmann @ 2015-01-13 15:12 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer
In-Reply-To: <20150113043509.29985.33515.stgit@nitbit.x32>

On 01/13/2015 05:35 AM, John Fastabend wrote:
...
>   struct net_device_ops {
>   	int			(*ndo_init)(struct net_device *dev);
> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>   	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
>   							      u8 state);
>   #endif
> +	int			(*ndo_split_queue_pairs)(struct net_device *dev,
> +					 unsigned int qpairs_start_from,
> +					 unsigned int qpairs_num,
> +					 struct sock *sk);
...
> +	int			(*ndo_get_dma_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
>   };

Any slight chance these 8 ndo ops could be further reduced? ;)

>   /**
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index da2d668..eb7a727 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
...
> +struct tpacket_dev_qpair_map_region_info {
> +	unsigned int tp_dev_bar_sz;		/* size of BAR */
> +	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
> +	/* number of contiguous memory on BAR mapping to user space */
> +	unsigned int tp_num_map_regions;
> +	/* number of contiguous memory on system mapping to user apce */
> +	unsigned int tp_num_sysm_map_regions;
> +	struct map_page_region {
> +		unsigned page_offset;	/* offset to start of region */
> +		unsigned page_sz;	/* size of page */
> +		unsigned page_cnt;	/* number of pages */

Please use unsigned int et al, or preferably __u* variants consistently
in the uapi structs.

> +	} tp_regions[MAX_MAP_MEMORY_REGIONS];
> +};
...
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 6880f34..8cd17da 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
...
> @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
>   	sock_prot_inuse_add(net, sk->sk_prot, -1);
>   	preempt_enable();
>
> +	if (po->tp_owns_queue_pairs) {
> +		struct net_device *dev;
> +
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (dev) {
> +			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +			umem_release(dev, po);
> +		}
> +	}
> +
...
> +static int
>   packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
>   {
>   	struct sock *sk = sock->sk;
> @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
>   		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
>   		return 0;
>   	}
> +	case PACKET_RXTX_QPAIRS_SPLIT:
> +	{
...
> +		/* This call only works after a bind call which calls a dev_hold
> +		 * operation so we do not need to increment dev ref counter
> +		 */
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		ops = dev->netdev_ops;
> +		if (!ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;
> +
> +		err =  ops->ndo_split_queue_pairs(dev,
> +						  qpairs.tp_qpairs_start_from,
> +						  qpairs.tp_qpairs_num, sk);
> +		if (!err)
> +			po->tp_owns_queue_pairs = true;

When this is being set here, above test in packet_release() and the chunk
quoted below in packet_mmap() are not guaranteed to work since we don't
test if some ndos are actually implemented by the driver. Seems a bit
fragile, I'm wondering if we should test this capability as a _whole_,
iow if all necessary functions to make this work are being provided by the
driver, e.g. flag the netdev as such and test for that instead.

> +		return err;
> +	}
> +	case PACKET_RXTX_QPAIRS_RETURN:
> +	{
...
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		ops = dev->netdev_ops;
> +		if (!ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;

Should test for ndo_return_queue_pairs.

> +		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +		if (!err)
> +			po->tp_owns_queue_pairs = false;
> +
...
> +	case PACKET_RXTX_QPAIRS_SPLIT:
> +	{
...
> +		/* This call only work after a bind call which calls a dev_hold
> +		 * operation so we do not need to increment dev ref counter
> +		 */
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		if (!dev->netdev_ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;

Copy-paste (although not quite, since here's no extra ops var). :)
Should be ndo_get_split_queue_pairs.

> +		err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> +					&qpairs_info.tp_qpairs_start_from,
> +					&qpairs_info.tp_qpairs_num, sk);
> +
...
> @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
>   	if (vma->vm_pgoff)
>   		return -EINVAL;
>
> +	dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +	if (!dev)
> +		return -EINVAL;
> +
>   	mutex_lock(&po->pg_vec_lock);
>
> +	if (po->tp_owns_queue_pairs) {
> +		ops = dev->netdev_ops;
> +		err = ops->ndo_direct_qpair_page_map(vma, dev);
> +		if (err)
> +			goto out;
> +		goto done;
> +	}
> +

^ permalink raw reply

* Re: why are IPv6 addresses removed on link down
From: Sowmini Varadhan @ 2015-01-13 15:13 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Ahern, YOSHIFUJI Hideaki, Stephen Hemminger,
	netdev@vger.kernel.org
In-Reply-To: <1421161791.13626.33.camel@redhat.com>

On (01/13/15 16:09), Hannes Frederic Sowa wrote:
> 
> Yes, it does have something to do with it. But I didn't understand what
> you meant by doing DAD on the interface-id.

I have to dig up the RFCs for this, but I recall that, at one point,
the specs assert that it is sufficient to verify that the interface-id
(I think via DAD for the link-local address) is unique, and use
that to infer uniqueness of all the other non-link-local addresses
as well.

I think later specs may have changed that, asserting that the
correct, safe, proper thing to do is to separately DAD each address
by itself.

> If you look at the patches I just posted, only addresses which are in
> link-local and not in permanent state will be flushed.
> 
> I also need to do research on how to safely approach this, I don't know,
> yet.
> 
> Bye,
> Hannes
> 
> 

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: John Fastabend @ 2015-01-13 15:24 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Hannes Frederic Sowa, netdev, danny.zhou, nhorman, john.ronciak,
	brouer
In-Reply-To: <54B51BEC.2040809@redhat.com>

On 01/13/2015 05:21 AM, Daniel Borkmann wrote:
> On 01/13/2015 01:35 PM, Hannes Frederic Sowa wrote:
>> On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> ...
>>> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
>>> + * iova, size, direction.
>>> + * */
>>> +struct tpacket_dma_mem_region {
>>> +    void *addr;        /* userspace virtual address */
>>> +    __u64 phys_addr;    /* physical address */
>>> +    __u64 iova;        /* IO virtual address used for DMA */
>>> +    unsigned long size;    /* size of region */
>>> +    int direction;        /* dma data direction */
>>> +};
>>
>> Have you tested this with with 32 bit user space and 32 bit kernel, too?
>> I don't have any problem with only supporting 64 bit kernels for this
>> feature, but looking through the code I wonder if we handle the __u64
>> addresses correctly in all situations.

We still need to test/implement this I'm going to guess there is some
more work needed for this to work correctly.

>
> Given this is placed into uapi and transferred via setsockopt(2), this
> would also need some form of compat handling, also for the case of mixed
> environments (e.g. 64 bit kernel, 32 bit user space).

noted, thanks!

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
From: John Fastabend @ 2015-01-13 15:46 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer
In-Reply-To: <54B52B25.1070809@redhat.com>

On 01/13/2015 06:26 AM, Daniel Borkmann wrote:
> On 01/13/2015 05:35 AM, John Fastabend wrote:
> ...
>> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
>> +                       unsigned int start_from,
>> +                       unsigned int qpairs_num,
>> +                       struct sock *sk)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(dev);
>> +    unsigned int qpair_index;
>
> We should probably return -EINVAL, still from within the setsockopt
> call when qpairs_num is 0?

yep,

>
>> +    /* allocate whatever available qpairs */
>> +    if (start_from == -1) {
>
> I guess we should define the notion of auto-select into a uapi
> define instead of -1, which might not be overly obvious.

Certainly not obvious should be defined in the UAPI.

>
> Anyway, extending Documentation/networking/packet_mmap.txt with
> API details/examples at least for a non-RFC version is encouraged. ;)

Yep for the non-RFC version I'll add an example to packet_mmap.txt

>
>> +        unsigned int count = 0;
>> +
>> +        for (qpair_index = adapter->num_rx_queues;
>> +             qpair_index < MAX_RX_QUEUES;
>> +             qpair_index++) {
>> +            if (!adapter->user_queue_info[qpair_index].sk_handle) {
>> +                count++;
>> +                if (count == qpairs_num) {
>> +                    start_from = qpair_index - count + 1;
>> +                    break;
>> +                }
>> +            } else {
>> +                count = 0;
>> +            }
>> +        }
>> +    }
>> +
>> +    /* otherwise the caller specified exact queues */
>> +    if ((start_from > MAX_TX_QUEUES) ||
>> +        (start_from > MAX_RX_QUEUES) ||
>> +        (start_from + qpairs_num > MAX_TX_QUEUES) ||
>> +        (start_from + qpairs_num > MAX_RX_QUEUES))
>> +        return -EINVAL;
>
> Shouldn't this be '>=' if I see this correctly?

hmm I think this is correct the device allocates MAX_TX_QUEUES so the
queue space index is (0, MAX_TX_QUEUES - 1). So MAX_TX_QUEUES and
MAX_RX_QUEUES would be an invalid access below,

>
>> +    /* If the qpairs are being used by the driver do not let user space
>> +     * consume the queues. Also if the queue has already been allocated
>> +     * to a socket do fail the request.
>> +     */
>> +    for (qpair_index = start_from;
>> +         qpair_index < start_from + qpairs_num;
>> +         qpair_index++) {
>> +        if ((qpair_index < adapter->num_tx_queues) ||
>> +            (qpair_index < adapter->num_rx_queues))
>> +            return -EINVAL;
>> +
>> +        if (adapter->user_queue_info[qpair_index].sk_handle)
>> +            return -EBUSY;
>> +    }
>> +
>> +    /* remember the sk handle for each queue pair */
>> +    for (qpair_index = start_from;
>> +         qpair_index < start_from + qpairs_num;
>> +         qpair_index++) {
>> +        adapter->user_queue_info[qpair_index].sk_handle = sk;
>> +        adapter->user_queue_info[qpair_index].num_of_regions = 0;
                                     ^^^^^^^^^^^^^^
                                     (0, MAX_TX_QUEUES - 1)

@@ -673,6 +687,9 @@ Hunk #2, a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
struct ixgbe_adapter {

         struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];

+       /* Direct User Space Queues */
+       struct ixgbe_user_queue_info user_queue_info[MAX_RX_QUEUES];
+



>> +    }
>> +
>> +    return 0;
>> +}
>
> I guess many drivers would need to implement similar code, do you see
> a chance to move generic parts to the core, at least for some helper
> functions?

I'm not entirely sure about this. It depends on how the driver manages
its queue space. Many of the 10Gpbs devices it seems could use similar
logic so it might make sense. Other drivers might conjure the queues
out of some other bank of queues. If I'm looking @ the devices I have
here, i40e at least would manage the queues slightly different then this
I think.

>
> Thanks,
> Daniel


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [patch net-next] net: sched: fix skb->protocol use in case of accelerated vlan path
From: Jiri Pirko @ 2015-01-13 15:48 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs
In-Reply-To: <1421057975-1754-1-git-send-email-jiri@resnulli.us>

Dave, I will send v2 with minor correction. Please drop this one for
now.

Mon, Jan 12, 2015 at 11:19:35AM CET, jiri@resnulli.us wrote:
>tc code implicitly considers skb->protocol even in case of accelerated
>vlan paths and expects vlan protocol type here. However, on rx path,
>if the vlan header was already stripped, skb->protocol contains value
>of next header. Similar situation is on tx path.
>
>So for skbs that use skb->vlan_tci for tagging, use skb->vlan_proto instead.
>
>Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
>Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
>---
>
>Note that this is present since vlan accel was introduced, pre-git times.
>Please consider this for stable.
>
> include/net/pkt_sched.h | 12 ++++++++++++
> net/sched/act_csum.c    |  2 +-
> net/sched/cls_flow.c    |  8 ++++----
> net/sched/em_ipset.c    |  2 +-
> net/sched/em_meta.c     |  2 +-
> net/sched/sch_api.c     |  2 +-
> net/sched/sch_dsmark.c  |  6 +++---
> net/sched/sch_teql.c    |  4 ++--
> 8 files changed, 25 insertions(+), 13 deletions(-)
>
>diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
>index 27a3383..cd590f7 100644
>--- a/include/net/pkt_sched.h
>+++ b/include/net/pkt_sched.h
>@@ -3,6 +3,7 @@
> 
> #include <linux/jiffies.h>
> #include <linux/ktime.h>
>+#include <linux/if_vlan.h>
> #include <net/sch_generic.h>
> 
> struct qdisc_walker {
>@@ -114,6 +115,17 @@ int tc_classify_compat(struct sk_buff *skb, const struct tcf_proto *tp,
> int tc_classify(struct sk_buff *skb, const struct tcf_proto *tp,
> 		struct tcf_result *res);
> 
>+static inline __be16 tc_skb_protocol(struct sk_buff *skb)
>+{
>+	/* We need to take extra care in case the skb came via
>+	 * vlan accelerated path. In that case, use skb->vlan_proto
>+	 * as the original vlan header was already stripped.
>+	 */
>+	if (vlan_tx_tag_present(skb))
>+		return skb->vlan_proto;
>+	return skb->protocol;
>+}
>+
> /* Calculate maximal size of packet seen by hard_start_xmit
>    routine of this device.
>  */
>diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
>index edbf40d..4cd5cf1 100644
>--- a/net/sched/act_csum.c
>+++ b/net/sched/act_csum.c
>@@ -509,7 +509,7 @@ static int tcf_csum(struct sk_buff *skb,
> 	if (unlikely(action == TC_ACT_SHOT))
> 		goto drop;
> 
>-	switch (skb->protocol) {
>+	switch (tc_skb_protocol(skb)) {
> 	case cpu_to_be16(ETH_P_IP):
> 		if (!tcf_csum_ipv4(skb, update_flags))
> 			goto drop;
>diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
>index 15d68f2..4614103 100644
>--- a/net/sched/cls_flow.c
>+++ b/net/sched/cls_flow.c
>@@ -77,7 +77,7 @@ static u32 flow_get_dst(const struct sk_buff *skb, const struct flow_keys *flow)
> {
> 	if (flow->dst)
> 		return ntohl(flow->dst);
>-	return addr_fold(skb_dst(skb)) ^ (__force u16)skb->protocol;
>+	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
> }
> 
> static u32 flow_get_proto(const struct sk_buff *skb, const struct flow_keys *flow)
>@@ -98,7 +98,7 @@ static u32 flow_get_proto_dst(const struct sk_buff *skb, const struct flow_keys
> 	if (flow->ports)
> 		return ntohs(flow->port16[1]);
> 
>-	return addr_fold(skb_dst(skb)) ^ (__force u16)skb->protocol;
>+	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
> }
> 
> static u32 flow_get_iif(const struct sk_buff *skb)
>@@ -144,7 +144,7 @@ static u32 flow_get_nfct(const struct sk_buff *skb)
> 
> static u32 flow_get_nfct_src(const struct sk_buff *skb, const struct flow_keys *flow)
> {
>-	switch (skb->protocol) {
>+	switch (tc_skb_protocol(skb)) {
> 	case htons(ETH_P_IP):
> 		return ntohl(CTTUPLE(skb, src.u3.ip));
> 	case htons(ETH_P_IPV6):
>@@ -156,7 +156,7 @@ fallback:
> 
> static u32 flow_get_nfct_dst(const struct sk_buff *skb, const struct flow_keys *flow)
> {
>-	switch (skb->protocol) {
>+	switch (tc_skb_protocol(skb)) {
> 	case htons(ETH_P_IP):
> 		return ntohl(CTTUPLE(skb, dst.u3.ip));
> 	case htons(ETH_P_IPV6):
>diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
>index 5b4a4ef..a3d79c8 100644
>--- a/net/sched/em_ipset.c
>+++ b/net/sched/em_ipset.c
>@@ -59,7 +59,7 @@ static int em_ipset_match(struct sk_buff *skb, struct tcf_ematch *em,
> 	struct net_device *dev, *indev = NULL;
> 	int ret, network_offset;
> 
>-	switch (skb->protocol) {
>+	switch (tc_skb_protocol(skb)) {
> 	case htons(ETH_P_IP):
> 		acpar.family = NFPROTO_IPV4;
> 		if (!pskb_network_may_pull(skb, sizeof(struct iphdr)))
>diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
>index c8f8c39..2159981 100644
>--- a/net/sched/em_meta.c
>+++ b/net/sched/em_meta.c
>@@ -197,7 +197,7 @@ META_COLLECTOR(int_priority)
> META_COLLECTOR(int_protocol)
> {
> 	/* Let userspace take care of the byte ordering */
>-	dst->value = skb->protocol;
>+	dst->value = tc_skb_protocol(skb);
> }
> 
> META_COLLECTOR(int_pkttype)
>diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>index 76f402e..243b7d1 100644
>--- a/net/sched/sch_api.c
>+++ b/net/sched/sch_api.c
>@@ -1807,7 +1807,7 @@ done:
> int tc_classify_compat(struct sk_buff *skb, const struct tcf_proto *tp,
> 		       struct tcf_result *res)
> {
>-	__be16 protocol = skb->protocol;
>+	__be16 protocol = tc_skb_protocol(skb);
> 	int err;
> 
> 	for (; tp; tp = rcu_dereference_bh(tp->next)) {
>diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
>index 227114f..66700a6 100644
>--- a/net/sched/sch_dsmark.c
>+++ b/net/sched/sch_dsmark.c
>@@ -203,7 +203,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> 	pr_debug("%s(skb %p,sch %p,[qdisc %p])\n", __func__, skb, sch, p);
> 
> 	if (p->set_tc_index) {
>-		switch (skb->protocol) {
>+		switch (tc_skb_protocol(skb)) {
> 		case htons(ETH_P_IP):
> 			if (skb_cow_head(skb, sizeof(struct iphdr)))
> 				goto drop;
>@@ -289,7 +289,7 @@ static struct sk_buff *dsmark_dequeue(struct Qdisc *sch)
> 	index = skb->tc_index & (p->indices - 1);
> 	pr_debug("index %d->%d\n", skb->tc_index, index);
> 
>-	switch (skb->protocol) {
>+	switch (tc_skb_protocol(skb)) {
> 	case htons(ETH_P_IP):
> 		ipv4_change_dsfield(ip_hdr(skb), p->mask[index],
> 				    p->value[index]);
>@@ -306,7 +306,7 @@ static struct sk_buff *dsmark_dequeue(struct Qdisc *sch)
> 		 */
> 		if (p->mask[index] != 0xff || p->value[index])
> 			pr_warn("%s: unsupported protocol %d\n",
>-				__func__, ntohs(skb->protocol));
>+				__func__, ntohs(tc_skb_protocol(skb)));
> 		break;
> 	}
> 
>diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
>index 6ada423..2ad0c40 100644
>--- a/net/sched/sch_teql.c
>+++ b/net/sched/sch_teql.c
>@@ -249,8 +249,8 @@ __teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res,
> 		char haddr[MAX_ADDR_LEN];
> 
> 		neigh_ha_snapshot(haddr, n, dev);
>-		err = dev_hard_header(skb, dev, ntohs(skb->protocol), haddr,
>-				      NULL, skb->len);
>+		err = dev_hard_header(skb, dev, ntohs(tc_skb_protocol(skb)),
>+				      haddr, NULL, skb->len);
> 
> 		if (err < 0)
> 			err = -EINVAL;
>-- 
>1.9.3
>

^ permalink raw reply

* Re: [PATCH net-next] rhashtable: Lower/upper bucket may map to same lock while shrinking
From: 'Thomas Graf' @ 2015-01-13 15:56 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, Fengguang Wu, LKP,
	linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1CAC6969@AcuExch.aculab.com>

On 01/13/15 at 03:06pm, David Laight wrote:
> OK, ht->mutes saves the day.
> Might be worth a comment to save people looking at the code in isolation
> from worrying and doing a bit search.
> OTOH it might be obvious from a slightly larger fragment than the diff.

Good idea. Will do this. Also, thanks for the suggestion in the other
email. I now understand what you meant.

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: John Fastabend @ 2015-01-13 15:58 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer
In-Reply-To: <54B535E2.7040600@redhat.com>

On 01/13/2015 07:12 AM, Daniel Borkmann wrote:
> On 01/13/2015 05:35 AM, John Fastabend wrote:
> ...
>>   struct net_device_ops {
>>       int            (*ndo_init)(struct net_device *dev);
>> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>>       int            (*ndo_switch_port_stp_update)(struct net_device
>> *dev,
>>                                     u8 state);
>>   #endif
>> +    int            (*ndo_split_queue_pairs)(struct net_device *dev,
>> +                     unsigned int qpairs_start_from,
>> +                     unsigned int qpairs_num,
>> +                     struct sock *sk);
> ...
>> +    int            (*ndo_get_dma_region_info)
>> +                    (struct net_device *dev,
>> +                     struct tpacket_dma_mem_region *region,
>> +                     struct sock *sk);
>>   };
>
> Any slight chance these 8 ndo ops could be further reduced? ;)
>

Its possible we could collapse a few of these calls. I'll see if
we can get it a bit smaller. Another option would be to put a
a pointer to the set of ops in the net_device struct. Something
like,

	struct net_device {
		...
		const struct af_packet_hw *afp_ops;
		...
	}

	struct af_packet_hw {
		int (*ndo_split_queue_pairs)(struct net_device *dev,
					     unsigned int qpairs_start_from,
					     unsigned int qpairs_num,
					     struct sock *sk);
		...
	}
		

>>   /**
>> diff --git a/include/uapi/linux/if_packet.h
>> b/include/uapi/linux/if_packet.h
>> index da2d668..eb7a727 100644
>> --- a/include/uapi/linux/if_packet.h
>> +++ b/include/uapi/linux/if_packet.h
> ...
>> +struct tpacket_dev_qpair_map_region_info {
>> +    unsigned int tp_dev_bar_sz;        /* size of BAR */
>> +    unsigned int tp_dev_sysm_sz;        /* size of systerm memory */
>> +    /* number of contiguous memory on BAR mapping to user space */
>> +    unsigned int tp_num_map_regions;
>> +    /* number of contiguous memory on system mapping to user apce */
>> +    unsigned int tp_num_sysm_map_regions;
>> +    struct map_page_region {
>> +        unsigned page_offset;    /* offset to start of region */
>> +        unsigned page_sz;    /* size of page */
>> +        unsigned page_cnt;    /* number of pages */
>
> Please use unsigned int et al, or preferably __u* variants consistently
> in the uapi structs.

I'll turn this all into __u* variants.

[...]

> ...
>> +static int
>>   packet_setsockopt(struct socket *sock, int level, int optname, char
>> __user *optval, unsigned int optlen)
>>   {
>>       struct sock *sk = sock->sk;
>> @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int
>> level, int optname, char __user *optv
>>           po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
>>           return 0;
>>       }
>> +    case PACKET_RXTX_QPAIRS_SPLIT:
>> +    {
> ...
>> +        /* This call only works after a bind call which calls a dev_hold
>> +         * operation so we do not need to increment dev ref counter
>> +         */
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        ops = dev->netdev_ops;
>> +        if (!ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>> +
>> +        err =  ops->ndo_split_queue_pairs(dev,
>> +                          qpairs.tp_qpairs_start_from,
>> +                          qpairs.tp_qpairs_num, sk);
>> +        if (!err)
>> +            po->tp_owns_queue_pairs = true;
>
> When this is being set here, above test in packet_release() and the chunk
> quoted below in packet_mmap() are not guaranteed to work since we don't
> test if some ndos are actually implemented by the driver. Seems a bit
> fragile, I'm wondering if we should test this capability as a _whole_,
> iow if all necessary functions to make this work are being provided by the
> driver, e.g. flag the netdev as such and test for that instead.

Sounds good to me, better than scattering ndo checks throughout. Also
with a feature flag administrators could disable it easily.

>
>> +        return err;
>> +    }
>> +    case PACKET_RXTX_QPAIRS_RETURN:
>> +    {
> ...
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        ops = dev->netdev_ops;
>> +        if (!ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>
> Should test for ndo_return_queue_pairs.

yep but I like the feature flag idea above.

>
>> +        err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
>> +        if (!err)
>> +            po->tp_owns_queue_pairs = false;
>> +
> ...
>> +    case PACKET_RXTX_QPAIRS_SPLIT:
>> +    {
> ...
>> +        /* This call only work after a bind call which calls a dev_hold
>> +         * operation so we do not need to increment dev ref counter
>> +         */
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        if (!dev->netdev_ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>
> Copy-paste (although not quite, since here's no extra ops var). :)
> Should be ndo_get_split_queue_pairs.

yep.

[...]

Thanks for reviewing!

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox