Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv2 net] team: fix netconsole setup over team
From: David Miller @ 2018-04-24 13:48 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, jiri, stephen, xiyou.wangcong
In-Reply-To: <9c47c72ba2e0a16d61a6d150b5f85adce88de439.1524551617.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 24 Apr 2018 14:33:37 +0800

> The same fix in Commit dbe173079ab5 ("bridge: fix netconsole
> setup over bridge") is also needed for team driver.
> 
> While at it, remove the unnecessary parameter *team from
> team_port_enable_netpoll().
> 
> v1->v2:
>   - fix it in a better way, as does bridge.
> 
> Fixes: 0fb52a27a04a ("team: cleanup netpoll clode")
> Reported-by: João Avelino Bellomo Filho <jbellomo@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks Xin.

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] bpf: allow map helpers access to map values directly
From: Paul Chaignon via iovisor-dev @ 2018-04-24 13:50 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy
In-Reply-To: <a71a2e2d-ca39-b351-a62d-315c034f1ea1-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On 04/23/2018 11:18 PM +0200, Daniel Borkmann wrote:
> On 04/22/2018 11:52 PM, Paul Chaignon wrote:
> > Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
> > access stack and packet memory.  Allow these helpers to directly access
> > map values by passing registers of type PTR_TO_MAP_VALUE.
> > 
> > This change removes the need for an extra copy to the stack when using a
> > map value to perform a second map lookup, as in the following:
> > 
> > struct bpf_map_def SEC("maps") infobyreq = {
> >     .type = BPF_MAP_TYPE_HASHMAP,
> >     .key_size = sizeof(struct request *),
> >     .value_size = sizeof(struct info_t),
> >     .max_entries = 1024,
> > };
> > struct bpf_map_def SEC("maps") counts = {
> >     .type = BPF_MAP_TYPE_HASHMAP,
> >     .key_size = sizeof(struct info_t),
> >     .value_size = sizeof(u64),
> >     .max_entries = 1024,
> > };
> > SEC("kprobe/blk_account_io_start")
> > int bpf_blk_account_io_start(struct pt_regs *ctx)
> > {
> >     struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
> >     u64 *count = bpf_map_lookup_elem(&counts, info);
> >     (*count)++;
> > }
> > 
> > Signed-off-by: Paul Chaignon <paul.chaignon-C0LM0jrOve7QT0dZR+AlfA@public.gmane.org>
> > ---
> >  kernel/bpf/verifier.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 5dd1dcb902bf..70e00beade03 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1914,7 +1914,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> >  	if (arg_type == ARG_PTR_TO_MAP_KEY ||
> >  	    arg_type == ARG_PTR_TO_MAP_VALUE) {
> >  		expected_type = PTR_TO_STACK;
> > -		if (!type_is_pkt_pointer(type) &&
> > +		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
> >  		    type != expected_type)
> >  			goto err_type;
> >  	} else if (arg_type == ARG_CONST_SIZE ||
> > @@ -1970,6 +1970,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> >  			err = check_packet_access(env, regno, reg->off,
> >  						  meta->map_ptr->key_size,
> >  						  false);
> > +		else if (type == PTR_TO_MAP_VALUE)
> > +			err = check_map_access(env, regno, reg->off,
> > +					       meta->map_ptr->key_size, false);
> >  		else
> >  			err = check_stack_boundary(env, regno,
> >  						   meta->map_ptr->key_size,
> 
> We should reuse check_helper_mem_access() here which covers all three cases
> from above already and simplifies the code a bit.

Thanks for the review.
I've sent a refactored patchset that uses check_helper_mem_access().

> 
> > @@ -1987,6 +1990,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> >  			err = check_packet_access(env, regno, reg->off,
> >  						  meta->map_ptr->value_size,
> >  						  false);
> > +		else if (type == PTR_TO_MAP_VALUE)
> > +			err = check_map_access(env, regno, reg->off,
> > +					       meta->map_ptr->value_size,
> > +					       false);
> >  		else
> >  			err = check_stack_boundary(env, regno,
> >  						   meta->map_ptr->value_size,
> > 
> 
> Ditto.
> 
> Thanks,
> Daniel

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: David Miller @ 2018-04-24 13:50 UTC (permalink / raw)
  To: tglx
  Cc: jesus.sanchez-palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, richardcochran, intel-wired-lan, anna-maria,
	henrik, john.stultz, levi.pearson, edumazet, willemb, mlichvar
In-Reply-To: <alpine.DEB.2.21.1804241012020.5261@nanos.tec.linutronix.de>

From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 24 Apr 2018 10:50:04 +0200 (CEST)

> So adding 8 bytes to spare duplicated code will not change the kmem_cache
> object size and I really doubt that anyone will notice.

It's about where the cache lines end up when each and every byte is added
to the structure, not just the slab object size.

^ permalink raw reply

* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: David Miller @ 2018-04-24 14:02 UTC (permalink / raw)
  To: pabeni; +Cc: kvm, haibinzhang, mst, jasowang, virtualization, netdev
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 24 Apr 2018 10:34:36 +0200

> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
> 
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
> 
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
> 
> queue size		256	512	1024
> 
> baseline		366	354	362
> weight 128		715	723	670
> weight 256		740	745	733
> weight 512		600	460	583
> weight 1024		423	427	418
> 
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
> 
> No measurable regression in unidirectional performance tests has
> been detected.
> 
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied to net-next, thanks.

^ permalink raw reply

* [PATCH iproute2] ipaddress: strengthen check on 'label' input
From: Patrick Talbert @ 2018-04-24 14:08 UTC (permalink / raw)
  To: netdev; +Cc: stephen

As mentioned in the ip-address man page, an address label must
be equal to the device name or prefixed by the device name
followed by a colon. Currently the only check on this input is
to see if the device name appears at the beginning of the label
string.

This commit adds an additional check to ensure label == dev or
continues with a colon.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
---
 ip/ipaddress.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index aecc9a1..edcf821 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -2168,9 +2168,14 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv)
 		fprintf(stderr, "Not enough information: \"dev\" argument is required.\n");
 		return -1;
 	}
-	if (l && matches(d, l) != 0) {
-		fprintf(stderr, "\"dev\" (%s) must match \"label\" (%s).\n", d, l);
-		return -1;
+	if (l) {
+		size_t d_len = strlen(d);
+
+		if (!(matches(d, l) == 0 && (l[d_len] == '\0' || l[d_len] == ':'))) {
+			fprintf(stderr, "\"label\" (%s) must match \"dev\" (%s) or be prefixed by"
+				" \"dev\" with a colon.\n", l, d);
+			return -1;
+		}
 	}
 
 	if (peer_len == 0 && local_len) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH 20/39] afs: simplify procfs code
From: Christoph Hellwig @ 2018-04-24 14:12 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, linux-kernel,
	linux-afs, netdev, Alexey Dobriyan
In-Reply-To: <22726.1524227374@warthog.procyon.org.uk>

On Fri, Apr 20, 2018 at 01:29:34PM +0100, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > > Use remove_proc_subtree to remove the whole subtree on cleanup, and
> > > unwind the registration loop into individual calls.  Switch to use
> > > proc_create_seq where applicable.
> > 
> > Note that this is likely going to clash with my patch to net-namespace all of
> > the afs proc files:
> > 
> > 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=mount-context&id=f60c26c827c073583107ebf19d87bc5c0e71c3d2
> > 
> > If it helps, I should be able to disentangle this from the mount-api changes
> > since the subsequent patch connects the dots to propagate the network
> > namespace over automount using the new fs_context to do it.
> 
> Okay, I'll follow up this mail with a pair of patches to just use network
> namespacing in AFS.  The first exports a function from core code only; the
> second is the actual modifications to AFS.

I don't think you should need any of these.  seq_file_net or
seq_file_single_net will return you the net_ns based on a struct
seq_file.  And even from your write routines you can reach the
seq_file in file->private pretty easily.

^ permalink raw reply

* [PATCH v2 net] sfc: ARFS filter IDs
From: Edward Cree @ 2018-04-24 14:14 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Associate an arbitrary ID with each ARFS filter, allowing to properly query
 for expiry.  The association is maintained in a hash table, which is
 protected by a spinlock.

v2: fixed uninitialised variable (thanks davem and lkp-robot).

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c       |  80 +++++++++++--------
 drivers/net/ethernet/sfc/efx.c        | 143 ++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/efx.h        |  19 +++++
 drivers/net/ethernet/sfc/farch.c      |  41 ++++++++--
 drivers/net/ethernet/sfc/net_driver.h |  36 +++++++++
 drivers/net/ethernet/sfc/rx.c         |  62 +++++++++++++--
 6 files changed, 335 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 83ce229f4eb7..63036d9bf3e6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -3999,29 +3999,6 @@ static void efx_ef10_prepare_flr(struct efx_nic *efx)
 	atomic_set(&efx->active_queues, 0);
 }
 
-static bool efx_ef10_filter_equal(const struct efx_filter_spec *left,
-				  const struct efx_filter_spec *right)
-{
-	if ((left->match_flags ^ right->match_flags) |
-	    ((left->flags ^ right->flags) &
-	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
-		return false;
-
-	return memcmp(&left->outer_vid, &right->outer_vid,
-		      sizeof(struct efx_filter_spec) -
-		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
-}
-
-static unsigned int efx_ef10_filter_hash(const struct efx_filter_spec *spec)
-{
-	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
-	return jhash2((const u32 *)&spec->outer_vid,
-		      (sizeof(struct efx_filter_spec) -
-		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
-		      0);
-	/* XXX should we randomise the initval? */
-}
-
 /* Decide whether a filter should be exclusive or else should allow
  * delivery to additional recipients.  Currently we decide that
  * filters for specific local unicast MAC and IP addresses are
@@ -4346,7 +4323,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		goto out_unlock;
 	match_pri = rc;
 
-	hash = efx_ef10_filter_hash(spec);
+	hash = efx_filter_spec_hash(spec);
 	is_mc_recip = efx_filter_is_mc_recipient(spec);
 	if (is_mc_recip)
 		bitmap_zero(mc_rem_map, EFX_EF10_FILTER_SEARCH_LIMIT);
@@ -4378,7 +4355,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		if (!saved_spec) {
 			if (ins_index < 0)
 				ins_index = i;
-		} else if (efx_ef10_filter_equal(spec, saved_spec)) {
+		} else if (efx_filter_spec_equal(spec, saved_spec)) {
 			if (spec->priority < saved_spec->priority &&
 			    spec->priority != EFX_FILTER_PRI_AUTO) {
 				rc = -EPERM;
@@ -4762,27 +4739,62 @@ static s32 efx_ef10_filter_get_rx_ids(struct efx_nic *efx,
 static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 					   unsigned int filter_idx)
 {
+	struct efx_filter_spec *spec, saved_spec;
 	struct efx_ef10_filter_table *table;
-	struct efx_filter_spec *spec;
-	bool ret;
+	struct efx_arfs_rule *rule = NULL;
+	bool ret = true, force = false;
+	u16 arfs_id;
 
 	down_read(&efx->filter_sem);
 	table = efx->filter_state;
 	down_write(&table->lock);
 	spec = efx_ef10_filter_entry_spec(table, filter_idx);
 
-	if (!spec || spec->priority != EFX_FILTER_PRI_HINT) {
-		ret = true;
+	if (!spec || spec->priority != EFX_FILTER_PRI_HINT)
 		goto out_unlock;
-	}
 
-	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
-		ret = false;
-		goto out_unlock;
+	spin_lock_bh(&efx->rps_hash_lock);
+	if (!efx->rps_hash_table) {
+		/* In the absence of the table, we always return 0 to ARFS. */
+		arfs_id = 0;
+	} else {
+		rule = efx_rps_hash_find(efx, spec);
+		if (!rule)
+			/* ARFS table doesn't know of this filter, so remove it */
+			goto expire;
+		arfs_id = rule->arfs_id;
+		ret = efx_rps_check_rule(rule, filter_idx, &force);
+		if (force)
+			goto expire;
+		if (!ret) {
+			spin_unlock_bh(&efx->rps_hash_lock);
+			goto out_unlock;
+		}
 	}
-
+	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, arfs_id))
+		ret = false;
+	else if (rule)
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+expire:
+	saved_spec = *spec; /* remove operation will kfree spec */
+	spin_unlock_bh(&efx->rps_hash_lock);
+	/* At this point (since we dropped the lock), another thread might queue
+	 * up a fresh insertion request (but the actual insertion will be held
+	 * up by our possession of the filter table lock).  In that case, it
+	 * will set rule->filter_id to EFX_ARFS_FILTER_ID_PENDING, meaning that
+	 * the rule is not removed by efx_rps_hash_del() below.
+	 */
 	ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
 					      filter_idx, true) == 0;
+	/* While we can't safely dereference rule (we dropped the lock), we can
+	 * still test it for NULL.
+	 */
+	if (ret && rule) {
+		/* Expiring, so remove entry from ARFS table */
+		spin_lock_bh(&efx->rps_hash_lock);
+		efx_rps_hash_del(efx, &saved_spec);
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 out_unlock:
 	up_write(&table->lock);
 	up_read(&efx->filter_sem);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 692dd729ee2a..a4ebd8715494 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -3027,6 +3027,10 @@ static int efx_init_struct(struct efx_nic *efx,
 	mutex_init(&efx->mac_lock);
 #ifdef CONFIG_RFS_ACCEL
 	mutex_init(&efx->rps_mutex);
+	spin_lock_init(&efx->rps_hash_lock);
+	/* Failure to allocate is not fatal, but may degrade ARFS performance */
+	efx->rps_hash_table = kcalloc(EFX_ARFS_HASH_TABLE_SIZE,
+				      sizeof(*efx->rps_hash_table), GFP_KERNEL);
 #endif
 	efx->phy_op = &efx_dummy_phy_operations;
 	efx->mdio.dev = net_dev;
@@ -3070,6 +3074,10 @@ static void efx_fini_struct(struct efx_nic *efx)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	kfree(efx->rps_hash_table);
+#endif
+
 	for (i = 0; i < EFX_MAX_CHANNELS; i++)
 		kfree(efx->channel[i]);
 
@@ -3092,6 +3100,141 @@ void efx_update_sw_stats(struct efx_nic *efx, u64 *stats)
 	stats[GENERIC_STAT_rx_noskb_drops] = atomic_read(&efx->n_rx_noskb_drops);
 }
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right)
+{
+	if ((left->match_flags ^ right->match_flags) |
+	    ((left->flags ^ right->flags) &
+	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
+		return false;
+
+	return memcmp(&left->outer_vid, &right->outer_vid,
+		      sizeof(struct efx_filter_spec) -
+		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
+}
+
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec)
+{
+	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
+	return jhash2((const u32 *)&spec->outer_vid,
+		      (sizeof(struct efx_filter_spec) -
+		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
+		      0);
+}
+
+#ifdef CONFIG_RFS_ACCEL
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force)
+{
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_PENDING) {
+		/* ARFS is currently updating this entry, leave it */
+		return false;
+	}
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_ERROR) {
+		/* ARFS tried and failed to update this, so it's probably out
+		 * of date.  Remove the filter and the ARFS rule entry.
+		 */
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+		*force = true;
+		return true;
+	} else if (WARN_ON(rule->filter_id != filter_idx)) { /* can't happen */
+		/* ARFS has moved on, so old filter is not needed.  Since we did
+		 * not mark the rule with EFX_ARFS_FILTER_ID_REMOVING, it will
+		 * not be removed by efx_rps_hash_del() subsequently.
+		 */
+		*force = true;
+		return true;
+	}
+	/* Remove it iff ARFS wants to. */
+	return true;
+}
+
+struct hlist_head *efx_rps_hash_bucket(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec)
+{
+	u32 hash = efx_filter_spec_hash(spec);
+
+	WARN_ON(!spin_is_locked(&efx->rps_hash_lock));
+	if (!efx->rps_hash_table)
+		return NULL;
+	return &efx->rps_hash_table[hash % EFX_ARFS_HASH_TABLE_SIZE];
+}
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec))
+			return rule;
+	}
+	return NULL;
+}
+
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			*new = false;
+			return rule;
+		}
+	}
+	rule = kmalloc(sizeof(*rule), GFP_ATOMIC);
+	*new = true;
+	if (rule) {
+		memcpy(&rule->spec, spec, sizeof(rule->spec));
+		hlist_add_head(&rule->node, head);
+	}
+	return rule;
+}
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (WARN_ON(!head))
+		return;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			/* Someone already reused the entry.  We know that if
+			 * this check doesn't fire (i.e. filter_id == REMOVING)
+			 * then the REMOVING mark was put there by our caller,
+			 * because caller is holding a lock on filter table and
+			 * only holders of that lock set REMOVING.
+			 */
+			if (rule->filter_id != EFX_ARFS_FILTER_ID_REMOVING)
+				return;
+			hlist_del(node);
+			kfree(rule);
+			return;
+		}
+	}
+	/* We didn't find it. */
+	WARN_ON(1);
+}
+#endif
+
 /* RSS contexts.  We're using linked lists and crappy O(n) algorithms, because
  * (a) this is an infrequent control-plane operation and (b) n is small (max 64)
  */
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index a3140e16fcef..6b4164b6d938 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -186,6 +186,25 @@ static inline void efx_filter_rfs_expire(struct work_struct *data) {}
 #endif
 bool efx_filter_is_mc_recipient(const struct efx_filter_spec *spec);
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right);
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec);
+
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force);
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec);
+
+/* @new is written to indicate if entry was newly added (true) or if an old
+ * entry was found and returned (false).
+ */
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new);
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec);
+
 /* RSS contexts */
 struct efx_rss_context *efx_alloc_rss_context_entry(struct efx_nic *efx);
 struct efx_rss_context *efx_find_rss_context_entry(struct efx_nic *efx, u32 id);
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 7174ef5e5c5e..c72adf8b52ea 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2905,18 +2905,45 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 {
 	struct efx_farch_filter_state *state = efx->filter_state;
 	struct efx_farch_filter_table *table;
-	bool ret = false;
+	bool ret = false, force = false;
+	u16 arfs_id;
 
 	down_write(&state->lock);
+	spin_lock_bh(&efx->rps_hash_lock);
 	table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP];
 	if (test_bit(index, table->used_bitmap) &&
-	    table->spec[index].priority == EFX_FILTER_PRI_HINT &&
-	    rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-				flow_id, 0)) {
-		efx_farch_filter_table_clear_entry(efx, table, index);
-		ret = true;
+	    table->spec[index].priority == EFX_FILTER_PRI_HINT) {
+		struct efx_arfs_rule *rule = NULL;
+		struct efx_filter_spec spec;
+
+		efx_farch_filter_to_gen_spec(&spec, &table->spec[index]);
+		if (!efx->rps_hash_table) {
+			/* In the absence of the table, we always returned 0 to
+			 * ARFS, so use the same to query it.
+			 */
+			arfs_id = 0;
+		} else {
+			rule = efx_rps_hash_find(efx, &spec);
+			if (!rule) {
+				/* ARFS table doesn't know of this filter, remove it */
+				force = true;
+			} else {
+				arfs_id = rule->arfs_id;
+				if (!efx_rps_check_rule(rule, index, &force))
+					goto out_unlock;
+			}
+		}
+		if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id,
+						 flow_id, arfs_id)) {
+			if (rule)
+				rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+			efx_rps_hash_del(efx, &spec);
+			efx_farch_filter_table_clear_entry(efx, table, index);
+			ret = true;
+		}
 	}
-
+out_unlock:
+	spin_unlock_bh(&efx->rps_hash_lock);
 	up_write(&state->lock);
 	return ret;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index eea3808b3f25..65568925c3ef 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -734,6 +734,35 @@ struct efx_rss_context {
 };
 
 #ifdef CONFIG_RFS_ACCEL
+/* Order of these is important, since filter_id >= %EFX_ARFS_FILTER_ID_PENDING
+ * is used to test if filter does or will exist.
+ */
+#define EFX_ARFS_FILTER_ID_PENDING	-1
+#define EFX_ARFS_FILTER_ID_ERROR	-2
+#define EFX_ARFS_FILTER_ID_REMOVING	-3
+/**
+ * struct efx_arfs_rule - record of an ARFS filter and its IDs
+ * @node: linkage into hash table
+ * @spec: details of the filter (used as key for hash table).  Use efx->type to
+ *	determine which member to use.
+ * @rxq_index: channel to which the filter will steer traffic.
+ * @arfs_id: filter ID which was returned to ARFS
+ * @filter_id: index in software filter table.  May be
+ *	%EFX_ARFS_FILTER_ID_PENDING if filter was not inserted yet,
+ *	%EFX_ARFS_FILTER_ID_ERROR if filter insertion failed, or
+ *	%EFX_ARFS_FILTER_ID_REMOVING if expiry is currently removing the filter.
+ */
+struct efx_arfs_rule {
+	struct hlist_node node;
+	struct efx_filter_spec spec;
+	u16 rxq_index;
+	u16 arfs_id;
+	s32 filter_id;
+};
+
+/* Size chosen so that the table is one page (4kB) */
+#define EFX_ARFS_HASH_TABLE_SIZE	512
+
 /**
  * struct efx_async_filter_insertion - Request to asynchronously insert a filter
  * @net_dev: Reference to the netdevice
@@ -873,6 +902,10 @@ struct efx_async_filter_insertion {
  *	@rps_expire_channel's @rps_flow_id
  * @rps_slot_map: bitmap of in-flight entries in @rps_slot
  * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
+ * @rps_hash_lock: Protects ARFS filter mapping state (@rps_hash_table and
+ *	@rps_next_id).
+ * @rps_hash_table: Mapping between ARFS filters and their various IDs
+ * @rps_next_id: next arfs_id for an ARFS filter
  * @active_queues: Count of RX and TX queues that haven't been flushed and drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be flushed.
  *	Decremented when the efx_flush_rx_queue() is called.
@@ -1029,6 +1062,9 @@ struct efx_nic {
 	unsigned int rps_expire_index;
 	unsigned long rps_slot_map;
 	struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
+	spinlock_t rps_hash_lock;
+	struct hlist_head *rps_hash_table;
+	u32 rps_next_id;
 #endif
 
 	atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 9c593c661cbf..64a94f242027 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -834,9 +834,29 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_nic *efx = netdev_priv(req->net_dev);
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int slot_idx = req - efx->rps_slot;
+	struct efx_arfs_rule *rule;
+	u16 arfs_id = 0;
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
+	if (efx->rps_hash_table) {
+		spin_lock_bh(&efx->rps_hash_lock);
+		rule = efx_rps_hash_find(efx, &req->spec);
+		/* The rule might have already gone, if someone else's request
+		 * for the same spec was already worked and then expired before
+		 * we got around to our work.  In that case we have nothing
+		 * tying us to an arfs_id, meaning that as soon as the filter
+		 * is considered for expiry it will be removed.
+		 */
+		if (rule) {
+			if (rc < 0)
+				rule->filter_id = EFX_ARFS_FILTER_ID_ERROR;
+			else
+				rule->filter_id = rc;
+			arfs_id = rule->arfs_id;
+		}
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.
@@ -848,18 +868,18 @@ static void efx_filter_rfs_work(struct work_struct *data)
 
 		if (req->spec.ether_type == htons(ETH_P_IP))
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 		else
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 	}
 
 	/* Release references */
@@ -872,8 +892,10 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_async_filter_insertion *req;
+	struct efx_arfs_rule *rule;
 	struct flow_keys fk;
 	int slot_idx;
+	bool new;
 	int rc;
 
 	/* find a free slot */
@@ -926,12 +948,42 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	req->spec.rem_port = fk.ports.src;
 	req->spec.loc_port = fk.ports.dst;
 
+	if (efx->rps_hash_table) {
+		/* Add it to ARFS hash table */
+		spin_lock(&efx->rps_hash_lock);
+		rule = efx_rps_hash_add(efx, &req->spec, &new);
+		if (!rule) {
+			rc = -ENOMEM;
+			goto out_unlock;
+		}
+		if (new)
+			rule->arfs_id = efx->rps_next_id++ % RPS_NO_FILTER;
+		rc = rule->arfs_id;
+		/* Skip if existing or pending filter already does the right thing */
+		if (!new && rule->rxq_index == rxq_index &&
+		    rule->filter_id >= EFX_ARFS_FILTER_ID_PENDING)
+			goto out_unlock;
+		rule->rxq_index = rxq_index;
+		rule->filter_id = EFX_ARFS_FILTER_ID_PENDING;
+		spin_unlock(&efx->rps_hash_lock);
+	} else {
+		/* Without an ARFS hash table, we just use arfs_id 0 for all
+		 * filters.  This means if multiple flows hash to the same
+		 * flow_id, all but the most recently touched will be eligible
+		 * for expiry.
+		 */
+		rc = 0;
+	}
+
+	/* Queue the request */
 	dev_hold(req->net_dev = net_dev);
 	INIT_WORK(&req->work, efx_filter_rfs_work);
 	req->rxq_index = rxq_index;
 	req->flow_id = flow_id;
 	schedule_work(&req->work);
-	return 0;
+	return rc;
+out_unlock:
+	spin_unlock(&efx->rps_hash_lock);
 out_clear:
 	clear_bit(slot_idx, &efx->rps_slot_map);
 	return rc;

^ permalink raw reply related

* Re: [PATCH 26/39] rtc/proc: switch to proc_create_single_data
From: Christoph Hellwig @ 2018-04-24 14:15 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, Alexey Dobriyan,
	Greg Kroah-Hartman, Jiri Slaby, Corey Minyard, Alessandro Zummo,
	linux-acpi, drbd-dev, linux-ide, netdev, linux-rtc,
	megaraidlinux.pdl, linux-scsi, devel, linux-afs, linux-ext4,
	jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <20180419131027.GC7369@piout.net>

On Thu, Apr 19, 2018 at 03:10:27PM +0200, Alexandre Belloni wrote:
> On 19/04/2018 14:41:27+0200, Christoph Hellwig wrote:
> > And stop trying to get a reference on the submodule, procfs code deals
> > with release after and unloaded module and thus removed proc entry.
> > 
> 
> Are you sure about that? The rtc module is not the one adding the procfs
> file so I'm not sure how the procfs code can handle it.

The proc file is removed from this call chain:

  <driver>_exit (module_exit handler)
    -> rtc_device_unregister
      -> rtc_proc_del_device
        -> remove_proc_entry

remove_proc_entry takes care of waiting for currently active file
operation instances and makes sure every new operation never calls
into the actual proc file ops.  Same behavior as in RTC exists all
over the kernel.

^ permalink raw reply

* Re: [PATCH 16/39] ipmi: simplify procfs code
From: Christoph Hellwig @ 2018-04-24 14:16 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, Alexey Dobriyan,
	Greg Kroah-Hartman, Jiri Slaby, Alessandro Zummo,
	Alexandre Belloni, linux-acpi, drbd-dev, linux-ide, netdev,
	linux-rtc, megaraidlinux.pdl, linux-scsi, devel, linux-afs,
	linux-ext4, jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <f322f243-9ab1-7e9f-00a4-9652cd288ca2@acm.org>

On Thu, Apr 19, 2018 at 10:29:29AM -0500, Corey Minyard wrote:
> On 04/19/2018 07:41 AM, Christoph Hellwig wrote:
>> Use remove_proc_subtree to remove the whole subtree on cleanup instead
>> of a hand rolled list of proc entries, unwind the registration loop into
>> individual calls.  Switch to use proc_create_single to further simplify
>> the code.
>
> I'm yanking all the proc code out of the IPMI driver in 3.18.  So this is 
> probably
> not necessary.

Ok, I'll drop this patch.

^ permalink raw reply

* Re: [PATCH net-next V3 0/3] Introduce adaptive TX interrupt moderation to net DIM
From: David Miller @ 2018-04-24 14:18 UTC (permalink / raw)
  To: talgi; +Cc: netdev, tariqt, saeedm, f.fainelli, andrew.gospodarek
In-Reply-To: <1524566163-41563-1-git-send-email-talgi@mellanox.com>

From: Tal Gilboa <talgi@mellanox.com>
Date: Tue, 24 Apr 2018 13:36:00 +0300

> Net DIM is a library designed for dynamic interrupt moderation. It was
> implemented and optimized with receive side interrupts in mind, since these
> are usually the CPU expensive ones. This patch-set introduces adaptive transmit
> interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
> Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
> Furthermore, it is essential for increasing bandwidth on cases where payload
> aggregation is required.
> 
> v3: Remove "inline" from functions in .c files (requested by DaveM). Revert
> adding "enabled" field from struct net_dim and applied mlx5e structural
> suggestions (suggested by SaeedM).
> 
> v2: Rebase over proper tree.
> 
> v1: Fix compilation issues due to missed function renaming.

I have no problem with this, series applied, thanks.

Although I have to say that I've always been suspicious of adaptive moderation
schemes, especially if implemented in software.

My thinking was that at these kinds of link speeds, the conditions of the link
change so fast that whatever state you've measured changes by the time you
commit new settings to the chip.

It obviously helps, so I must be missing some piece of the puzzle in my mental
analysis :-)

^ permalink raw reply

* Re: [PATCH 03/39] proc: introduce proc_create_seq_private
From: Christoph Hellwig @ 2018-04-24 14:19 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, linux-rtc,
	Alessandro Zummo, Alexandre Belloni, devel, linux-kernel,
	linux-scsi, Corey Minyard, linux-ide, Greg Kroah-Hartman,
	jfs-discussion, linux-afs, linux-acpi, netdev, netfilter-devel,
	Jiri Slaby, linux-ext4, Alexey Dobriyan, megaraidlinux.pdl,
	drbd-dev
In-Reply-To: <20180419141818.pjys7at4xmz2h6ho@mwanda>

On Thu, Apr 19, 2018 at 05:18:18PM +0300, Dan Carpenter wrote:
> > -static const struct file_operations cio_ignore_proc_fops = {
> > -	.open    = cio_ignore_proc_open,
> > -	.read    = seq_read,
> > -	.llseek  = seq_lseek,
> > -	.release = seq_release_private,
> > -	.write   = cio_ignore_write,
>                    ^^^^^^^^^^^^^^^^
> The cio_ignore_write() function isn't used any more so compilers will
> complain.

No compiler in the buildboot farm complained, but neverless this

^ permalink raw reply

* Re: simplify procfs code for seq_file instances
From: Christoph Hellwig @ 2018-04-24 14:23 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-rtc-u79uwXL29TY76Z2rM5mHXA, Alessandro Zummo,
	Alexandre Belloni, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Corey Minyard,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	jfs-discussion-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Jiri Slaby, Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20180419185750.GD2066@avx2>

On Thu, Apr 19, 2018 at 09:57:50PM +0300, Alexey Dobriyan wrote:
> >     git://git.infradead.org/users/hch/misc.git proc_create
> 
> 
> I want to ask if it is time to start using poorman function overloading
> with _b_c_e(). There are millions of allocation functions for example,
> all slightly difference, and people will add more. Seeing /proc interfaces
> doubled like this is painful.

Function overloading is totally unacceptable.

And I very much disagree with a tradeoff that keeps 5000 lines of 
code vs a few new helpers.

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: kbuild test robot @ 2018-04-24 14:27 UTC (permalink / raw)
  To: Björn Töpel
  Cc: kbuild-all, bjorn.topel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180423135619.7179-3-bjorn.topel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

Hi Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introducing-AF_XDP-support/20180424-085240
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/xdp/xdp_umem.o: In function `xdp_umem_reg':
>> xdp_umem.c:(.text+0x200): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45403 bytes --]

^ permalink raw reply

* Re: [PATCH 03/39] proc: introduce proc_create_seq_private
From: Christoph Hellwig @ 2018-04-24 14:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-rtc, Alessandro Zummo, Alexandre Belloni, devel,
	linux-kernel, linux-scsi, Corey Minyard, linux-ide,
	Greg Kroah-Hartman, jfs-discussion, linux-afs, linux-acpi, netdev,
	netfilter-devel, Alexander Viro, Jiri Slaby, Andrew Morton,
	linux-ext4, Christoph Hellwig, megaraidlinux.pdl, drbd-dev
In-Reply-To: <20180419185027.GC2066@avx2>

On Thu, Apr 19, 2018 at 09:50:27PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 02:41:04PM +0200, Christoph Hellwig wrote:
> > Variant of proc_create_data that directly take a struct seq_operations
> 
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -45,6 +45,7 @@ struct proc_dir_entry {
> >  	const struct inode_operations *proc_iops;
> >  	const struct file_operations *proc_fops;
> >  	const struct seq_operations *seq_ops;
> > +	size_t state_size;
> 
> "unsigned int" please.
> 
> Where have you seen 4GB priv states?

We're passing the result of sizeof, which happens to be a size_t.
But if it makes you happy I can switch to unsigned int.

^ permalink raw reply

* Re: [PATCH 02/39] proc: introduce proc_create_seq{,_data}
From: Christoph Hellwig @ 2018-04-24 14:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro,
	Greg Kroah-Hartman, Jiri Slaby, Corey Minyard, Alessandro Zummo,
	Alexandre Belloni, linux-acpi, drbd-dev, linux-ide, netdev,
	linux-rtc, megaraidlinux.pdl, linux-scsi, devel, linux-afs,
	linux-ext4, jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <20180419184106.GA2066@avx2>

On Thu, Apr 19, 2018 at 09:41:06PM +0300, Alexey Dobriyan wrote:
> Should be oopsable.
> Once proc_create_data() returns, entry is live, ->open can be called.

Ok, switching to opencoding proc_create_data instead.

^ permalink raw reply

* [PATCH RFC 0/9] veth: Driver XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
 (XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

The envisioned use cases are:

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

With single core and simple XDP programs which only redirect and drop
packets, I got 10.5 Mpps redirect/drop rate with i40e 25G NIC + veth.

XXV710 (i40e) --- (XDP redirect) --> veth===veth (XDP drop)

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because I wanted to avoid stack inflation
by recursive calling of XDP programs.

As an RFC this has not implemented recently introduced xdp_adjust_tail
and based on top of Jesper's redirect memory return API patch set
(684009d4fdaf).
Any feedback is welcome. Thanks!

Toshiaki Makita (9):
  net: Export skb_headers_offset_update and skb_copy_header
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Use NAPI for XDP
  veth: Handle xdp_frame in xdp napi ring
  veth: Add ndo_xdp_xmit
  veth: Add XDP TX and REDIRECT
  veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
  veth: Avoid per-packet spinlock of XDP napi ring on enqueueing

 drivers/net/veth.c     | 688 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/filter.h |  16 ++
 include/linux/skbuff.h |   2 +
 net/core/filter.c      |  11 +-
 net/core/skbuff.c      |  12 +-
 5 files changed, 699 insertions(+), 30 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 12 +++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255..fdf80a9d4582 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1030,6 +1030,8 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca8..531354900177 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1290,7 +1290,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1304,8 +1304,9 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_network_header += off;
 	skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1313,6 +1314,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1355,7 +1357,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1419,7 +1421,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1599,7 +1601,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 2/9] veth: Add driver XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 205 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..9c4197306716 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,15 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -30,9 +35,11 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 /*
@@ -98,6 +105,25 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_link_ksettings	= veth_get_link_ksettings,
 };
 
+/* general routines */
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb);
+
+static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+{
+	skb = veth_xdp_rcv_skb(dev, skb);
+	if (!skb)
+		return NET_RX_DROP;
+
+	return netif_rx(skb);
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+}
+
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -111,7 +137,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -126,10 +152,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -179,19 +201,152 @@ static void veth_set_multicast_list(struct net_device *dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	u32 pktlen, headroom, act, metalen;
+	int size, mac_len, delta, off;
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	void *orig_data;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	mac_len = skb->data - skb_mac_header(skb);
+	pktlen = skb->len + mac_len;
+	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
+	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (size > PAGE_SIZE)
+		goto drop;
+
+	headroom = skb_headroom(skb) - mac_len;
+	if (skb_shared(skb) || skb_head_is_locked(skb) ||
+	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
+		struct sk_buff *nskb;
+		void *head, *start;
+		struct page *page;
+		int head_off;
+
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			goto drop;
+
+		head = page_address(page);
+		start = head + VETH_XDP_HEADROOM;
+		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		nskb = veth_build_skb(head,
+				      VETH_XDP_HEADROOM + mac_len, skb->len,
+				      PAGE_SIZE);
+		if (!nskb) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		skb_copy_header(nskb, skb);
+		head_off = skb_headroom(nskb) - skb_headroom(skb);
+		skb_headers_offset_update(nskb, head_off);
+		dev_consume_skb_any(skb);
+		skb = nskb;
+	}
+
+	xdp.data_hard_start = skb->head;
+	xdp.data = skb_mac_header(skb);
+	xdp.data_end = xdp.data + pktlen;
+	xdp.data_meta = xdp.data;
+	xdp.rxq = &priv->xdp_rxq;
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(dev, xdp_prog, act);
+	case XDP_DROP:
+		goto drop;
+	}
+	rcu_read_unlock();
+
+	delta = orig_data - xdp.data;
+	off = mac_len + delta;
+	if (off > 0)
+		__skb_push(skb, off);
+	else if (off < 0)
+		__skb_pull(skb, -off);
+	skb->mac_header -= delta;
+	skb->protocol = eth_type_trans(skb, dev);
+
+	metalen = xdp.data - xdp.data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+out:
+	return skb;
+drop:
+	rcu_read_unlock();
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer = rtnl_dereference(priv->peer);
+	int err;
 
 	if (!peer)
 		return -ENOTCONN;
 
+	err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
+	if (err < 0)
+		return err;
+
+	err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err < 0)
+		goto err_reg_mem;
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
 	}
+
 	return 0;
+err_reg_mem:
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
+	return err;
 }
 
 static int veth_close(struct net_device *dev)
@@ -203,6 +358,8 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
 	return 0;
 }
 
@@ -276,6 +433,48 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	old_prog = rtnl_dereference(priv->xdp_prog);
+
+	rcu_assign_pointer(priv->xdp_prog, prog);
+
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static u32 veth_xdp_query(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = rtnl_dereference(priv->xdp_prog);
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = veth_xdp_query(dev);
+		xdp->prog_attached = !!xdp->prog_id;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +489,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_bpf		= veth_xdp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9c4197306716..7271d9582b4a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -410,6 +410,23 @@ static int veth_get_iflink(const struct net_device *dev)
 	return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		if (rtnl_dereference(peer_priv->xdp_prog))
+			features &= ~NETIF_F_GSO_SOFTWARE;
+	}
+
+	return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
 	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -438,13 +455,32 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
+	struct net_device *peer;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
+	peer = rtnl_dereference(priv->peer);
+
+	if (!old_prog && prog && peer) {
+		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+			peer->hard_header_len -
+			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (peer->mtu > peer->max_mtu)
+			dev_set_mtu(peer, peer->max_mtu);
+	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
-	if (old_prog)
+	if (old_prog) {
 		bpf_prog_put(old_prog);
+		if (!prog && peer) {
+			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = ETH_MAX_MTU;
+		}
+	}
+
+	if ((!!old_prog ^ !!prog) && peer)
+		netdev_update_features(peer);
 
 	return 0;
 }
@@ -487,6 +523,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_poll_controller	= veth_poll_controller,
 #endif
 	.ndo_get_iflink		= veth_get_iflink,
+	.ndo_fix_features	= veth_fix_features,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 4/9] veth: Use NAPI for XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

In order to avoid stack inflation by recursive XDP program call from
ndo_xdp_xmit, this change introduces NAPI in veth.

Add veth's own NAPI handler when XDP is enabled.
Use ptr_ring to emulate NIC ring. Tx function enqueues packets to the
ring and peer NAPI handler drains the ring.

This way also makes REDIRECT bulk interface simple. When ndo_xdp_xmit is
implemented later, ndo_xdp_flush schedules NAPI of the peer veth device
and NAPI handles xdp frames enqueued by previous ndo_xdp_xmit, which is
quite similar to physical NIC tx function using DMA ring descriptors and
mmio door bell.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved in the future by
allocating rings on per-queue basis.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 197 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 164 insertions(+), 33 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7271d9582b4a..452771f31c30 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -21,11 +21,13 @@
 #include <linux/module.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
 struct pcpu_vstats {
@@ -35,10 +37,14 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	bool			rx_notify_masked;
+	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 };
 
@@ -107,28 +113,56 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
-					struct sk_buff *skb);
+static void veth_ptr_free(void *ptr)
+{
+	if (!ptr)
+		return;
+	dev_kfree_skb_any(ptr);
+}
 
-static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+static void veth_xdp_flush(struct veth_priv *priv)
 {
-	skb = veth_xdp_rcv_skb(dev, skb);
-	if (!skb)
+	/* Write ptr_ring before reading rx_notify_masked */
+	smp_mb();
+	if (!priv->rx_notify_masked) {
+		priv->rx_notify_masked = true;
+		napi_schedule(&priv->xdp_napi);
+	}
+}
+
+static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
+{
+	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+		return -ENOSPC;
+
+	return 0;
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+	if (unlikely(veth_xdp_enqueue(priv, skb))) {
+		dev_kfree_skb_any(skb);
 		return NET_RX_DROP;
+	}
 
-	return netif_rx(skb);
+	return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
 {
-	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+	struct veth_priv *priv = netdev_priv(dev);
+
+	return __dev_forward_skb(dev, skb) ?: xdp ?
+		veth_xdp_rx(priv, skb) :
+		netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct veth_priv *priv = netdev_priv(dev);
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	int length = skb->len;
+	bool rcv_xdp = false;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -137,7 +171,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	rcv_priv = netdev_priv(rcv);
+	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -148,7 +185,13 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 		atomic64_inc(&priv->dropped);
 	}
+
+	/* TODO: check xmit_more and tx_stopped */
+	if (rcv_xdp)
+		veth_xdp_flush(rcv_priv);
+
 	rcu_read_unlock();
+
 	return NETDEV_TX_OK;
 }
 
@@ -220,10 +263,9 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
-	struct veth_priv *priv = netdev_priv(dev);
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
 	struct bpf_prog *xdp_prog;
@@ -293,7 +335,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
-		trace_xdp_exception(dev, xdp_prog, act);
+		trace_xdp_exception(priv->dev, xdp_prog, act);
 	case XDP_DROP:
 		goto drop;
 	}
@@ -306,7 +348,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	else if (off < 0)
 		__skb_pull(skb, -off);
 	skb->mac_header -= delta;
-	skb->protocol = eth_type_trans(skb, dev);
+	skb->protocol = eth_type_trans(skb, priv->dev);
 
 	metalen = xdp.data - xdp.data_meta;
 	if (metalen)
@@ -319,6 +361,72 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	return NULL;
 }
 
+static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+{
+	int i, done = 0;
+
+	for (i = 0; i < budget; i++) {
+		void *ptr = ptr_ring_consume(&priv->xdp_ring);
+		struct sk_buff *skb;
+
+		if (!ptr)
+			break;
+
+		skb = veth_xdp_rcv_skb(priv, ptr);
+
+		if (skb)
+			napi_gro_receive(&priv->xdp_napi, skb);
+
+		done++;
+	}
+
+	return done;
+}
+
+static int veth_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv =
+		container_of(napi, struct veth_priv, xdp_napi);
+	int done;
+
+	done = veth_xdp_rcv(priv, budget);
+
+	if (done < budget && napi_complete_done(napi, done)) {
+		/* Write rx_notify_masked before reading ptr_ring */
+		smp_store_mb(priv->rx_notify_masked, false);
+		if (unlikely(!ptr_ring_empty(&priv->xdp_ring))) {
+			priv->rx_notify_masked = true;
+			napi_schedule(&priv->xdp_napi);
+		}
+	}
+
+	return done;
+}
+
+static int veth_napi_add(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (err)
+		return err;
+
+	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+	napi_enable(&priv->xdp_napi);
+
+	return 0;
+}
+
+static void veth_napi_del(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	napi_disable(&priv->xdp_napi);
+	netif_napi_del(&priv->xdp_napi);
+	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -337,6 +445,12 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	if (rtnl_dereference(priv->xdp_prog)) {
+		err = veth_napi_add(dev);
+		if (err)
+			goto err_reg_mem;
+	}
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
@@ -358,6 +472,9 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	if (rtnl_dereference(priv->xdp_prog))
+		veth_napi_del(dev);
+
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
@@ -384,15 +501,12 @@ static void veth_dev_free(struct net_device *dev)
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void veth_poll_controller(struct net_device *dev)
 {
-	/* veth only receives frames when its peer sends one
-	 * Since it's a synchronous operation, we are guaranteed
-	 * never to have pending data when we poll for it so
-	 * there is nothing to do here.
-	 *
-	 * We need this though so netpoll recognizes us as an interface that
-	 * supports polling, which enables bridge devices in virt setups to
-	 * still use netconsole
-	 */
+	struct veth_priv *priv = netdev_priv(dev);
+
+	rcu_read_lock();
+	if (rcu_access_pointer(priv->xdp_prog))
+		veth_xdp_flush(priv);
+	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
 
@@ -456,26 +570,40 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 	struct net_device *peer;
+	int err;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
 	peer = rtnl_dereference(priv->peer);
 
-	if (!old_prog && prog && peer) {
-		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
-		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
-			peer->hard_header_len -
-			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-		if (peer->mtu > peer->max_mtu)
-			dev_set_mtu(peer, peer->max_mtu);
+	if (!old_prog && prog) {
+		if (dev->flags & IFF_UP) {
+			err = veth_napi_add(dev);
+			if (err)
+				return err;
+		}
+
+		if (peer) {
+			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+				peer->hard_header_len -
+				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+			if (peer->mtu > peer->max_mtu)
+				dev_set_mtu(peer, peer->max_mtu);
+		}
 	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
 	if (old_prog) {
 		bpf_prog_put(old_prog);
-		if (!prog && peer) {
-			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
-			peer->max_mtu = ETH_MAX_MTU;
+		if (!prog) {
+			if (dev->flags & IFF_UP)
+				veth_napi_del(dev);
+
+			if (peer) {
+				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+				peer->max_mtu = ETH_MAX_MTU;
+			}
 		}
 	}
 
@@ -688,10 +816,13 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
+	priv->dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
+	priv->dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
+
 	return 0;
 
 err_register_dev:
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is preparation for XDP TX and ndo_xdp_xmit.

Now the napi ring accepts both skb and xdp_frame. When xdp_frame is
enqueued, skb will not be allocated until XDP program on veth returns
PASS. This will speedup the XDP processing when ndo_xdp_xmit is
implemented and xdp_frame is enqueued by the peer device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 452771f31c30..89c91c1c9935 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -48,6 +49,16 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
 /*
  * ethtool interface
  */
@@ -117,7 +128,14 @@ static void veth_ptr_free(void *ptr)
 {
 	if (!ptr)
 		return;
-	dev_kfree_skb_any(ptr);
+
+	if (veth_is_xdp_frame(ptr)) {
+		struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
+
+		xdp_return_frame(frame);
+	} else {
+		dev_kfree_skb_any(ptr);
+	}
 }
 
 static void veth_xdp_flush(struct veth_priv *priv)
@@ -263,6 +281,60 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+					struct xdp_frame *frame)
+{
+	struct bpf_prog *xdp_prog;
+	unsigned int headroom;
+	struct sk_buff *skb;
+	int len, delta = 0;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (xdp_prog) {
+		struct xdp_buff xdp;
+		u32 act;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = frame->data + frame->len;
+		xdp.data_meta = frame->data - frame->metasize;
+		xdp.rxq = &priv->xdp_rxq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_PASS:
+			delta = frame->data - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+		case XDP_ABORTED:
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+	rcu_read_unlock();
+
+	headroom = frame->data - delta - (void *)frame;
+	len = frame->len + delta;
+	skb = veth_build_skb(frame, headroom, len, 0);
+	if (!skb) {
+		xdp_return_frame(frame);
+		goto err;
+	}
+
+	skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+	return skb;
+err_xdp:
+	rcu_read_unlock();
+	xdp_return_frame(frame);
+
+	return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
@@ -372,7 +444,10 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr))
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+		else
+			skb = veth_xdp_rcv_skb(priv, ptr);
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that whether an XDP program is loaded on the redirect target veth
device does not affect how xdp_frames sent by ndo_xdp_xmit is handled,
since the ring sits in rx (peer) side. Instead, whether XDP program is
loaded on peer veth does.

When peer veth device has driver XDP, ndo_xdp_xmit forwards xdp_frames
to its peer without modification.
If not, ndo_xdp_xmit converts xdp_frames to skb on sender side and
invokes netif_rx rather than dropping them. Although this will not
result in good performance, I'm thinking dropping redirected packets
when XDP is not loaded on the peer device is too restrictive, so added
this fallback.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c     | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/filter.h | 16 +++++++++++
 net/core/filter.c      | 11 +-------
 3 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 89c91c1c9935..b1d591be0eba 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -54,6 +54,11 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -138,7 +143,7 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
-static void veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_priv *priv)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
@@ -206,7 +211,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* TODO: check xmit_more and tx_stopped */
 	if (rcv_xdp)
-		veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rcv_priv);
 
 	rcu_read_unlock();
 
@@ -281,6 +286,66 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	int headroom = frame->data - (void *)frame;
+	struct net_device *rcv;
+	int err = 0;
+
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		return -ENXIO;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
+		err = xdp_ok_fwd_dev(rcv, frame->len);
+		if (unlikely(err))
+			return err;
+
+		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
+	} else {
+		struct sk_buff *skb;
+
+		skb = veth_build_skb(frame, headroom, frame->len, 0);
+		if (unlikely(!skb))
+			return -ENOMEM;
+
+		/* Get page ref in case skb is dropped in netif_rx.
+		 * The caller is responsible for freeing the page on error.
+		 */
+		get_page(virt_to_page(frame->data));
+		if (unlikely(veth_forward_skb(rcv, skb, false) != NET_RX_SUCCESS))
+			return -ENXIO;
+
+		/* Put page ref on success */
+		page_frag_free(frame->data);
+	}
+
+	return err;
+}
+
+static void veth_xdp_flush(struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto out;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+		goto out;
+
+	__veth_xdp_flush(rcv_priv);
+out:
+	rcu_read_unlock();
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 					struct xdp_frame *frame)
 {
@@ -580,7 +645,7 @@ static void veth_poll_controller(struct net_device *dev)
 
 	rcu_read_lock();
 	if (rcu_access_pointer(priv->xdp_prog))
-		veth_xdp_flush(priv);
+		__veth_xdp_flush(priv);
 	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
@@ -730,6 +795,8 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
+	.ndo_xdp_xmit		= veth_xdp_xmit,
+	.ndo_xdp_flush		= veth_xdp_flush,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b2308174..7d043f51d1d7 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
+#include <linux/if_vlan.h>
 
 #include <net/sch_generic.h>
 
@@ -752,6 +753,21 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static __always_inline int
+xdp_ok_fwd_dev(const struct net_device *fwd, unsigned int pktlen)
+{
+	unsigned int len;
+
+	if (unlikely(!(fwd->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
+	if (pktlen > len)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..25ae8ffaa968 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2923,16 +2923,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
 static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 {
-	unsigned int len;
-
-	if (unlikely(!(fwd->flags & IFF_UP)))
-		return -ENETDOWN;
-
-	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
-	if (skb->len > len)
-		return -EMSGSIZE;
-
-	return 0;
+	return xdp_ok_fwd_dev(fwd, skb->len);
 }
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem info from NIC so that page recycling of the NIC works on
the destination veth's XDP.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 85 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1d591be0eba..98fc91a64e29 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@ struct veth_priv {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
@@ -346,9 +347,21 @@ static void veth_xdp_flush(struct net_device *dev)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return veth_xdp_xmit(dev, frame);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-					struct xdp_frame *frame)
+					struct xdp_frame *frame, bool *xdp_xmit,
+					bool *xdp_redir)
 {
+	struct xdp_frame orig_frame;
 	struct bpf_prog *xdp_prog;
 	unsigned int headroom;
 	struct sk_buff *skb;
@@ -372,6 +385,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		case XDP_PASS:
 			delta = frame->data - xdp.data;
 			break;
+		case XDP_TX:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+				trace_xdp_exception(priv->dev, xdp_prog, act);
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_REDIRECT:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_redir = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
@@ -396,12 +432,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 err_xdp:
 	rcu_read_unlock();
 	xdp_return_frame(frame);
-
+xdp_xmit:
 	return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb)
+					struct sk_buff *skb, bool *xdp_xmit,
+					bool *xdp_redir)
 {
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
@@ -469,6 +506,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	switch (act) {
 	case XDP_PASS:
 		break;
+	case XDP_TX:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_REDIRECT:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+			goto err_xdp;
+		*xdp_redir = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -496,9 +553,15 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	rcu_read_unlock();
 	dev_kfree_skb_any(skb);
 	return NULL;
+err_xdp:
+	rcu_read_unlock();
+	page_frag_free(xdp.data);
+xdp_xmit:
+	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
+			bool *xdp_redir)
 {
 	int i, done = 0;
 
@@ -509,10 +572,12 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		if (veth_is_xdp_frame(ptr))
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
-		else
-			skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr)) {
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+					       xdp_xmit, xdp_redir);
+		} else {
+			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+		}
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -527,9 +592,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_priv *priv =
 		container_of(napi, struct veth_priv, xdp_napi);
+	bool xdp_xmit = false;
+	bool xdp_redir = false;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget);
+	done = veth_xdp_rcv(priv, budget, &xdp_xmit, &xdp_redir);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -540,6 +607,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	if (xdp_xmit)
+		veth_xdp_flush(priv->dev);
+	if (xdp_redir)
+		xdp_do_flush_map();
+
 	return done;
 }
 
@@ -585,6 +657,9 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	/* Save original mem info as it can be overwritten */
+	priv->xdp_mem = priv->xdp_rxq.mem;
+
 	if (rtnl_dereference(priv->xdp_prog)) {
 		err = veth_napi_add(dev);
 		if (err)
@@ -615,6 +690,7 @@ static int veth_close(struct net_device *dev)
 	if (rtnl_dereference(priv->xdp_prog))
 		veth_napi_del(dev);
 
+	priv->xdp_rxq.mem = priv->xdp_mem;
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 98fc91a64e29..1592119e3873 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -30,6 +30,7 @@
 #define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+#define VETH_XDP_QUEUE_SIZE	NAPI_POLL_WEIGHT
 
 struct pcpu_vstats {
 	u64			packets;
@@ -50,6 +51,8 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static DEFINE_PER_CPU(void *[VETH_XDP_QUEUE_SIZE], xdp_consume_q);
+
 static bool veth_is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VETH_XDP_FLAG;
@@ -563,27 +566,32 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
 			bool *xdp_redir)
 {
-	int i, done = 0;
-
-	for (i = 0; i < budget; i++) {
-		void *ptr = ptr_ring_consume(&priv->xdp_ring);
-		struct sk_buff *skb;
-
-		if (!ptr)
-			break;
+	void **q = this_cpu_ptr(xdp_consume_q);
+	int num, lim, done = 0;
+
+	do {
+		int i;
+
+		lim = min(budget - done, VETH_XDP_QUEUE_SIZE);
+		num = ptr_ring_consume_batched(&priv->xdp_ring, q, lim);
+		for (i = 0; i < num; i++) {
+			struct sk_buff *skb;
+			void *ptr = q[i];
+
+			if (veth_is_xdp_frame(ptr)) {
+				skb = veth_xdp_rcv_one(priv,
+						       veth_ptr_to_xdp(ptr),
+						       xdp_xmit, xdp_redir);
+			} else {
+				skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit,
+						       xdp_redir);
+			}
 
-		if (veth_is_xdp_frame(ptr)) {
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
-					       xdp_xmit, xdp_redir);
-		} else {
-			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+			if (skb)
+				napi_gro_receive(&priv->xdp_napi, skb);
 		}
-
-		if (skb)
-			napi_gro_receive(&priv->xdp_napi, skb);
-
-		done++;
-	}
+		done += num;
+	} while (unlikely(num == lim && done < budget));
 
 	return done;
 }
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.
This is different from dequeue in that multiple veth devices can be
redirect target in one napi loop so allocate percpu storage in veth
private structure.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1592119e3873..5978d76f2c00 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -38,12 +38,18 @@ struct pcpu_vstats {
 	struct u64_stats_sync	syncp;
 };
 
+struct xdp_queue {
+	void *q[VETH_XDP_QUEUE_SIZE];
+	unsigned int len;
+};
+
 struct veth_priv {
 	struct napi_struct	xdp_napi;
 	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_queue __percpu *xdp_produce_q;
 	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
@@ -147,8 +153,48 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
+static void veth_xdp_cleanup_queues(struct veth_priv *priv)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct xdp_queue *q = per_cpu_ptr(priv->xdp_produce_q, cpu);
+		int i;
+
+		for (i = 0; i < q->len; i++)
+			veth_ptr_free(q->q[i]);
+
+		q->len = 0;
+	}
+}
+
+static bool veth_xdp_flush_queue(struct veth_priv *priv)
+{
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+	int i;
+
+	if (unlikely(!q->len))
+		return false;
+
+	spin_lock(&priv->xdp_ring.producer_lock);
+	for (i = 0; i < q->len; i++) {
+		void *ptr = q->q[i];
+
+		if (unlikely(__ptr_ring_produce(&priv->xdp_ring, ptr)))
+			veth_ptr_free(ptr);
+	}
+	spin_unlock(&priv->xdp_ring.producer_lock);
+
+	q->len = 0;
+
+	return true;
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
+	if (unlikely(!veth_xdp_flush_queue(priv)))
+		return;
+
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
 	if (!priv->rx_notify_masked) {
@@ -159,9 +205,13 @@ static void __veth_xdp_flush(struct veth_priv *priv)
 
 static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
 {
-	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+
+	if (unlikely(q->len >= VETH_XDP_QUEUE_SIZE))
 		return -ENOSPC;
 
+	q->q[q->len++] = ptr;
+
 	return 0;
 }
 
@@ -644,6 +694,7 @@ static void veth_napi_del(struct net_device *dev)
 
 	napi_disable(&priv->xdp_napi);
 	netif_napi_del(&priv->xdp_napi);
+	veth_xdp_cleanup_queues(priv);
 	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
 }
 
@@ -711,15 +762,28 @@ static int is_valid_veth_mtu(int mtu)
 
 static int veth_dev_init(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
 	if (!dev->vstats)
 		return -ENOMEM;
+
+	priv->xdp_produce_q = __alloc_percpu(sizeof(*priv->xdp_produce_q),
+					     sizeof (void *));
+	if (!priv->xdp_produce_q) {
+		free_percpu(dev->vstats);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	free_percpu(dev->vstats);
+	free_percpu(priv->xdp_produce_q);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-- 
2.14.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox