Netdev List
 help / color / mirror / Atom feed
* [PATCH net 0/7] rxrpc: Fix use of skb_cow_data()
From: David Howells @ 2019-08-29 13:07 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel


Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
with skb_unshare() early on in the input process.  The problem that is
being seen is that skb_cow_data() indirectly requires that the maximum
usage count on an sk_buff be 1, and it may generate an assertion failure in
pskb_expand_head() if not.

This can occur because rxrpc_input_data() may be still holding a ref when
it has just attached the sk_buff to the rx ring and given that attachment
its own ref.  If recvmsg happens fast enough, skb_cow_data() can see the
ref still held by the softirq handler.

Further, a packet may contain multiple subpackets, each of which gets its
own attachment to the ring and its own ref - also making skb_cow_data() go
bang.

Fix this by:

 (1) The DATA packet is currently parsed for subpackets twice by the input
     routines.  Parse it just once instead and make notes in the sk_buff
     private data.

 (2) Use the notes from (1) when attaching the packet to the ring multiple
     times.  Once the packet is attached to the ring, recvmsg can see it
     and start modifying it, so the softirq handler is not permitted to
     look inside it from that point.

 (3) Pass the ref from the input code to the ring rather than getting an
     extra ref.  rxrpc_input_data() uses a ref on the second refcount to
     prevent the packet from evaporating under it.

 (4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
     before we take call->input_lock.  Other sorts of packets don't get
     modified and so can be left.

     A trace is emitted if skb_unshare() eats the skb.  Note that
     skb_share() for our accounting in this regard as we can't see the
     parameters in the packet to log in a trace line if it releases it.

 (5) Remove the calls to skb_cow_data().  These are then no longer
     necessary.

There are also patches to improve the rxrpc_skb tracepoint to make sure
that Tx-derived buffers are identified separately from Rx-derived buffers
in the trace.

The patches are tagged here:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-fixes-20190827

and can also be found on the following branch:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes

David
---
David Howells (7):
      rxrpc: Improve jumbo packet counting
      rxrpc: Use info in skbuff instead of reparsing a jumbo packet
      rxrpc: Pass the input handler's data skb reference to the Rx ring
      rxrpc: Abstract out rxtx ring cleanup
      rxrpc: Add a private skb flag to indicate transmission-phase skbs
      rxrpc: Use the tx-phase skb flag to simplify tracing
      rxrpc: Use skb_unshare() rather than skb_cow_data()


 include/trace/events/rxrpc.h |   59 ++++----
 net/rxrpc/ar-internal.h      |   16 ++
 net/rxrpc/call_event.c       |    8 +
 net/rxrpc/call_object.c      |   33 ++---
 net/rxrpc/conn_event.c       |    6 -
 net/rxrpc/input.c            |  304 +++++++++++++++++++++++-------------------
 net/rxrpc/local_event.c      |    4 -
 net/rxrpc/output.c           |    6 -
 net/rxrpc/peer_event.c       |   10 +
 net/rxrpc/protocol.h         |    9 +
 net/rxrpc/recvmsg.c          |   47 ++++--
 net/rxrpc/rxkad.c            |   32 +---
 net/rxrpc/sendmsg.c          |   13 +-
 net/rxrpc/skbuff.c           |   40 ++++--
 14 files changed, 319 insertions(+), 268 deletions(-)


^ permalink raw reply

* [PATCH net 5/5] rxrpc: Use skb_unshare() rather than skb_cow_data()
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <156708397261.25861.2158085372781699276.stgit@warthog.procyon.org.uk>

The in-place decryption routines in AF_RXRPC's rxkad security module
currently call skb_cow_data() to make sure the data isn't shared and that
the skb can be written over.  This has a problem, however, as the softirq
handler may be still holding a ref or the Rx ring may be holding multiple
refs when skb_cow_data() is called in rxkad_verify_packet() - and so
skb_shared() returns true and __pskb_pull_tail() dislikes that.  If this
occurs, something like the following report will be generated.

	kernel BUG at net/core/skbuff.c:1463!
	...
	RIP: 0010:pskb_expand_head+0x253/0x2b0
	...
	Call Trace:
	 __pskb_pull_tail+0x49/0x460
	 skb_cow_data+0x6f/0x300
	 rxkad_verify_packet+0x18b/0xb10 [rxrpc]
	 rxrpc_recvmsg_data.isra.11+0x4a8/0xa10 [rxrpc]
	 rxrpc_kernel_recv_data+0x126/0x240 [rxrpc]
	 afs_extract_data+0x51/0x2d0 [kafs]
	 afs_deliver_fs_fetch_data+0x188/0x400 [kafs]
	 afs_deliver_to_call+0xac/0x430 [kafs]
	 afs_wait_for_call_to_complete+0x22f/0x3d0 [kafs]
	 afs_make_call+0x282/0x3f0 [kafs]
	 afs_fs_fetch_data+0x164/0x300 [kafs]
	 afs_fetch_data+0x54/0x130 [kafs]
	 afs_readpages+0x20d/0x340 [kafs]
	 read_pages+0x66/0x180
	 __do_page_cache_readahead+0x188/0x1a0
	 ondemand_readahead+0x17d/0x2e0
	 generic_file_read_iter+0x740/0xc10
	 __vfs_read+0x145/0x1a0
	 vfs_read+0x8c/0x140
	 ksys_read+0x4a/0xb0
	 do_syscall_64+0x43/0xf0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by using skb_unshare() instead in the input path for DATA packets
that have a security index != 0.  Non-DATA packets don't need in-place
encryption and neither do unencrypted DATA packets.

Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Reported-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/trace/events/rxrpc.h |   12 ++++++++----
 net/rxrpc/ar-internal.h      |    1 +
 net/rxrpc/input.c            |   18 ++++++++++++++++++
 net/rxrpc/rxkad.c            |   32 +++++++++-----------------------
 net/rxrpc/skbuff.c           |   25 ++++++++++++++++++++-----
 5 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index e2356c51883b..a13a62db3565 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -32,6 +32,8 @@ enum rxrpc_skb_trace {
 	rxrpc_skb_received,
 	rxrpc_skb_rotated,
 	rxrpc_skb_seen,
+	rxrpc_skb_unshared,
+	rxrpc_skb_unshared_nomem,
 };
 
 enum rxrpc_local_trace {
@@ -231,7 +233,9 @@ enum rxrpc_tx_point {
 	EM(rxrpc_skb_purged,			"PUR") \
 	EM(rxrpc_skb_received,			"RCV") \
 	EM(rxrpc_skb_rotated,			"ROT") \
-	E_(rxrpc_skb_seen,			"SEE")
+	EM(rxrpc_skb_seen,			"SEE") \
+	EM(rxrpc_skb_unshared,			"UNS") \
+	E_(rxrpc_skb_unshared_nomem,		"US0")
 
 #define rxrpc_local_traces \
 	EM(rxrpc_local_got,			"GOT") \
@@ -633,9 +637,9 @@ TRACE_EVENT(rxrpc_call,
 
 TRACE_EVENT(rxrpc_skb,
 	    TP_PROTO(struct sk_buff *skb, enum rxrpc_skb_trace op,
-		     int usage, int mod_count, const void *where),
+		     int usage, int mod_count, u8 flags,    const void *where),
 
-	    TP_ARGS(skb, op, usage, mod_count, where),
+	    TP_ARGS(skb, op, usage, mod_count, flags, where),
 
 	    TP_STRUCT__entry(
 		    __field(struct sk_buff *,		skb		)
@@ -648,7 +652,7 @@ TRACE_EVENT(rxrpc_skb,
 
 	    TP_fast_assign(
 		    __entry->skb = skb;
-		    __entry->flags = rxrpc_skb(skb)->rx_flags;
+		    __entry->flags = flags;
 		    __entry->op = op;
 		    __entry->usage = usage;
 		    __entry->mod_count = mod_count;
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 2d5294f3e62f..852e58781fda 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -1110,6 +1110,7 @@ void rxrpc_kernel_data_consumed(struct rxrpc_call *, struct sk_buff *);
 void rxrpc_packet_destructor(struct sk_buff *);
 void rxrpc_new_skb(struct sk_buff *, enum rxrpc_skb_trace);
 void rxrpc_see_skb(struct sk_buff *, enum rxrpc_skb_trace);
+void rxrpc_eaten_skb(struct sk_buff *, enum rxrpc_skb_trace);
 void rxrpc_get_skb(struct sk_buff *, enum rxrpc_skb_trace);
 void rxrpc_free_skb(struct sk_buff *, enum rxrpc_skb_trace);
 void rxrpc_purge_queue(struct sk_buff_head *);
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 31090bdf1fae..d122c53c8697 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -1249,6 +1249,24 @@ int rxrpc_input_packet(struct sock *udp_sk, struct sk_buff *skb)
 			goto bad_message;
 		if (!rxrpc_validate_data(skb))
 			goto bad_message;
+
+		/* Unshare the packet so that it can be modified for in-place
+		 * decryption.
+		 */
+		if (sp->hdr.securityIndex != 0) {
+			struct sk_buff *nskb = skb_unshare(skb, GFP_ATOMIC);
+			if (!nskb) {
+				rxrpc_eaten_skb(skb, rxrpc_skb_unshared_nomem);
+				goto out;
+			}
+
+			if (nskb != skb) {
+				rxrpc_eaten_skb(skb, rxrpc_skb_received);
+				rxrpc_new_skb(skb, rxrpc_skb_unshared);
+				skb = nskb;
+				sp = rxrpc_skb(skb);
+			}
+		}
 		break;
 
 	case RXRPC_PACKET_TYPE_CHALLENGE:
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index ae8cd8926456..c60c520fde7c 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -187,10 +187,8 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 	struct rxrpc_skb_priv *sp;
 	struct rxrpc_crypt iv;
 	struct scatterlist sg[16];
-	struct sk_buff *trailer;
 	unsigned int len;
 	u16 check;
-	int nsg;
 	int err;
 
 	sp = rxrpc_skb(skb);
@@ -214,15 +212,14 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 	crypto_skcipher_encrypt(req);
 
 	/* we want to encrypt the skbuff in-place */
-	nsg = skb_cow_data(skb, 0, &trailer);
-	err = -ENOMEM;
-	if (nsg < 0 || nsg > 16)
+	err = -EMSGSIZE;
+	if (skb_shinfo(skb)->nr_frags > 16)
 		goto out;
 
 	len = data_size + call->conn->size_align - 1;
 	len &= ~(call->conn->size_align - 1);
 
-	sg_init_table(sg, nsg);
+	sg_init_table(sg, ARRAY_SIZE(sg));
 	err = skb_to_sgvec(skb, sg, 0, len);
 	if (unlikely(err < 0))
 		goto out;
@@ -319,11 +316,10 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb,
 	struct rxkad_level1_hdr sechdr;
 	struct rxrpc_crypt iv;
 	struct scatterlist sg[16];
-	struct sk_buff *trailer;
 	bool aborted;
 	u32 data_size, buf;
 	u16 check;
-	int nsg, ret;
+	int ret;
 
 	_enter("");
 
@@ -336,11 +332,7 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb,
 	/* Decrypt the skbuff in-place.  TODO: We really want to decrypt
 	 * directly into the target buffer.
 	 */
-	nsg = skb_cow_data(skb, 0, &trailer);
-	if (nsg < 0 || nsg > 16)
-		goto nomem;
-
-	sg_init_table(sg, nsg);
+	sg_init_table(sg, ARRAY_SIZE(sg));
 	ret = skb_to_sgvec(skb, sg, offset, 8);
 	if (unlikely(ret < 0))
 		return ret;
@@ -388,10 +380,6 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb,
 	if (aborted)
 		rxrpc_send_abort_packet(call);
 	return -EPROTO;
-
-nomem:
-	_leave(" = -ENOMEM");
-	return -ENOMEM;
 }
 
 /*
@@ -406,7 +394,6 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb,
 	struct rxkad_level2_hdr sechdr;
 	struct rxrpc_crypt iv;
 	struct scatterlist _sg[4], *sg;
-	struct sk_buff *trailer;
 	bool aborted;
 	u32 data_size, buf;
 	u16 check;
@@ -423,12 +410,11 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb,
 	/* Decrypt the skbuff in-place.  TODO: We really want to decrypt
 	 * directly into the target buffer.
 	 */
-	nsg = skb_cow_data(skb, 0, &trailer);
-	if (nsg < 0)
-		goto nomem;
-
 	sg = _sg;
-	if (unlikely(nsg > 4)) {
+	nsg = skb_shinfo(skb)->nr_frags;
+	if (nsg <= 4) {
+		nsg = 4;
+	} else {
 		sg = kmalloc_array(nsg, sizeof(*sg), GFP_NOIO);
 		if (!sg)
 			goto nomem;
diff --git a/net/rxrpc/skbuff.c b/net/rxrpc/skbuff.c
index 8e6f45f84b9b..0348d2bf6f7d 100644
--- a/net/rxrpc/skbuff.c
+++ b/net/rxrpc/skbuff.c
@@ -24,7 +24,8 @@ void rxrpc_new_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 {
 	const void *here = __builtin_return_address(0);
 	int n = atomic_inc_return(select_skb_count(skb));
-	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
+	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n,
+			rxrpc_skb(skb)->rx_flags, here);
 }
 
 /*
@@ -35,7 +36,8 @@ void rxrpc_see_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 	const void *here = __builtin_return_address(0);
 	if (skb) {
 		int n = atomic_read(select_skb_count(skb));
-		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
+		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n,
+				rxrpc_skb(skb)->rx_flags, here);
 	}
 }
 
@@ -46,10 +48,21 @@ void rxrpc_get_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 {
 	const void *here = __builtin_return_address(0);
 	int n = atomic_inc_return(select_skb_count(skb));
-	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
+	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n,
+			rxrpc_skb(skb)->rx_flags, here);
 	skb_get(skb);
 }
 
+/*
+ * Note the dropping of a ref on a socket buffer by the core.
+ */
+void rxrpc_eaten_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
+{
+	const void *here = __builtin_return_address(0);
+	int n = atomic_inc_return(&rxrpc_n_rx_skbs);
+	trace_rxrpc_skb(skb, op, 0, n, 0, here);
+}
+
 /*
  * Note the destruction of a socket buffer.
  */
@@ -60,7 +73,8 @@ void rxrpc_free_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 		int n;
 		CHECK_SLAB_OKAY(&skb->users);
 		n = atomic_dec_return(select_skb_count(skb));
-		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
+		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n,
+				rxrpc_skb(skb)->rx_flags, here);
 		kfree_skb(skb);
 	}
 }
@@ -75,7 +89,8 @@ void rxrpc_purge_queue(struct sk_buff_head *list)
 	while ((skb = skb_dequeue((list))) != NULL) {
 		int n = atomic_dec_return(select_skb_count(skb));
 		trace_rxrpc_skb(skb, rxrpc_skb_purged,
-				refcount_read(&skb->users), n, here);
+				refcount_read(&skb->users), n,
+				rxrpc_skb(skb)->rx_flags, here);
 		kfree_skb(skb);
 	}
 }


^ permalink raw reply related

* [PATCH net 4/5] rxrpc: Use the tx-phase skb flag to simplify tracing
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <156708397261.25861.2158085372781699276.stgit@warthog.procyon.org.uk>

Use the previously-added transmit-phase skbuff private flag to simplify the
socket buffer tracing a bit.  Which phase the skbuff comes from can now be
divined from the skb rather than having to be guessed from the call state.

We can also reduce the number of rxrpc_skb_trace values by eliminating the
difference between Tx and Rx in the symbols.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/trace/events/rxrpc.h |   51 ++++++++++++++++++------------------------
 net/rxrpc/ar-internal.h      |    1 +
 net/rxrpc/call_event.c       |    8 +++----
 net/rxrpc/call_object.c      |    6 ++---
 net/rxrpc/conn_event.c       |    6 ++---
 net/rxrpc/input.c            |   22 +++++++++---------
 net/rxrpc/local_event.c      |    4 ++-
 net/rxrpc/output.c           |    6 ++---
 net/rxrpc/peer_event.c       |   10 ++++----
 net/rxrpc/recvmsg.c          |    6 ++---
 net/rxrpc/sendmsg.c          |   10 ++++----
 net/rxrpc/skbuff.c           |   15 +++++++-----
 12 files changed, 69 insertions(+), 76 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index fa06b528c73c..e2356c51883b 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -23,20 +23,15 @@
 #define __RXRPC_DECLARE_TRACE_ENUMS_ONCE_ONLY
 
 enum rxrpc_skb_trace {
-	rxrpc_skb_rx_cleaned,
-	rxrpc_skb_rx_freed,
-	rxrpc_skb_rx_got,
-	rxrpc_skb_rx_lost,
-	rxrpc_skb_rx_purged,
-	rxrpc_skb_rx_received,
-	rxrpc_skb_rx_rotated,
-	rxrpc_skb_rx_seen,
-	rxrpc_skb_tx_cleaned,
-	rxrpc_skb_tx_freed,
-	rxrpc_skb_tx_got,
-	rxrpc_skb_tx_new,
-	rxrpc_skb_tx_rotated,
-	rxrpc_skb_tx_seen,
+	rxrpc_skb_cleaned,
+	rxrpc_skb_freed,
+	rxrpc_skb_got,
+	rxrpc_skb_lost,
+	rxrpc_skb_new,
+	rxrpc_skb_purged,
+	rxrpc_skb_received,
+	rxrpc_skb_rotated,
+	rxrpc_skb_seen,
 };
 
 enum rxrpc_local_trace {
@@ -228,20 +223,15 @@ enum rxrpc_tx_point {
  * Declare tracing information enums and their string mappings for display.
  */
 #define rxrpc_skb_traces \
-	EM(rxrpc_skb_rx_cleaned,		"Rx CLN") \
-	EM(rxrpc_skb_rx_freed,			"Rx FRE") \
-	EM(rxrpc_skb_rx_got,			"Rx GOT") \
-	EM(rxrpc_skb_rx_lost,			"Rx *L*") \
-	EM(rxrpc_skb_rx_purged,			"Rx PUR") \
-	EM(rxrpc_skb_rx_received,		"Rx RCV") \
-	EM(rxrpc_skb_rx_rotated,		"Rx ROT") \
-	EM(rxrpc_skb_rx_seen,			"Rx SEE") \
-	EM(rxrpc_skb_tx_cleaned,		"Tx CLN") \
-	EM(rxrpc_skb_tx_freed,			"Tx FRE") \
-	EM(rxrpc_skb_tx_got,			"Tx GOT") \
-	EM(rxrpc_skb_tx_new,			"Tx NEW") \
-	EM(rxrpc_skb_tx_rotated,		"Tx ROT") \
-	E_(rxrpc_skb_tx_seen,			"Tx SEE")
+	EM(rxrpc_skb_cleaned,			"CLN") \
+	EM(rxrpc_skb_freed,			"FRE") \
+	EM(rxrpc_skb_got,			"GOT") \
+	EM(rxrpc_skb_lost,			"*L*") \
+	EM(rxrpc_skb_new,			"NEW") \
+	EM(rxrpc_skb_purged,			"PUR") \
+	EM(rxrpc_skb_received,			"RCV") \
+	EM(rxrpc_skb_rotated,			"ROT") \
+	E_(rxrpc_skb_seen,			"SEE")
 
 #define rxrpc_local_traces \
 	EM(rxrpc_local_got,			"GOT") \
@@ -650,6 +640,7 @@ TRACE_EVENT(rxrpc_skb,
 	    TP_STRUCT__entry(
 		    __field(struct sk_buff *,		skb		)
 		    __field(enum rxrpc_skb_trace,	op		)
+		    __field(u8,				flags		)
 		    __field(int,			usage		)
 		    __field(int,			mod_count	)
 		    __field(const void *,		where		)
@@ -657,14 +648,16 @@ TRACE_EVENT(rxrpc_skb,
 
 	    TP_fast_assign(
 		    __entry->skb = skb;
+		    __entry->flags = rxrpc_skb(skb)->rx_flags;
 		    __entry->op = op;
 		    __entry->usage = usage;
 		    __entry->mod_count = mod_count;
 		    __entry->where = where;
 			   ),
 
-	    TP_printk("s=%p %s u=%d m=%d p=%pSR",
+	    TP_printk("s=%p %cx %s u=%d m=%d p=%pSR",
 		      __entry->skb,
+		      __entry->flags & RXRPC_SKB_TX_BUFFER ? 'T' : 'R',
 		      __print_symbolic(__entry->op, rxrpc_skb_traces),
 		      __entry->usage,
 		      __entry->mod_count,
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 63d3a91ce5e9..2d5294f3e62f 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -185,6 +185,7 @@ struct rxrpc_host_header {
  * - max 48 bytes (struct sk_buff::cb)
  */
 struct rxrpc_skb_priv {
+	atomic_t	nr_ring_pins;		/* Number of rxtx ring pins */
 	u8		nr_subpackets;		/* Number of subpackets */
 	u8		rx_flags;		/* Received packet flags */
 #define RXRPC_SKB_INCL_LAST	0x01		/* - Includes last packet */
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index c767679bfa5d..cedbbb3a7c2e 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -199,7 +199,7 @@ static void rxrpc_resend(struct rxrpc_call *call, unsigned long now_j)
 			continue;
 
 		skb = call->rxtx_buffer[ix];
-		rxrpc_see_skb(skb, rxrpc_skb_tx_seen);
+		rxrpc_see_skb(skb, rxrpc_skb_seen);
 
 		if (anno_type == RXRPC_TX_ANNO_UNACK) {
 			if (ktime_after(skb->tstamp, max_age)) {
@@ -255,18 +255,18 @@ static void rxrpc_resend(struct rxrpc_call *call, unsigned long now_j)
 			continue;
 
 		skb = call->rxtx_buffer[ix];
-		rxrpc_get_skb(skb, rxrpc_skb_tx_got);
+		rxrpc_get_skb(skb, rxrpc_skb_got);
 		spin_unlock_bh(&call->lock);
 
 		if (rxrpc_send_data_packet(call, skb, true) < 0) {
-			rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
+			rxrpc_free_skb(skb, rxrpc_skb_freed);
 			return;
 		}
 
 		if (rxrpc_is_client_call(call))
 			rxrpc_expose_client_call(call);
 
-		rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 		spin_lock_bh(&call->lock);
 
 		/* We need to clear the retransmit state, but there are two
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index c9ab2da957fe..014548c259ce 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -429,9 +429,7 @@ static void rxrpc_cleanup_ring(struct rxrpc_call *call)
 	int i;
 
 	for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++) {
-		rxrpc_free_skb(call->rxtx_buffer[i],
-			       (call->tx_phase ? rxrpc_skb_tx_cleaned :
-				rxrpc_skb_rx_cleaned));
+		rxrpc_free_skb(call->rxtx_buffer[i], rxrpc_skb_cleaned);
 		call->rxtx_buffer[i] = NULL;
 	}
 }
@@ -587,7 +585,7 @@ void rxrpc_cleanup_call(struct rxrpc_call *call)
 	ASSERTCMP(call->conn, ==, NULL);
 
 	rxrpc_cleanup_ring(call);
-	rxrpc_free_skb(call->tx_pending, rxrpc_skb_tx_cleaned);
+	rxrpc_free_skb(call->tx_pending, rxrpc_skb_cleaned);
 
 	call_rcu(&call->rcu, rxrpc_rcu_destroy_call);
 }
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index df6624c140be..a1ceef4f5cd0 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -472,7 +472,7 @@ void rxrpc_process_connection(struct work_struct *work)
 	/* go through the conn-level event packets, releasing the ref on this
 	 * connection that each one has when we've finished with it */
 	while ((skb = skb_dequeue(&conn->rx_queue))) {
-		rxrpc_see_skb(skb, rxrpc_skb_rx_seen);
+		rxrpc_see_skb(skb, rxrpc_skb_seen);
 		ret = rxrpc_process_event(conn, skb, &abort_code);
 		switch (ret) {
 		case -EPROTO:
@@ -484,7 +484,7 @@ void rxrpc_process_connection(struct work_struct *work)
 			goto requeue_and_leave;
 		case -ECONNABORTED:
 		default:
-			rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+			rxrpc_free_skb(skb, rxrpc_skb_freed);
 			break;
 		}
 	}
@@ -501,6 +501,6 @@ void rxrpc_process_connection(struct work_struct *work)
 protocol_error:
 	if (rxrpc_abort_connection(conn, ret, abort_code) < 0)
 		goto requeue_and_leave;
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 	goto out;
 }
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 140cede77655..31090bdf1fae 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -233,7 +233,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
 		ix = call->tx_hard_ack & RXRPC_RXTX_BUFF_MASK;
 		skb = call->rxtx_buffer[ix];
 		annotation = call->rxtx_annotations[ix];
-		rxrpc_see_skb(skb, rxrpc_skb_tx_rotated);
+		rxrpc_see_skb(skb, rxrpc_skb_rotated);
 		call->rxtx_buffer[ix] = NULL;
 		call->rxtx_annotations[ix] = 0;
 		skb->next = list;
@@ -258,7 +258,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
 		skb = list;
 		list = skb->next;
 		skb_mark_not_on_list(skb);
-		rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 	}
 
 	return rot_last;
@@ -443,7 +443,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 
 	state = READ_ONCE(call->state);
 	if (state >= RXRPC_CALL_COMPLETE) {
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 		return;
 	}
 
@@ -559,7 +559,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 		 * and also rxrpc_fill_out_ack().
 		 */
 		if (!terminal)
-			rxrpc_get_skb(skb, rxrpc_skb_rx_got);
+			rxrpc_get_skb(skb, rxrpc_skb_got);
 		call->rxtx_annotations[ix] = annotation;
 		smp_wmb();
 		call->rxtx_buffer[ix] = skb;
@@ -620,7 +620,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 
 unlock:
 	spin_unlock(&call->input_lock);
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 	_leave(" [queued]");
 }
 
@@ -1056,7 +1056,7 @@ static void rxrpc_input_call_packet(struct rxrpc_call *call,
 		break;
 	}
 
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 no_free:
 	_leave("");
 }
@@ -1119,7 +1119,7 @@ static void rxrpc_post_packet_to_local(struct rxrpc_local *local,
 		skb_queue_tail(&local->event_queue, skb);
 		rxrpc_queue_local(local);
 	} else {
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 	}
 }
 
@@ -1134,7 +1134,7 @@ static void rxrpc_reject_packet(struct rxrpc_local *local, struct sk_buff *skb)
 		skb_queue_tail(&local->reject_queue, skb);
 		rxrpc_queue_local(local);
 	} else {
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 	}
 }
 
@@ -1198,7 +1198,7 @@ int rxrpc_input_packet(struct sock *udp_sk, struct sk_buff *skb)
 	if (skb->tstamp == 0)
 		skb->tstamp = ktime_get_real();
 
-	rxrpc_new_skb(skb, rxrpc_skb_rx_received);
+	rxrpc_new_skb(skb, rxrpc_skb_received);
 
 	skb_pull(skb, sizeof(struct udphdr));
 
@@ -1215,7 +1215,7 @@ int rxrpc_input_packet(struct sock *udp_sk, struct sk_buff *skb)
 		static int lose;
 		if ((lose++ & 7) == 7) {
 			trace_rxrpc_rx_lose(sp);
-			rxrpc_free_skb(skb, rxrpc_skb_rx_lost);
+			rxrpc_free_skb(skb, rxrpc_skb_lost);
 			return 0;
 		}
 	}
@@ -1389,7 +1389,7 @@ int rxrpc_input_packet(struct sock *udp_sk, struct sk_buff *skb)
 	goto out;
 
 discard:
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 out:
 	trace_rxrpc_rx_done(0, 0);
 	return 0;
diff --git a/net/rxrpc/local_event.c b/net/rxrpc/local_event.c
index e93a78f7c05e..3ce6d628cd75 100644
--- a/net/rxrpc/local_event.c
+++ b/net/rxrpc/local_event.c
@@ -90,7 +90,7 @@ void rxrpc_process_local_events(struct rxrpc_local *local)
 	if (skb) {
 		struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 
-		rxrpc_see_skb(skb, rxrpc_skb_rx_seen);
+		rxrpc_see_skb(skb, rxrpc_skb_seen);
 		_debug("{%d},{%u}", local->debug_id, sp->hdr.type);
 
 		switch (sp->hdr.type) {
@@ -108,7 +108,7 @@ void rxrpc_process_local_events(struct rxrpc_local *local)
 			break;
 		}
 
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 	}
 
 	_leave("");
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 369e516c4bdf..935bb60fff56 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -565,7 +565,7 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
 	memset(&whdr, 0, sizeof(whdr));
 
 	while ((skb = skb_dequeue(&local->reject_queue))) {
-		rxrpc_see_skb(skb, rxrpc_skb_rx_seen);
+		rxrpc_see_skb(skb, rxrpc_skb_seen);
 		sp = rxrpc_skb(skb);
 
 		switch (skb->mark) {
@@ -581,7 +581,7 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
 			ioc = 2;
 			break;
 		default:
-			rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+			rxrpc_free_skb(skb, rxrpc_skb_freed);
 			continue;
 		}
 
@@ -606,7 +606,7 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
 						      rxrpc_tx_point_reject);
 		}
 
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 	}
 
 	_leave("");
diff --git a/net/rxrpc/peer_event.c b/net/rxrpc/peer_event.c
index 7666ec72d37e..c97ebdc043e4 100644
--- a/net/rxrpc/peer_event.c
+++ b/net/rxrpc/peer_event.c
@@ -163,11 +163,11 @@ void rxrpc_error_report(struct sock *sk)
 		_leave("UDP socket errqueue empty");
 		return;
 	}
-	rxrpc_new_skb(skb, rxrpc_skb_rx_received);
+	rxrpc_new_skb(skb, rxrpc_skb_received);
 	serr = SKB_EXT_ERR(skb);
 	if (!skb->len && serr->ee.ee_origin == SO_EE_ORIGIN_TIMESTAMPING) {
 		_leave("UDP empty message");
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 		return;
 	}
 
@@ -177,7 +177,7 @@ void rxrpc_error_report(struct sock *sk)
 		peer = NULL;
 	if (!peer) {
 		rcu_read_unlock();
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 		_leave(" [no peer]");
 		return;
 	}
@@ -189,7 +189,7 @@ void rxrpc_error_report(struct sock *sk)
 	     serr->ee.ee_code == ICMP_FRAG_NEEDED)) {
 		rxrpc_adjust_mtu(peer, serr);
 		rcu_read_unlock();
-		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+		rxrpc_free_skb(skb, rxrpc_skb_freed);
 		rxrpc_put_peer(peer);
 		_leave(" [MTU update]");
 		return;
@@ -197,7 +197,7 @@ void rxrpc_error_report(struct sock *sk)
 
 	rxrpc_store_error(peer, serr);
 	rcu_read_unlock();
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 	rxrpc_put_peer(peer);
 
 	_leave("");
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index e49eacfaf4d6..3b0becb12041 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -190,7 +190,7 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call)
 	hard_ack++;
 	ix = hard_ack & RXRPC_RXTX_BUFF_MASK;
 	skb = call->rxtx_buffer[ix];
-	rxrpc_see_skb(skb, rxrpc_skb_rx_rotated);
+	rxrpc_see_skb(skb, rxrpc_skb_rotated);
 	sp = rxrpc_skb(skb);
 
 	subpacket = call->rxtx_annotations[ix] & RXRPC_RX_ANNO_SUBPACKET;
@@ -205,7 +205,7 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call)
 	/* Barrier against rxrpc_input_data(). */
 	smp_store_release(&call->rx_hard_ack, hard_ack);
 
-	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 
 	trace_rxrpc_receive(call, rxrpc_receive_rotate, serial, hard_ack);
 	if (last) {
@@ -340,7 +340,7 @@ static int rxrpc_recvmsg_data(struct socket *sock, struct rxrpc_call *call,
 			break;
 		}
 		smp_rmb();
-		rxrpc_see_skb(skb, rxrpc_skb_rx_seen);
+		rxrpc_see_skb(skb, rxrpc_skb_seen);
 		sp = rxrpc_skb(skb);
 
 		if (!(flags & MSG_PEEK)) {
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 472dc3b7d91f..6a1547b270fe 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -176,7 +176,7 @@ static int rxrpc_queue_packet(struct rxrpc_sock *rx, struct rxrpc_call *call,
 	skb->tstamp = ktime_get_real();
 
 	ix = seq & RXRPC_RXTX_BUFF_MASK;
-	rxrpc_get_skb(skb, rxrpc_skb_tx_got);
+	rxrpc_get_skb(skb, rxrpc_skb_got);
 	call->rxtx_annotations[ix] = annotation;
 	smp_wmb();
 	call->rxtx_buffer[ix] = skb;
@@ -248,7 +248,7 @@ static int rxrpc_queue_packet(struct rxrpc_sock *rx, struct rxrpc_call *call,
 	}
 
 out:
-	rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 	_leave(" = %d", ret);
 	return ret;
 }
@@ -289,7 +289,7 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 
 	skb = call->tx_pending;
 	call->tx_pending = NULL;
-	rxrpc_see_skb(skb, rxrpc_skb_tx_seen);
+	rxrpc_see_skb(skb, rxrpc_skb_seen);
 
 	copied = 0;
 	do {
@@ -338,7 +338,7 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 
 			sp = rxrpc_skb(skb);
 			sp->rx_flags |= RXRPC_SKB_TX_BUFFER;
-			rxrpc_new_skb(skb, rxrpc_skb_tx_new);
+			rxrpc_new_skb(skb, rxrpc_skb_new);
 
 			_debug("ALLOC SEND %p", skb);
 
@@ -440,7 +440,7 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 	return ret;
 
 call_terminated:
-	rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
+	rxrpc_free_skb(skb, rxrpc_skb_freed);
 	_leave(" = %d", call->error);
 	return call->error;
 
diff --git a/net/rxrpc/skbuff.c b/net/rxrpc/skbuff.c
index 9ad5045b7c2f..8e6f45f84b9b 100644
--- a/net/rxrpc/skbuff.c
+++ b/net/rxrpc/skbuff.c
@@ -14,7 +14,8 @@
 #include <net/af_rxrpc.h>
 #include "ar-internal.h"
 
-#define select_skb_count(op) (op >= rxrpc_skb_tx_cleaned ? &rxrpc_n_tx_skbs : &rxrpc_n_rx_skbs)
+#define is_tx_skb(skb) (rxrpc_skb(skb)->rx_flags & RXRPC_SKB_TX_BUFFER)
+#define select_skb_count(skb) (is_tx_skb(skb) ? &rxrpc_n_tx_skbs : &rxrpc_n_rx_skbs)
 
 /*
  * Note the allocation or reception of a socket buffer.
@@ -22,7 +23,7 @@
 void rxrpc_new_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 {
 	const void *here = __builtin_return_address(0);
-	int n = atomic_inc_return(select_skb_count(op));
+	int n = atomic_inc_return(select_skb_count(skb));
 	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
 }
 
@@ -33,7 +34,7 @@ void rxrpc_see_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 {
 	const void *here = __builtin_return_address(0);
 	if (skb) {
-		int n = atomic_read(select_skb_count(op));
+		int n = atomic_read(select_skb_count(skb));
 		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
 	}
 }
@@ -44,7 +45,7 @@ void rxrpc_see_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 void rxrpc_get_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 {
 	const void *here = __builtin_return_address(0);
-	int n = atomic_inc_return(select_skb_count(op));
+	int n = atomic_inc_return(select_skb_count(skb));
 	trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
 	skb_get(skb);
 }
@@ -58,7 +59,7 @@ void rxrpc_free_skb(struct sk_buff *skb, enum rxrpc_skb_trace op)
 	if (skb) {
 		int n;
 		CHECK_SLAB_OKAY(&skb->users);
-		n = atomic_dec_return(select_skb_count(op));
+		n = atomic_dec_return(select_skb_count(skb));
 		trace_rxrpc_skb(skb, op, refcount_read(&skb->users), n, here);
 		kfree_skb(skb);
 	}
@@ -72,8 +73,8 @@ void rxrpc_purge_queue(struct sk_buff_head *list)
 	const void *here = __builtin_return_address(0);
 	struct sk_buff *skb;
 	while ((skb = skb_dequeue((list))) != NULL) {
-		int n = atomic_dec_return(select_skb_count(rxrpc_skb_rx_purged));
-		trace_rxrpc_skb(skb, rxrpc_skb_rx_purged,
+		int n = atomic_dec_return(select_skb_count(skb));
+		trace_rxrpc_skb(skb, rxrpc_skb_purged,
 				refcount_read(&skb->users), n, here);
 		kfree_skb(skb);
 	}


^ permalink raw reply related

* [PATCH net 3/5] rxrpc: Add a private skb flag to indicate transmission-phase skbs
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <156708397261.25861.2158085372781699276.stgit@warthog.procyon.org.uk>

Add a flag in the private data on an skbuff to indicate that this is a
transmission-phase buffer rather than a receive-phase buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/sendmsg.c     |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 20d7907a5bc6..63d3a91ce5e9 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -188,6 +188,7 @@ struct rxrpc_skb_priv {
 	u8		nr_subpackets;		/* Number of subpackets */
 	u8		rx_flags;		/* Received packet flags */
 #define RXRPC_SKB_INCL_LAST	0x01		/* - Includes last packet */
+#define RXRPC_SKB_TX_BUFFER	0x02		/* - Is transmit buffer */
 	union {
 		int		remain;		/* amount of space remaining for next write */
 
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index bae14438f869..472dc3b7d91f 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -336,6 +336,8 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 			if (!skb)
 				goto maybe_error;
 
+			sp = rxrpc_skb(skb);
+			sp->rx_flags |= RXRPC_SKB_TX_BUFFER;
 			rxrpc_new_skb(skb, rxrpc_skb_tx_new);
 
 			_debug("ALLOC SEND %p", skb);
@@ -346,7 +348,6 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 			skb_reserve(skb, call->conn->security_size);
 			skb->len += call->conn->security_size;
 
-			sp = rxrpc_skb(skb);
 			sp->remain = chunk;
 			if (sp->remain > skb_tailroom(skb))
 				sp->remain = skb_tailroom(skb);


^ permalink raw reply related

* [PATCH net 2/5] rxrpc: Abstract out rxtx ring cleanup
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <156708397261.25861.2158085372781699276.stgit@warthog.procyon.org.uk>

Abstract out rxtx ring cleanup into its own function from its two callers.
This makes it easier to apply the same changes to both.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/call_object.c |   33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 217b12be9e08..c9ab2da957fe 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -421,6 +421,21 @@ void rxrpc_get_call(struct rxrpc_call *call, enum rxrpc_call_trace op)
 	trace_rxrpc_call(call, op, n, here, NULL);
 }
 
+/*
+ * Clean up the RxTx skb ring.
+ */
+static void rxrpc_cleanup_ring(struct rxrpc_call *call)
+{
+	int i;
+
+	for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++) {
+		rxrpc_free_skb(call->rxtx_buffer[i],
+			       (call->tx_phase ? rxrpc_skb_tx_cleaned :
+				rxrpc_skb_rx_cleaned));
+		call->rxtx_buffer[i] = NULL;
+	}
+}
+
 /*
  * Detach a call from its owning socket.
  */
@@ -429,7 +444,6 @@ void rxrpc_release_call(struct rxrpc_sock *rx, struct rxrpc_call *call)
 	const void *here = __builtin_return_address(0);
 	struct rxrpc_connection *conn = call->conn;
 	bool put = false;
-	int i;
 
 	_enter("{%d,%d}", call->debug_id, atomic_read(&call->usage));
 
@@ -479,13 +493,7 @@ void rxrpc_release_call(struct rxrpc_sock *rx, struct rxrpc_call *call)
 	if (conn)
 		rxrpc_disconnect_call(call);
 
-	for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++) {
-		rxrpc_free_skb(call->rxtx_buffer[i],
-			       (call->tx_phase ? rxrpc_skb_tx_cleaned :
-				rxrpc_skb_rx_cleaned));
-		call->rxtx_buffer[i] = NULL;
-	}
-
+	rxrpc_cleanup_ring(call);
 	_leave("");
 }
 
@@ -568,8 +576,6 @@ static void rxrpc_rcu_destroy_call(struct rcu_head *rcu)
  */
 void rxrpc_cleanup_call(struct rxrpc_call *call)
 {
-	int i;
-
 	_net("DESTROY CALL %d", call->debug_id);
 
 	memset(&call->sock_node, 0xcd, sizeof(call->sock_node));
@@ -580,12 +586,7 @@ void rxrpc_cleanup_call(struct rxrpc_call *call)
 	ASSERT(test_bit(RXRPC_CALL_RELEASED, &call->flags));
 	ASSERTCMP(call->conn, ==, NULL);
 
-	/* Clean up the Rx/Tx buffer */
-	for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++)
-		rxrpc_free_skb(call->rxtx_buffer[i],
-			       (call->tx_phase ? rxrpc_skb_tx_cleaned :
-				rxrpc_skb_rx_cleaned));
-
+	rxrpc_cleanup_ring(call);
 	rxrpc_free_skb(call->tx_pending, rxrpc_skb_tx_cleaned);
 
 	call_rcu(&call->rcu, rxrpc_rcu_destroy_call);


^ permalink raw reply related

* [PATCH net 1/5] rxrpc: Pass the input handler's data skb reference to the Rx ring
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <156708397261.25861.2158085372781699276.stgit@warthog.procyon.org.uk>

Pass the reference held on a DATA skb in the rxrpc input handler into the
Rx ring rather than getting an additional ref for this and then dropping
the original ref at the end.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/input.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 35b1a9368d80..140cede77655 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -422,7 +422,8 @@ static void rxrpc_input_dup_data(struct rxrpc_call *call, rxrpc_seq_t seq,
 }
 
 /*
- * Process a DATA packet, adding the packet to the Rx ring.
+ * Process a DATA packet, adding the packet to the Rx ring.  The caller's
+ * packet ref must be passed on or discarded.
  */
 static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 {
@@ -441,8 +442,10 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 	       sp->hdr.serial, seq0, sp->hdr.flags, sp->nr_subpackets);
 
 	state = READ_ONCE(call->state);
-	if (state >= RXRPC_CALL_COMPLETE)
+	if (state >= RXRPC_CALL_COMPLETE) {
+		rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
 		return;
+	}
 
 	if (call->state == RXRPC_CALL_SERVER_RECV_REQUEST) {
 		unsigned long timo = READ_ONCE(call->next_req_timo);
@@ -555,7 +558,8 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 		 * Barriers against rxrpc_recvmsg_data() and rxrpc_rotate_rx_window()
 		 * and also rxrpc_fill_out_ack().
 		 */
-		rxrpc_get_skb(skb, rxrpc_skb_rx_got);
+		if (!terminal)
+			rxrpc_get_skb(skb, rxrpc_skb_rx_got);
 		call->rxtx_annotations[ix] = annotation;
 		smp_wmb();
 		call->rxtx_buffer[ix] = skb;
@@ -616,6 +620,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
 
 unlock:
 	spin_unlock(&call->input_lock);
+	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
 	_leave(" [queued]");
 }
 
@@ -1024,7 +1029,7 @@ static void rxrpc_input_call_packet(struct rxrpc_call *call,
 	switch (sp->hdr.type) {
 	case RXRPC_PACKET_TYPE_DATA:
 		rxrpc_input_data(call, skb);
-		break;
+		goto no_free;
 
 	case RXRPC_PACKET_TYPE_ACK:
 		rxrpc_input_ack(call, skb);
@@ -1051,6 +1056,8 @@ static void rxrpc_input_call_packet(struct rxrpc_call *call,
 		break;
 	}
 
+	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+no_free:
 	_leave("");
 }
 
@@ -1375,8 +1382,11 @@ int rxrpc_input_packet(struct sock *udp_sk, struct sk_buff *skb)
 		mutex_unlock(&call->user_mutex);
 	}
 
+	/* Process a call packet; this either discards or passes on the ref
+	 * elsewhere.
+	 */
 	rxrpc_input_call_packet(call, skb);
-	goto discard;
+	goto out;
 
 discard:
 	rxrpc_free_skb(skb, rxrpc_skb_rx_freed);


^ permalink raw reply related

* [PATCH net 0/5] rxrpc: Fix use of skb_cow_data()
From: David Howells @ 2019-08-29 13:06 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel


Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
with skb_unshare() early on in the input process.  The problem that is
being seen is that skb_cow_data() indirectly requires that the maximum
usage count on an sk_buff be 1, and it may generate an assertion failure in
pskb_expand_head() if not.

This can occur because rxrpc_input_data() may be still holding a ref when
it has just attached the sk_buff to the rx ring and given that attachment
its own ref.  If recvmsg happens fast enough, skb_cow_data() can see the
ref still held by the softirq handler.

Further, a packet may contain multiple subpackets, each of which gets its
own attachment to the ring and its own ref - also making skb_cow_data() go
bang.

Fix this by:

 (1) The DATA packet is currently parsed for subpackets twice by the input
     routines.  Parse it just once instead and make notes in the sk_buff
     private data.

 (2) Use the notes from (1) when attaching the packet to the ring multiple
     times.  Once the packet is attached to the ring, recvmsg can see it
     and start modifying it, so the softirq handler is not permitted to
     look inside it from that point.

 (3) Pass the ref from the input code to the ring rather than getting an
     extra ref.  rxrpc_input_data() uses a ref on the second refcount to
     prevent the packet from evaporating under it.

 (4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
     before we take call->input_lock.  Other sorts of packets don't get
     modified and so can be left.

     A trace is emitted if skb_unshare() eats the skb.  Note that
     skb_share() for our accounting in this regard as we can't see the
     parameters in the packet to log in a trace line if it releases it.

 (5) Remove the calls to skb_cow_data().  These are then no longer
     necessary.

There are also patches to improve the rxrpc_skb tracepoint to make sure
that Tx-derived buffers are identified separately from Rx-derived buffers
in the trace.

The patches are tagged here:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-fixes-20190827

and can also be found on the following branch:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes

David
---
David Howells (5):
      rxrpc: Pass the input handler's data skb reference to the Rx ring
      rxrpc: Abstract out rxtx ring cleanup
      rxrpc: Add a private skb flag to indicate transmission-phase skbs
      rxrpc: Use the tx-phase skb flag to simplify tracing
      rxrpc: Use skb_unshare() rather than skb_cow_data()


 include/trace/events/rxrpc.h |   59 ++++++++++++++++++++----------------------
 net/rxrpc/ar-internal.h      |    3 ++
 net/rxrpc/call_event.c       |    8 +++---
 net/rxrpc/call_object.c      |   33 +++++++++++------------
 net/rxrpc/conn_event.c       |    6 ++--
 net/rxrpc/input.c            |   52 ++++++++++++++++++++++++++++---------
 net/rxrpc/local_event.c      |    4 +--
 net/rxrpc/output.c           |    6 ++--
 net/rxrpc/peer_event.c       |   10 ++++---
 net/rxrpc/recvmsg.c          |    6 ++--
 net/rxrpc/rxkad.c            |   32 ++++++-----------------
 net/rxrpc/sendmsg.c          |   13 +++++----
 net/rxrpc/skbuff.c           |   40 ++++++++++++++++++++--------
 13 files changed, 151 insertions(+), 121 deletions(-)


^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Ivan Vecera @ 2019-08-29 12:55 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: Jiri Pirko, alexandre.belloni, UNGLinuxDriver, davem, andrew,
	allan.nielsen, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829124412.nrlpz5tzx3fkdoiw@soft-dev3.microsemi.net>

On Thu, 29 Aug 2019 14:44:14 +0200
Horatiu Vultur <horatiu.vultur@microchip.com> wrote:

> When a port is added to a bridge, then the port gets in promisc mode.
> But in our case the HW has bridge capabilities so it is not required
> to set the port in promisc mode.
> But if someone else requires the port to be in promisc mode (tcpdump
> or any other application) then I would like to set the port in promisc
> mode.
> In previous emails Andrew came with the suggestion to look at
> dev->promiscuity and check if the port is a bridge port. Using this
> information I could see when to add the port in promisc mode. He also
> suggested to add a new switchdev call(maybe I missunderstood him, or I
> have done it at the wrong place) in case there are no callbacks in the
> driver to get this information.

I would use the 1st suggestion.

for/in your driver:
if (dev->promiscuity > 0) {
	if (dev->promiscuity == 1 && netif_is_bridge_port(dev)) {
		/* no need to set promisc mode because promiscuity
		 * was requested by bridge
		 */
		...
	} else {
		/* need to set promisc mode as someone else requested
		 * promiscuity
		 */
	}
}

Thanks,
Ivan

^ permalink raw reply

* Re: [PATCH net] netdevsim: Restore per-network namespace accounting for fib entries
From: David Ahern @ 2019-08-29 12:54 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David Ahern, davem, netdev
In-Reply-To: <20190829062850.GG2312@nanopsycho>

On 8/29/19 12:28 AM, Jiri Pirko wrote:
> Wed, Aug 28, 2019 at 11:26:03PM CEST, dsahern@gmail.com wrote:
>> On 8/28/19 4:37 AM, Jiri Pirko wrote:
>>> Tue, Aug 06, 2019 at 09:15:17PM CEST, dsahern@kernel.org wrote:
>>>> From: David Ahern <dsahern@gmail.com>
>>>>
>>>> Prior to the commit in the fixes tag, the resource controller in netdevsim
>>>> tracked fib entries and rules per network namespace. Restore that behavior.
>>>
>>> David, please help me understand. If the counters are per-device, not
>>> per-netns, they are both the same. If we have device (devlink instance)
>>> is in a netns and take only things happening in this netns into account,
>>> it should count exactly the same amount of fib entries, doesn't it?
>>
>> if you are only changing where the counters are stored - net_generic vs
>> devlink private - then yes, they should be equivalent.
> 
> Okay.
> 
>>
>>>
>>> I re-thinked the devlink netns patchset and currently I'm going in
>>> slightly different direction. I'm having netns as an attribute of
>>> devlink reload. So all the port netdevices and everything gets
>>> re-instantiated into new netns. Works fine with mlxsw. There we also
>>> re-register the fib notifier.
>>>
>>> I think that this can work for your usecase in netdevsim too:
>>> 1) devlink instance is registering a fib notifier to track all fib
>>>    entries in a namespace it belongs to. The counters are per-device -
>>>    counting fib entries in a namespace the device is in.
>>> 2) another devlink instance can do the same tracking in the same
>>>    namespace. No problem, it's a separate counter, but the numbers are
>>>    the same. One can set different limits to different devlink
>>>    instances, but you can have only one. That is the bahaviour you have
>>>    now.
>>> 3) on devlink reload, netdevsim re-instantiates ports and re-registers
>>>    fib notifier
>>> 4) on devlink reload with netns change, all should be fine as the
>>>    re-registered fib nofitier replays the entries. The ports are
>>>    re-instatiated in new netns.
>>>
>>> This way, we would get consistent behaviour between netdevsim and real
>>> devices (mlxsw), correct devlink-netns implementation (you also
>>> suggested to move ports to the namespace). Everyone should be happy.
>>>
>>> What do you think?
>>>
>>
>> Right now, registering the fib notifier walks all namespaces. That is
>> not a scalable solution. Are you changing that to replay only a given
>> netns? Are you changing the notifiers to be per-namespace?
> 
> Eventually, that seems like good idea. Currently I want to do
> if (net==nsim_dev->mynet)
> 	done
> check at the beginning of the notifier.
> 

The per-namespace replay should be done as part of this re-work. It
should not be that big of a change. Add 'struct net' arg to
register_fib_notifier. If set, call fib_net_dump only for that
namespace. The seq check should be made per-namespace.

You mentioned mlxsw works fine with moving ports to a new network
namespace, so that will be a 'real' example with a known scalability
problem that should be addressed now.

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Horatiu Vultur @ 2019-08-29 12:44 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexandre.belloni, UNGLinuxDriver, davem, andrew, allan.nielsen,
	ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829121811.GI2312@nanopsycho>

The 08/29/2019 14:18, Jiri Pirko wrote:
> External E-Mail
> 
> 
> Thu, Aug 29, 2019 at 12:56:54PM CEST, horatiu.vultur@microchip.com wrote:
> >The 08/29/2019 11:51, Jiri Pirko wrote:
> >> External E-Mail
> >> 
> >> 
> >> Thu, Aug 29, 2019 at 11:22:28AM CEST, horatiu.vultur@microchip.com wrote:
> >> >Add the SWITCHDEV_ATTR_ID_PORT_PROMISCUITY switchdev notification type,
> >> >used to indicate whenever the dev promiscuity counter is changed.
> >> >
> >> >The notification doesn't use any switchdev_attr attribute because in the
> >> >notifier callbacks is it possible to get the dev and read directly
> >> >the promiscuity value.
> >> >
> >> >Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
> >> >---
> >> > include/net/switchdev.h | 1 +
> >> > net/core/dev.c          | 9 +++++++++
> >> > 2 files changed, 10 insertions(+)
> >> >
> >> >diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> >> >index aee86a1..14b1617 100644
> >> >--- a/include/net/switchdev.h
> >> >+++ b/include/net/switchdev.h
> >> >@@ -40,6 +40,7 @@ enum switchdev_attr_id {
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED,
> >> > 	SWITCHDEV_ATTR_ID_BRIDGE_MROUTER,
> >> >+	SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
> >> > };
> >> > 
> >> > struct switchdev_attr {
> >> >diff --git a/net/core/dev.c b/net/core/dev.c
> >> >index 49589ed..40c74f2 100644
> >> >--- a/net/core/dev.c
> >> >+++ b/net/core/dev.c
> >> >@@ -142,6 +142,7 @@
> >> > #include <linux/net_namespace.h>
> >> > #include <linux/indirect_call_wrapper.h>
> >> > #include <net/devlink.h>
> >> >+#include <net/switchdev.h>
> >> > 
> >> > #include "net-sysfs.h"
> >> > 
> >> >@@ -7377,6 +7378,11 @@ static void dev_change_rx_flags(struct net_device *dev, int flags)
> >> > static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
> >> > {
> >> > 	unsigned int old_flags = dev->flags;
> >> >+	struct switchdev_attr attr = {
> >> >+		.orig_dev = dev,
> >> >+		.id = SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
> >> >+		.flags = SWITCHDEV_F_DEFER,
> >> 
> >
> >Hi Jiri,
> >
> >> NACK
> >> 
> >> This is invalid usecase for switchdev infra. Switchdev is there for
> >> bridge offload purposes only.
> >> 
> >> For promiscuity changes, the infrastructure is already present in the
> >> code. See __dev_notify_flags(). it calls:
> >> call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info)
> >> and you can actually see the changed flag in ".flags_changed".
> >Yes, you are right. But in case the port is part of a bridge and then
> >enable promisc mode by a user space application(tpcdump) then the drivers
> 
> If the promisc is on, it is on. Why do you need to know who enabled it?
When a port is added to a bridge, then the port gets in promisc mode.
But in our case the HW has bridge capabilities so it is not required to
set the port in promisc mode.
But if someone else requires the port to be in promisc mode (tcpdump or
any other application) then I would like to set the port in promisc
mode.
In previous emails Andrew came with the suggestion to look at
dev->promiscuity and check if the port is a bridge port. Using this
information I could see when to add the port in promisc mode. He also
suggested to add a new switchdev call(maybe I missunderstood him, or I
have done it at the wrong place) in case there are no callbacks in the
driver to get this information.
> 
> Or do you want to use this to ask driver to ask hw to trap packets to
> kernel? If yes, I don't think it is correct. If you want to "steal" some
> packets from the hw forwarding datapath, use TC action "trap".
No, I just wanted to know when to update the HW to set the port in
promisc mode.

> 
> 
> >will not be notified. The reason is that the dev->flags will still be the
> >same(because IFF_PROMISC was already set) and only promiscuity will be
> >changed.
> >
> >One fix could be to call __dev_notify_flags() no matter when the
> >promisc is enable or disabled.
> >
> >> 
> >> You just have to register netdev notifier block in your driver. Grep
> >> for: register_netdevice_notifier
> >> 
> >> 
> >> >+	};
> >> > 	kuid_t uid;
> >> > 	kgid_t gid;
> >> > 
> >> >@@ -7419,6 +7425,9 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
> >> > 	}
> >> > 	if (notify)
> >> > 		__dev_notify_flags(dev, old_flags, IFF_PROMISC);
> >> >+
> >> >+	switchdev_port_attr_set(dev, &attr);
> >> >+
> >> > 	return 0;
> >> > }
> >> > 
> >> >-- 
> >> >2.7.4
> >> >
> >> 
> >
> >-- 
> >/Horatiu
> 

-- 
/Horatiu

^ permalink raw reply

* Re: [PATCH] wil6210: Delete an unnecessary kfree() call in wil_tid_ampdu_rx_alloc()
From: merez @ 2019-08-28  7:44 UTC (permalink / raw)
  To: Markus Elfring
  Cc: linux-wireless, netdev, wil6210, David S. Miller, Kalle Valo,
	LKML, kernel-janitors, linux-wireless-owner
In-Reply-To: <b9620e49-618d-b392-6456-17de5807df75@web.de>

On 2019-08-27 17:44, Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Tue, 27 Aug 2019 16:39:02 +0200
> 
> A null pointer would be passed to a call of the function “kfree”
> directly after a call of the function “kcalloc” failed at one place.
> Remove this superfluous function call.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---

Reviewed-by: Maya Erez <merez@codeaurora.org>

-- 
Maya Erez
Qualcomm Israel, Inc. on behalf of Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH ipsec-next 7/7] xfrm: add espintcp (RFC 8229)
From: Sabrina Dubroca @ 2019-08-29 12:34 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Herbert Xu
In-Reply-To: <20190829070431.GA2879@gauss3.secunet.de>

2019-08-29, 09:04:31 +0200, Steffen Klassert wrote:
> On Wed, Aug 21, 2019 at 11:46:25PM +0200, Sabrina Dubroca wrote:
> > +static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
> > +{
> > +	struct xfrm_encap_tmpl *encap = x->encap;
> > +	struct esp_tcp_sk *esk;
> > +	__be16 sport, dport;
> > +	struct sock *nsk;
> > +	struct sock *sk;
> > +
> > +	sk = rcu_dereference(x->encap_sk);
> > +	if (sk && sk->sk_state == TCP_ESTABLISHED)
> > +		return sk;
> > +
> > +	spin_lock_bh(&x->lock);
> > +	sport = encap->encap_sport;
> > +	dport = encap->encap_dport;
> > +	nsk = rcu_dereference_protected(x->encap_sk,
> > +					lockdep_is_held(&x->lock));
> > +	if (sk && sk == nsk) {
> > +		esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
> > +		if (!esk) {
> > +			spin_unlock_bh(&x->lock);
> > +			return ERR_PTR(-ENOMEM);
> > +		}
> > +		RCU_INIT_POINTER(x->encap_sk, NULL);
> > +		esk->sk = sk;
> > +		call_rcu(&esk->rcu, esp_free_tcp_sk);
> 
> I don't understand this, can you please explain what you are doing here?

If we get to this block, the current encap_sk is not in ESTABLISHED
state and cannot be used to send data. We want to get rid of it (reset
x->encap_sk) and find a new usable socket. The weird kmalloc dance is
just here so that we can do the sock_put after an RCU grace period,
which I think is needed so that we don't destruct a socket that's
still used by packets in flight.

That's code I copied from Herbert's first submission, I may be missing
some details here.

> > +	}
> > +	spin_unlock_bh(&x->lock);
> > +
> > +	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
> > +				     dport, x->props.saddr.a4, sport, 0);
> > +	if (!sk)
> > +		return ERR_PTR(-ENOENT);
> > +
> > +	if (!tcp_is_ulp_esp(sk)) {
> > +		sock_put(sk);
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	spin_lock_bh(&x->lock);
> > +	nsk = rcu_dereference_protected(x->encap_sk,
> > +					lockdep_is_held(&x->lock));
> > +	if (encap->encap_sport != sport ||
> > +	    encap->encap_dport != dport) {
> > +		sock_put(sk);
> > +		sk = nsk ?: ERR_PTR(-EREMCHG);
> > +	} else if (sk == nsk) {
> > +		sock_put(sk);
> > +	} else {
> > +		rcu_assign_pointer(x->encap_sk, sk);
> > +	}
> > +	spin_unlock_bh(&x->lock);
> > +
> > +	return sk;
> > +}
> > +
> > +static int esp_output_tcp_finish(struct xfrm_state *x, struct sk_buff *skb)
> > +{
> > +	struct sock *sk;
> > +	int err;
> > +
> > +	rcu_read_lock();
> > +
> > +	sk = esp_find_tcp_sk(x);
> > +	err = PTR_ERR(sk);
> > +	if (IS_ERR(sk))
> 
> Maybe better 'if (err)'?

That will work if I replace PTR_ERR with PTR_ERR_OR_ZERO.

> > +		goto out;
> > +
> > +	bh_lock_sock(sk);
> > +	if (sock_owned_by_user(sk)) {
> > +		err = espintcp_queue_out(sk, skb);
> > +		if (err < 0)
> > +			goto unlock_sock;
> 
> This goto is not needed.

Will fix.

> > +	} else {
> > +		err = espintcp_push_skb(sk, skb);
> > +	}
> > +
> > +unlock_sock:
> > +	bh_unlock_sock(sk);
> > +out:
> > +	rcu_read_unlock();
> > +	return err;
> > +}
> > +
> > +static int esp_output_tcp_encap_cb(struct net *net, struct sock *sk,
> > +				   struct sk_buff *skb)
> > +{
> > +	struct dst_entry *dst = skb_dst(skb);
> > +	struct xfrm_state *x = dst->xfrm;
> > +
> > +	return esp_output_tcp_finish(x, skb);
> > +}
> > +
> > +static int esp_output_tail_tcp(struct xfrm_state *x, struct sk_buff *skb)
> > +{
> > +	int err;
> > +
> > +	local_bh_disable();
> > +	err = xfrm_trans_queue_net(xs_net(x), skb, esp_output_tcp_encap_cb);
> > +	local_bh_enable();
> > +
> > +	/* EINPROGRESS just happens to do the right thing.  It
> > +	 * actually means that the skb has been consumed and
> > +	 * isn't coming back.
> > +	 */
> > +	return err ?: -EINPROGRESS;
> > +}
> > +#endif
> > +
> >  static void esp_output_done(struct crypto_async_request *base, int err)
> >  {
> >  	struct sk_buff *skb = base->data;
> > @@ -147,7 +272,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
> >  		secpath_reset(skb);
> >  		xfrm_dev_resume(skb);
> >  	} else {
> > -		xfrm_output_resume(skb, err);
> > +#ifdef CONFIG_XFRM_ESPINTCP
> 
> Do we really need all these ifdef? I guess most of them could
> be avoided with some code refactorization.

If we compile espintcp out, esp_output_tail_tcp will be dead code so
it might as well not be compiled in either.

The more trivial operations (like in esp_input_done2) could be
compiled in anyway. That might be considered a bit inconsistent,
because I need to keep the ifdef in esp_init_state (so that we land in
the default case and error out when espintcp is disabled).

Then I can provide variants of some functions (esp_output_tcp_encap,
esp_output_tail_tcp) that always fail for !XFRM_ESPINTCP.

> > +		if (!err &&
> > +		    x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
> > +			esp_output_tail_tcp(x, skb);
> > +		else
> > +#endif
> > +			xfrm_output_resume(skb, err);
> >  	}
> >  }
> 
> ...
> 
> > @@ -296,7 +460,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
> >  	struct sk_buff *trailer;
> >  	int tailen = esp->tailen;
> >  
> > -	/* this is non-NULL only with UDP Encapsulation */
> > +	/* this is non-NULL only with TCP/UDP Encapsulation */
> >  	if (x->encap) {
> >  		int err = esp_output_encap(x, skb, esp);
> >  
> > @@ -491,6 +655,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
> >  	if (sg != dsg)
> >  		esp_ssg_unref(x, tmp);
> >  
> > +#ifdef CONFIG_XFRM_ESPINTCP
> > +	if (!err && x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
> > +		err = esp_output_tail_tcp(x, skb);
> > +#endif
> > +
> >  error_free:
> >  	kfree(tmp);
> >  error:
> > @@ -617,10 +786,16 @@ int esp_input_done2(struct sk_buff *skb, int err)
> >  
> >  	if (x->encap) {
> >  		struct xfrm_encap_tmpl *encap = x->encap;
> > +		struct tcphdr *th = (void *)(skb_network_header(skb) + ihl);
> 
> This gives a unused variable warning if CONFIG_XFRM_ESPINTCP
> is not set.

Oops, will fix. Or that will take care of itself if I just remove the ifdef below.

> >  		struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
> >  		__be16 source;
> >  
> >  		switch (x->encap->encap_type) {
> > +#ifdef CONFIG_XFRM_ESPINTCP

(this one)

> > +		case TCP_ENCAP_ESPINTCP:
> > +			source = th->source;
> > +			break;
> > +#endif
> >  		case UDP_ENCAP_ESPINUDP:
> >  		case UDP_ENCAP_ESPINUDP_NON_IKE:
> >  			source = uh->source;
> > @@ -1017,6 +1192,14 @@ static int esp_init_state(struct xfrm_state *x)
> >  		case UDP_ENCAP_ESPINUDP_NON_IKE:
> >  			x->props.header_len += sizeof(struct udphdr) + 2 * sizeof(u32);
> >  			break;
> > +#ifdef CONFIG_XFRM_ESPINTCP
> > +		case TCP_ENCAP_ESPINTCP:
> > +			/* only the length field, TCP encap is done by
> > +			 * the socket
> > +			 */
> > +			x->props.header_len += 2;
> > +			break;
> > +#endif
> >  		}
> >  	}
> >  
> > diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
> > index c967fc3c38c8..ccc012b3ea10 100644
> > --- a/net/xfrm/Kconfig
> > +++ b/net/xfrm/Kconfig
> > @@ -71,6 +71,15 @@ config XFRM_IPCOMP
> >  	select CRYPTO
> >  	select CRYPTO_DEFLATE
> >  
> > +config XFRM_ESPINTCP
> > +	bool "ESP in TCP encapsulation (RFC 8229)"
> > +	depends on XFRM && INET_ESP
> > +	select STREAM_PARSER
> 
> I'm getting these compile errors:
> 
> espintcp.o: In function `espintcp_close':
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:469: undefined reference to `sk_msg_free'
> net/xfrm/espintcp.o: In function `espintcp_sendmsg':
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:302: undefined reference to `sk_msg_alloc'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:316: undefined reference to `sk_msg_memcopy_from_iter'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:341: undefined reference to `sk_msg_free'
> /home/klassert/git/linux-stk/net/xfrm/espintcp.c:321: undefined reference to `sk_msg_memcopy_from_iter'
> /home/klassert/git/linux-stk/Makefile:1067: recipe for target 'vmlinux' failed
> make[1]: *** [vmlinux] Error 1
> 
> I guess you need to select NET_SOCK_MSG.

Right, I'll go fix the Kconfig entry.

> Btw. I've updated the ipsec-next tree, so patch 1 is not needed anymore.

Cool, I'll drop it.

> Everything else looks good!

Thanks for the review.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH v2 2/6] mdev: Make mdev alias unique among all mdevs
From: Yunsheng Lin @ 2019-08-29 12:31 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev
In-Reply-To: <20190829111904.16042-3-parav@mellanox.com>

On 2019/8/29 19:19, Parav Pandit wrote:
> Mdev alias should be unique among all the mdevs, so that when such alias
> is used by the mdev users to derive other objects, there is no
> collision in a given system.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> 
> ---
> Changelog:
> v1->v2:
>  - Moved alias NULL check at beginning
> v0->v1:
>  - Fixed inclusiong of alias for NULL check
>  - Added ratelimited debug print for sha1 hash collision error
> ---
>  drivers/vfio/mdev/mdev_core.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3bdff0469607..c9bf2ac362b9 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -388,6 +388,13 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
>  			ret = -EEXIST;
>  			goto mdev_fail;
>  		}
> +		if (alias && tmp->alias && strcmp(alias, tmp->alias) == 0) {

!strcmp(alias, tmp->alias) is a more common kernel pattern.

Also if we limit max len of the alias in patch 1, maybe use that to limit the
string compare too.

> +			mutex_unlock(&mdev_list_lock);
> +			ret = -EEXIST;
> +			dev_dbg_ratelimited(dev, "Hash collision in alias creation for UUID %pUl\n",
> +					    uuid);
> +			goto mdev_fail;
> +		}
>  	}
>  
>  	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> 


^ permalink raw reply

* Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
From: Yunsheng Lin @ 2019-08-29 12:26 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev
In-Reply-To: <20190829111904.16042-2-parav@mellanox.com>

On 2019/8/29 19:18, Parav Pandit wrote:
> Some vendor drivers want an identifier for an mdev device that is
> shorter than the UUID, due to length restrictions in the consumers of
> that identifier.
> 
> Add a callback that allows a vendor driver to request an alias of a
> specified length to be generated for an mdev device. If generated,
> that alias is checked for collisions.
> 
> It is an optional attribute.
> mdev alias is generated using sha1 from the mdev name.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> 
> ---
> Changelog:
> v1->v2:
>  - Kept mdev_device naturally aligned
>  - Added error checking for crypt_*() calls
>  - Corrected a typo from 'and' to 'an'
>  - Changed return type of generate_alias() from int to char*
> v0->v1:
>  - Moved alias length check outside of the parent lock
>  - Moved alias and digest allocation from kvzalloc to kzalloc
>  - &alias[0] changed to alias
>  - alias_length check is nested under get_alias_length callback check
>  - Changed comments to start with an empty line
>  - Fixed cleaunup of hash if mdev_bus_register() fails
>  - Added comment where alias memory ownership is handed over to mdev device
>  - Updated commit log to indicate motivation for this feature
> ---
>  drivers/vfio/mdev/mdev_core.c    | 123 ++++++++++++++++++++++++++++++-
>  drivers/vfio/mdev/mdev_private.h |   5 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
>  include/linux/mdev.h             |   4 +
>  4 files changed, 135 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b558d4cfd082..3bdff0469607 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -10,9 +10,11 @@
>  #include <linux/module.h>
>  #include <linux/device.h>
>  #include <linux/slab.h>
> +#include <linux/mm.h>
>  #include <linux/uuid.h>
>  #include <linux/sysfs.h>
>  #include <linux/mdev.h>
> +#include <crypto/hash.h>
>  
>  #include "mdev_private.h"
>  
> @@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
>  static LIST_HEAD(mdev_list);
>  static DEFINE_MUTEX(mdev_list_lock);
>  
> +static struct crypto_shash *alias_hash;
> +
>  struct device *mdev_parent_dev(struct mdev_device *mdev)
>  {
>  	return mdev->parent->dev;
> @@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>  		return -EINVAL;
>  
> +	if (ops->get_alias_length) {
> +		unsigned int digest_size;
> +		unsigned int aligned_len;
> +
> +		aligned_len = roundup(ops->get_alias_length(), 2);
> +		digest_size = crypto_shash_digestsize(alias_hash);
> +		if (aligned_len / 2 > digest_size)
> +			return -EINVAL;
> +	}
> +
>  	dev = get_device(dev);
>  	if (!dev)
>  		return -EINVAL;
> @@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
>  	mutex_unlock(&mdev_list_lock);
>  
>  	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev->alias);
>  	kfree(mdev);
>  }
>  
> @@ -269,18 +284,101 @@ static void mdev_device_release(struct device *dev)
>  	mdev_device_free(mdev);
>  }
>  
> -int mdev_device_create(struct kobject *kobj,
> -		       struct device *dev, const guid_t *uuid)
> +static const char *
> +generate_alias(const char *uuid, unsigned int max_alias_len)
> +{
> +	struct shash_desc *hash_desc;
> +	unsigned int digest_size;
> +	unsigned char *digest;
> +	unsigned int alias_len;
> +	char *alias;
> +	int ret;
> +
> +	/*
> +	 * Align to multiple of 2 as bin2hex will generate
> +	 * even number of bytes.
> +	 */
> +	alias_len = roundup(max_alias_len, 2);
> +	alias = kzalloc(alias_len + 1, GFP_KERNEL);

It seems the mtty_alias_length in mtty.c can be set from module
parameter, and user can set a very large number, maybe limit the
max of the alias_len before calling kzalloc?

> +	if (!alias)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Allocate and init descriptor */
> +	hash_desc = kvzalloc(sizeof(*hash_desc) +
> +			     crypto_shash_descsize(alias_hash),
> +			     GFP_KERNEL);
> +	if (!hash_desc) {
> +		ret = -ENOMEM;
> +		goto desc_err;
> +	}
> +
> +	hash_desc->tfm = alias_hash;
> +
> +	digest_size = crypto_shash_digestsize(alias_hash);
> +
> +	digest = kzalloc(digest_size, GFP_KERNEL);
> +	if (!digest) {
> +		ret = -ENOMEM;
> +		goto digest_err;
> +	}
> +	ret = crypto_shash_init(hash_desc);
> +	if (ret)
> +		goto hash_err;
> +
> +	ret = crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
> +	if (ret)
> +		goto hash_err;
> +
> +	ret = crypto_shash_final(hash_desc, digest);
> +	if (ret)
> +		goto hash_err;
> +
> +	bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
> +	/*
> +	 * When alias length is odd, zero out an additional last byte
> +	 * that bin2hex has copied.
> +	 */
> +	if (max_alias_len % 2)
> +		alias[max_alias_len] = 0;
> +
> +	kfree(digest);
> +	kvfree(hash_desc);
> +	return alias;
> +
> +hash_err:
> +	kfree(digest);
> +digest_err:
> +	kvfree(hash_desc);
> +desc_err:
> +	kfree(alias);
> +	return ERR_PTR(ret);
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid)
>  {
>  	int ret;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	const char *alias = NULL;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	if (parent->ops->get_alias_length) {
> +		unsigned int alias_len;
> +
> +		alias_len = parent->ops->get_alias_length();
> +		if (alias_len) {
> +			alias = generate_alias(uuid_str, alias_len);
> +			if (IS_ERR(alias)) {
> +				ret = PTR_ERR(alias);
> +				goto alias_fail;
> +			}
> +		}
> +	}
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -300,6 +398,12 @@ int mdev_device_create(struct kobject *kobj,
>  	}
>  
>  	guid_copy(&mdev->uuid, uuid);
> +	mdev->alias = alias;
> +	/*
> +	 * At this point alias memory is owned by the mdev.
> +	 * Mark it NULL, so that only mdev can free it.
> +	 */
> +	alias = NULL;
>  	list_add(&mdev->next, &mdev_list);
>  	mutex_unlock(&mdev_list_lock);
>  
> @@ -346,6 +450,8 @@ int mdev_device_create(struct kobject *kobj,
>  	up_read(&parent->unreg_sem);
>  	put_device(&mdev->dev);
>  mdev_fail:
> +	kfree(alias);
> +alias_fail:
>  	mdev_put_parent(parent);
>  	return ret;
>  }
> @@ -406,7 +512,17 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
>  
>  static int __init mdev_init(void)
>  {
> -	return mdev_bus_register();
> +	int ret;
> +
> +	alias_hash = crypto_alloc_shash("sha1", 0, 0);
> +	if (!alias_hash)
> +		return -ENOMEM;
> +
> +	ret = mdev_bus_register();
> +	if (ret)
> +		crypto_free_shash(alias_hash);
> +
> +	return ret;
>  }
>  
>  static void __exit mdev_exit(void)
> @@ -415,6 +531,7 @@ static void __exit mdev_exit(void)
>  		class_compat_unregister(mdev_bus_compat_class);
>  
>  	mdev_bus_unregister();
> +	crypto_free_shash(alias_hash);
>  }
>  
>  module_init(mdev_init)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 7d922950caaf..078fdaf7836e 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -32,6 +32,7 @@ struct mdev_device {
>  	struct list_head next;
>  	struct kobject *type_kobj;
>  	struct device *iommu_device;
> +	const char *alias;
>  	bool active;
>  };
>  
> @@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
>  int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
> -int  mdev_device_create(struct kobject *kobj,
> -			struct device *dev, const guid_t *uuid);
> +int mdev_device_create(struct kobject *kobj, struct device *dev,
> +		       const char *uuid_str, const guid_t *uuid);
>  int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 7570c7602ab4..43afe0e80b76 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
>  		return -ENOMEM;
>  
>  	ret = guid_parse(str, &uuid);
> -	kfree(str);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	ret = mdev_device_create(kobj, dev, &uuid);
> +	ret = mdev_device_create(kobj, dev, str, &uuid);
>  	if (ret)
> -		return ret;
> +		goto err;
>  
> -	return count;
> +	ret = count;
> +
> +err:
> +	kfree(str);
> +	return ret;
>  }
>  
>  MDEV_TYPE_ATTR_WO(create);
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 0ce30ca78db0..f036fe9854ee 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
>   * @mmap:		mmap callback
>   *			@mdev: mediated device structure
>   *			@vma: vma structure
> + * @get_alias_length:	Generate alias for the mdevs of this parent based on the
> + *			mdev device name when it returns non zero alias length.
> + *			It is optional.
>   * Parent device that support mediated device should be registered with mdev
>   * module with mdev_parent_ops structure.
>   **/
> @@ -92,6 +95,7 @@ struct mdev_parent_ops {
>  	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
>  			 unsigned long arg);
>  	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +	unsigned int (*get_alias_length)(void);
>  };
>  
>  /* interface for exporting mdev supported type attributes */
> 


^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Shmulik Ladkani @ 2019-08-29 12:22 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal
In-Reply-To: <88a3da53-fecc-0d8c-56dc-a4c3b0e11dfd@iogearbox.net>

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.
> 

Thanks Daniel.

Managed to write a reproducer which mimics the skb we see on prodction,
that hits the exact same BUG_ON.

Submitted as a separate RFC PATCH to bpf-next.
Tested on v5.0.y (and fwd ported to net-next for submission).

Daniel, please use this reproducer.

Do note that the test assigns:

+	skb_shinfo(skb[0])->gso_size = 1288;

which is the *mangled* gso_size value, to mimic the works of
bpf_skb_proto_4_to_6().

When setting 'gso_size = 1288 + 20' (the *original* gso_size of the
GROed skb prior bpf_skb_proto_4_to_6), the test passes successfully and
we don't hit the mentioned BUG_ON.

Best,
Shmulik

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Jiri Pirko @ 2019-08-29 12:18 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: alexandre.belloni, UNGLinuxDriver, davem, andrew, allan.nielsen,
	ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829105650.btgvytgja63sx6wx@soft-dev3.microsemi.net>

Thu, Aug 29, 2019 at 12:56:54PM CEST, horatiu.vultur@microchip.com wrote:
>The 08/29/2019 11:51, Jiri Pirko wrote:
>> External E-Mail
>> 
>> 
>> Thu, Aug 29, 2019 at 11:22:28AM CEST, horatiu.vultur@microchip.com wrote:
>> >Add the SWITCHDEV_ATTR_ID_PORT_PROMISCUITY switchdev notification type,
>> >used to indicate whenever the dev promiscuity counter is changed.
>> >
>> >The notification doesn't use any switchdev_attr attribute because in the
>> >notifier callbacks is it possible to get the dev and read directly
>> >the promiscuity value.
>> >
>> >Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
>> >---
>> > include/net/switchdev.h | 1 +
>> > net/core/dev.c          | 9 +++++++++
>> > 2 files changed, 10 insertions(+)
>> >
>> >diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>> >index aee86a1..14b1617 100644
>> >--- a/include/net/switchdev.h
>> >+++ b/include/net/switchdev.h
>> >@@ -40,6 +40,7 @@ enum switchdev_attr_id {
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED,
>> > 	SWITCHDEV_ATTR_ID_BRIDGE_MROUTER,
>> >+	SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
>> > };
>> > 
>> > struct switchdev_attr {
>> >diff --git a/net/core/dev.c b/net/core/dev.c
>> >index 49589ed..40c74f2 100644
>> >--- a/net/core/dev.c
>> >+++ b/net/core/dev.c
>> >@@ -142,6 +142,7 @@
>> > #include <linux/net_namespace.h>
>> > #include <linux/indirect_call_wrapper.h>
>> > #include <net/devlink.h>
>> >+#include <net/switchdev.h>
>> > 
>> > #include "net-sysfs.h"
>> > 
>> >@@ -7377,6 +7378,11 @@ static void dev_change_rx_flags(struct net_device *dev, int flags)
>> > static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
>> > {
>> > 	unsigned int old_flags = dev->flags;
>> >+	struct switchdev_attr attr = {
>> >+		.orig_dev = dev,
>> >+		.id = SWITCHDEV_ATTR_ID_PORT_PROMISCUITY,
>> >+		.flags = SWITCHDEV_F_DEFER,
>> 
>
>Hi Jiri,
>
>> NACK
>> 
>> This is invalid usecase for switchdev infra. Switchdev is there for
>> bridge offload purposes only.
>> 
>> For promiscuity changes, the infrastructure is already present in the
>> code. See __dev_notify_flags(). it calls:
>> call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info)
>> and you can actually see the changed flag in ".flags_changed".
>Yes, you are right. But in case the port is part of a bridge and then
>enable promisc mode by a user space application(tpcdump) then the drivers

If the promisc is on, it is on. Why do you need to know who enabled it?

Or do you want to use this to ask driver to ask hw to trap packets to
kernel? If yes, I don't think it is correct. If you want to "steal" some
packets from the hw forwarding datapath, use TC action "trap".


>will not be notified. The reason is that the dev->flags will still be the
>same(because IFF_PROMISC was already set) and only promiscuity will be
>changed.
>
>One fix could be to call __dev_notify_flags() no matter when the
>promisc is enable or disabled.
>
>> 
>> You just have to register netdev notifier block in your driver. Grep
>> for: register_netdevice_notifier
>> 
>> 
>> >+	};
>> > 	kuid_t uid;
>> > 	kgid_t gid;
>> > 
>> >@@ -7419,6 +7425,9 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
>> > 	}
>> > 	if (notify)
>> > 		__dev_notify_flags(dev, old_flags, IFF_PROMISC);
>> >+
>> >+	switchdev_port_attr_set(dev, &attr);
>> >+
>> > 	return 0;
>> > }
>> > 
>> >-- 
>> >2.7.4
>> >
>> 
>
>-- 
>/Horatiu

^ permalink raw reply

* Re: KASAN: use-after-free Read in rxrpc_put_peer
From: David Howells @ 2019-08-29 12:17 UTC (permalink / raw)
  To: syzbot; +Cc: dhowells, davem, linux-afs, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000f6a13b059132aa6c@google.com>

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git 48b9e92aeb3c2b0df3454faf9024f6ca611d65a3

^ permalink raw reply

* [RFC PATCH bpf-next 2/2] test_bpf: Introduce 'gso_linear_no_head_frag' skb_segment test
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani
In-Reply-To: <20190829121344.20675-1-shmulik.ladkani@gmail.com>

Following reports of skb_segment() hitting a BUG_ON when working on
GROed skbs which have their gso_size mangled (e.g. after a
bpf_skb_change_proto call), add a reproducer test that mimics the
input skbs that lead to the mentioned BUG_ON as in [1].

[1] https://lists.openwall.net/netdev/2019/08/26/110

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
 lib/test_bpf.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 5e80cb3d3ca0..2fe1e3ab3c89 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6859,6 +6859,60 @@ static __init struct sk_buff *build_test_skb(void)
 	return NULL;
 }
 
+static __init struct sk_buff *build_test_skb_linear_no_head_frag(void)
+{
+	unsigned int alloc_size = 2000;
+	unsigned int headroom = 102, doffset = 72, data_size = 1308;
+	struct sk_buff *skb[2];
+	int i;
+
+	/* skbs linked in a frag_list, both with linear data, with head_frag=0
+	 * (data allocated by kmalloc), both have tcp data of 1308 bytes
+	 * (total payload is 2616 bytes).
+	 * Data offset is 72 bytes (40 ipv6 hdr, 32 tcp hdr). Some headroom.
+	 */
+	for (i = 0; i < 2; i++) {
+		skb[i] = alloc_skb(alloc_size, GFP_KERNEL);
+		if (!skb[i]) {
+			if (i == 0)
+				goto err_skb0;
+			else
+				goto err_skb1;
+		}
+
+		skb[i]->protocol = htons(ETH_P_IPV6);
+		skb_reserve(skb[i], headroom);
+		skb_put(skb[i], doffset + data_size);
+		skb_reset_network_header(skb[i]);
+		if (i == 0)
+			skb_reset_mac_header(skb[i]);
+		else
+			skb_set_mac_header(skb[i], -ETH_HLEN);
+		__skb_pull(skb[i], doffset);
+	}
+
+	/* setup shinfo.
+	 * mimic bpf_skb_proto_4_to_6, which resets gso_segs and assigns a
+	 * reduced gso_size.
+	 */
+	skb_shinfo(skb[0])->gso_size = 1288;
+	skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV6 | SKB_GSO_DODGY;
+	skb_shinfo(skb[0])->gso_segs = 0;
+	skb_shinfo(skb[0])->frag_list = skb[1];
+
+	/* adjust skb[0]'s len */
+	skb[0]->len += skb[1]->len;
+	skb[0]->data_len += skb[1]->len;
+	skb[0]->truesize += skb[1]->truesize;
+
+	return skb[0];
+
+err_skb1:
+	kfree_skb(skb[0]);
+err_skb0:
+	return NULL;
+}
+
 struct skb_segment_test {
 	const char *descr;
 	struct sk_buff *(*build_skb)(void);
@@ -6871,6 +6925,15 @@ static struct skb_segment_test skb_segment_tests[] __initconst = {
 		.build_skb = build_test_skb,
 		.features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
 			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM
+	},
+	{
+		.descr = "gso_linear_no_head_frag",
+		.build_skb = build_test_skb_linear_no_head_frag,
+		.features = NETIF_F_SG | NETIF_F_FRAGLIST |
+			    NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_GSO |
+			    NETIF_F_LLTX_BIT | NETIF_F_GRO |
+			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM |
+			    NETIF_F_HW_VLAN_STAG_TX_BIT
 	}
 };
 
-- 
2.19.1


^ permalink raw reply related

* [PATCH bpf-next 1/2] test_bpf: Refactor test_skb_segment() to allow testing skb_segment() on numerous different skbs
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani
In-Reply-To: <20190829121344.20675-1-shmulik.ladkani@gmail.com>

Currently, test_skb_segment() builds a single test skb and runs
skb_segment() on it.

Extend test_skb_segment() so it processes an array of numerous
skb/feature pairs to test.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
 lib/test_bpf.c | 51 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index c41705835cba..5e80cb3d3ca0 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6859,34 +6859,65 @@ static __init struct sk_buff *build_test_skb(void)
 	return NULL;
 }
 
-static __init int test_skb_segment(void)
-{
+struct skb_segment_test {
+	const char *descr;
+	struct sk_buff *(*build_skb)(void);
 	netdev_features_t features;
+};
+
+static struct skb_segment_test skb_segment_tests[] __initconst = {
+	{
+		.descr = "gso_with_rx_frags",
+		.build_skb = build_test_skb,
+		.features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+			    NETIF_F_IPV6_CSUM | NETIF_F_RXCSUM
+	}
+};
+
+static __init int test_skb_segment_single(const struct skb_segment_test *test)
+{
 	struct sk_buff *skb, *segs;
 	int ret = -1;
 
-	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
-		   NETIF_F_IPV6_CSUM;
-	features |= NETIF_F_RXCSUM;
-	skb = build_test_skb();
+	skb = test->build_skb();
 	if (!skb) {
 		pr_info("%s: failed to build_test_skb", __func__);
 		goto done;
 	}
 
-	segs = skb_segment(skb, features);
+	segs = skb_segment(skb, test->features);
 	if (!IS_ERR(segs)) {
 		kfree_skb_list(segs);
 		ret = 0;
-		pr_info("%s: success in skb_segment!", __func__);
-	} else {
-		pr_info("%s: failed in skb_segment!", __func__);
 	}
 	kfree_skb(skb);
 done:
 	return ret;
 }
 
+static __init int test_skb_segment(void)
+{
+	int i, err_cnt = 0, pass_cnt = 0;
+
+	for (i = 0; i < ARRAY_SIZE(skb_segment_tests); i++) {
+		const struct skb_segment_test *test = &skb_segment_tests[i];
+
+		pr_info("#%d %s ", i, test->descr);
+
+		if (test_skb_segment_single(test)) {
+			pr_cont("FAIL\n");
+			err_cnt++;
+		} else {
+			pr_cont("PASS\n");
+			pass_cnt++;
+		}
+	}
+
+	pr_info("%s: Summary: %d PASSED, %d FAILED\n", __func__,
+		pass_cnt, err_cnt);
+	return err_cnt ? -EINVAL : 0;
+}
+
 static __init int test_bpf(void)
 {
 	int i, err_cnt = 0, pass_cnt = 0;
-- 
2.19.1


^ permalink raw reply related

* [RFC PATCH bpf-next 0/2] test_bpf: Add an skb_segment test for a linear head_frag=0 skb whose gso_size was mangled
From: Shmulik Ladkani @ 2019-08-29 12:13 UTC (permalink / raw)
  To: Daniel Borkmann, Yonghong Song
  Cc: Eric Dumazet, Alexei Starovoitov, netdev, bpf, Shmulik Ladkani

Following reports of skb_segment() hitting a BUG_ON when working on
GROed skbs which have their gso_size mangled (e.g. after a
bpf_skb_change_proto call), add a reproducer test that mimics the
input skbs that lead to the mentioned BUG_ON as in [1].

Currently, this new test *INDEED* hits the BUG_ON, therefore the final
patch is to be applied only after skb_segment is fixed.

When changing the gso_size of the test skb to its original value (+20
bytes) then the test passes successfully.

[1] https://lists.openwall.net/netdev/2019/08/26/110

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yonghong Song <yhs@fb.com> 

Shmulik Ladkani (2):
  test_bpf: Refactor test_skb_segment() to allow testing skb_segment()
    on numerous different skbs
  test_bpf: Introduce 'gso_linear_no_head_frag' skb_segment test

 lib/test_bpf.c | 112 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 103 insertions(+), 9 deletions(-)

-- 
2.19.1


^ permalink raw reply

* Re: KASAN: use-after-free Read in rxrpc_send_keepalive
From: syzbot @ 2019-08-29 12:10 UTC (permalink / raw)
  To: davem, dhowells, linux-afs, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000e695c1058fb26925@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    ed2393ca Add linux-next specific files for 20190827
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=156adb1e600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2ef5940a07ed45f4
dashboard link: https://syzkaller.appspot.com/bug?extid=d850c266e3df14da1d31
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=167ab582600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d850c266e3df14da1d31@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): hsr0: link becomes ready
==================================================================
BUG: KASAN: use-after-free in rxrpc_send_keepalive+0x8a2/0x940  
net/rxrpc/output.c:634
Read of size 8 at addr ffff888086b01218 by task kworker/0:1/12

CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.3.0-rc6-next-20190827 #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: krxrpcd rxrpc_peer_keepalive_worker
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
  kasan_report+0x12/0x20 mm/kasan/common.c:634
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
  rxrpc_send_keepalive+0x8a2/0x940 net/rxrpc/output.c:634
  rxrpc_peer_keepalive_dispatch net/rxrpc/peer_event.c:369 [inline]
  rxrpc_peer_keepalive_worker+0x7be/0xd02 net/rxrpc/peer_event.c:430
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 8741:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:510 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:483
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:524
  __do_kmalloc mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3664
  kmalloc_array include/linux/slab.h:614 [inline]
  kcalloc include/linux/slab.h:625 [inline]
  alloc_pipe_info+0x199/0x420 fs/pipe.c:676
  get_pipe_inode fs/pipe.c:738 [inline]
  create_pipe_files+0x8e/0x730 fs/pipe.c:770
  __do_pipe_flags+0x48/0x250 fs/pipe.c:807
  do_pipe2+0x84/0x160 fs/pipe.c:855
  __do_sys_pipe2 fs/pipe.c:873 [inline]
  __se_sys_pipe2 fs/pipe.c:871 [inline]
  __x64_sys_pipe2+0x54/0x80 fs/pipe.c:871
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8741:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  kasan_set_free_info mm/kasan/common.c:332 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:471
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  free_pipe_info+0x243/0x300 fs/pipe.c:709
  put_pipe_info+0xd0/0xf0 fs/pipe.c:582
  pipe_release+0x1e6/0x280 fs/pipe.c:603
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x65f/0x760 arch/x86/entry/common.c:300
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888086b01200
  which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 24 bytes inside of
  1024-byte region [ffff888086b01200, ffff888086b01600)
The buggy address belongs to the page:
page:ffffea00021ac000 refcount:1 mapcount:0 mapping:ffff8880aa400c40  
index:0xffff888086b00480 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00027b5588 ffffea00028e3808 ffff8880aa400c40
raw: ffff888086b00480 ffff888086b00000 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888086b01100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888086b01180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff888086b01200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
  ffff888086b01280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888086b01300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
From: Vladimir Oltean @ 2019-08-29 11:50 UTC (permalink / raw)
  To: Florian Fainelli, Vivien Didelot, Andrew Lunn, Ido Schimmel,
	Roopa Prabhu, nikolay, David S. Miller
  Cc: netdev
In-Reply-To: <20190825184454.14678-3-olteanv@gmail.com>

Vivien,

On Sun, 25 Aug 2019 at 21:46, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> The bridge core assumes that enabling/disabling vlan_filtering will
> translate into the simple toggling of a flag for switchdev drivers.
>
> That is clearly not the case for sja1105, which alters the VLAN table
> and the pvids in order to obtain port separation in standalone mode.
>
> There are 2 parts to the issue.
>
> First, tag_8021q changes the pvid to a unique per-port rx_vid for frame
> identification. But we need to disable tag_8021q when vlan_filtering
> kicks in, and at that point, the VLAN configured as pvid will have to be
> removed from the filtering table of the ports. With an invalid pvid, the
> ports will drop all traffic.  Since the bridge will not call any vlan
> operation through switchdev after enabling vlan_filtering, we need to
> ensure we're in a functional state ourselves. Hence read the pvid that
> the bridge is aware of, and program that into our ports.
>
> Secondly, tag_8021q uses the 1024-3071 range privately in
> vlan_filtering=0 mode. Had the user installed one of these VLANs during
> a previous vlan_filtering=1 session, then upon the next tag_8021q
> cleanup for vlan_filtering to kick in again, VLANs in that range will
> get deleted unconditionally, hence breaking user expectation. So when
> deleting the VLANs, check if the bridge had knowledge about them, and if
> it did, re-apply the settings. Wrap this logic inside a
> dsa_8021q_vid_apply helper function to reduce code duplication.
>
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> ---
>  net/dsa/tag_8021q.c | 91 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 71 insertions(+), 20 deletions(-)
>
> diff --git a/net/dsa/tag_8021q.c b/net/dsa/tag_8021q.c
> index 67a1bc635a7b..81f943e365b9 100644
> --- a/net/dsa/tag_8021q.c
> +++ b/net/dsa/tag_8021q.c
> @@ -93,6 +93,68 @@ int dsa_8021q_rx_source_port(u16 vid)
>  }
>  EXPORT_SYMBOL_GPL(dsa_8021q_rx_source_port);
>
> +static int dsa_8021q_restore_pvid(struct dsa_switch *ds, int port)
> +{
> +       struct bridge_vlan_info vinfo;
> +       struct net_device *slave;
> +       u16 pvid;
> +       int err;
> +
> +       if (!dsa_is_user_port(ds, port))
> +               return 0;
> +
> +       slave = ds->ports[port].slave;
> +
> +       err = br_vlan_get_pvid(slave, &pvid);
> +       if (err < 0) {
> +               dev_err(ds->dev, "Couldn't determine bridge PVID\n");
> +               return err;
> +       }
> +
> +       err = br_vlan_get_info(slave, pvid, &vinfo);
> +       if (err < 0) {
> +               dev_err(ds->dev, "Couldn't determine PVID attributes\n");
> +               return err;
> +       }
> +
> +       return dsa_port_vid_add(&ds->ports[port], pvid, vinfo.flags);

If the bridge had installed a dsa_8021q VLAN here, I need to use the
dsa_slave_vid_add logic to restore it. The dsa_8021q flags on the CPU
port are "ingress tagged", but that may not be the case for the bridge
VLAN.
Should I expose dsa_slave_vlan_add in dsa_priv.h, or should I just
open-code another dsa_port_vid_add for dp->cpu_dp, duplicating a bit
of code from dsa_slave_vlan_add?

> +}
> +
> +/* If @enabled is true, installs @vid with @flags into the switch port's HW
> + * filter.
> + * If @enabled is false, deletes @vid (ignores @flags) from the port. Had the
> + * user explicitly configured this @vid through the bridge core, then the @vid
> + * is installed again, but this time with the flags from the bridge layer.
> + */
> +static int dsa_8021q_vid_apply(struct dsa_switch *ds, int port, u16 vid,
> +                              u16 flags, bool enabled)
> +{
> +       struct dsa_port *dp = &ds->ports[port];
> +       struct bridge_vlan_info vinfo;
> +       int err;
> +
> +       if (enabled)
> +               return dsa_port_vid_add(dp, vid, flags);
> +
> +       err = dsa_port_vid_del(dp, vid);
> +       if (err < 0)
> +               return err;
> +
> +       /* Nothing to restore from the bridge for a non-user port */
> +       if (!dsa_is_user_port(ds, port))
> +               return 0;
> +
> +       err = br_vlan_get_info(dp->slave, vid, &vinfo);
> +       /* Couldn't determine bridge attributes for this vid,
> +        * it means the bridge had not configured it.
> +        */
> +       if (err < 0)
> +               return 0;
> +
> +       /* Restore the VID from the bridge */
> +       return dsa_port_vid_add(dp, vid, vinfo.flags);
> +}
> +
>  /* RX VLAN tagging (left) and TX VLAN tagging (right) setup shown for a single
>   * front-panel switch port (here swp0).
>   *
> @@ -148,8 +210,6 @@ EXPORT_SYMBOL_GPL(dsa_8021q_rx_source_port);
>  int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>  {
>         int upstream = dsa_upstream_port(ds, port);
> -       struct dsa_port *dp = &ds->ports[port];
> -       struct dsa_port *upstream_dp = &ds->ports[upstream];
>         u16 rx_vid = dsa_8021q_rx_vid(ds, port);
>         u16 tx_vid = dsa_8021q_tx_vid(ds, port);
>         int i, err;
> @@ -166,7 +226,6 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>          * restrictions, so there are no concerns about leaking traffic.
>          */
>         for (i = 0; i < ds->num_ports; i++) {
> -               struct dsa_port *other_dp = &ds->ports[i];
>                 u16 flags;
>
>                 if (i == upstream)
> @@ -179,10 +238,7 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>                         /* The RX VID is a regular VLAN on all others */
>                         flags = BRIDGE_VLAN_INFO_UNTAGGED;
>
> -               if (enabled)
> -                       err = dsa_port_vid_add(other_dp, rx_vid, flags);
> -               else
> -                       err = dsa_port_vid_del(other_dp, rx_vid);
> +               err = dsa_8021q_vid_apply(ds, i, rx_vid, flags, enabled);
>                 if (err) {
>                         dev_err(ds->dev, "Failed to apply RX VID %d to port %d: %d\n",
>                                 rx_vid, port, err);
> @@ -193,10 +249,7 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>         /* CPU port needs to see this port's RX VID
>          * as tagged egress.
>          */
> -       if (enabled)
> -               err = dsa_port_vid_add(upstream_dp, rx_vid, 0);
> -       else
> -               err = dsa_port_vid_del(upstream_dp, rx_vid);
> +       err = dsa_8021q_vid_apply(ds, upstream, rx_vid, 0, enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply RX VID %d to port %d: %d\n",
>                         rx_vid, port, err);
> @@ -204,26 +257,24 @@ int dsa_port_setup_8021q_tagging(struct dsa_switch *ds, int port, bool enabled)
>         }
>
>         /* Finally apply the TX VID on this port and on the CPU port */
> -       if (enabled)
> -               err = dsa_port_vid_add(dp, tx_vid, BRIDGE_VLAN_INFO_UNTAGGED);
> -       else
> -               err = dsa_port_vid_del(dp, tx_vid);
> +       err = dsa_8021q_vid_apply(ds, port, tx_vid, BRIDGE_VLAN_INFO_UNTAGGED,
> +                                 enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
>                         tx_vid, port, err);
>                 return err;
>         }
> -       if (enabled)
> -               err = dsa_port_vid_add(upstream_dp, tx_vid, 0);
> -       else
> -               err = dsa_port_vid_del(upstream_dp, tx_vid);
> +       err = dsa_8021q_vid_apply(ds, upstream, tx_vid, 0, enabled);
>         if (err) {
>                 dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
>                         tx_vid, upstream, err);
>                 return err;
>         }
>
> -       return 0;
> +       if (!enabled)
> +               err = dsa_8021q_restore_pvid(ds, port);
> +
> +       return err;
>  }
>  EXPORT_SYMBOL_GPL(dsa_port_setup_8021q_tagging);
>
> --
> 2.17.1
>

Thanks,
-Vladimir

^ permalink raw reply

* RE: [PATCH v2] bonding: force enable lacp port after link state recovery for 802.3ad
From: zhangsha (A) @ 2019-08-29 11:33 UTC (permalink / raw)
  To: David Miller
  Cc: j.vosburgh@gmail.com, vfalico@gmail.com, andy@greyhouse.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, yuehaibing,
	hunongda, Chenzhendong (alex)
In-Reply-To: <20190827.150456.509211205582645335.davem@davemloft.net>



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: 2019年8月28日 6:05
> To: zhangsha (A) <zhangsha.zhang@huawei.com>
> Cc: j.vosburgh@gmail.com; vfalico@gmail.com; andy@greyhouse.net;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; yuehaibing
> <yuehaibing@huawei.com>; hunongda <hunongda@huawei.com>;
> Chenzhendong (alex) <alex.chen@huawei.com>
> Subject: Re: [PATCH v2] bonding: force enable lacp port after link state
> recovery for 802.3ad
> 
> From: <zhangsha.zhang@huawei.com>
> Date: Fri, 23 Aug 2019 11:42:09 +0800
> 
> > - If speed/duplex getting failed here, the link status
> >   will be changed to BOND_LINK_FAIL;
> 
> How does it fail at this step?  I suspect this is a driver specific problem.

Hi, David,
I'm really sorry for the delayed email and appreciated for your feedback.

I was testing in kernel 4.19 with a Huawei hinic card when the problem occurred.
I checked the dmesg and got the logs in the following order:
1) link status definitely down for interface eth6, disabling it
2) link status up again after 0 ms for interface eth6
3) the paterner's system mac becomes to "00:00:00:00:00:00".
By  reading the codes, I think that the link status of the slave should be changed
to BOND_LINK_FAIL from BOND_LINK_DOWN. 

As this problem has only occurred once only, I am not very sure about whether this is a
driver specific problem or not at the moment. But I find the commit 4d2c0cda, 
its log says " Some NIC drivers don't have correct speed/duplex settings at the
time they send NETDEV_UP notification ...",  so I prefer to believe it's not.

To my problem I think  it is not enough that link-monitoring (miimon) only set
SPEED/DUPLEX right, the lacp port should be enabled too at the same time.
 



^ permalink raw reply

* WARNING in __mark_chain_precision (2)
From: syzbot @ 2019-08-29 11:28 UTC (permalink / raw)
  To: ast, bpf, clang-built-linux, daniel, davem, hawk, jakub.kicinski,
	john.fastabend, kafai, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    47ee6e86 selftests/bpf: remove wrong nhoff in flow dissect..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16227fa6600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7
dashboard link: https://syzkaller.appspot.com/bug?extid=c8d66267fd2b5955287e
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10d26ebc600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=127805ca600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com

------------[ cut here ]------------
verifier backtracking bug
WARNING: CPU: 0 PID: 8795 at kernel/bpf/verifier.c:1782  
__mark_chain_precision+0x197a/0x1ea0 kernel/bpf/verifier.c:1782
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 8795 Comm: syz-executor439 Not tainted 5.3.0-rc3+ #101
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:219
  __warn.cold+0x20/0x4c kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:__mark_chain_precision+0x197a/0x1ea0 kernel/bpf/verifier.c:1782
Code: 08 31 ff 89 de e8 a6 9e f2 ff 84 db 0f 85 07 ff ff ff e8 59 9d f2 ff  
48 c7 c7 a0 a7 91 87 c6 05 3d c6 21 08 01 e8 1e 10 c4 ff <0f> 0b 41 bc f2  
ff ff ff e9 e8 fe ff ff 48 8b bd d8 fe ff ff e8 cd
RSP: 0018:ffff88808609f380 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815c3ba6 RDI: ffffed1010c13e62
RBP: ffff88808609f4d0 R08: ffff8880a97940c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffff8880985b6cc0 R14: ffff88808e538e80 R15: ffff88808609f468
  mark_chain_precision kernel/bpf/verifier.c:1819 [inline]
  check_cond_jmp_op+0xcd8/0x3c30 kernel/bpf/verifier.c:5841
  do_check+0x63cd/0x8a30 kernel/bpf/verifier.c:7783
  bpf_check+0x6fff/0x99b2 kernel/bpf/verifier.c:9297
  bpf_prog_load+0xe68/0x1670 kernel/bpf/syscall.c:1709
  __do_sys_bpf+0xa44/0x3350 kernel/bpf/syscall.c:2860
  __se_sys_bpf kernel/bpf/syscall.c:2819 [inline]
  __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:2819
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4402a9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc52f10618 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402a9
RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401b30
R13: 0000000000401bc0 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* KASAN: use-after-free Read in nr_release (2)
From: syzbot @ 2019-08-29 11:28 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    6525771f Merge tag 'arc-5.3-rc7' of git://git.kernel.org/p..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11f44d82600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
dashboard link: https://syzkaller.appspot.com/bug?extid=fd05016a0b263a41eb33
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16c627ca600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+fd05016a0b263a41eb33@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in atomic_read  
include/asm-generic/atomic-instrumented.h:26 [inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero_checked+0x81/0x200  
lib/refcount.c:123
Read of size 4 at addr ffff888094849540 by task syz-executor.2/9817

CPU: 1 PID: 9817 Comm: syz-executor.2 Not tainted 5.3.0-rc6+ #128
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
  kasan_report+0x12/0x17 mm/kasan/common.c:618
  check_memory_region_inline mm/kasan/generic.c:185 [inline]
  check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
  __kasan_check_read+0x11/0x20 mm/kasan/common.c:92
  atomic_read include/asm-generic/atomic-instrumented.h:26 [inline]
  refcount_inc_not_zero_checked+0x81/0x200 lib/refcount.c:123
  refcount_inc_checked+0x17/0x70 lib/refcount.c:156
  sock_hold include/net/sock.h:649 [inline]
  nr_release+0x62/0x3e0 net/netrom/af_netrom.c:520
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x413561
Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 1b 00 00 c3 48  
83 ec 08 e8 0a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 53 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffcd6d6ea60 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000413561
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000001 R08: ffffffffffffffff R09: ffffffffffffffff
R10: 00007ffcd6d6eb40 R11: 0000000000000293 R12: 000000000075c9a0
R13: 000000000075c9a0 R14: 0000000000761aa0 R15: ffffffffffffffff

Allocated by task 9818:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:493 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:466
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:507
  __do_kmalloc mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3664
  kmalloc include/linux/slab.h:557 [inline]
  sk_prot_alloc+0x23a/0x310 net/core/sock.c:1603
  sk_alloc+0x39/0xf70 net/core/sock.c:1657
  nr_create+0xb9/0x5e0 net/netrom/af_netrom.c:433
  __sock_create+0x3d8/0x730 net/socket.c:1418
  sock_create net/socket.c:1469 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1511
  __do_sys_socket net/socket.c:1520 [inline]
  __se_sys_socket net/socket.c:1518 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1518
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9817:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:455
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:463
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  sk_prot_free net/core/sock.c:1640 [inline]
  __sk_destruct+0x4f7/0x6e0 net/core/sock.c:1726
  sk_destruct+0x86/0xa0 net/core/sock.c:1734
  __sk_free+0xfb/0x360 net/core/sock.c:1745
  sk_free+0x42/0x50 net/core/sock.c:1756
  sock_put include/net/sock.h:1725 [inline]
  nr_release+0x356/0x3e0 net/netrom/af_netrom.c:554
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880948494c0
  which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 128 bytes inside of
  2048-byte region [ffff8880948494c0, ffff888094849cc0)
The buggy address belongs to the page:
page:ffffea0002521200 refcount:1 mapcount:0 mapping:ffff8880aa400e00  
index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00024ff508 ffffea0002560088 ffff8880aa400e00
raw: 0000000000000000 ffff8880948483c0 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888094849400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff888094849480: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff888094849500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
  ffff888094849580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888094849600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox