Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v3] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Eric Dumazet @ 2022-04-22 21:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pengcheng Yang, Neal Cardwell, netdev, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Paolo Abeni
In-Reply-To: <20220422133712.17eebbcb@kernel.org>

On Fri, Apr 22, 2022 at 1:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 20 Apr 2022 10:34:41 +0800 Pengcheng Yang wrote:
> > If an ACK (s)acks multiple skbs, we favor the information
> > from the most recently sent skb by choosing the skb with
> > the highest prior_delivered count. But in the interval
> > between receiving ACKs, we send multiple skbs with the same
> > prior_delivered, because the tp->delivered only changes
> > when we receive an ACK.
> >
> > We used RACK's solution, copying tcp_rack_sent_after() as
> > tcp_skb_sent_after() helper to determine "which packet was
> > sent last?". Later, we will use tcp_skb_sent_after() instead
> > in RACK.
> >
> > Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
> > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
>
> Somehow this patch got marked as archived in patchwork. Reviving it now.
>
> Eric, Neal, ack?

Oops, right, thanks !

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: 5.10.4+ hang with 'rmmod nf_conntrack'
From: Ben Greear @ 2022-04-22 21:09 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <cdddd527-c25d-c581-cea7-d1f13a0d9948@candelatech.com>

On 4/22/22 9:32 AM, Ben Greear wrote:
> On 1/8/21 5:07 AM, Ben Greear wrote:
>> On 1/7/21 10:16 PM, Florian Westphal wrote:
>>> Ben Greear <greearb@candelatech.com> wrote:
>>>> I noticed my system has a hung process trying to 'rmmod nf_conntrack'.
>>>>
>>>> I've generally been doing the script that calls rmmod forever,
>>>> but only extensively tested on 5.4 kernel and earlier.
>>>>
>>>> If anyone has any ideas, please let me know.  This is from 'sysrq t'.  I don't see
>>>> any hung-task splats in dmesg.
>>>
>>> rmmod on conntrack loops forever until the active conntrack object count reaches 0.
>>> (plus a walk of the conntrack table to evict/put all entries).
> 
> Hello Florian,
> 
> I keep hitting this bug in a particular test case in 5.17.4+, so I added some debug to
> try to learn more.
> 
> My debugging patch looks like this:
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 7552e1e9fd62..29724114caef 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -2543,6 +2543,7 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
>   {
>          int busy;
>          struct net *net;
> +       unsigned long loops = 0;
> 
>          /*
>           * This makes sure all current packets have passed through
> @@ -2556,12 +2557,30 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
>                  struct nf_conntrack_net *cnet = nf_ct_pernet(net);
> 
>                  nf_ct_iterate_cleanup(kill_all, net, 0, 0);
> -               if (atomic_read(&cnet->count) != 0)
> +               if (atomic_read(&cnet->count) != 0) {
> +                       if (loops > 50010)
> +                               pr_err("nf-conntrack-cleanup-net-list, loops: %ld  cnet-count: %d, expect-count: %d users4: %d users6: %d  users_bridge: %d\n",
> +                                      loops, atomic_read(&cnet->count), cnet->expect_count,
> +                                      cnet->users4, cnet->users6, cnet->users_bridge);
>                          busy = 1;
> +               }
>          }
>          if (busy) {
> +               loops++;
> +               if (loops > 50000) {
> +                       msleep(500);
> +               }
>                  schedule();
> -               goto i_see_dead_people;
> +               if (loops > 50020) {
> +                       /* This thing is wedged, going to require a reboot to recover, so attempt
> +                        * to just ignore the bad count and see if system works OK.
> +                        */
> +                       WARN_ON_ONCE(1);
> +                       pr_err("ERROR:  nf_conntrack_cleanup_net cannot make progress.  Ignoring stale reference count and will continue.\n");
> +               }
> +               else {
> +                       goto i_see_dead_people;
> +               }
>          }
> 
>          list_for_each_entry(net, net_exit_list, exit_list) {
> 
> 
> Do you (or anyone else), have some ideas for how to debug this further to help find where the reference
> is leaked (or not released)?

I am now quite sure that the problem I was seeing was caused by an skb leak in the mt76 driver
(for which Felix just found a solution).  After that fix, then I no longer see the nf_conntrack
rmmod hangs.  I will keep testing in case I am just geting (un)lucky.

I do plan to keep my hack/patch in my kernel though, I'd rather it continue and leak some more
memory instead of busy hang forever when skb leaks are hit...

Thanks,
Ben


^ permalink raw reply

* Re: [PATCH net-next 4/5] net: dt-bindings: Introduce the Qualcomm IPQESS Ethernet controller
From: Rob Herring @ 2022-04-22 21:10 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Rob Herring, netdev, linux-arm-kernel, Vladimir Oltean,
	Robert Marko, Luka Perkov, Florian Fainelli, linux-kernel,
	Heiner Kallweit, Russell King, davem, Andrew Lunn,
	thomas.petazzoni, devicetree
In-Reply-To: <20220422180305.301882-5-maxime.chevallier@bootlin.com>

On Fri, 22 Apr 2022 20:03:04 +0200, Maxime Chevallier wrote:
> Add the DT binding for the IPQESS Ethernet Controller. This is a simple
> controller, only requiring the phy-mode, interrupts, clocks, and
> possibly a MAC address setting.
> 
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
> ---
>  .../devicetree/bindings/net/qcom,ipqess.yaml  | 94 +++++++++++++++++++
>  1 file changed, 94 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/qcom,ipqess.yaml
> 

My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
on your patch (DT_CHECKER_FLAGS is new in v5.13):

yamllint warnings/errors:

dtschema/dtc warnings/errors:
Error: Documentation/devicetree/bindings/net/qcom,ipqess.example.dts:26.27-28 syntax error
FATAL ERROR: Unable to parse input tree
make[1]: *** [scripts/Makefile.lib:364: Documentation/devicetree/bindings/net/qcom,ipqess.example.dtb] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [Makefile:1401: dt_binding_check] Error 2

doc reference errors (make refcheckdocs):

See https://patchwork.ozlabs.org/patch/

This check can fail if there are any dependencies. The base for a patch
series is generally the most recent rc1.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit.

^ permalink raw reply

* Re: [PATCH net-next RESEND v5 0/3] Add reset deassertion for Aspeed MDIO
From: Jakub Kicinski @ 2022-04-22 19:25 UTC (permalink / raw)
  To: Dylan Hung
  Cc: robh+dt, joel, andrew, andrew, hkallweit1, linux, davem, pabeni,
	p.zabel, devicetree, linux-arm-kernel, linux-aspeed, linux-kernel,
	netdev, krzk+dt, BMC-SW
In-Reply-To: <20220418014059.3054-1-dylan_hung@aspeedtech.com>

On Mon, 18 Apr 2022 09:40:56 +0800 Dylan Hung wrote:
> Add missing reset deassertion for Aspeed MDIO bus controller. The reset
> is asserted by the hardware when power-on so the driver only needs to
> deassert it. To be able to work with the old DT blobs, the reset is
> optional since it may be deasserted by the bootloader or the previous
> kernel.

still doesn't apply cleanly to net-next:

$ git checkout net-next/master 
HEAD is now at c78c5a660439 dt-bindings: net: mediatek,net: convert to the json-schema

$ git pw series apply 632891
Applying: dt-bindings: net: add reset property for aspeed, ast2600-mdio binding
Applying: net: mdio: add reset control for Aspeed MDIO
Using index info to reconstruct a base tree...
M	drivers/net/mdio/mdio-aspeed.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/net/mdio/mdio-aspeed.c
Applying: ARM: dts: aspeed: add reset properties into MDIO nodes

^ permalink raw reply

* Re: [PATCH net-next 0/5] net: ipqess: introduce Qualcomm IPQESS driver
From: Andrew Lunn @ 2022-04-22 20:29 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: davem, Rob Herring, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Florian Fainelli, Heiner Kallweit, Russell King,
	linux-arm-kernel, Vladimir Oltean, Luka Perkov, Robert Marko
In-Reply-To: <20220422180305.301882-1-maxime.chevallier@bootlin.com>

On Fri, Apr 22, 2022 at 08:03:00PM +0200, Maxime Chevallier wrote:
> Hello everyone,
> 
> This series introduces a new driver, for the Qualcomm IPQESS Ethernet
> Controller, found on the IPQ4019.
> 
> The driver itself is pretty straightforward, but has lived out-of-tree
> for a while. I've done my best to clean-up some outdated API calls, but
> some might remain.
> 
> This controller is somewhat special, since it's part of the IPQ4019 SoC
> which also includes an QCA8K switch, and uses the IPQESS controller for
> the CPU port.

Does it exist in a form where it is not connected to a switch?

As Florian has suggested, if we assume frames are always going
to/coming from a switch, we can play around with the frame format a
little. A dummy tag could be added to the head or tail of the frame,
which the MAC driver then uses. That gives us a more normal structure.

      Andrew

^ permalink raw reply

* Re: [RFC PATCH] 9p: case-insensitive host filesystems
From: Dominique Martinet @ 2022-04-22 19:57 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: qemu-devel, Will Cohen, Greg Kurz, Michael Roitzsch, Keno Fischer,
	Eric Van Hensbergen, linux-kernel, netdev, v9fs-developer,
	Latchesar Ionkov
In-Reply-To: <1757498.AyhHxzoH2B@silver>

Christian Schoenebeck wrote on Fri, Apr 22, 2022 at 08:02:46PM +0200:
> So maybe it's better to handle case-insensitivity entirely on client side? 
> I've read that some generic "case fold" code has landed in the Linux kernel 
> recently that might do the trick?

I haven't tried, but settings S_CASEFOLD on every inodes i_flags might do
what you want client-side.
That's easy enough to test and could be a mount option

Even with that it's possible to do a direct open without readdir first
if one knows the path and I that would only be case-insensitive if the
backing server is case insensitive though, so just setting the option
and expecting it to work all the time might be a little bit
optimistic... I believe guess that should be an optimization at best.

Ideally the server should tell the client they are casefolded somehow,
but 9p doesn't have any capability/mount time negotiation besides msize
so that's difficult with the current protocol.

-- 
Dominique | Asmadeus

^ permalink raw reply

* [PATCH net-next 7/8] mptcp: dump infinite_map field in mptcp_dump_mpext
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

In trace event class mptcp_dump_mpext, dump the newly added infinite_map
field of struct mptcp_dump_mpext too.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/trace/events/mptcp.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/mptcp.h b/include/trace/events/mptcp.h
index f8e28e686c65..563e48617374 100644
--- a/include/trace/events/mptcp.h
+++ b/include/trace/events/mptcp.h
@@ -84,6 +84,7 @@ DECLARE_EVENT_CLASS(mptcp_dump_mpext,
 		__field(u8, reset_transient)
 		__field(u8, reset_reason)
 		__field(u8, csum_reqd)
+		__field(u8, infinite_map)
 	),
 
 	TP_fast_assign(
@@ -102,9 +103,10 @@ DECLARE_EVENT_CLASS(mptcp_dump_mpext,
 		__entry->reset_transient = mpext->reset_transient;
 		__entry->reset_reason = mpext->reset_reason;
 		__entry->csum_reqd = mpext->csum_reqd;
+		__entry->infinite_map = mpext->infinite_map;
 	),
 
-	TP_printk("data_ack=%llu data_seq=%llu subflow_seq=%u data_len=%u csum=%x use_map=%u dsn64=%u data_fin=%u use_ack=%u ack64=%u mpc_map=%u frozen=%u reset_transient=%u reset_reason=%u csum_reqd=%u",
+	TP_printk("data_ack=%llu data_seq=%llu subflow_seq=%u data_len=%u csum=%x use_map=%u dsn64=%u data_fin=%u use_ack=%u ack64=%u mpc_map=%u frozen=%u reset_transient=%u reset_reason=%u csum_reqd=%u infinite_map=%u",
 		  __entry->data_ack, __entry->data_seq,
 		  __entry->subflow_seq, __entry->data_len,
 		  __entry->csum, __entry->use_map,
@@ -112,7 +114,7 @@ DECLARE_EVENT_CLASS(mptcp_dump_mpext,
 		  __entry->use_ack, __entry->ack64,
 		  __entry->mpc_map, __entry->frozen,
 		  __entry->reset_transient, __entry->reset_reason,
-		  __entry->csum_reqd)
+		  __entry->csum_reqd, __entry->infinite_map)
 );
 
 DEFINE_EVENT(mptcp_dump_mpext, mptcp_sendmsg_frag,
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 8/8] selftests: mptcp: add infinite map mibs check
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds a function chk_infi_nr() to check the mibs for the
infinite mapping. Invoke it in chk_join_nr() when validate_checksum
is set.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 .../testing/selftests/net/mptcp/mptcp_join.sh | 36 ++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 7314257d248a..9eb4d889a24a 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -1106,6 +1106,38 @@ chk_rst_nr()
 	echo "$extra_msg"
 }
 
+chk_infi_nr()
+{
+	local infi_tx=$1
+	local infi_rx=$2
+	local count
+	local dump_stats
+
+	printf "%-${nr_blank}s %s" " " "itx"
+	count=$(ip netns exec $ns2 nstat -as | grep InfiniteMapTx | awk '{print $2}')
+	[ -z "$count" ] && count=0
+	if [ "$count" != "$infi_tx" ]; then
+		echo "[fail] got $count infinite map[s] TX expected $infi_tx"
+		fail_test
+		dump_stats=1
+	else
+		echo -n "[ ok ]"
+	fi
+
+	echo -n " - infirx"
+	count=$(ip netns exec $ns1 nstat -as | grep InfiniteMapRx | awk '{print $2}')
+	[ -z "$count" ] && count=0
+	if [ "$count" != "$infi_rx" ]; then
+		echo "[fail] got $count infinite map[s] RX expected $infi_rx"
+		fail_test
+		dump_stats=1
+	else
+		echo "[ ok ]"
+	fi
+
+	[ "${dump_stats}" = 1 ] && dump_stats
+}
+
 chk_join_nr()
 {
 	local syn_nr=$1
@@ -1115,7 +1147,8 @@ chk_join_nr()
 	local csum_ns2=${5:-0}
 	local fail_nr=${6:-0}
 	local rst_nr=${7:-0}
-	local corrupted_pkts=${8:-0}
+	local infi_nr=${8:-0}
+	local corrupted_pkts=${9:-0}
 	local count
 	local dump_stats
 	local with_cookie
@@ -1170,6 +1203,7 @@ chk_join_nr()
 		chk_csum_nr $csum_ns1 $csum_ns2
 		chk_fail_nr $fail_nr $fail_nr
 		chk_rst_nr $rst_nr $rst_nr
+		chk_infi_nr $infi_nr $infi_nr
 	fi
 }
 
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 5/8] mptcp: infinite mapping receiving
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds the infinite mapping receiving logic. When the infinite
mapping is received, set the map_data_len of the subflow to 0.

In subflow_check_data_avail(), only reset the subflow when the map_data_len
of the subflow is non-zero.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/subflow.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 31dcb550316f..30ffb00661bb 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1006,7 +1006,9 @@ static enum mapping_status get_mapping_status(struct sock *ssk,
 
 	data_len = mpext->data_len;
 	if (data_len == 0) {
+		pr_debug("infinite mapping received");
 		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPRX);
+		subflow->map_data_len = 0;
 		return MAPPING_INVALID;
 	}
 
@@ -1220,7 +1222,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
 			return true;
 		}
 
-		if (subflow->mp_join || subflow->fully_established) {
+		if ((subflow->mp_join || subflow->fully_established) && subflow->map_data_len) {
 			/* fatal protocol error, close the socket.
 			 * subflow_error_report() will introduce the appropriate barriers
 			 */
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 2/8] mptcp: add the fallback check
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds the fallback check in subflow_check_data_avail(). Only
do the fallback when the msk hasn't fallen back yet.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/subflow.c | 45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index f217926f6a9c..7f26a5b04ad3 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1203,35 +1203,38 @@ static bool subflow_check_data_avail(struct sock *ssk)
 	return false;
 
 fallback:
-	/* RFC 8684 section 3.7. */
-	if (subflow->send_mp_fail) {
-		if (mptcp_has_another_subflow(ssk)) {
+	if (!__mptcp_check_fallback(msk)) {
+		/* RFC 8684 section 3.7. */
+		if (subflow->send_mp_fail) {
+			if (mptcp_has_another_subflow(ssk)) {
+				ssk->sk_err = EBADMSG;
+				tcp_set_state(ssk, TCP_CLOSE);
+				subflow->reset_transient = 0;
+				subflow->reset_reason = MPTCP_RST_EMIDDLEBOX;
+				tcp_send_active_reset(ssk, GFP_ATOMIC);
+				while ((skb = skb_peek(&ssk->sk_receive_queue)))
+					sk_eat_skb(ssk, skb);
+			}
+			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+			return true;
+		}
+
+		if (subflow->mp_join || subflow->fully_established) {
+			/* fatal protocol error, close the socket.
+			 * subflow_error_report() will introduce the appropriate barriers
+			 */
 			ssk->sk_err = EBADMSG;
 			tcp_set_state(ssk, TCP_CLOSE);
 			subflow->reset_transient = 0;
-			subflow->reset_reason = MPTCP_RST_EMIDDLEBOX;
+			subflow->reset_reason = MPTCP_RST_EMPTCP;
 			tcp_send_active_reset(ssk, GFP_ATOMIC);
-			while ((skb = skb_peek(&ssk->sk_receive_queue)))
-				sk_eat_skb(ssk, skb);
+			WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+			return false;
 		}
-		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
-		return true;
-	}
 
-	if (subflow->mp_join || subflow->fully_established) {
-		/* fatal protocol error, close the socket.
-		 * subflow_error_report() will introduce the appropriate barriers
-		 */
-		ssk->sk_err = EBADMSG;
-		tcp_set_state(ssk, TCP_CLOSE);
-		subflow->reset_transient = 0;
-		subflow->reset_reason = MPTCP_RST_EMPTCP;
-		tcp_send_active_reset(ssk, GFP_ATOMIC);
-		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
-		return false;
+		__mptcp_do_fallback(msk);
 	}
 
-	__mptcp_do_fallback(msk);
 	skb = skb_peek(&ssk->sk_receive_queue);
 	subflow->map_valid = 1;
 	subflow->map_seq = READ_ONCE(msk->ack_seq);
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 1/8] mptcp: don't send RST for single subflow
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

When a bad checksum is detected and a single subflow is in use, don't
send RST + MP_FAIL, send data_ack + MP_FAIL instead.

So invoke tcp_send_active_reset() only when mptcp_has_another_subflow()
is true.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/subflow.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index aba260f547da..f217926f6a9c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1206,14 +1206,14 @@ static bool subflow_check_data_avail(struct sock *ssk)
 	/* RFC 8684 section 3.7. */
 	if (subflow->send_mp_fail) {
 		if (mptcp_has_another_subflow(ssk)) {
+			ssk->sk_err = EBADMSG;
+			tcp_set_state(ssk, TCP_CLOSE);
+			subflow->reset_transient = 0;
+			subflow->reset_reason = MPTCP_RST_EMIDDLEBOX;
+			tcp_send_active_reset(ssk, GFP_ATOMIC);
 			while ((skb = skb_peek(&ssk->sk_receive_queue)))
 				sk_eat_skb(ssk, skb);
 		}
-		ssk->sk_err = EBADMSG;
-		tcp_set_state(ssk, TCP_CLOSE);
-		subflow->reset_transient = 0;
-		subflow->reset_reason = MPTCP_RST_EMIDDLEBOX;
-		tcp_send_active_reset(ssk, GFP_ATOMIC);
 		WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
 		return true;
 	}
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 6/8] mptcp: add mib for infinite map sending
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds a new mib named MPTCP_MIB_INFINITEMAPTX, increase it
when a infinite mapping has been sent out.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/mib.c      | 1 +
 net/mptcp/mib.h      | 1 +
 net/mptcp/protocol.c | 1 +
 3 files changed, 3 insertions(+)

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index e55d3dfbee0c..d93a8c9996fd 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -24,6 +24,7 @@ static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("MPJoinAckRx", MPTCP_MIB_JOINACKRX),
 	SNMP_MIB_ITEM("MPJoinAckHMacFailure", MPTCP_MIB_JOINACKMAC),
 	SNMP_MIB_ITEM("DSSNotMatching", MPTCP_MIB_DSSNOMATCH),
+	SNMP_MIB_ITEM("InfiniteMapTx", MPTCP_MIB_INFINITEMAPTX),
 	SNMP_MIB_ITEM("InfiniteMapRx", MPTCP_MIB_INFINITEMAPRX),
 	SNMP_MIB_ITEM("DSSNoMatchTCP", MPTCP_MIB_DSSTCPMISMATCH),
 	SNMP_MIB_ITEM("DataCsumErr", MPTCP_MIB_DATACSUMERR),
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 00576179a619..529d07af9e14 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -17,6 +17,7 @@ enum linux_mptcp_mib_field {
 	MPTCP_MIB_JOINACKRX,		/* Received an ACK + MP_JOIN */
 	MPTCP_MIB_JOINACKMAC,		/* HMAC was wrong on ACK + MP_JOIN */
 	MPTCP_MIB_DSSNOMATCH,		/* Received a new mapping that did not match the previous one */
+	MPTCP_MIB_INFINITEMAPTX,	/* Sent an infinite mapping */
 	MPTCP_MIB_INFINITEMAPRX,	/* Received an infinite mapping */
 	MPTCP_MIB_DSSTCPMISMATCH,	/* DSS-mapping did not map with TCP's sequence numbers */
 	MPTCP_MIB_DATACSUMERR,		/* The data checksum fail */
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 161c07f49db6..4581c570ef68 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1239,6 +1239,7 @@ static void mptcp_update_infinite_map(struct mptcp_sock *msk,
 	mpext->infinite_map = 1;
 	mpext->data_len = 0;
 
+	MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPTX);
 	mptcp_subflow_ctx(ssk)->send_infinite_map = 0;
 	pr_fallback(msk);
 	__mptcp_do_fallback(msk);
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 3/8] mptcp: track and update contiguous data status
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds a new member allow_infinite_fallback in mptcp_sock,
which is initialized to 'true' when the connection begins and is set
to 'false' on any retransmit or successful MP_JOIN. Only do infinite
mapping fallback if there is a single subflow AND there have been no
retransmissions AND there have never been any MP_JOINs.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/protocol.c | 3 +++
 net/mptcp/protocol.h | 1 +
 net/mptcp/subflow.c  | 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0492aa9308c7..6d653914e9fe 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2465,6 +2465,7 @@ static void __mptcp_retrans(struct sock *sk)
 		dfrag->already_sent = max(dfrag->already_sent, info.sent);
 		tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle,
 			 info.size_goal);
+		WRITE_ONCE(msk->allow_infinite_fallback, false);
 	}
 
 	release_sock(ssk);
@@ -2539,6 +2540,7 @@ static int __mptcp_init_sock(struct sock *sk)
 	msk->first = NULL;
 	inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
 	WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
+	WRITE_ONCE(msk->allow_infinite_fallback, true);
 	msk->recovery = false;
 
 	mptcp_pm_data_init(msk);
@@ -3275,6 +3277,7 @@ bool mptcp_finish_join(struct sock *ssk)
 	}
 
 	subflow->map_seq = READ_ONCE(msk->ack_seq);
+	WRITE_ONCE(msk->allow_infinite_fallback, false);
 
 out:
 	mptcp_event(MPTCP_EVENT_SUB_ESTABLISHED, msk, ssk, GFP_ATOMIC);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index aca1fb56523f..88d292374599 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -263,6 +263,7 @@ struct mptcp_sock {
 	bool		rcv_fastclose;
 	bool		use_64bit_ack; /* Set when we received a 64-bit DSN */
 	bool		csum_enabled;
+	bool		allow_infinite_fallback;
 	u8		recvmsg_inq:1,
 			cork:1,
 			nodelay:1;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 7f26a5b04ad3..31dcb550316f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1206,7 +1206,8 @@ static bool subflow_check_data_avail(struct sock *ssk)
 	if (!__mptcp_check_fallback(msk)) {
 		/* RFC 8684 section 3.7. */
 		if (subflow->send_mp_fail) {
-			if (mptcp_has_another_subflow(ssk)) {
+			if (mptcp_has_another_subflow(ssk) ||
+			    !READ_ONCE(msk->allow_infinite_fallback)) {
 				ssk->sk_err = EBADMSG;
 				tcp_set_state(ssk, TCP_CLOSE);
 				subflow->reset_transient = 0;
@@ -1486,6 +1487,7 @@ int __mptcp_subflow_connect(struct sock *sk, const struct mptcp_addr_info *loc,
 	/* discard the subflow socket */
 	mptcp_sock_graft(ssk, sk->sk_socket);
 	iput(SOCK_INODE(sf));
+	WRITE_ONCE(msk->allow_infinite_fallback, false);
 	return err;
 
 failed_unlink:
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 4/8] mptcp: infinite mapping sending
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev
  Cc: Geliang Tang, davem, kuba, pabeni, matthieu.baerts, mptcp,
	Mat Martineau
In-Reply-To: <20220422215543.545732-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch adds the infinite mapping sending logic.

Add a new flag send_infinite_map in struct mptcp_subflow_context. Set
it true when a single contiguous subflow is in use and the
allow_infinite_fallback flag is true in mptcp_pm_mp_fail_received().

In mptcp_sendmsg_frag(), if this flag is true, call the new function
mptcp_update_infinite_map() to set the infinite mapping.

Add a new flag infinite_map in struct mptcp_ext, set it true in
mptcp_update_infinite_map(), and check this flag in a new helper
mptcp_check_infinite_map().

In mptcp_update_infinite_map(), set data_len to 0, and clear the
send_infinite_map flag, then do fallback.

In mptcp_established_options(), use the helper mptcp_check_infinite_map()
to let the infinite mapping DSS can be sent out in the fallback mode.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/mptcp.h  |  3 ++-
 net/mptcp/options.c  |  8 ++++++--
 net/mptcp/pm.c       |  6 ++++++
 net/mptcp/protocol.c | 17 +++++++++++++++++
 net/mptcp/protocol.h | 12 ++++++++++++
 5 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0a3b0fb04a3b..8b1afd6f5cc4 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -35,7 +35,8 @@ struct mptcp_ext {
 			frozen:1,
 			reset_transient:1;
 	u8		reset_reason:4,
-			csum_reqd:1;
+			csum_reqd:1,
+			infinite_map:1;
 };
 
 #define MPTCP_RM_IDS_MAX	8
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 325383646f5c..88f4ebbd6515 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -825,7 +825,7 @@ bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 
 	opts->suboptions = 0;
 
-	if (unlikely(__mptcp_check_fallback(msk)))
+	if (unlikely(__mptcp_check_fallback(msk) && !mptcp_check_infinite_map(skb)))
 		return false;
 
 	if (unlikely(skb && TCP_SKB_CB(skb)->tcp_flags & TCPHDR_RST)) {
@@ -1340,8 +1340,12 @@ void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp,
 			put_unaligned_be32(mpext->subflow_seq, ptr);
 			ptr += 1;
 			if (opts->csum_reqd) {
+				/* data_len == 0 is reserved for the infinite mapping,
+				 * the checksum will also be set to 0.
+				 */
 				put_unaligned_be32(mpext->data_len << 16 |
-						   mptcp_make_csum(mpext), ptr);
+						   (mpext->data_len ? mptcp_make_csum(mpext) : 0),
+						   ptr);
 			} else {
 				put_unaligned_be32(mpext->data_len << 16 |
 						   TCPOPT_NOP << 8 | TCPOPT_NOP, ptr);
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 8aa0cdb7ad46..5c36870d3420 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -285,7 +285,13 @@ void mptcp_pm_mp_prio_received(struct sock *ssk, u8 bkup)
 
 void mptcp_pm_mp_fail_received(struct sock *sk, u64 fail_seq)
 {
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+
 	pr_debug("fail_seq=%llu", fail_seq);
+
+	if (!mptcp_has_another_subflow(sk) && READ_ONCE(msk->allow_infinite_fallback))
+		subflow->send_infinite_map = 1;
 }
 
 /* path manager helpers */
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6d653914e9fe..161c07f49db6 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1229,6 +1229,21 @@ static void mptcp_update_data_checksum(struct sk_buff *skb, int added)
 	mpext->csum = csum_fold(csum_block_add(csum, skb_checksum(skb, offset, added, 0), offset));
 }
 
+static void mptcp_update_infinite_map(struct mptcp_sock *msk,
+				      struct sock *ssk,
+				      struct mptcp_ext *mpext)
+{
+	if (!mpext)
+		return;
+
+	mpext->infinite_map = 1;
+	mpext->data_len = 0;
+
+	mptcp_subflow_ctx(ssk)->send_infinite_map = 0;
+	pr_fallback(msk);
+	__mptcp_do_fallback(msk);
+}
+
 static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 			      struct mptcp_data_frag *dfrag,
 			      struct mptcp_sendmsg_info *info)
@@ -1360,6 +1375,8 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 out:
 	if (READ_ONCE(msk->csum_enabled))
 		mptcp_update_data_checksum(skb, copy);
+	if (mptcp_subflow_ctx(ssk)->send_infinite_map)
+		mptcp_update_infinite_map(msk, ssk, mpext);
 	trace_mptcp_sendmsg_frag(mpext);
 	mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
 	return copy;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 88d292374599..61d600693ffd 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -441,6 +441,7 @@ struct mptcp_subflow_context {
 		send_mp_prio : 1,
 		send_mp_fail : 1,
 		send_fastclose : 1,
+		send_infinite_map : 1,
 		rx_eof : 1,
 		can_ack : 1,        /* only after processing the remote a key */
 		disposable : 1,	    /* ctx can be free at ulp release time */
@@ -877,6 +878,17 @@ static inline void mptcp_do_fallback(struct sock *sk)
 
 #define pr_fallback(a) pr_debug("%s:fallback to TCP (msk=%p)", __func__, a)
 
+static inline bool mptcp_check_infinite_map(struct sk_buff *skb)
+{
+	struct mptcp_ext *mpext;
+
+	mpext = skb ? mptcp_get_ext(skb) : NULL;
+	if (mpext && mpext->infinite_map)
+		return true;
+
+	return false;
+}
+
 static inline bool subflow_simultaneous_connect(struct sock *sk)
 {
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
-- 
2.36.0


^ permalink raw reply related

* [PATCH net-next 0/8] mptcp: TCP fallback for established connections
From: Mat Martineau @ 2022-04-22 21:55 UTC (permalink / raw)
  To: netdev; +Cc: Mat Martineau, davem, kuba, pabeni, matthieu.baerts, mptcp

RFC 8684 allows some MPTCP connections to fall back to regular TCP when
the MPTCP DSS checksum detects middlebox interference, there is only a
single subflow, and there is no unacknowledged out-of-sequence
data. When this condition is detected, the stack sends a MPTCP DSS
option with an "infinite mapping" to signal that a fallback is
happening, and the peers will stop sending MPTCP options in their TCP
headers. The Linux MPTCP stack has not yet supported this type of
fallback, instead closing the connection when the MPTCP checksum fails.

This series adds support for fallback to regular TCP in a more limited
scenario, for only MPTCP connections that have never connected
additional subflows or transmitted out-of-sequence data. The selftests
are also updated to check new MIBs that track infinite mappings.

Geliang Tang (8):
  mptcp: don't send RST for single subflow
  mptcp: add the fallback check
  mptcp: track and update contiguous data status
  mptcp: infinite mapping sending
  mptcp: infinite mapping receiving
  mptcp: add mib for infinite map sending
  mptcp: dump infinite_map field in mptcp_dump_mpext
  selftests: mptcp: add infinite map mibs check

 include/net/mptcp.h                           |  3 +-
 include/trace/events/mptcp.h                  |  6 +-
 net/mptcp/mib.c                               |  1 +
 net/mptcp/mib.h                               |  1 +
 net/mptcp/options.c                           |  8 ++-
 net/mptcp/pm.c                                |  6 ++
 net/mptcp/protocol.c                          | 21 +++++++
 net/mptcp/protocol.h                          | 13 +++++
 net/mptcp/subflow.c                           | 57 +++++++++++--------
 .../testing/selftests/net/mptcp/mptcp_join.sh | 36 +++++++++++-
 10 files changed, 121 insertions(+), 31 deletions(-)

base-commit: c78c5a660439d4d341a03b651541fda3ebe76160
-- 
2.36.0

^ permalink raw reply

* Re: [net-next v1] net: Add a second bind table hashed by port and address
From: Eric Dumazet @ 2022-04-22 21:25 UTC (permalink / raw)
  To: Joanne Koong; +Cc: netdev, Martin KaFai Lau, David Miller, Jakub Kicinski
In-Reply-To: <CAJnrk1az3efAkZz9uY7U99xnWGU89Qxv4XPv0RJzWrQkB3_4_w@mail.gmail.com>

On Fri, Apr 22, 2022 at 2:07 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > We currently have one tcp bind table (bhash) which hashes by port
> > > number only. In the socket bind path, we check for bind conflicts by
> > > traversing the specified port's inet_bind2_bucket while holding the
> > > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> > >
> > > In instances where there are tons of sockets hashed to the same port
> > > at different addresses, checking for a bind conflict is time-intensive
> > > and can cause softirq cpu lockups, as well as stops new tcp connections
> > > since __inet_inherit_port() also contests for the spinlock.
> > >
> > > This patch proposes adding a second bind table, bhash2, that hashes by
> > > port and ip address. Searching the bhash2 table leads to significantly
> > > faster conflict resolution and less time holding the spinlock.
> > > When experimentally testing this on a local server, the results for how
> > > long a bind request takes were as follows:
> > >
> > > when there are ~24k sockets already bound to the port -
> > >
> > > ipv4:
> > > before - 0.002317 seconds
> > > with bhash2 - 0.000018 seconds
> > >
> > > ipv6:
> > > before - 0.002431 seconds
> > > with bhash2 - 0.000021 seconds
> >
> >
> > Hi Joanne
> >
> > Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
> >
> > I fear we add some extra code and cost for quite an unusual configuration.
> >
> > Thanks.
> >
> Hi Eric,
>
> I have a test on my local server that populates the bhash table entry
> with 24k sockets for a given port and address, and then times how long
> a bind request on that port takes.

OK, but why 24k ? Why not 24 M then ?

In this case, will a 64K hash table be big enough ?

 When populating the table entry, I
> use the same IPv6 address on the host (with SO_REUSEADDR set). At
> Facebook, there are some internal teams that submit bind requests for
> 400 vips on the same port on concurrent threads that run into softirq
> lockup issues due to the bhash table entry spinlock contention, which
> is the main motivation behind this patch.



I am pretty sure the IPv6 stack does not scale well if we have
thousands of IPv6 addresses on one netdev.
Some O(N) behavior will also trigger latency violations.

Can you share the test, in a form that can be added in linux tree ?

I mean, before today nobody was trying to have 24k listeners on a host,
so it would be nice to have a regression test for future changes in the stack.

If the goal is to deal with 400 vips, why using 24k in your changelog ?
I would rather stick to the reality, and not pretend TCP stack should
scale to 24k listeners.

I have not looked at the patch yet, I choked on the changelog for
being exaggerated.

^ permalink raw reply

* RE: [PATCH net v3] ice: Fix race during aux device (un)plugging
From: Ertman, David M @ 2022-04-22 20:55 UTC (permalink / raw)
  To: ivecera, netdev@vger.kernel.org
  Cc: poros, mschmidt, Leon Romanovsky, Brandeburg, Jesse,
	Nguyen, Anthony L, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Saleem, Shiraz, moderated list:INTEL ETHERNET DRIVERS, open list
In-Reply-To: <MW5PR11MB581100DBD307763A92012BEADDF79@MW5PR11MB5811.namprd11.prod.outlook.com>

> -----Original Message-----
> From: Ertman, David M
> Sent: Friday, April 22, 2022 10:42 AM
> To: Ivan Vecera <ivecera@redhat.com>; netdev@vger.kernel.org
> Cc: poros <poros@redhat.com>; mschmidt <mschmidt@redhat.com>; Leon
> Romanovsky <leon@kernel.org>; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; David S. Miller <davem@davemloft.net>;
> Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>;
> Saleem, Shiraz <shiraz.saleem@intel.com>; moderated list:INTEL ETHERNET
> DRIVERS <intel-wired-lan@lists.osuosl.org>; open list <linux-
> kernel@vger.kernel.org>
> Subject: RE: [PATCH net v3] ice: Fix race during aux device (un)plugging
> 
> > -----Original Message-----
> > From: Ivan Vecera <ivecera@redhat.com>
> > Sent: Wednesday, April 20, 2022 11:09 PM
> > To: netdev@vger.kernel.org
> > Cc: poros <poros@redhat.com>; mschmidt <mschmidt@redhat.com>;
> Leon
> > Romanovsky <leon@kernel.org>; Brandeburg, Jesse
> > <jesse.brandeburg@intel.com>; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; David S. Miller <davem@davemloft.net>;
> > Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>;
> > Ertman, David M <david.m.ertman@intel.com>; Saleem, Shiraz
> > <shiraz.saleem@intel.com>; moderated list:INTEL ETHERNET DRIVERS
> <intel-
> > wired-lan@lists.osuosl.org>; open list <linux-kernel@vger.kernel.org>
> > Subject: [PATCH net v3] ice: Fix race during aux device (un)plugging
> >
> > Function ice_plug_aux_dev() assigns pf->adev field too early prior
> > aux device initialization and on other side ice_unplug_aux_dev()
> > starts aux device deinit and at the end assigns NULL to pf->adev.
> > This is wrong because pf->adev should always be non-NULL only when
> > aux device is fully initialized and ready. This wrong order causes
> > a crash when ice_send_event_to_aux() call occurs because that function
> > depends on non-NULL value of pf->adev and does not assume that
> > aux device is half-initialized or half-destroyed.
> > After order correction the race window is tiny but it is still there,
> > as Leon mentioned and manipulation with pf->adev needs to be protected
> > by mutex.
> >
> > Fix (un-)plugging functions so pf->adev field is set after aux device
> > init and prior aux device destroy and protect pf->adev assignment by
> > new mutex. This mutex is also held during ice_send_event_to_aux()
> > call to ensure that aux device is valid during that call. Device
> > lock used ice_send_event_to_aux() to avoid its concurrent run can
> > be removed as this is secured by that mutex.
> >
> > Reproducer:
> > cycle=1
> > while :;do
> >         echo "#### Cycle: $cycle"
> >
> >         ip link set ens7f0 mtu 9000
> >         ip link add bond0 type bond mode 1 miimon 100
> >         ip link set bond0 up
> >         ifenslave bond0 ens7f0
> >         ip link set bond0 mtu 9000
> >         ethtool -L ens7f0 combined 1
> >         ip link del bond0
> >         ip link set ens7f0 mtu 1500
> >         sleep 1
> >
> >         let cycle++
> > done
> >
> > In short when the device is added/removed to/from bond the aux device
> > is unplugged/plugged. When MTU of the device is changed an event is
> > sent to aux device asynchronously. This can race with (un)plugging
> > operation and because pf->adev is set too early (plug) or too late
> > (unplug) the function ice_send_event_to_aux() can touch uninitialized
> > or destroyed fields. In the case of crash below pf->adev->dev.mutex.
> >
> > Crash:
> > [   53.372066] bond0: (slave ens7f0): making interface the new active one
> > [   53.378622] bond0: (slave ens7f0): Enslaving as an active interface with an
> u
> > p link
> > [   53.386294] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes
> > ready
> > [   53.549104] bond0: (slave ens7f1): Enslaving as a backup interface with an
> > up
> >  link
> > [   54.118906] ice 0000:ca:00.0 ens7f0: Number of in use tx queues changed
> > inval
> > idating tc mappings. Priority traffic classification disabled!
> > [   54.233374] ice 0000:ca:00.1 ens7f1: Number of in use tx queues changed
> > inval
> > idating tc mappings. Priority traffic classification disabled!
> > [   54.248204] bond0: (slave ens7f0): Releasing backup interface
> > [   54.253955] bond0: (slave ens7f1): making interface the new active one
> > [   54.274875] bond0: (slave ens7f1): Releasing backup interface
> > [   54.289153] bond0 (unregistering): Released all slaves
> > [   55.383179] MII link monitoring set to 100 ms
> > [   55.398696] bond0: (slave ens7f0): making interface the new active one
> > [   55.405241] BUG: kernel NULL pointer dereference, address:
> > 0000000000000080
> > [   55.405289] bond0: (slave ens7f0): Enslaving as an active interface with an
> u
> > p link
> > [   55.412198] #PF: supervisor write access in kernel mode
> > [   55.412200] #PF: error_code(0x0002) - not-present page
> > [   55.412201] PGD 25d2ad067 P4D 0
> > [   55.412204] Oops: 0002 [#1] PREEMPT SMP NOPTI
> > [   55.412207] CPU: 0 PID: 403 Comm: kworker/0:2 Kdump: loaded Tainted:
> G
> > S
> >            5.17.0-13579-g57f2d6540f03 #1
> > [   55.429094] bond0: (slave ens7f1): Enslaving as a backup interface with an
> > up
> >  link
> > [   55.430224] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS
> 1.4.4
> > 10/07/
> > 2021
> > [   55.430226] Workqueue: ice ice_service_task [ice]
> > [   55.468169] RIP: 0010:mutex_unlock+0x10/0x20
> > [   55.472439] Code: 0f b1 13 74 96 eb e0 4c 89 ee eb d8 e8 79 54 ff ff 66 0f 1f
> 84
> > 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 40 ef 01 00 31 d2 <f0> 48 0f b1 17
> 75
> > 01 c3 e9 e3 fe ff ff 0f 1f 00 0f 1f 44 00 00 48
> > [   55.491186] RSP: 0018:ff4454230d7d7e28 EFLAGS: 00010246
> > [   55.496413] RAX: ff1a79b208b08000 RBX: ff1a79b2182e8880 RCX:
> > 0000000000000001
> > [   55.503545] RDX: 0000000000000000 RSI: ff4454230d7d7db0 RDI:
> > 0000000000000080
> > [   55.510678] RBP: ff1a79d1c7e48b68 R08: ff4454230d7d7db0 R09:
> > 0000000000000041
> > [   55.517812] R10: 00000000000000a5 R11: 00000000000006e6 R12:
> > ff1a79d1c7e48bc0
> > [   55.524945] R13: 0000000000000000 R14: ff1a79d0ffc305c0 R15:
> > 0000000000000000
> > [   55.532076] FS:  0000000000000000(0000) GS:ff1a79d0ffc00000(0000)
> > knlGS:0000000000000000
> > [   55.540163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   55.545908] CR2: 0000000000000080 CR3: 00000003487ae003 CR4:
> > 0000000000771ef0
> > [   55.553041] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [   55.560173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [   55.567305] PKRU: 55555554
> > [   55.570018] Call Trace:
> > [   55.572474]  <TASK>
> > [   55.574579]  ice_service_task+0xaab/0xef0 [ice]
> > [   55.579130]  process_one_work+0x1c5/0x390
> > [   55.583141]  ? process_one_work+0x390/0x390
> > [   55.587326]  worker_thread+0x30/0x360
> > [   55.590994]  ? process_one_work+0x390/0x390
> > [   55.595180]  kthread+0xe6/0x110
> > [   55.598325]  ? kthread_complete_and_exit+0x20/0x20
> > [   55.603116]  ret_from_fork+0x1f/0x30
> > [   55.606698]  </TASK>
> >
> > Fixes: f9f5301e7e2d ("ice: Register auxiliary device to provide RDMA")
> > Cc: Leon Romanovsky <leon@kernel.org>
> > Signed-off-by: Ivan Vecera <ivecera@redhat.com>
> 
> Sorry for previous mis-reply - hit the wrong button.
> 
> LGTM
> Acked-by: Dave Ertman <david.m.ertman@intel.com>

After thinking about this for a bit longer, I did think of one issue.

With the removal of the device_lock in ice_send_event_to_aux(), there is no guarantee that the
function pointer will not become NULL by the auxiliary_driver unloading.  It is a very small window,
but it could happen.

I think the device_lock should probably stay also.

DaveE

^ permalink raw reply

* Re: [PATCH -next] libbpf: Add additional null-pointer checking in make_parent_dir
From: Andrii Nakryiko @ 2022-04-22 21:46 UTC (permalink / raw)
  To: cuigaosheng
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh, Networking,
	bpf, open list, gongruiqi1, wangweiyang2
In-Reply-To: <60b4e208-efed-c2fb-d1e0-125e5409c861@huawei.com>

On Thu, Apr 21, 2022 at 7:55 PM cuigaosheng <cuigaosheng1@huawei.com> wrote:
>
> This email adjusts the code format.
>
> I don't understand why we don't check path for NULL, bpf_link__pin is an
> external
> interface, It will be called by external functions and provide input
> parameters,

that external interface expects non-NULL string as input argument,
which is a default throughout libbpf's API. You will get SIGSEGV in
lots of cases if you pass NULL where you are not supposed to, e.g.,
bpf_object__open_file() and many others. It doesn't mean that libbpf
should check any pointer argument for NULL.

You can argue that strdup(NULL) shouldn't crash but it doesn't. It's
because strdup() has a contract that it shouldn't be passed NULL. So
is the case here.


> for example in samples/bpf/hbm.c:
>
> > 201         link = bpf_program__attach_cgroup(bpf_prog, cg1);
> > 202         if (libbpf_get_error(link)) {
> > 203                 fprintf(stderr, "ERROR: bpf_program__attach_cgroup
> > failed\n");
> > 204                 goto err;
> > 205         }
> > 206
> > 207         sprintf(cg_pin_path, "/sys/fs/bpf/hbm%d", cg_id);
> > 208         rc = bpf_link__pin(link, cg_pin_path);
> > 209         if (rc < 0) {
> > 210                 printf("ERROR: bpf_link__pin failed: %d\n", rc);
> > 211                 goto err;
> > 212         }
>
> if cg_pin_path is NULL, strdup(NULL) will trigger a segmentation fault in
> make_parent_dir, I think we should avoid this and add null-pointer checking
> for path, just like check_path:
> >  7673 static int check_path(const char *path)
> >  7674 {
> >  7675         char *cp, errmsg[STRERR_BUFSIZE];
> >  7676         struct statfs st_fs;
> >  7677         char *dname, *dir;
> >  7678         int err = 0;
> >  7679
> >  7680         if (path == NULL)
> >  7681                 return -EINVAL;
> >  7682
> >  7683         dname = strdup(path);
> >  7684         if (dname == NULL)
> >  7685                 return -ENOMEM;
> >  7686
> >  7687         dir = dirname(dname);
> >  7688         if (statfs(dir, &st_fs)) {
> >  7689                 cp = libbpf_strerror_r(errno, errmsg,
> > sizeof(errmsg));
> >  7690                 pr_warn("failed to statfs %s: %s\n", dir, cp);
> >  7691                 err = -errno;
> >  7692         }
> >  7693         free(dname);
> >  7694
> >  7695         if (!err && st_fs.f_type != BPF_FS_MAGIC) {
> >  7696                 pr_warn("specified path %s is not on BPF FS\n",
> > path);
> >  7697                 err = -EINVAL;
> >  7698         }
> >  7699
> >  7700         return err;
> >  7701 }
>
> Thanks.
>
>
> 在 2022/4/22 0:55, Andrii Nakryiko 写道:
> > On Thu, Apr 21, 2022 at 6:01 AM Gaosheng Cui <cuigaosheng1@huawei.com> wrote:
> >> The make_parent_dir is called without null-pointer checking for path,
> >> such as bpf_link__pin. To ensure there is no null-pointer dereference
> >> in make_parent_dir, so make_parent_dir requires additional null-pointer
> >> checking for path.
> >>
> >> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
> >> ---
> >>   tools/lib/bpf/libbpf.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index b53e51884f9e..5786e6184bf5 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -7634,6 +7634,9 @@ static int make_parent_dir(const char *path)
> >>          char *dname, *dir;
> >>          int err = 0;
> >>
> >> +       if (path == NULL)
> >> +               return -EINVAL;
> >> +
> > API contract is that path shouldn't be NULL. Just like we don't check
> > link, obj, prog for NULL in every single API, I don't think we need to
> > do it here, unless I'm missing something?
> >
> >>          dname = strdup(path);
> >>          if (dname == NULL)
> >>                  return -ENOMEM;
> >> --
> >> 2.25.1
> >>
> > .

^ permalink raw reply

* Re: [PATCH net-next] net/af_packet: add VLAN support for AF_PACKET SOCK_RAW GSO
From: Willem de Bruijn @ 2022-04-22 21:39 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: Willem de Bruijn, netdev, Michael S . Tsirkin, Jason Wang,
	David S . Miller, Jakub Kicinski, Paolo Abeni, Maxim Mikityanskiy,
	virtualization, Balazs Nemeth, Mike Pattrick, Eric Dumazet
In-Reply-To: <YmIOLBihyeLy+PCS@Laptop-X1>

On Thu, Apr 21, 2022 at 10:10 PM Hangbin Liu <liuhangbin@gmail.com> wrote:
>
> Hi Willem,
> On Thu, Apr 21, 2022 at 10:15:16AM -0400, Willem de Bruijn wrote:
> > Your approach does sound simpler than the above. Thanks for looking
> > into that alternative, though.
>
> Sorry I have to bring this topic back. I just remembered that
> tpacket_snd() already called skb_probe_transport_header() before calling
> virtio_net_hdr_* functions. e.g.
>
> - tpacket_snd()
>   - tpacket_fill_skb()
>     - packet_parse_headers()
>       - skb_probe_transport_header()
>   - if (po->has_vnet_hdr)
>     - virtio_net_hdr_to_skb()
>     - virtio_net_hdr_set_proto()
>
> While for packet_snd()
>
> - packet_snd()
>   - if (has_vnet_hdr)
>     - virtio_net_hdr_to_skb()
>     - virtio_net_hdr_set_proto()
>   - packet_parse_headers()
>     - skb_probe_transport_header()
>
> If we split skb_probe_transport_header() from packet_parse_headers() and
> move it before calling virtio_net_hdr_* function in packet_snd(). Should
> we do the same for tpacket_snd(), i.e. move skb_probe_transport_header()
> after the virtio_net_hdr_* function?

That sounds like the inverse: "move after" instead of "move before"?

But I thought the plan was to go back to your last patch which brings
packet_snd in line with tpacket_snd by moving packet_parse_headers in
its entirety before virtio_net_hdr_*?

> I think it really doesn't matter whether calls skb_probe_transport_header()
> before or after virtio_net_hdr_* functions if we could set the skb->protocol
> and network header correctly. Because
>
> - skb_probe_transport_header()
>   - skb_flow_dissect_flow_keys_basic()
>     - __skb_flow_dissect()
>
> In __skb_flow_dissect()
> ```
>  * @data: raw buffer pointer to the packet, if NULL use skb->data
>  * @proto: protocol for which to get the flow, if @data is NULL use skb->protocol
>  * @nhoff: network header offset, if @data is NULL use skb_network_offset(skb)
>  * @hlen: packet header length, if @data is NULL use skb_headlen(skb)
> ```
>
> So when data is NULL, we need to make sure the protocol, network header offset,
> packet header length are correct.
>
> Before this patch, the VLAN packet network header offset is incorrect when calls
> skb_probe_transport_header(). After the fix, this issue should be gone
> and we can call skb_probe_transport_header() safely.
>
> So my conclusion is. There is no need to split packet_parse_headers(). Move
> packet_parse_headers() before calling virtio_net_hdr_* function in packet_snd()
> should be safe.

Ack. Sorry if my last response was not entirely clear on this point.

> Please pardon me if I didn't make something clear.
> Let's me know what do you think.
>
> Thanks
> Hangbin

^ permalink raw reply

* Re: [patch iproute2-next] devlink: introduce -h[ex] cmdline option to allow dumping numbers in hex format
From: Shannon Nelson @ 2022-04-22 21:36 UTC (permalink / raw)
  To: Jiri Pirko, netdev; +Cc: sthemmin, dsahern
In-Reply-To: <20220419171637.1147925-1-jiri@resnulli.us>

On 4/19/22 10:16 AM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> For health reporter dumps it is quite convenient to have the numbers in
> hexadecimal format. Introduce a command line option to allow user to
> achieve that output.
> 
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> ---
>   devlink/devlink.c  | 19 +++++++++++++------
>   man/man8/devlink.8 |  4 ++++
>   2 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/devlink/devlink.c b/devlink/devlink.c
> index aab739f7f437..bc681737ec8a 100644
> --- a/devlink/devlink.c
> +++ b/devlink/devlink.c
> @@ -367,6 +367,7 @@ struct dl {
>   	bool pretty_output;
>   	bool verbose;
>   	bool stats;
> +	bool hex;
>   	struct {
>   		bool present;
>   		char *bus_name;
> @@ -8044,6 +8045,8 @@ static int cmd_health_dump_clear(struct dl *dl)
>   
>   static int fmsg_value_show(struct dl *dl, int type, struct nlattr *nl_data)
>   {
> +	const char *num_fmt = dl->hex ? "%x" : "%u";
> +	const char *num64_fmt = dl->hex ? "%"PRIx64 : "%"PRIu64;

Can we get a leading "0x" on these to help identify that they are hex 
digits?

>   	uint8_t *data;
>   	uint32_t len;
>   
> @@ -8053,16 +8056,16 @@ static int fmsg_value_show(struct dl *dl, int type, struct nlattr *nl_data)
>   		print_bool(PRINT_ANY, NULL, "%s", mnl_attr_get_u8(nl_data));
>   		break;
>   	case MNL_TYPE_U8:
> -		print_uint(PRINT_ANY, NULL, "%u", mnl_attr_get_u8(nl_data));
> +		print_uint(PRINT_ANY, NULL, num_fmt, mnl_attr_get_u8(nl_data));
>   		break;
>   	case MNL_TYPE_U16:
> -		print_uint(PRINT_ANY, NULL, "%u", mnl_attr_get_u16(nl_data));
> +		print_uint(PRINT_ANY, NULL, num_fmt, mnl_attr_get_u16(nl_data));
>   		break;
>   	case MNL_TYPE_U32:
> -		print_uint(PRINT_ANY, NULL, "%u", mnl_attr_get_u32(nl_data));
> +		print_uint(PRINT_ANY, NULL, num_fmt, mnl_attr_get_u32(nl_data));
>   		break;
>   	case MNL_TYPE_U64:
> -		print_u64(PRINT_ANY, NULL, "%"PRIu64, mnl_attr_get_u64(nl_data));
> +		print_u64(PRINT_ANY, NULL, num64_fmt, mnl_attr_get_u64(nl_data));
>   		break;
>   	case MNL_TYPE_NUL_STRING:
>   		print_string(PRINT_ANY, NULL, "%s", mnl_attr_get_str(nl_data));
> @@ -8939,7 +8942,7 @@ static void help(void)
>   	pr_err("Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }\n"
>   	       "       devlink [ -f[orce] ] -b[atch] filename -N[etns] netnsname\n"
>   	       "where  OBJECT := { dev | port | sb | monitor | dpipe | resource | region | health | trap }\n"
> -	       "       OPTIONS := { -V[ersion] | -n[o-nice-names] | -j[son] | -p[retty] | -v[erbose] -s[tatistics] }\n");
> +	       "       OPTIONS := { -V[ersion] | -n[o-nice-names] | -j[son] | -p[retty] | -v[erbose] -s[tatistics] -h[ex] }\n");
>   }
>   
>   static int dl_cmd(struct dl *dl, int argc, char **argv)
> @@ -9053,6 +9056,7 @@ int main(int argc, char **argv)
>   		{ "statistics",		no_argument,		NULL, 's' },
>   		{ "Netns",		required_argument,	NULL, 'N' },
>   		{ "iec",		no_argument,		NULL, 'i' },
> +		{ "hex",		no_argument,		NULL, 'h' },

Can we use 'x' instead of 'h' here?  Most times '-h' means 'help', and 
might surprise unsuspecting users when it isn't a help flag.

Thanks,
sln

>   		{ NULL, 0, NULL, 0 }
>   	};
>   	const char *batch_file = NULL;
> @@ -9068,7 +9072,7 @@ int main(int argc, char **argv)
>   		return EXIT_FAILURE;
>   	}
>   
> -	while ((opt = getopt_long(argc, argv, "Vfb:njpvsN:i",
> +	while ((opt = getopt_long(argc, argv, "Vfb:njpvsN:ih",
>   				  long_options, NULL)) >= 0) {
>   
>   		switch (opt) {
> @@ -9106,6 +9110,9 @@ int main(int argc, char **argv)
>   		case 'i':
>   			use_iec = true;
>   			break;
> +		case 'h':
> +			dl->hex = true;
> +			break;
>   		default:
>   			pr_err("Unknown option.\n");
>   			help();
> diff --git a/man/man8/devlink.8 b/man/man8/devlink.8
> index 840cf44cf97b..3797a27cefc5 100644
> --- a/man/man8/devlink.8
> +++ b/man/man8/devlink.8
> @@ -63,6 +63,10 @@ Switches to the specified network namespace.
>   .BR "\-i", " --iec"
>   Print human readable rates in IEC units (e.g. 1Ki = 1024).
>   
> +.TP
> +.BR "\-h", " --hex"
> +Print dump numbers in hexadecimal format.
> +
>   .SS
>   .I OBJECT
>   

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: switchdev: check br_vlan_group() return value
From: Jakub Kicinski @ 2022-04-22 22:14 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Clément Léger, Roopa Prabhu, David S. Miller,
	Paolo Abeni, Tobias Waldekranz, bridge, netdev, linux-kernel
In-Reply-To: <c945ebff-02fe-f2d5-656f-6bdfc46416f1@blackwall.org>

On Thu, 21 Apr 2022 13:17:51 +0300 Nikolay Aleksandrov wrote:
> On 21/04/2022 13:12, Clément Léger wrote:
> > br_vlan_group() can return NULL and thus return value must be checked
> > to avoid dereferencing a NULL pointer.
> > 
> > Fixes: 6284c723d9b9 ("net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations")
> > Signed-off-by: Clément Léger <clement.leger@bootlin.com>
> > ---
> >  net/bridge/br_switchdev.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
> > index 81400e0b26ac..8f3d76c751dd 100644
> > --- a/net/bridge/br_switchdev.c
> > +++ b/net/bridge/br_switchdev.c
> > @@ -354,6 +354,8 @@ static int br_switchdev_vlan_attr_replay(struct net_device *br_dev,
> >  	attr.orig_dev = br_dev;
> >  
> >  	vg = br_vlan_group(br);
> > +	if (!vg)
> > +		return 0;
> >  
> >  	list_for_each_entry(v, &vg->vlan_list, vlist) {
> >  		if (v->msti) {  
> 
> Acked-by: Nikolay Aleksandrov <razor@blackwall.org>

Thanks! Applying to net tho, the patch in question is already 
in Linus's tree.

^ permalink raw reply

* [RFC PATCH] 9p: case-insensitive host filesystems
From: Christian Schoenebeck @ 2022-04-22 18:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Will Cohen, Greg Kurz, Michael Roitzsch, Keno Fischer,
	Dominique Martinet, Eric Van Hensbergen, linux-kernel, netdev,
	v9fs-developer, Latchesar Ionkov

Now that 9p support for macOS hosts just landed in QEMU 7.0 and with support 
for Windows hosts on the horizon [1], the question is how to deal with case-
insensitive host filesystems, which are very common on those two systems?

I made some tests, e.g. trying to setup a 9p root fs Linux installation on a 
macOS host as described in the QEMU HOWTO [2], which at a certain point causes 
the debootstrap script to fail when trying to unpack the 'libpam-runtime' 
package. That's because it would try to create this symlink:

  /usr/share/man/man7/PAM.7.gz -> /usr/share/man/man7/pam.7.gz

which fails with EEXIST on a case-insensitive APFS. Unfortunately you can't 
easily switch an existing APFS partition to case-sensitivity. It requires to 
reformat the entire partition, loosing all your data, etc.

So I did a quick test with QEMU as outlined below, trying to simply let 9p 
server "eat" EEXIST errors in such cases, but then I realized that most of the 
time it would not even come that far, as Linux client would first send a 
'Twalk' request to check whether target symlink entry already exists, and as 
it gets a positive response from 9p server (again, due to case-insensitivity) 
client would stop right there without even trying to send a 'Tsymlink' 
request.

So maybe it's better to handle case-insensitivity entirely on client side? 
I've read that some generic "case fold" code has landed in the Linux kernel 
recently that might do the trick?

Should 9p server give a hint to 9p client that it's a case-insensitive fs? And 
if yes, once per entire exported fs or rather for each directory (as there 
might be submounts on host)?

[1] https://lore.kernel.org/all/20220408171013.912436-1-bmeng.cn@gmail.com/
[2] https://wiki.qemu.org/Documentation/9p_root_fs

---
 hw/9pfs/9p-local.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/hw/9pfs/9p-local.c b/hw/9pfs/9p-local.c
index d42ce6d8b8..d6cb45c758 100644
--- a/hw/9pfs/9p-local.c
+++ b/hw/9pfs/9p-local.c
@@ -39,6 +39,10 @@
 #endif
 #endif
 #include <sys/ioctl.h>
+#ifdef CONFIG_DARWIN
+#include <glib.h>
+#include <glib/gprintf.h>
+#endif

 #ifndef XFS_SUPER_MAGIC
 #define XFS_SUPER_MAGIC  0x58465342
@@ -57,6 +61,18 @@ typedef struct {
     int mountfd;
 } LocalData;

+#ifdef CONFIG_DARWIN
+
+/* Compare strings case-insensitive (assuming UTF-8 encoding). */
+static int p9_stricmp(const char *a, const char *b)
+{
+    g_autofree gchar *cia = g_utf8_casefold(a, -1);
+    g_autofree gchar *cib = g_utf8_casefold(b, -1);
+    return g_utf8_collate(cia, cib);
+}
+
+#endif
+
 int local_open_nofollow(FsContext *fs_ctx, const char *path, int flags,
                         mode_t mode)
 {
@@ -931,6 +947,25 @@ static int local_symlink(FsContext *fs_ctx, const char 
*oldpath,
                fs_ctx->export_flags & V9FS_SM_NONE) {
         err = symlinkat(oldpath, dirfd, name);
         if (err) {
+#if CONFIG_DARWIN
+            if (errno == EEXIST) {
+                printf("  -> symlinkat(oldpath='%s', dirfd=%d, name='%s') = 
EEXIST\n", oldpath, dirfd, name);
+            }
+            if (errno == EEXIST &&
+                strcmp(oldpath, name) && !p9_stricmp(oldpath, name))
+            {
+                struct stat st1, st2;
+                const int cur_errno = errno;
+                if (!fstatat(dirfd, oldpath, &st1, AT_SYMLINK_NOFOLLOW) &&
+                    !fstatat(dirfd, name, &st2, AT_SYMLINK_NOFOLLOW) &&
+                    st1.st_dev == st2.st_dev && st1.st_ino == st2.st_ino)
+                {
+                    printf("  -> iCASE SAME\n");
+                    err = 0;
+                }
+                errno = cur_errno;
+            }
+#endif
             goto out;
         }
         err = fchownat(dirfd, name, credp->fc_uid, credp->fc_gid,
@@ -983,6 +1018,25 @@ static int local_link(FsContext *ctx, V9fsPath *oldpath,

     ret = linkat(odirfd, oname, ndirfd, name, 0);
     if (ret < 0) {
+#if CONFIG_DARWIN
+        if (errno == EEXIST) {
+            printf("  -> linkat(odirfd=%d, oname='%s', ndirfd=%d, name='%s') 
= EEXIST\n", odirfd, oname, ndirfd, name);
+        }
+        if (errno == EEXIST &&
+            strcmp(oname, name) && !p9_stricmp(oname, name))
+        {
+            struct stat st1, st2;
+            const int cur_errno = errno;
+            if (!fstatat(odirfd, oname, &st1, AT_SYMLINK_NOFOLLOW) &&
+                !fstatat(ndirfd, name, &st2, AT_SYMLINK_NOFOLLOW) &&
+                st1.st_dev == st2.st_dev && st1.st_ino == st2.st_ino)
+            {
+                printf("  -> iCASE SAME\n");
+                ret = 0;
+            }
+            errno = cur_errno;
+        }
+#endif
         goto out_close;
     }

-- 
2.32.0 (Apple Git-132)

^ permalink raw reply related

* Re: [PATCH net-next v1 09/17] net/mlx5: Simplify HW context interfaces by using SA entry
From: Saeed Mahameed @ 2022-04-22 22:19 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Paolo Abeni, Jakub Kicinski, David S . Miller, Leon Romanovsky,
	Jason Gunthorpe, linux-netdev, Raed Salem
In-Reply-To: <3ad7b80c6f58d938550dd3259c5eaaecd8833af4.1650363043.git.leonro@nvidia.com>

On 19 Apr 13:13, Leon Romanovsky wrote:
>From: Leon Romanovsky <leonro@nvidia.com>
>
>SA context logic used multiple structures to store same data
>over and over. By simplifying the SA context interfaces, we
>can remove extra structs.
>
>Reviewed-by: Raed Salem <raeds@nvidia.com>
>Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>---
> .../mellanox/mlx5/core/en_accel/ipsec.c       |  50 ++---
> .../mellanox/mlx5/core/en_accel/ipsec.h       |  27 ++-
> .../mlx5/core/en_accel/ipsec_offload.c        | 182 ++++--------------
> 3 files changed, 62 insertions(+), 197 deletions(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
>index 0daf9350471f..537311a74bfb 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
>@@ -63,9 +63,9 @@ struct xfrm_state *mlx5e_ipsec_sadb_rx_lookup(struct mlx5e_ipsec *ipsec,
> 	return ret;
> }
>
>-static int  mlx5e_ipsec_sadb_rx_add(struct mlx5e_ipsec_sa_entry *sa_entry,
>-				    unsigned int handle)
>+static int mlx5e_ipsec_sadb_rx_add(struct mlx5e_ipsec_sa_entry *sa_entry)
> {
>+	unsigned int handle = sa_entry->ipsec_obj_id;
> 	struct mlx5e_ipsec *ipsec = sa_entry->ipsec;
> 	struct mlx5e_ipsec_sa_entry *_sa_entry;
> 	unsigned long flags;
>@@ -277,16 +277,14 @@ static void _update_xfrm_state(struct work_struct *work)
> 	struct mlx5e_ipsec_sa_entry *sa_entry = container_of(
> 		modify_work, struct mlx5e_ipsec_sa_entry, modify_work);
>
>-	mlx5_accel_esp_modify_xfrm(sa_entry->xfrm, &modify_work->attrs);
>+	mlx5_accel_esp_modify_xfrm(sa_entry, &modify_work->attrs);
> }
>
> static int mlx5e_xfrm_add_state(struct xfrm_state *x)
> {
> 	struct mlx5e_ipsec_sa_entry *sa_entry = NULL;
> 	struct net_device *netdev = x->xso.real_dev;
>-	struct mlx5_accel_esp_xfrm_attrs attrs;
> 	struct mlx5e_priv *priv;
>-	unsigned int sa_handle;
> 	int err;
>
> 	priv = netdev_priv(netdev);
>@@ -309,33 +307,20 @@ static int mlx5e_xfrm_add_state(struct xfrm_state *x)
> 	/* check esn */
> 	mlx5e_ipsec_update_esn_state(sa_entry);
>
>-	/* create xfrm */
>-	mlx5e_ipsec_build_accel_xfrm_attrs(sa_entry, &attrs);
>-	sa_entry->xfrm = mlx5_accel_esp_create_xfrm(priv->mdev, &attrs);
>-	if (IS_ERR(sa_entry->xfrm)) {
>-		err = PTR_ERR(sa_entry->xfrm);
>-		goto err_sa_entry;
>-	}
>-
>+	mlx5e_ipsec_build_accel_xfrm_attrs(sa_entry, &sa_entry->attrs);
> 	/* create hw context */
>-	sa_entry->hw_context =
>-			mlx5_accel_esp_create_hw_context(priv->mdev,
>-							 sa_entry->xfrm,
>-							 &sa_handle);
>-	if (IS_ERR(sa_entry->hw_context)) {
>-		err = PTR_ERR(sa_entry->hw_context);
>+	err = mlx5_ipsec_create_sa_ctx(sa_entry);
>+	if (err)
> 		goto err_xfrm;
>-	}
>
>-	sa_entry->ipsec_obj_id = sa_handle;
>-	err = mlx5e_accel_ipsec_fs_add_rule(priv, &sa_entry->xfrm->attrs,
>+	err = mlx5e_accel_ipsec_fs_add_rule(priv, &sa_entry->attrs,
> 					    sa_entry->ipsec_obj_id,
> 					    &sa_entry->ipsec_rule);
> 	if (err)
> 		goto err_hw_ctx;
>
> 	if (x->xso.flags & XFRM_OFFLOAD_INBOUND) {
>-		err = mlx5e_ipsec_sadb_rx_add(sa_entry, sa_handle);
>+		err = mlx5e_ipsec_sadb_rx_add(sa_entry);
> 		if (err)
> 			goto err_add_rule;
> 	} else {
>@@ -348,15 +333,12 @@ static int mlx5e_xfrm_add_state(struct xfrm_state *x)
> 	goto out;
>
> err_add_rule:
>-	mlx5e_accel_ipsec_fs_del_rule(priv, &sa_entry->xfrm->attrs,
>+	mlx5e_accel_ipsec_fs_del_rule(priv, &sa_entry->attrs,
> 				      &sa_entry->ipsec_rule);
> err_hw_ctx:
>-	mlx5_accel_esp_free_hw_context(priv->mdev, sa_entry->hw_context);
>+	mlx5_ipsec_free_sa_ctx(sa_entry);
> err_xfrm:
>-	mlx5_accel_esp_destroy_xfrm(sa_entry->xfrm);
>-err_sa_entry:
> 	kfree(sa_entry);
>-
> out:
> 	return err;
> }
>@@ -374,14 +356,10 @@ static void mlx5e_xfrm_free_state(struct xfrm_state *x)
> 	struct mlx5e_ipsec_sa_entry *sa_entry = to_ipsec_sa_entry(x);
> 	struct mlx5e_priv *priv = netdev_priv(x->xso.dev);
>
>-	if (sa_entry->hw_context) {
>-		cancel_work_sync(&sa_entry->modify_work.work);
>-		mlx5e_accel_ipsec_fs_del_rule(priv, &sa_entry->xfrm->attrs,
>-					      &sa_entry->ipsec_rule);
>-		mlx5_accel_esp_free_hw_context(sa_entry->xfrm->mdev, sa_entry->hw_context);
>-		mlx5_accel_esp_destroy_xfrm(sa_entry->xfrm);
>-	}
>-
>+	cancel_work_sync(&sa_entry->modify_work.work);
>+	mlx5e_accel_ipsec_fs_del_rule(priv, &sa_entry->attrs,
>+				      &sa_entry->ipsec_rule);
>+	mlx5_ipsec_free_sa_ctx(sa_entry);
> 	kfree(sa_entry);
> }
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
>index b438b0358c36..cdcb95f90623 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
>@@ -102,11 +102,6 @@ struct mlx5_accel_esp_xfrm_attrs {
> 	u8 is_ipv6;
> };
>
>-struct mlx5_accel_esp_xfrm {
>-	struct mlx5_core_dev  *mdev;
>-	struct mlx5_accel_esp_xfrm_attrs attrs;
>-};
>-
> enum mlx5_accel_ipsec_cap {
> 	MLX5_ACCEL_IPSEC_CAP_DEVICE		= 1 << 0,
> 	MLX5_ACCEL_IPSEC_CAP_ESP		= 1 << 1,
>@@ -162,11 +157,11 @@ struct mlx5e_ipsec_sa_entry {
> 	unsigned int handle; /* Handle in SADB_RX */
> 	struct xfrm_state *x;
> 	struct mlx5e_ipsec *ipsec;
>-	struct mlx5_accel_esp_xfrm *xfrm;
>-	void *hw_context;
>+	struct mlx5_accel_esp_xfrm_attrs attrs;
> 	void (*set_iv_op)(struct sk_buff *skb, struct xfrm_state *x,
> 			  struct xfrm_offload *xo);
> 	u32 ipsec_obj_id;
>+	u32 enc_key_id;
> 	struct mlx5e_ipsec_rule ipsec_rule;
> 	struct mlx5e_ipsec_modify_state_work modify_work;
> };
>@@ -188,19 +183,19 @@ void mlx5e_accel_ipsec_fs_del_rule(struct mlx5e_priv *priv,
> 				   struct mlx5_accel_esp_xfrm_attrs *attrs,
> 				   struct mlx5e_ipsec_rule *ipsec_rule);
>
>-void *mlx5_accel_esp_create_hw_context(struct mlx5_core_dev *mdev,
>-				       struct mlx5_accel_esp_xfrm *xfrm,
>-				       u32 *sa_handle);
>-void mlx5_accel_esp_free_hw_context(struct mlx5_core_dev *mdev, void *context);
>+int mlx5_ipsec_create_sa_ctx(struct mlx5e_ipsec_sa_entry *sa_entry);
>+void mlx5_ipsec_free_sa_ctx(struct mlx5e_ipsec_sa_entry *sa_entry);
>
> u32 mlx5_ipsec_device_caps(struct mlx5_core_dev *mdev);
>
>-struct mlx5_accel_esp_xfrm *
>-mlx5_accel_esp_create_xfrm(struct mlx5_core_dev *mdev,
>-			   const struct mlx5_accel_esp_xfrm_attrs *attrs);
>-void mlx5_accel_esp_destroy_xfrm(struct mlx5_accel_esp_xfrm *xfrm);
>-void mlx5_accel_esp_modify_xfrm(struct mlx5_accel_esp_xfrm *xfrm,
>+void mlx5_accel_esp_modify_xfrm(struct mlx5e_ipsec_sa_entry *sa_entry,
> 				const struct mlx5_accel_esp_xfrm_attrs *attrs);
>+
>+static inline struct mlx5_core_dev *
>+mlx5e_ipsec_sa2dev(struct mlx5e_ipsec_sa_entry *sa_entry)
>+{
>+	return sa_entry->ipsec->mdev;
>+}
> #else
> static inline int mlx5e_ipsec_init(struct mlx5e_priv *priv)
> {
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
>index a7bd31d10bd4..817747d5229e 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
>@@ -5,21 +5,6 @@
> #include "ipsec.h"
> #include "lib/mlx5.h"
>
>-struct mlx5_ipsec_sa_ctx {
>-	struct rhash_head hash;
>-	u32 enc_key_id;
>-	u32 ipsec_obj_id;
>-	/* hw ctx */
>-	struct mlx5_core_dev *dev;
>-	struct mlx5_ipsec_esp_xfrm *mxfrm;
>-};
>-
>-struct mlx5_ipsec_esp_xfrm {
>-	/* reference counter of SA ctx */
>-	struct mlx5_ipsec_sa_ctx *sa_ctx;
>-	struct mlx5_accel_esp_xfrm accel_xfrm;
>-};
>-
> u32 mlx5_ipsec_device_caps(struct mlx5_core_dev *mdev)
> {
> 	u32 caps;
>@@ -61,43 +46,11 @@ u32 mlx5_ipsec_device_caps(struct mlx5_core_dev *mdev)
> }
> EXPORT_SYMBOL_GPL(mlx5_ipsec_device_caps);
>
>-struct mlx5_accel_esp_xfrm *
>-mlx5_accel_esp_create_xfrm(struct mlx5_core_dev *mdev,
>-			   const struct mlx5_accel_esp_xfrm_attrs *attrs)
>-{
>-	struct mlx5_ipsec_esp_xfrm *mxfrm;
>-
>-	mxfrm = kzalloc(sizeof(*mxfrm), GFP_KERNEL);
>-	if (!mxfrm)
>-		return ERR_PTR(-ENOMEM);
>-
>-	memcpy(&mxfrm->accel_xfrm.attrs, attrs,
>-	       sizeof(mxfrm->accel_xfrm.attrs));
>-
>-	mxfrm->accel_xfrm.mdev = mdev;
>-	return &mxfrm->accel_xfrm;
>-}
>-
>-void mlx5_accel_esp_destroy_xfrm(struct mlx5_accel_esp_xfrm *xfrm)
>+static int mlx5_create_ipsec_obj(struct mlx5e_ipsec_sa_entry *sa_entry)
> {
>-	struct mlx5_ipsec_esp_xfrm *mxfrm = container_of(xfrm, struct mlx5_ipsec_esp_xfrm,
>-							 accel_xfrm);
>-
>-	kfree(mxfrm);
>-}
>-
>-struct mlx5_ipsec_obj_attrs {
>-	const struct aes_gcm_keymat *aes_gcm;
>-	u32 accel_flags;
>-	u32 esn_msb;
>-	u32 enc_key_id;
>-};
>-
>-static int mlx5_create_ipsec_obj(struct mlx5_core_dev *mdev,
>-				 struct mlx5_ipsec_obj_attrs *attrs,
>-				 u32 *ipsec_id)

I don't see the point of this change, the function used to receive two
primitives, now it receives a god object, just to grab the two primitives,
this breaks the bottom up design, and contaminates the code with the
sa_entry container, that only should be visible by high-level ipsec module and
the SA DB, all service and low level functions should remain as
primitive and simple as possible to avoid future abuse and reduce the scope
and visibility of god objects. The effect of this change is more severe in
the next patch.

Even within the same file, i still recommend a monotonic bottom up
design and keep the complex objects usage to as few hight level functions
as possible.

>-{
>-	const struct aes_gcm_keymat *aes_gcm = attrs->aes_gcm;
>+	struct mlx5_accel_esp_xfrm_attrs *attrs = &sa_entry->attrs;
>+	struct mlx5_core_dev *mdev = mlx5e_ipsec_sa2dev(sa_entry);
>+	struct aes_gcm_keymat *aes_gcm = &attrs->keymat.aes_gcm;
> 	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
> 	u32 in[MLX5_ST_SZ_DW(create_ipsec_obj_in)] = {};
> 	void *obj, *salt_p, *salt_iv_p;
>@@ -128,14 +81,14 @@ static int mlx5_create_ipsec_obj(struct mlx5_core_dev *mdev,
> 	salt_iv_p = MLX5_ADDR_OF(ipsec_obj, obj, implicit_iv);
> 	memcpy(salt_iv_p, &aes_gcm->seq_iv, sizeof(aes_gcm->seq_iv));
> 	/* esn */
>-	if (attrs->accel_flags & MLX5_ACCEL_ESP_FLAGS_ESN_TRIGGERED) {
>+	if (attrs->flags & MLX5_ACCEL_ESP_FLAGS_ESN_TRIGGERED) {
> 		MLX5_SET(ipsec_obj, obj, esn_en, 1);
>-		MLX5_SET(ipsec_obj, obj, esn_msb, attrs->esn_msb);
>-		if (attrs->accel_flags & MLX5_ACCEL_ESP_FLAGS_ESN_STATE_OVERLAP)
>+		MLX5_SET(ipsec_obj, obj, esn_msb, attrs->esn);
>+		if (attrs->flags & MLX5_ACCEL_ESP_FLAGS_ESN_STATE_OVERLAP)
> 			MLX5_SET(ipsec_obj, obj, esn_overlap, 1);
> 	}
>
>-	MLX5_SET(ipsec_obj, obj, dekn, attrs->enc_key_id);
>+	MLX5_SET(ipsec_obj, obj, dekn, sa_entry->enc_key_id);
>
> 	/* general object fields set */
> 	MLX5_SET(general_obj_in_cmd_hdr, in, opcode,
>@@ -145,13 +98,15 @@ static int mlx5_create_ipsec_obj(struct mlx5_core_dev *mdev,
>
> 	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
> 	if (!err)
>-		*ipsec_id = MLX5_GET(general_obj_out_cmd_hdr, out, obj_id);
>+		sa_entry->ipsec_obj_id =
>+			MLX5_GET(general_obj_out_cmd_hdr, out, obj_id);
>
> 	return err;
> }
>
>-static void mlx5_destroy_ipsec_obj(struct mlx5_core_dev *mdev, u32 ipsec_id)
>+static void mlx5_destroy_ipsec_obj(struct mlx5e_ipsec_sa_entry *sa_entry)
> {
>+	struct mlx5_core_dev *mdev = mlx5e_ipsec_sa2dev(sa_entry);
> 	u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {};
> 	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)];
>
>@@ -159,79 +114,52 @@ static void mlx5_destroy_ipsec_obj(struct mlx5_core_dev *mdev, u32 ipsec_id)
> 		 MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
> 	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type,
> 		 MLX5_GENERAL_OBJECT_TYPES_IPSEC);
>-	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, ipsec_id);
>+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, sa_entry->ipsec_obj_id);
>
> 	mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
> }
>
>-static void *mlx5_ipsec_offload_create_sa_ctx(struct mlx5_core_dev *mdev,
>-					      struct mlx5_accel_esp_xfrm *accel_xfrm,
>-					      const __be32 saddr[4], const __be32 daddr[4],
>-					      const __be32 spi, bool is_ipv6, u32 *hw_handle)
>+int mlx5_ipsec_create_sa_ctx(struct mlx5e_ipsec_sa_entry *sa_entry)
> {
>-	struct mlx5_accel_esp_xfrm_attrs *xfrm_attrs = &accel_xfrm->attrs;
>-	struct aes_gcm_keymat *aes_gcm = &xfrm_attrs->keymat.aes_gcm;
>-	struct mlx5_ipsec_obj_attrs ipsec_attrs = {};
>-	struct mlx5_ipsec_esp_xfrm *mxfrm;
>-	struct mlx5_ipsec_sa_ctx *sa_ctx;
>+	struct aes_gcm_keymat *aes_gcm = &sa_entry->attrs.keymat.aes_gcm;
>+	struct mlx5_core_dev *mdev = mlx5e_ipsec_sa2dev(sa_entry);
> 	int err;
>
>-	/* alloc SA context */
>-	sa_ctx = kzalloc(sizeof(*sa_ctx), GFP_KERNEL);
>-	if (!sa_ctx)
>-		return ERR_PTR(-ENOMEM);
>-
>-	sa_ctx->dev = mdev;
>-
>-	mxfrm = container_of(accel_xfrm, struct mlx5_ipsec_esp_xfrm, accel_xfrm);
>-	sa_ctx->mxfrm = mxfrm;
>-
> 	/* key */
> 	err = mlx5_create_encryption_key(mdev, aes_gcm->aes_key,
> 					 aes_gcm->key_len / BITS_PER_BYTE,
> 					 MLX5_ACCEL_OBJ_IPSEC_KEY,
>-					 &sa_ctx->enc_key_id);
>+					 &sa_entry->enc_key_id);
> 	if (err) {
> 		mlx5_core_dbg(mdev, "Failed to create encryption key (err = %d)\n", err);
>-		goto err_sa_ctx;
>+		return err;
> 	}
>
>-	ipsec_attrs.aes_gcm = aes_gcm;
>-	ipsec_attrs.accel_flags = accel_xfrm->attrs.flags;
>-	ipsec_attrs.esn_msb = accel_xfrm->attrs.esn;
>-	ipsec_attrs.enc_key_id = sa_ctx->enc_key_id;
>-	err = mlx5_create_ipsec_obj(mdev, &ipsec_attrs,
>-				    &sa_ctx->ipsec_obj_id);
>+	err = mlx5_create_ipsec_obj(sa_entry);
> 	if (err) {
> 		mlx5_core_dbg(mdev, "Failed to create IPsec object (err = %d)\n", err);
> 		goto err_enc_key;
> 	}
>
>-	*hw_handle = sa_ctx->ipsec_obj_id;
>-	mxfrm->sa_ctx = sa_ctx;
>-
>-	return sa_ctx;
>+	return 0;
>
> err_enc_key:
>-	mlx5_destroy_encryption_key(mdev, sa_ctx->enc_key_id);
>-err_sa_ctx:
>-	kfree(sa_ctx);
>-	return ERR_PTR(err);
>+	mlx5_destroy_encryption_key(mdev, sa_entry->enc_key_id);
>+	return err;
> }
>
>-static void mlx5_ipsec_offload_delete_sa_ctx(void *context)
>+void mlx5_ipsec_free_sa_ctx(struct mlx5e_ipsec_sa_entry *sa_entry)
> {
>-	struct mlx5_ipsec_sa_ctx *sa_ctx = (struct mlx5_ipsec_sa_ctx *)context;
>+	struct mlx5_core_dev *mdev = mlx5e_ipsec_sa2dev(sa_entry);
>
>-	mlx5_destroy_ipsec_obj(sa_ctx->dev, sa_ctx->ipsec_obj_id);
>-	mlx5_destroy_encryption_key(sa_ctx->dev, sa_ctx->enc_key_id);
>-	kfree(sa_ctx);
>+	mlx5_destroy_ipsec_obj(sa_entry);
>+	mlx5_destroy_encryption_key(mdev, sa_entry->enc_key_id);
> }
>
>-static int mlx5_modify_ipsec_obj(struct mlx5_core_dev *mdev,
>-				 struct mlx5_ipsec_obj_attrs *attrs,
>-				 u32 ipsec_id)
>+static int mlx5_modify_ipsec_obj(struct mlx5e_ipsec_sa_entry *sa_entry,
>+				 const struct mlx5_accel_esp_xfrm_attrs *attrs)
> {
>+	struct mlx5_core_dev *mdev = mlx5e_ipsec_sa2dev(sa_entry);
> 	u32 in[MLX5_ST_SZ_DW(modify_ipsec_obj_in)] = {};
> 	u32 out[MLX5_ST_SZ_DW(query_ipsec_obj_out)];
> 	u64 modify_field_select = 0;
>@@ -239,7 +167,7 @@ static int mlx5_modify_ipsec_obj(struct mlx5_core_dev *mdev,
> 	void *obj;
> 	int err;
>
>-	if (!(attrs->accel_flags & MLX5_ACCEL_ESP_FLAGS_ESN_TRIGGERED))
>+	if (!(attrs->flags & MLX5_ACCEL_ESP_FLAGS_ESN_TRIGGERED))
> 		return 0;
>
> 	general_obj_types = MLX5_CAP_GEN_64(mdev, general_obj_types);
>@@ -249,11 +177,11 @@ static int mlx5_modify_ipsec_obj(struct mlx5_core_dev *mdev,
> 	/* general object fields set */
> 	MLX5_SET(general_obj_in_cmd_hdr, in, opcode, MLX5_CMD_OP_QUERY_GENERAL_OBJECT);
> 	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type, MLX5_GENERAL_OBJECT_TYPES_IPSEC);
>-	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, ipsec_id);
>+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, sa_entry->ipsec_obj_id);
> 	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
> 	if (err) {
> 		mlx5_core_err(mdev, "Query IPsec object failed (Object id %d), err = %d\n",
>-			      ipsec_id, err);
>+			      sa_entry->ipsec_obj_id, err);
> 		return err;
> 	}
>
>@@ -266,8 +194,8 @@ static int mlx5_modify_ipsec_obj(struct mlx5_core_dev *mdev,
> 		return -EOPNOTSUPP;
>
> 	obj = MLX5_ADDR_OF(modify_ipsec_obj_in, in, ipsec_object);
>-	MLX5_SET(ipsec_obj, obj, esn_msb, attrs->esn_msb);
>-	if (attrs->accel_flags & MLX5_ACCEL_ESP_FLAGS_ESN_STATE_OVERLAP)
>+	MLX5_SET(ipsec_obj, obj, esn_msb, attrs->esn);
>+	if (attrs->flags & MLX5_ACCEL_ESP_FLAGS_ESN_STATE_OVERLAP)
> 		MLX5_SET(ipsec_obj, obj, esn_overlap, 1);
>
> 	/* general object fields set */
>@@ -276,50 +204,14 @@ static int mlx5_modify_ipsec_obj(struct mlx5_core_dev *mdev,
> 	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
> }
>
>-void mlx5_accel_esp_modify_xfrm(struct mlx5_accel_esp_xfrm *xfrm,
>+void mlx5_accel_esp_modify_xfrm(struct mlx5e_ipsec_sa_entry *sa_entry,
> 				const struct mlx5_accel_esp_xfrm_attrs *attrs)
> {
>-	struct mlx5_ipsec_obj_attrs ipsec_attrs = {};
>-	struct mlx5_core_dev *mdev = xfrm->mdev;
>-	struct mlx5_ipsec_esp_xfrm *mxfrm;
> 	int err;
>
>-	mxfrm = container_of(xfrm, struct mlx5_ipsec_esp_xfrm, accel_xfrm);
>-
>-	/* need to add find and replace in ipsec_rhash_sa the sa_ctx */
>-	/* modify device with new hw_sa */
>-	ipsec_attrs.accel_flags = attrs->flags;
>-	ipsec_attrs.esn_msb = attrs->esn;
>-	err = mlx5_modify_ipsec_obj(mdev,
>-				    &ipsec_attrs,
>-				    mxfrm->sa_ctx->ipsec_obj_id);
>-
>+	err = mlx5_modify_ipsec_obj(sa_entry, attrs);
> 	if (err)
> 		return;
>
>-	memcpy(&xfrm->attrs, attrs, sizeof(xfrm->attrs));
>-}
>-
>-void *mlx5_accel_esp_create_hw_context(struct mlx5_core_dev *mdev,
>-				       struct mlx5_accel_esp_xfrm *xfrm,
>-				       u32 *sa_handle)
>-{
>-	__be32 saddr[4] = {}, daddr[4] = {};
>-
>-	if (!xfrm->attrs.is_ipv6) {
>-		saddr[3] = xfrm->attrs.saddr.a4;
>-		daddr[3] = xfrm->attrs.daddr.a4;
>-	} else {
>-		memcpy(saddr, xfrm->attrs.saddr.a6, sizeof(saddr));
>-		memcpy(daddr, xfrm->attrs.daddr.a6, sizeof(daddr));
>-	}
>-
>-	return mlx5_ipsec_offload_create_sa_ctx(mdev, xfrm, saddr, daddr,
>-						xfrm->attrs.spi,
>-						xfrm->attrs.is_ipv6, sa_handle);
>-}
>-
>-void mlx5_accel_esp_free_hw_context(struct mlx5_core_dev *mdev, void *context)
>-{
>-	mlx5_ipsec_offload_delete_sa_ctx(context);
>+	memcpy(&sa_entry->attrs, attrs, sizeof(sa_entry->attrs));
> }
>-- 
>2.35.1
>

^ permalink raw reply

* Re: [PATCH v2 net] tcp: md5: incorrect tcp_header_len for incoming connections
From: patchwork-bot+netdevbpf @ 2022-04-22 22:20 UTC (permalink / raw)
  To: Francesco Ruggeri
  Cc: pabeni, kuba, dsahern, yoshfuji, davem, edumazet, linux-kernel,
	netdev
In-Reply-To: <20220421005026.686A45EC01F2@us226.sjc.aristanetworks.com>

Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 20 Apr 2022 17:50:26 -0700 you wrote:
> In tcp_create_openreq_child we adjust tcp_header_len for md5 using the
> remote address in newsk. But that address is still 0 in newsk at this
> point, and it is only set later by the callers (tcp_v[46]_syn_recv_sock).
> Use the address from the request socket instead.
> 
> v2: Added "Fixes:" line.
> 
> [...]

Here is the summary with links:
  - [v2,net] tcp: md5: incorrect tcp_header_len for incoming connections
    https://git.kernel.org/netdev/net/c/5b0b9e4c2c89

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next] net: bridge: switchdev: check br_vlan_group() return value
From: patchwork-bot+netdevbpf @ 2022-04-22 22:20 UTC (permalink / raw)
  To: =?utf-8?b?Q2zDqW1lbnQgTMOpZ2VyIDxjbGVtZW50LmxlZ2VyQGJvb3RsaW4uY29tPg==?=
  Cc: roopa, razor, davem, kuba, pabeni, tobias, bridge, netdev,
	linux-kernel
In-Reply-To: <20220421101247.121896-1-clement.leger@bootlin.com>

Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 21 Apr 2022 12:12:47 +0200 you wrote:
> br_vlan_group() can return NULL and thus return value must be checked
> to avoid dereferencing a NULL pointer.
> 
> Fixes: 6284c723d9b9 ("net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations")
> Signed-off-by: Clément Léger <clement.leger@bootlin.com>
> ---
>  net/bridge/br_switchdev.c | 2 ++
>  1 file changed, 2 insertions(+)

Here is the summary with links:
  - [net-next] net: bridge: switchdev: check br_vlan_group() return value
    https://git.kernel.org/netdev/net/c/7f40ea2145d9

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox