* Re: [PATCH net] net/mlx5: Fix a NULL vs IS_ERR() check
From: Wojciech Drewek @ 2023-11-03 9:51 UTC (permalink / raw)
To: Dan Carpenter, Saeed Mahameed
Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Eli Cohen, netdev, linux-rdma, kernel-janitors
In-Reply-To: <4ee5fbea-7807-42dd-a9b8-738ac23249d0@moroto.mountain>
On 03.11.2023 07:36, Dan Carpenter wrote:
> The mlx5_esw_offloads_devlink_port() function returns error pointers, not
> NULL.
>
> Fixes: 7bef147a6ab6 ("net/mlx5: Don't skip vport check")
> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
> ---
> I *think* these normally go through the mellanox tree and not net.
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
>
> drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 693e55b010d9..5c569d4bfd00 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -1493,7 +1493,7 @@ mlx5e_vport_vf_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
>
> dl_port = mlx5_esw_offloads_devlink_port(dev->priv.eswitch,
> rpriv->rep->vport);
> - if (dl_port) {
> + if (!IS_ERR(dl_port)) {
> SET_NETDEV_DEVLINK_PORT(netdev, dl_port);
> mlx5e_rep_vnic_reporter_create(priv, dl_port);
> }
^ permalink raw reply
* Re: [PATCH net] netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
From: Pablo Neira Ayuso @ 2023-11-03 9:45 UTC (permalink / raw)
To: Florian Westphal
Cc: Dan Carpenter, Jozsef Kadlecsik, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, netfilter-devel, coreteam, netdev,
linux-kernel, kernel-janitors
In-Reply-To: <20231103091801.GA8035@breakpoint.cc>
On Fri, Nov 03, 2023 at 10:18:01AM +0100, Florian Westphal wrote:
> Dan Carpenter <dan.carpenter@linaro.org> wrote:
> > The problem is in nft_byteorder_eval() where we are iterating through a
> > loop and writing to dst[0], dst[1], dst[2] and so on... On each
> > iteration we are writing 8 bytes. But dst[] is an array of u32 so each
> > element only has space for 4 bytes. That means that every iteration
> > overwrites part of the previous element.
> >
> > I spotted this bug while reviewing commit caf3ef7468f7 ("netfilter:
> > nf_tables: prevent OOB access in nft_byteorder_eval") which is a related
> > issue. I think that the reason we have not detected this bug in testing
> > is that most of time we only write one element.
>
> LGTM, thanks Dan. We will route this via nf.git.
Thanks for your patch.
One question, is this update really required?
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 3bbd13ab1ecf..b157c5cafd14 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -178,9 +178,9 @@ static inline __be32 nft_reg_load_be32(const u32 *sreg)
return *(__force __be32 *)sreg;
}
-static inline void nft_reg_store64(u32 *dreg, u64 val)
+static inline void nft_reg_store64(u64 *dreg, u64 val)
{
- put_unaligned(val, (u64 *)dreg);
+ put_unaligned(val, dreg);
}
static inline u64 nft_reg_load64(const u32 *sreg)
because one of the goals of nft_reg_store64() is to avoid that caller
casts the register to 64-bits.
^ permalink raw reply related
* Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
From: Michal Kubecek @ 2023-11-03 9:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jiri Slaby, David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng, eric.dumazet
In-Reply-To: <CANn89iJOwQUwAVcofW+X_8srFcPnaWKyqOoM005L6Zgh8=OvpA@mail.gmail.com>
On Fri, Nov 03, 2023 at 09:17:27AM +0100, Eric Dumazet wrote:
>
> It seems the test had some expectations.
>
> Setting a small (1 byte) RCVBUF/SNDBUF, and yet expecting to send
> 46080 bytes fast enough was not reasonable.
> It might have relied on the fact that tcp sendmsg() can cook large GSO
> packets, even if sk->sk_sndbuf is small.
>
> With tight memory settings, it is possible TCP has to resort on RTO
> timers (200ms by default) to recover from dropped packets.
There seems to be one drop but somehow the sender does not recover from
it, even if the retransmit and following packets are acked quickly:
09:15:29.424017 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [S], seq 104649613, win 33280, options [mss 65495,sackOK,TS val 1319295278 ecr 0,nop,wscale 7], length 0
09:15:29.424024 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [S.], seq 1343383818, ack 104649614, win 585, options [mss 65495,sackOK,TS val 1319295278 ecr 1319295278,nop,wscale 0], length 0
09:15:29.424031 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 1, win 260, options [nop,nop,TS val 1319295278 ecr 1319295278], length 0
09:15:29.424155 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [.], seq 1:16641, ack 1, win 585, options [nop,nop,TS val 1319295279 ecr 1319295278], length 16640
09:15:29.424160 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 16641, win 130, options [nop,nop,TS val 1319295279 ecr 1319295279], length 0
09:15:29.424179 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 16641:33281, ack 1, win 585, options [nop,nop,TS val 1319295279 ecr 1319295279], length 16640
09:15:29.424183 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 16641, win 0, options [nop,nop,TS val 1319295279 ecr 1319295279], length 0
09:15:29.424280 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [P.], seq 1:12, ack 16641, win 16640, options [nop,nop,TS val 1319295279 ecr 1319295279], length 11
09:15:29.424284 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [.], ack 12, win 574, options [nop,nop,TS val 1319295279 ecr 1319295279], length 0
09:15:29.630272 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 16641:33281, ack 12, win 574, options [nop,nop,TS val 1319295485 ecr 1319295279], length 16640
09:15:29.630334 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 33281, win 2304, options [nop,nop,TS val 1319295485 ecr 1319295485], length 0
09:15:29.836938 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 33281:35585, ack 12, win 574, options [nop,nop,TS val 1319295691 ecr 1319295485], length 2304
09:15:29.836984 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 35585, win 2304, options [nop,nop,TS val 1319295691 ecr 1319295691], length 0
09:15:30.043606 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 35585:37889, ack 12, win 574, options [nop,nop,TS val 1319295898 ecr 1319295691], length 2304
09:15:30.043653 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 37889, win 2304, options [nop,nop,TS val 1319295898 ecr 1319295898], length 0
09:15:30.250270 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 37889:40193, ack 12, win 574, options [nop,nop,TS val 1319296105 ecr 1319295898], length 2304
09:15:30.250316 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 40193, win 2304, options [nop,nop,TS val 1319296105 ecr 1319296105], length 0
09:15:30.456932 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 40193:42497, ack 12, win 574, options [nop,nop,TS val 1319296311 ecr 1319296105], length 2304
09:15:30.456975 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 42497, win 2304, options [nop,nop,TS val 1319296311 ecr 1319296311], length 0
09:15:30.663598 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [P.], seq 42497:44801, ack 12, win 574, options [nop,nop,TS val 1319296518 ecr 1319296311], length 2304
09:15:30.663638 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [.], ack 44801, win 2304, options [nop,nop,TS val 1319296518 ecr 1319296518], length 0
09:15:30.663646 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [FP.], seq 44801:46081, ack 12, win 574, options [nop,nop,TS val 1319296518 ecr 1319296518], length 1280
09:15:30.663712 IP 127.0.0.1.40386 > 127.0.0.1.42483: Flags [F.], seq 12, ack 46082, win 2304, options [nop,nop,TS val 1319296518 ecr 1319296518], length 0
09:15:30.663724 IP 127.0.0.1.42483 > 127.0.0.1.40386: Flags [.], ack 13, win 573, options [nop,nop,TS val 1319296518 ecr 1319296518], length 0
(window size values are scaled here). Part of the problem is that the
receiver side sets SO_RCVBUF after connect() so that the window shrinks
after sender already sent more data; when I move the bufsized() calls
in the python script before listen() and connect(), the test runs
quickly.
Michal
^ permalink raw reply
* Re: [PATCH] Fix termination state for idr_for_each_entry_ul()
From: patchwork-bot+netdevbpf @ 2023-11-03 9:20 UTC (permalink / raw)
To: NeilBrown; +Cc: netdev, davem, linux-kernel, willy, chrism, xiyou.wangcong
In-Reply-To: <169810161336.20306.1410058490199370047@noble.neil.brown.name>
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Tue, 24 Oct 2023 09:53:33 +1100 you wrote:
> The comment for idr_for_each_entry_ul() states
>
> after normal termination @entry is left with the value NULL
>
> This is not correct in the case where UINT_MAX has an entry in the idr.
> In that case @entry will be non-NULL after termination.
> No current code depends on the documentation being correct, but to
> save future code we should fix it.
>
> [...]
Here is the summary with links:
- Fix termination state for idr_for_each_entry_ul()
https://git.kernel.org/netdev/net/c/e8ae8ad479e2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] tcp: fix fastopen code vs usec TS
From: patchwork-bot+netdevbpf @ 2023-11-03 9:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: davem, kuba, pabeni, netdev, eric.dumazet, oliver.sang, ncardwell,
morleyd
In-Reply-To: <20231031061945.2801972-1-edumazet@google.com>
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Tue, 31 Oct 2023 06:19:45 +0000 you wrote:
> After blamed commit, TFO client-ack-dropped-then-recovery-ms-timestamps
> packetdrill test failed.
>
> David Morley and Neal Cardwell started investigating and Neal pointed
> that we had :
>
> tcp_conn_request()
> tcp_try_fastopen()
> -> tcp_fastopen_create_child
> -> child = inet_csk(sk)->icsk_af_ops->syn_recv_sock()
> -> tcp_create_openreq_child()
> -> copy req_usec_ts from req:
> newtp->tcp_usec_ts = treq->req_usec_ts;
> // now the new TFO server socket always does usec TS, no matter
> // what the route options are...
> send_synack()
> -> tcp_make_synack()
> // disable tcp_rsk(req)->req_usec_ts if route option is not present:
> if (tcp_rsk(req)->req_usec_ts < 0)
> tcp_rsk(req)->req_usec_ts = dst_tcp_usec_ts(dst);
>
> [...]
Here is the summary with links:
- [net] tcp: fix fastopen code vs usec TS
https://git.kernel.org/netdev/net/c/cdbab6236605
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCHv2 net] selftests: pmtu.sh: fix result checking
From: patchwork-bot+netdevbpf @ 2023-11-03 9:20 UTC (permalink / raw)
To: Hangbin Liu
Cc: netdev, davem, kuba, edumazet, pabeni, shuah, dsahern,
linux-kselftest, po-hsu.lin
In-Reply-To: <20231031034732.3531008-1-liuhangbin@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Tue, 31 Oct 2023 11:47:32 +0800 you wrote:
> In the PMTU test, when all previous tests are skipped and the new test
> passes, the exit code is set to 0. However, the current check mistakenly
> treats this as an assignment, causing the check to pass every time.
>
> Consequently, regardless of how many tests have failed, if the latest test
> passes, the PMTU test will report a pass.
>
> [...]
Here is the summary with links:
- [PATCHv2,net] selftests: pmtu.sh: fix result checking
https://git.kernel.org/netdev/net/c/63e201916b27
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v3] net: stmmac: xgmac: Enable support for multiple Flexible PPS outputs
From: patchwork-bot+netdevbpf @ 2023-11-03 9:20 UTC (permalink / raw)
To: Furong Xu
Cc: davem, alexandre.torgue, joabreu, edumazet, kuba, pabeni,
mcoquelin.stm32, jpinto, horms, fancer.lancer, jacob.e.keller,
netdev, linux-stm32, linux-arm-kernel, linux-kernel, xfr, rock.xu
In-Reply-To: <20231031022729.2347871-1-0x1207@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Tue, 31 Oct 2023 10:27:29 +0800 you wrote:
> From XGMAC Core 3.20 and later, each Flexible PPS has individual PPSEN bit
> to select Fixed mode or Flexible mode. The PPSEN must be set, or it stays
> in Fixed PPS mode by default.
> XGMAC Core prior 3.20, only PPSEN0(bit 4) is writable. PPSEN{1,2,3} are
> read-only reserved, and they are already in Flexible mode by default, our
> new code always set PPSEN{1,2,3} do not make things worse ;-)
>
> [...]
Here is the summary with links:
- [net,v3] net: stmmac: xgmac: Enable support for multiple Flexible PPS outputs
https://git.kernel.org/netdev/net/c/db456d90a4c1
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
From: Florian Westphal @ 2023-11-03 9:18 UTC (permalink / raw)
To: Dan Carpenter
Cc: Florian Westphal, Pablo Neira Ayuso, Jozsef Kadlecsik,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netfilter-devel, coreteam, netdev, linux-kernel, kernel-janitors
In-Reply-To: <15fdceb5-2de5-4453-98b3-cfa9d486e8da@moroto.mountain>
Dan Carpenter <dan.carpenter@linaro.org> wrote:
> The problem is in nft_byteorder_eval() where we are iterating through a
> loop and writing to dst[0], dst[1], dst[2] and so on... On each
> iteration we are writing 8 bytes. But dst[] is an array of u32 so each
> element only has space for 4 bytes. That means that every iteration
> overwrites part of the previous element.
>
> I spotted this bug while reviewing commit caf3ef7468f7 ("netfilter:
> nf_tables: prevent OOB access in nft_byteorder_eval") which is a related
> issue. I think that the reason we have not detected this bug in testing
> is that most of time we only write one element.
LGTM, thanks Dan. We will route this via nf.git.
^ permalink raw reply
* [PATCH] net: ti: icssg-prueth: Fix error cleanup on failing pruss_request_mem_region
From: Jan Kiszka @ 2023-11-03 9:13 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
MD Danish Anwar
Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D)
From: Jan Kiszka <jan.kiszka@siemens.com>
We were just continuing in this case, surely not desired.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index 0242e123fc05..70c696ef0177 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -2064,7 +2064,7 @@ static int prueth_probe(struct platform_device *pdev)
&prueth->shram);
if (ret) {
dev_err(dev, "unable to get PRUSS SHRD RAM2: %d\n", ret);
- pruss_put(prueth->pruss);
+ goto put_pruss;
}
prueth->sram_pool = of_gen_pool_get(np, "sram", 0);
@@ -2209,6 +2209,8 @@ static int prueth_probe(struct platform_device *pdev)
put_mem:
pruss_release_mem_region(prueth->pruss, &prueth->shram);
+
+put_pruss:
pruss_put(prueth->pruss);
put_cores:
--
2.35.3
^ permalink raw reply related
* Re: [PATCH net] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Christoph Hellwig @ 2023-11-03 8:22 UTC (permalink / raw)
To: Nathan Chancellor
Cc: edumazet, davem, dsahern, kuba, pabeni, ndesaulniers, trix,
0x7f454c46, fruggeri, noureddine, netdev, linux-kernel, llvm,
patches
In-Reply-To: <20231031-tcp-ao-fix-label-in-compound-statement-warning-v1-1-c9731d115f17@kernel.org>
On Tue, Oct 31, 2023 at 01:23:35PM -0700, Nathan Chancellor wrote:
> Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
>
> net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
> 663 | }
> | ^
> 1 error generated.
>
> On earlier releases (such as clang-11, the current minimum supported
> version for building the kernel) that do not support C23, this was a
> hard error unconditionally:
>
> net/ipv4/tcp_output.c:663:2: error: expected statement
> }
> ^
> 1 error generated.
>
> Add a semicolon after the label to create an empty statement, which
> resolves the warning or error for all compilers.
Can you please just split the A0 handlig into a separate helper, which
shuld make the whole thing a lot cleaner?
^ permalink raw reply
* Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
From: Eric Dumazet @ 2023-11-03 8:17 UTC (permalink / raw)
To: Jiri Slaby
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng, eric.dumazet
In-Reply-To: <875400ba-58d8-44a0-8fe9-334e322bd1db@kernel.org>
On Fri, Nov 3, 2023 at 8:07 AM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> On 03. 11. 23, 7:56, Jiri Slaby wrote:
> > On 03. 11. 23, 7:10, Jiri Slaby wrote:
> >> On 17. 07. 23, 17:29, Eric Dumazet wrote:
> >>> With modern NIC drivers shifting to full page allocations per
> >>> received frame, we face the following issue:
> >>>
> >>> TCP has one per-netns sysctl used to tweak how to translate
> >>> a memory use into an expected payload (RWIN), in RX path.
> >>>
> >>> tcp_win_from_space() implementation is limited to few cases.
> >>>
> >>> For hosts dealing with various MSS, we either under estimate
> >>> or over estimate the RWIN we send to the remote peers.
> >>>
> >>> For instance with the default sysctl_tcp_adv_win_scale value,
> >>> we expect to store 50% of payload per allocated chunk of memory.
> >>>
> >>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
> >>> by NIC drivers, we are sending too big RWIN, leading to potential
> >>> tcp collapse operations, which are extremely expensive and source
> >>> of latency spikes.
> >>>
> >>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
> >>> uses a per socket scaling factor, so that we can precisely
> >>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
> >>>
> >>> This patch alone can double TCP receive performance when receivers
> >>> are too slow to drain their receive queue, or by allowing
> >>> a bigger RWIN when MSS is close to PAGE_SIZE.
> >>
> >> Hi,
> >>
> >> I bisected a python-eventlet test failure:
> >> > =================================== FAILURES
> >> ===================================
> >> > _______________________ TestGreenSocket.test_full_duplex
> >> _______________________
> >> >
> >> > self = <tests.greenio_test.TestGreenSocket
> >> testMethod=test_full_duplex>
> >> >
> >> > def test_full_duplex(self):
> >> > ...
> >> > > large_evt.wait()
> >> >
> >> > tests/greenio_test.py:424:
> >> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >> _ _ _ _ _ _
> >> > eventlet/greenthread.py:181: in wait
> >> > return self._exit_event.wait()
> >> > eventlet/event.py:125: in wait
> >> > result = hub.switch()
> >> ...
> >> > E tests.TestIsTakingTooLong: 1
> >> >
> >> > eventlet/hubs/hub.py:313: TestIsTakingTooLong
> >>
> >> to this commit. With the commit, the test takes > 1.5 s. Without the
> >> commit it takes only < 300 ms. And they set timeout to 1 s.
> >>
> >> The reduced self-stadning test case:
> >> #!/usr/bin/python3
> >> import eventlet
> >> from eventlet.green import select, socket, time, ssl
> >>
> >> def bufsized(sock, size=1):
> >> """ Resize both send and receive buffers on a socket.
> >> Useful for testing trampoline. Returns the socket.
> >>
> >> >>> import socket
> >> >>> sock = bufsized(socket.socket(socket.AF_INET,
> >> socket.SOCK_STREAM))
> >> """
> >> sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
> >> sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
> >> return sock
> >>
> >> def min_buf_size():
> >> """Return the minimum buffer size that the platform supports."""
> >> test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >> test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
> >> return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
> >>
> >> def test_full_duplex():
> >> large_data = b'*' * 10 * min_buf_size()
> >> listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >> listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> >> listener.bind(('127.0.0.1', 0))
> >> listener.listen(50)
> >> bufsized(listener)
> >>
> >> def send_large(sock):
> >> sock.sendall(large_data)
> >>
> >> def read_large(sock):
> >> result = sock.recv(len(large_data))
> >> while len(result) < len(large_data):
> >> result += sock.recv(len(large_data))
> >> assert result == large_data
> >>
> >> def server():
> >> (sock, addr) = listener.accept()
> >> sock = bufsized(sock)
> >> send_large_coro = eventlet.spawn(send_large, sock)
> >> eventlet.sleep(0)
> >> result = sock.recv(10)
> >> expected = b'hello world'
> >> while len(result) < len(expected):
> >> result += sock.recv(10)
> >> assert result == expected
> >> send_large_coro.wait()
> >>
> >> server_evt = eventlet.spawn(server)
> >> client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >> client.connect(('127.0.0.1', listener.getsockname()[1]))
> >> bufsized(client)
> >> large_evt = eventlet.spawn(read_large, client)
> >> eventlet.sleep(0)
> >> client.sendall(b'hello world')
> >> server_evt.wait()
> >> large_evt.wait()
> >> client.close()
> >>
> >> test_full_duplex()
> >>
> >> =====================================
> >>
> >> I speak neither python nor networking, so any ideas :)? Is the test
> >> simply wrong?
> >
> > strace -rT -e trace=network:
> >
> > GOOD:
> > > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000063>
> > > 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
> > > 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> > <0.000012>
> > > 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000015>
> > > 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
> > > 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
> > > 0.000058 listen(3, 50) = 0 <0.000014>
> > > 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > > 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> > > 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> > <0.000014>
> > > 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
> > > 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> > progress) <0.000070>
> > > 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062),
> > sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
> > > 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
> > > 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
> > > 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> > > 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
> > > 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
> > > 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
> > > 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> > > 0.000061 sendto(6, "********************************"..., 46080, 0,
> > NULL, 0) = 32768 <0.000049>
> > > 0.000135 sendto(6, "********************************"..., 13312, 0,
> > NULL, 0) = 13312 <0.000017>
> > > 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN
> > (Resource temporarily unavailable) <0.000010>
> > > 0.000125 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 32768 <0.000032>
> > > 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000011>
> > > 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
> > > 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
> > > 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
> > > 0.000212 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 13312 <0.000019>
> > > 0.050676 +++ exited with 0 +++
> >
> >
> > BAD:
> > > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000045>
> > > 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
> > > 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> > <0.000013>
> > > 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> > <0.000016>
> > > 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
> > > 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
> > > 0.000068 listen(3, 50) = 0 <0.000014>
> > > 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
> > > 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
> > > 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> > <0.000023>
> > > 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
> > > 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> > progress) <0.000074>
> > > 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002),
> > sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
> > > 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
> > > 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > > 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
> > > 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
> > > 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
> > > 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > > 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> > > 0.000071 sendto(6, "********************************"..., 46080, 0,
> > NULL, 0) = 16640 <0.000026>
> > > 0.000117 sendto(6, "********************************"..., 29440, 0,
> > NULL, 0) = 16640 <0.000017>
> > > 0.000041 sendto(6, "********************************"..., 12800, 0,
> > NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
> > > 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN
> > (Resource temporarily unavailable) <0.000010>
> > > 0.000086 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 16640 <0.000018>
> > > 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000009>
> > > 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
> > > 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
> > > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> > > 0.206685 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 16640 <0.000116>
> > > 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000013>
> > > 0.000208 sendto(6, "********************************"..., 12800, 0,
> > NULL, 0) = 12800 <0.000025>
> > > 0.206317 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000171>
> > > 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000029>
> > > 0.206161 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000082>
> > > 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000034>
> > > 0.206572 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000146>
> > > 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000029>
> > > 0.206604 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000162>
> > > 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1
> > EAGAIN (Resource temporarily unavailable) <0.000016>
> > > 0.206164 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 2304 <0.000116>
> > > 0.000291 recvfrom(5, "********************************"..., 46080, 0,
> > NULL, NULL) = 1280 <0.000038>
> > > 0.052224 +++ exited with 0 +++
> >
> > I.e. recvfrom() returns -EAGAIN and takes 200 ms.
>
> Ah, no, those 200 ms are not spent in recvfrom() but in poll:
> > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> > 0.000040 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
> u64=139942919405573}}], 1023, 60000) = 1 <0.204038>
> > 0.204258 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b4049c) = 0 <0.000045>
> > 0.000104 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 16640 <0.000078>
> > 0.000264 recvfrom(5, 0x5603425f3550, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000025>
> > 0.000127 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5,
> u64=94570884890629}}) = 0 <0.000031>
> > 0.000112 epoll_wait(4, [{events=EPOLLOUT, data={u32=6,
> u64=139942919405574}}], 1023, 60000) = 1 <0.000026>
> > 0.000083 epoll_ctl(4, EPOLL_CTL_DEL, 6, 0x7ffd49b404fc) = 0 <0.000025>
> > 0.000063 sendto(6, "********************************"..., 12800, 0,
> NULL, 0) = 12800 <0.000028>
> > 0.000226 epoll_wait(4, [], 1023, 0) = 0 <0.000007>
> > 0.000029 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
> u64=94570884890629}}], 1023, 60000) = 1 <0.205476>
> > 0.205728 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000077>
> > 0.000157 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000066>
> > 0.000180 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000026>
> > 0.000139 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000030>
> > 0.000104 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205881>
> > 0.206222 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000086>
> > 0.000189 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000027>
> > 0.000153 recvfrom(5, 0x5603425ece20, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000017>
> > 0.000074 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000018>
> > 0.000067 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205749>
> > 0.206088 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000197>
> > 0.000391 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000080>
> > 0.000210 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000027>
> > 0.000174 epoll_ctl(4, EPOLL_CTL_ADD, 5,
> {events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
> <0.000031>
> > 0.000127 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
> 60000) = 1 <0.205471>
> > 0.205783 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000205>
>
It seems the test had some expectations.
Setting a small (1 byte) RCVBUF/SNDBUF, and yet expecting to send
46080 bytes fast enough was not reasonable.
It might have relied on the fact that tcp sendmsg() can cook large GSO
packets, even if sk->sk_sndbuf is small.
With tight memory settings, it is possible TCP has to resort on RTO
timers (200ms by default) to recover from dropped packets.
What happens if you double /proc/sys/net/ipv4/tcp_wmem and/or
/proc/sys/net/ipv4/tcp_rmem first value ?
(4096 -> 8192)
^ permalink raw reply
* Re: [PATCH net v2 0/3] bugfixs for smc
From: Wenjia Zhang @ 2023-11-03 8:13 UTC (permalink / raw)
To: D. Wythe, kgraul, jaka, wintera
Cc: kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <1698991660-82957-1-git-send-email-alibuda@linux.alibaba.com>
On 03.11.23 07:07, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patches includes bugfix following:
>
> 1. hung state
> 2. sock leak
> 3. potential panic
>
> We have been testing these patches for some time, but
> if you have any questions, please let us know.
>
> --
> v1:
> Fix spelling errors and incorrect function names in descriptions
>
> v2->v1:
> Add fix tags for bugfix patch
>
> D. Wythe (3):
> net/smc: fix dangling sock under state SMC_APPFINCLOSEWAIT
> net/smc: allow cdc msg send rather than drop it with NULL sndbuf_desc
> net/smc: put sk reference if close work was canceled
>
> net/smc/af_smc.c | 4 ++--
> net/smc/smc.h | 5 +++++
> net/smc/smc_cdc.c | 11 +++++------
> net/smc/smc_close.c | 5 +++--
> 4 files changed, 15 insertions(+), 10 deletions(-)
>
Thank you for the fixes, LGTM! For all of the 3 pathces:
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
^ permalink raw reply
* [PATCH bpf-next] bpftool: fix prog object type in manpage
From: Artem Savkov @ 2023-11-03 8:11 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf, netdev
Cc: Artem Savkov, Jerry Snitselaar
bpftool's man page lists "program" as one of possible values for OBJECT,
while in fact bpftool accepts "prog" instead.
Reported-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
tools/bpf/bpftool/Documentation/bpftool.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst
index 6965c94dfdafe..09e4f2ff5658b 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -20,7 +20,7 @@ SYNOPSIS
**bpftool** **version**
- *OBJECT* := { **map** | **program** | **link** | **cgroup** | **perf** | **net** | **feature** |
+ *OBJECT* := { **map** | **prog** | **link** | **cgroup** | **perf** | **net** | **feature** |
**btf** | **gen** | **struct_ops** | **iter** }
*OPTIONS* := { { **-V** | **--version** } | |COMMON_OPTIONS| }
--
2.41.0
^ permalink raw reply related
* [PATCH] netlink: introduce netlink poll to resolve fast return issue
From: Jong eon Park @ 2023-11-03 7:22 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: netdev, linux-kernel, Jong eon Park, Dong ha Kang
In-Reply-To: <CGME20231103072245epcas1p4471a31e9f579e38501c8c856d3ca2a77@epcas1p4.samsung.com>
In very rare cases, there was an issue where a user's poll function
waiting for a uevent would continuously return very quickly, causing
excessive CPU usage due to the following scenario.
Once sk_rcvbuf becomes full netlink_broadcast_deliver returns an error and
netlink_overrun is called. However, if netlink_overrun was called in a
context just before a another context returns from the poll and recv is
invoked, emptying the rcvbuf, sk->sk_err = ENOBUF is written to the
netlink socket belatedly and it enters the NETLINK_S_CONGESTED state.
If the user does not check for POLLERR, they cannot consume and clean
sk_err and repeatedly enter the situation where they call poll again but
return immediately.
To address this issue, I would like to introduce the following netlink
poll.
After calling the datagram_poll, netlink poll checks the
NETLINK_S_CONGESTED status and rcv queue, and this make the user to be
readable once more even if the user has already emptied rcv queue. This
allows the user to be able to consume sk->sk_err value through
netlink_recvmsg, thus the situation described above can be avoided
Co-developed-by: Dong ha Kang <dongha7.kang@samsung.com>
Signed-off-by: Jong eon Park <jongeon.park@samsung.com>
---
net/netlink/af_netlink.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index eb086b06d60d..f08c10220041 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2002,6 +2002,20 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
return err ? : copied;
}
+static __poll_t netlink_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ __poll_t mask = datagram_poll(file, sock, wait);
+ struct sock *sk = sock->sk;
+ struct netlink_sock *nlk = nlk_sk(sk);
+
+ if (test_bit(NETLINK_S_CONGESTED, &nlk->state) &&
+ skb_queue_empty_lockless(&sk->sk_receive_queue))
+ mask |= EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
static void netlink_data_ready(struct sock *sk)
{
BUG();
@@ -2803,7 +2817,7 @@ static const struct proto_ops netlink_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname = netlink_getname,
- .poll = datagram_poll,
+ .poll = netlink_poll,
.ioctl = netlink_ioctl,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
--
2.25.1
^ permalink raw reply related
* Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
From: Jiri Slaby @ 2023-11-03 7:07 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
eric.dumazet
In-Reply-To: <c798f412-ac14-4997-9431-c98d1b8e16d8@kernel.org>
On 03. 11. 23, 7:56, Jiri Slaby wrote:
> On 03. 11. 23, 7:10, Jiri Slaby wrote:
>> On 17. 07. 23, 17:29, Eric Dumazet wrote:
>>> With modern NIC drivers shifting to full page allocations per
>>> received frame, we face the following issue:
>>>
>>> TCP has one per-netns sysctl used to tweak how to translate
>>> a memory use into an expected payload (RWIN), in RX path.
>>>
>>> tcp_win_from_space() implementation is limited to few cases.
>>>
>>> For hosts dealing with various MSS, we either under estimate
>>> or over estimate the RWIN we send to the remote peers.
>>>
>>> For instance with the default sysctl_tcp_adv_win_scale value,
>>> we expect to store 50% of payload per allocated chunk of memory.
>>>
>>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
>>> by NIC drivers, we are sending too big RWIN, leading to potential
>>> tcp collapse operations, which are extremely expensive and source
>>> of latency spikes.
>>>
>>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
>>> uses a per socket scaling factor, so that we can precisely
>>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
>>>
>>> This patch alone can double TCP receive performance when receivers
>>> are too slow to drain their receive queue, or by allowing
>>> a bigger RWIN when MSS is close to PAGE_SIZE.
>>
>> Hi,
>>
>> I bisected a python-eventlet test failure:
>> > =================================== FAILURES
>> ===================================
>> > _______________________ TestGreenSocket.test_full_duplex
>> _______________________
>> >
>> > self = <tests.greenio_test.TestGreenSocket
>> testMethod=test_full_duplex>
>> >
>> > def test_full_duplex(self):
>> > ...
>> > > large_evt.wait()
>> >
>> > tests/greenio_test.py:424:
>> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>> _ _ _ _ _ _
>> > eventlet/greenthread.py:181: in wait
>> > return self._exit_event.wait()
>> > eventlet/event.py:125: in wait
>> > result = hub.switch()
>> ...
>> > E tests.TestIsTakingTooLong: 1
>> >
>> > eventlet/hubs/hub.py:313: TestIsTakingTooLong
>>
>> to this commit. With the commit, the test takes > 1.5 s. Without the
>> commit it takes only < 300 ms. And they set timeout to 1 s.
>>
>> The reduced self-stadning test case:
>> #!/usr/bin/python3
>> import eventlet
>> from eventlet.green import select, socket, time, ssl
>>
>> def bufsized(sock, size=1):
>> """ Resize both send and receive buffers on a socket.
>> Useful for testing trampoline. Returns the socket.
>>
>> >>> import socket
>> >>> sock = bufsized(socket.socket(socket.AF_INET,
>> socket.SOCK_STREAM))
>> """
>> sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
>> sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
>> return sock
>>
>> def min_buf_size():
>> """Return the minimum buffer size that the platform supports."""
>> test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>> test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
>> return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
>>
>> def test_full_duplex():
>> large_data = b'*' * 10 * min_buf_size()
>> listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>> listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
>> listener.bind(('127.0.0.1', 0))
>> listener.listen(50)
>> bufsized(listener)
>>
>> def send_large(sock):
>> sock.sendall(large_data)
>>
>> def read_large(sock):
>> result = sock.recv(len(large_data))
>> while len(result) < len(large_data):
>> result += sock.recv(len(large_data))
>> assert result == large_data
>>
>> def server():
>> (sock, addr) = listener.accept()
>> sock = bufsized(sock)
>> send_large_coro = eventlet.spawn(send_large, sock)
>> eventlet.sleep(0)
>> result = sock.recv(10)
>> expected = b'hello world'
>> while len(result) < len(expected):
>> result += sock.recv(10)
>> assert result == expected
>> send_large_coro.wait()
>>
>> server_evt = eventlet.spawn(server)
>> client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>> client.connect(('127.0.0.1', listener.getsockname()[1]))
>> bufsized(client)
>> large_evt = eventlet.spawn(read_large, client)
>> eventlet.sleep(0)
>> client.sendall(b'hello world')
>> server_evt.wait()
>> large_evt.wait()
>> client.close()
>>
>> test_full_duplex()
>>
>> =====================================
>>
>> I speak neither python nor networking, so any ideas :)? Is the test
>> simply wrong?
>
> strace -rT -e trace=network:
>
> GOOD:
> > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> <0.000063>
> > 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
> > 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> <0.000012>
> > 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> <0.000015>
> > 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
> > 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
> > 0.000058 listen(3, 50) = 0 <0.000014>
> > 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> > 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> <0.000014>
> > 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313),
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
> > 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> progress) <0.000070>
> > 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062),
> sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
> > 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313),
> sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
> > 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
> > 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> > 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
> > 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
> > 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
> > 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> > 0.000061 sendto(6, "********************************"..., 46080, 0,
> NULL, 0) = 32768 <0.000049>
> > 0.000135 sendto(6, "********************************"..., 13312, 0,
> NULL, 0) = 13312 <0.000017>
> > 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN
> (Resource temporarily unavailable) <0.000010>
> > 0.000125 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 32768 <0.000032>
> > 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000011>
> > 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
> > 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
> > 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
> > 0.000212 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 13312 <0.000019>
> > 0.050676 +++ exited with 0 +++
>
>
> BAD:
> > 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> <0.000045>
> > 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
> > 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
> <0.000013>
> > 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> <0.000016>
> > 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
> > 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
> > 0.000068 listen(3, 50) = 0 <0.000014>
> > 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
> > 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
> > 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
> <0.000023>
> > 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901),
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
> > 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
> progress) <0.000074>
> > 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002),
> sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
> > 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901),
> sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
> > 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
> > 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
> > 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
> > 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> > 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> > 0.000071 sendto(6, "********************************"..., 46080, 0,
> NULL, 0) = 16640 <0.000026>
> > 0.000117 sendto(6, "********************************"..., 29440, 0,
> NULL, 0) = 16640 <0.000017>
> > 0.000041 sendto(6, "********************************"..., 12800, 0,
> NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
> > 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN
> (Resource temporarily unavailable) <0.000010>
> > 0.000086 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 16640 <0.000018>
> > 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000009>
> > 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
> > 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
> > 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> > 0.206685 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 16640 <0.000116>
> > 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000013>
> > 0.000208 sendto(6, "********************************"..., 12800, 0,
> NULL, 0) = 12800 <0.000025>
> > 0.206317 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000171>
> > 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000029>
> > 0.206161 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000082>
> > 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000034>
> > 0.206572 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000146>
> > 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000029>
> > 0.206604 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000162>
> > 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1
> EAGAIN (Resource temporarily unavailable) <0.000016>
> > 0.206164 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 2304 <0.000116>
> > 0.000291 recvfrom(5, "********************************"..., 46080, 0,
> NULL, NULL) = 1280 <0.000038>
> > 0.052224 +++ exited with 0 +++
>
> I.e. recvfrom() returns -EAGAIN and takes 200 ms.
Ah, no, those 200 ms are not spent in recvfrom() but in poll:
> 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> 0.000040 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
u64=139942919405573}}], 1023, 60000) = 1 <0.204038>
> 0.204258 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b4049c) = 0 <0.000045>
> 0.000104 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 16640 <0.000078>
> 0.000264 recvfrom(5, 0x5603425f3550, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000025>
> 0.000127 epoll_ctl(4, EPOLL_CTL_ADD, 5,
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5,
u64=94570884890629}}) = 0 <0.000031>
> 0.000112 epoll_wait(4, [{events=EPOLLOUT, data={u32=6,
u64=139942919405574}}], 1023, 60000) = 1 <0.000026>
> 0.000083 epoll_ctl(4, EPOLL_CTL_DEL, 6, 0x7ffd49b404fc) = 0 <0.000025>
> 0.000063 sendto(6, "********************************"..., 12800, 0,
NULL, 0) = 12800 <0.000028>
> 0.000226 epoll_wait(4, [], 1023, 0) = 0 <0.000007>
> 0.000029 epoll_wait(4, [{events=EPOLLIN, data={u32=5,
u64=94570884890629}}], 1023, 60000) = 1 <0.205476>
> 0.205728 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000077>
> 0.000157 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000066>
> 0.000180 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000026>
> 0.000139 epoll_ctl(4, EPOLL_CTL_ADD, 5,
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
<0.000030>
> 0.000104 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
60000) = 1 <0.205881>
> 0.206222 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000086>
> 0.000189 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000027>
> 0.000153 recvfrom(5, 0x5603425ece20, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000017>
> 0.000074 epoll_ctl(4, EPOLL_CTL_ADD, 5,
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
<0.000018>
> 0.000067 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
60000) = 1 <0.205749>
> 0.206088 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000197>
> 0.000391 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000080>
> 0.000210 recvfrom(5, 0x5603425e30c0, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000027>
> 0.000174 epoll_ctl(4, EPOLL_CTL_ADD, 5,
{events=EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, data={u32=5, u64=5}}) = 0
<0.000031>
> 0.000127 epoll_wait(4, [{events=EPOLLIN, data={u32=5, u64=5}}], 1023,
60000) = 1 <0.205471>
> 0.205783 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0x7ffd49b404fc) = 0 <0.000205>
--
js
suse labs
^ permalink raw reply
* Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
From: Jiri Slaby @ 2023-11-03 6:56 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
eric.dumazet
In-Reply-To: <738bb6a1-4e99-4113-9345-48eea11e2108@kernel.org>
On 03. 11. 23, 7:10, Jiri Slaby wrote:
> On 17. 07. 23, 17:29, Eric Dumazet wrote:
>> With modern NIC drivers shifting to full page allocations per
>> received frame, we face the following issue:
>>
>> TCP has one per-netns sysctl used to tweak how to translate
>> a memory use into an expected payload (RWIN), in RX path.
>>
>> tcp_win_from_space() implementation is limited to few cases.
>>
>> For hosts dealing with various MSS, we either under estimate
>> or over estimate the RWIN we send to the remote peers.
>>
>> For instance with the default sysctl_tcp_adv_win_scale value,
>> we expect to store 50% of payload per allocated chunk of memory.
>>
>> For the typical use of MTU=1500 traffic, and order-0 pages allocations
>> by NIC drivers, we are sending too big RWIN, leading to potential
>> tcp collapse operations, which are extremely expensive and source
>> of latency spikes.
>>
>> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
>> uses a per socket scaling factor, so that we can precisely
>> adjust the RWIN based on effective skb->len/skb->truesize ratio.
>>
>> This patch alone can double TCP receive performance when receivers
>> are too slow to drain their receive queue, or by allowing
>> a bigger RWIN when MSS is close to PAGE_SIZE.
>
> Hi,
>
> I bisected a python-eventlet test failure:
> > =================================== FAILURES
> ===================================
> > _______________________ TestGreenSocket.test_full_duplex
> _______________________
> >
> > self = <tests.greenio_test.TestGreenSocket testMethod=test_full_duplex>
> >
> > def test_full_duplex(self):
> > ...
> > > large_evt.wait()
> >
> > tests/greenio_test.py:424:
> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _ _ _ _
> > eventlet/greenthread.py:181: in wait
> > return self._exit_event.wait()
> > eventlet/event.py:125: in wait
> > result = hub.switch()
> ...
> > E tests.TestIsTakingTooLong: 1
> >
> > eventlet/hubs/hub.py:313: TestIsTakingTooLong
>
> to this commit. With the commit, the test takes > 1.5 s. Without the
> commit it takes only < 300 ms. And they set timeout to 1 s.
>
> The reduced self-stadning test case:
> #!/usr/bin/python3
> import eventlet
> from eventlet.green import select, socket, time, ssl
>
> def bufsized(sock, size=1):
> """ Resize both send and receive buffers on a socket.
> Useful for testing trampoline. Returns the socket.
>
> >>> import socket
> >>> sock = bufsized(socket.socket(socket.AF_INET, socket.SOCK_STREAM))
> """
> sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
> sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
> return sock
>
> def min_buf_size():
> """Return the minimum buffer size that the platform supports."""
> test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
> return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
>
> def test_full_duplex():
> large_data = b'*' * 10 * min_buf_size()
> listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> listener.bind(('127.0.0.1', 0))
> listener.listen(50)
> bufsized(listener)
>
> def send_large(sock):
> sock.sendall(large_data)
>
> def read_large(sock):
> result = sock.recv(len(large_data))
> while len(result) < len(large_data):
> result += sock.recv(len(large_data))
> assert result == large_data
>
> def server():
> (sock, addr) = listener.accept()
> sock = bufsized(sock)
> send_large_coro = eventlet.spawn(send_large, sock)
> eventlet.sleep(0)
> result = sock.recv(10)
> expected = b'hello world'
> while len(result) < len(expected):
> result += sock.recv(10)
> assert result == expected
> send_large_coro.wait()
>
> server_evt = eventlet.spawn(server)
> client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> client.connect(('127.0.0.1', listener.getsockname()[1]))
> bufsized(client)
> large_evt = eventlet.spawn(read_large, client)
> eventlet.sleep(0)
> client.sendall(b'hello world')
> server_evt.wait()
> large_evt.wait()
> client.close()
>
> test_full_duplex()
>
> =====================================
>
> I speak neither python nor networking, so any ideas :)? Is the test
> simply wrong?
strace -rT -e trace=network:
GOOD:
> 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
<0.000063>
> 0.000406 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000042>
> 0.000097 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0
<0.000012>
> 0.000101 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
<0.000015>
> 0.000058 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000009>
> 0.000035 bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000027>
> 0.000058 listen(3, 50) = 0 <0.000014>
> 0.000029 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> 0.000023 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> 0.000052 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
<0.000014>
> 0.000050 getsockname(3, {sa_family=AF_INET, sin_port=htons(44313),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000011>
> 0.000037 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
progress) <0.000070>
> 0.000210 accept4(3, {sa_family=AF_INET, sin_port=htons(56062),
sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000012>
> 0.000040 getsockname(6, {sa_family=AF_INET, sin_port=htons(44313),
sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000007>
> 0.000062 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000007>
> 0.000020 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> 0.000082 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
> 0.000023 connect(5, {sa_family=AF_INET, sin_port=htons(44313),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000008>
> 0.000022 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000020>
> 0.000036 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000007>
> 0.000061 sendto(6, "********************************"..., 46080, 0,
NULL, 0) = 32768 <0.000049>
> 0.000135 sendto(6, "********************************"..., 13312, 0,
NULL, 0) = 13312 <0.000017>
> 0.000087 recvfrom(6, 0x7f78e58af890, 10, 0, NULL, NULL) = -1 EAGAIN
(Resource temporarily unavailable) <0.000010>
> 0.000125 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 32768 <0.000032>
> 0.000066 recvfrom(5, 0x55fdb41bb880, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000011>
> 0.000075 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000023>
> 0.000117 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000015>
> 0.000050 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000011>
> 0.000212 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 13312 <0.000019>
> 0.050676 +++ exited with 0 +++
BAD:
> 0.000000 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
<0.000045>
> 0.000244 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000015>
> 0.000057 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [4608], [4]) = 0 <0.000013>
> 0.000104 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
<0.000016>
> 0.000065 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 <0.000010>
> 0.000038 bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000031>
> 0.000068 listen(3, 50) = 0 <0.000014>
> 0.000032 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000010>
> 0.000030 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000018>
> 0.000060 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 5
<0.000023>
> 0.000071 getsockname(3, {sa_family=AF_INET, sin_port=htons(45901),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000019>
> 0.000068 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in
progress) <0.000074>
> 0.000259 accept4(3, {sa_family=AF_INET, sin_port=htons(35002),
sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 6 <0.000014>
> 0.000051 getsockname(6, {sa_family=AF_INET, sin_port=htons(45901),
sin_addr=inet_addr("127.0.0.1")}, [128 => 16]) = 0 <0.000010>
> 0.000082 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> 0.000040 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000009>
> 0.000104 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000009>
> 0.000028 connect(5, {sa_family=AF_INET, sin_port=htons(45901),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0 <0.000009>
> 0.000026 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [1], 4) = 0 <0.000009>
> 0.000024 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [1], 4) = 0 <0.000008>
> 0.000071 sendto(6, "********************************"..., 46080, 0,
NULL, 0) = 16640 <0.000026>
> 0.000117 sendto(6, "********************************"..., 29440, 0,
NULL, 0) = 16640 <0.000017>
> 0.000041 sendto(6, "********************************"..., 12800, 0,
NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
> 0.000075 recvfrom(6, 0x7f4db88a38c0, 10, 0, NULL, NULL) = -1 EAGAIN
(Resource temporarily unavailable) <0.000010>
> 0.000086 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 16640 <0.000018>
> 0.000044 recvfrom(5, 0x55a64d59b2a0, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000009>
> 0.000059 sendto(5, "hello world", 11, 0, NULL, 0) = 11 <0.000018>
> 0.000093 recvfrom(6, "hello worl", 10, 0, NULL, NULL) = 10 <0.000009>
> 0.000029 recvfrom(6, "d", 10, 0, NULL, NULL) = 1 <0.000009>
> 0.206685 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 16640 <0.000116>
> 0.000306 recvfrom(5, 0x55a64d5a7600, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000013>
> 0.000208 sendto(6, "********************************"..., 12800, 0,
NULL, 0) = 12800 <0.000025>
> 0.206317 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000171>
> 0.000304 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000029>
> 0.206161 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000082>
> 0.000212 recvfrom(5, 0x55a64d5a0ed0, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000034>
> 0.206572 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000146>
> 0.000274 recvfrom(5, 0x55a64d597170, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000029>
> 0.206604 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000162>
> 0.000270 recvfrom(5, 0x55a64d5a20d0, 46080, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable) <0.000016>
> 0.206164 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 2304 <0.000116>
> 0.000291 recvfrom(5, "********************************"..., 46080, 0,
NULL, NULL) = 1280 <0.000038>
> 0.052224 +++ exited with 0 +++
I.e. recvfrom() returns -EAGAIN and takes 200 ms.
--
js
suse labs
^ permalink raw reply
* [PATCH net] netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
From: Dan Carpenter @ 2023-11-03 6:42 UTC (permalink / raw)
To: Florian Westphal
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netfilter-devel,
coreteam, netdev, linux-kernel, kernel-janitors
The problem is in nft_byteorder_eval() where we are iterating through a
loop and writing to dst[0], dst[1], dst[2] and so on... On each
iteration we are writing 8 bytes. But dst[] is an array of u32 so each
element only has space for 4 bytes. That means that every iteration
overwrites part of the previous element.
I spotted this bug while reviewing commit caf3ef7468f7 ("netfilter:
nf_tables: prevent OOB access in nft_byteorder_eval") which is a related
issue. I think that the reason we have not detected this bug in testing
is that most of time we only write one element.
Fixes: ce1e7989d989 ("netfilter: nft_byteorder: provide 64bit le/be conversion")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
---
include/net/netfilter/nf_tables.h | 4 ++--
net/netfilter/nft_byteorder.c | 5 +++--
net/netfilter/nft_meta.c | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 3bbd13ab1ecf..b157c5cafd14 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -178,9 +178,9 @@ static inline __be32 nft_reg_load_be32(const u32 *sreg)
return *(__force __be32 *)sreg;
}
-static inline void nft_reg_store64(u32 *dreg, u64 val)
+static inline void nft_reg_store64(u64 *dreg, u64 val)
{
- put_unaligned(val, (u64 *)dreg);
+ put_unaligned(val, dreg);
}
static inline u64 nft_reg_load64(const u32 *sreg)
diff --git a/net/netfilter/nft_byteorder.c b/net/netfilter/nft_byteorder.c
index e596d1a842f7..f6e791a68101 100644
--- a/net/netfilter/nft_byteorder.c
+++ b/net/netfilter/nft_byteorder.c
@@ -38,13 +38,14 @@ void nft_byteorder_eval(const struct nft_expr *expr,
switch (priv->size) {
case 8: {
+ u64 *dst64 = (void *)dst;
u64 src64;
switch (priv->op) {
case NFT_BYTEORDER_NTOH:
for (i = 0; i < priv->len / 8; i++) {
src64 = nft_reg_load64(&src[i]);
- nft_reg_store64(&dst[i],
+ nft_reg_store64(&dst64[i],
be64_to_cpu((__force __be64)src64));
}
break;
@@ -52,7 +53,7 @@ void nft_byteorder_eval(const struct nft_expr *expr,
for (i = 0; i < priv->len / 8; i++) {
src64 = (__force __u64)
cpu_to_be64(nft_reg_load64(&src[i]));
- nft_reg_store64(&dst[i], src64);
+ nft_reg_store64(&dst64[i], src64);
}
break;
}
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index f7da7c43333b..ba0d3683a45d 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -63,7 +63,7 @@ nft_meta_get_eval_time(enum nft_meta_keys key,
{
switch (key) {
case NFT_META_TIME_NS:
- nft_reg_store64(dest, ktime_get_real_ns());
+ nft_reg_store64((u64 *)dest, ktime_get_real_ns());
break;
case NFT_META_TIME_DAY:
nft_reg_store8(dest, nft_meta_weekday());
--
2.42.0
^ permalink raw reply related
* [PATCH net] net/mlx5: Fix a NULL vs IS_ERR() check
From: Dan Carpenter @ 2023-11-03 6:36 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Eli Cohen, netdev, linux-rdma, kernel-janitors
The mlx5_esw_offloads_devlink_port() function returns error pointers, not
NULL.
Fixes: 7bef147a6ab6 ("net/mlx5: Don't skip vport check")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
---
I *think* these normally go through the mellanox tree and not net.
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 693e55b010d9..5c569d4bfd00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1493,7 +1493,7 @@ mlx5e_vport_vf_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep)
dl_port = mlx5_esw_offloads_devlink_port(dev->priv.eswitch,
rpriv->rep->vport);
- if (dl_port) {
+ if (!IS_ERR(dl_port)) {
SET_NETDEV_DEVLINK_PORT(netdev, dl_port);
mlx5e_rep_vnic_reporter_create(priv, dl_port);
}
--
2.42.0
^ permalink raw reply related
* RE: [net-next v4 0/5] net: wwan: t7xx: fw flashing & coredump support
From: Jinjian Song @ 2023-11-03 6:32 UTC (permalink / raw)
To: Loic Poulain, Jiri Pirko
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, corbet@lwn.net, ryazanov.s.a@gmail.com,
johannes@sipsolutions.net, chandrashekar.devegowda@intel.com,
linuxwwan@intel.com, chiranjeevi.rapolu@linux.intel.com,
haijun.liu@mediatek.com, m.chetan.kumar@linux.intel.com,
ricardo.martinez@linux.intel.com, netdev@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
nmarupaka@google.com, vsankar@lenovo.com,
danielwinkler@google.com, Lei Zhang(Terence), Qifeng Liu(Qifeng),
Fuqiang Yan(Felix)
>>On Wed, 13 Sept 2023 at 11:17, Jiri Pirko <jiri@resnulli.us> wrote:
>>>
>>> Tue, Sep 12, 2023 at 11:48:40AM CEST, songjinjian@hotmail.com wrote:
>>> >Adds support for t7xx wwan device firmware flashing & coredump
>>> >collection using devlink.
>>>
>>> I don't believe that use of devlink is correct here. It seems like a
>>> misfit. IIUC, what you need is to communicate with the modem.
>>> Basically a communication channel to modem. The other wwan drivers
>>> implement these channels in _ctrl.c files, using multiple protocols.
>>> Why can't you do something similar and let devlink out of this please?
>>>
>>> Until you put in arguments why you really need devlink and why is it
>>> a good fit, I'm against this. Please don't send any other versions of
>>> this patchset that use devlink.
>The t7xx driver already has regular wwan data and control interfaces registered with the wwan framework, making it functional. Here the exposed low level resources are not really wwan/class specific as it is for firmware upgrade >and coredump, so I think that is why Jinjian chose the 'feature agnostic' devlink framework. IMHO I think it makes sense to rely on such a framework, or maybe on the devcoredump class.
>That said, I see the protocol for flashing and doing the coreboot is fastboot, which is already supported on the user side with the fastboot tool, so I'm not sure abstracting it here makes sense. If the protocol is really fasboot compliant, Wouldn't it be simpler to directly expose it as a new device/channel? and rely on a userspace tool for regular fastboot operations (flash, boot, dump). This may require slightly modifying the fastboot tool to detect and support that new transport (in addition to the existing usb and ethernet support).
>I think this is the advantages of using devlink to implement flash and dump collect:
>1.Devlink framework provide the interface of flash and dump, and no need to develop corresponding tools anymore. From another perspective, using devlnik can directly reduce the complexity of PC manufacturer's customer production lines, helping to reduce their costs(time/production).
>2.Currently, the platform architecture of each WWAN module manufacturer is not compatible, Qualcomm implement communicate channels in host driver and using fastboot tool to flash, using their coredump tool to collect dump. Intel implement their host driver in devlink framework, using devlink tool to flash and collect dump. MTK design to use devlink doing flash and dump collection. Devlink can ignore difference in platform architecture, abstracting these commonly used interfaces, I understand that Intel and MTK are preparing to use this plan in the future.
>3.Fastboot tool relies on manual installation by user and it does not have the advantages of the above two features(currently, seemed can't support collect core dump directly). It seems that devlink does not need to be installed separately, _ctrl.c files used for QualComm, but Intel and MTK don't use the same design in module, so cannot directly reference.
>4.It seemed that Intel WWAN product using iosm driver withing devlink framework has be allowned in upstream, it provides some guidance and direction.
>So hope to get your help about the suggestions for the next steps, thanks.
Hope to get your help about the suggestion, if devlink framework here is not suitable, I hope to discuss other alternative solutions.
Thanks.
Regards,
Jinjian
^ permalink raw reply
* Re: [PATCH net-next] tcp: get rid of sysctl_tcp_adv_win_scale
From: Jiri Slaby @ 2023-11-03 6:10 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
eric.dumazet
In-Reply-To: <20230717152917.751987-1-edumazet@google.com>
On 17. 07. 23, 17:29, Eric Dumazet wrote:
> With modern NIC drivers shifting to full page allocations per
> received frame, we face the following issue:
>
> TCP has one per-netns sysctl used to tweak how to translate
> a memory use into an expected payload (RWIN), in RX path.
>
> tcp_win_from_space() implementation is limited to few cases.
>
> For hosts dealing with various MSS, we either under estimate
> or over estimate the RWIN we send to the remote peers.
>
> For instance with the default sysctl_tcp_adv_win_scale value,
> we expect to store 50% of payload per allocated chunk of memory.
>
> For the typical use of MTU=1500 traffic, and order-0 pages allocations
> by NIC drivers, we are sending too big RWIN, leading to potential
> tcp collapse operations, which are extremely expensive and source
> of latency spikes.
>
> This patch makes sysctl_tcp_adv_win_scale obsolete, and instead
> uses a per socket scaling factor, so that we can precisely
> adjust the RWIN based on effective skb->len/skb->truesize ratio.
>
> This patch alone can double TCP receive performance when receivers
> are too slow to drain their receive queue, or by allowing
> a bigger RWIN when MSS is close to PAGE_SIZE.
Hi,
I bisected a python-eventlet test failure:
> =================================== FAILURES
===================================
> _______________________ TestGreenSocket.test_full_duplex
_______________________
>
> self = <tests.greenio_test.TestGreenSocket testMethod=test_full_duplex>
>
> def test_full_duplex(self):
> ...
> > large_evt.wait()
>
> tests/greenio_test.py:424:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _
> eventlet/greenthread.py:181: in wait
> return self._exit_event.wait()
> eventlet/event.py:125: in wait
> result = hub.switch()
...
> E tests.TestIsTakingTooLong: 1
>
> eventlet/hubs/hub.py:313: TestIsTakingTooLong
to this commit. With the commit, the test takes > 1.5 s. Without the
commit it takes only < 300 ms. And they set timeout to 1 s.
The reduced self-stadning test case:
#!/usr/bin/python3
import eventlet
from eventlet.green import select, socket, time, ssl
def bufsized(sock, size=1):
""" Resize both send and receive buffers on a socket.
Useful for testing trampoline. Returns the socket.
>>> import socket
>>> sock = bufsized(socket.socket(socket.AF_INET, socket.SOCK_STREAM))
"""
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
return sock
def min_buf_size():
"""Return the minimum buffer size that the platform supports."""
test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
def test_full_duplex():
large_data = b'*' * 10 * min_buf_size()
listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listener.bind(('127.0.0.1', 0))
listener.listen(50)
bufsized(listener)
def send_large(sock):
sock.sendall(large_data)
def read_large(sock):
result = sock.recv(len(large_data))
while len(result) < len(large_data):
result += sock.recv(len(large_data))
assert result == large_data
def server():
(sock, addr) = listener.accept()
sock = bufsized(sock)
send_large_coro = eventlet.spawn(send_large, sock)
eventlet.sleep(0)
result = sock.recv(10)
expected = b'hello world'
while len(result) < len(expected):
result += sock.recv(10)
assert result == expected
send_large_coro.wait()
server_evt = eventlet.spawn(server)
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client.connect(('127.0.0.1', listener.getsockname()[1]))
bufsized(client)
large_evt = eventlet.spawn(read_large, client)
eventlet.sleep(0)
client.sendall(b'hello world')
server_evt.wait()
large_evt.wait()
client.close()
test_full_duplex()
=====================================
I speak neither python nor networking, so any ideas :)? Is the test
simply wrong?
Thanks.
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> Documentation/networking/ip-sysctl.rst | 1 +
> include/linux/tcp.h | 4 +++-
> include/net/netns/ipv4.h | 2 +-
> include/net/tcp.h | 24 ++++++++++++++++++++----
> net/ipv4/tcp.c | 11 ++++++-----
> net/ipv4/tcp_input.c | 19 ++++++++++++-------
> 6 files changed, 43 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 4a010a7cde7f8085db5ba6f1b9af53e9e5223cd5..82f2117cf2b36a834e5e391feda0210d916bff8b 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -321,6 +321,7 @@ tcp_abort_on_overflow - BOOLEAN
> option can harm clients of your server.
>
> tcp_adv_win_scale - INTEGER
> + Obsolete since linux-6.6
> Count buffering overhead as bytes/2^tcp_adv_win_scale
> (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
> if it is <= 0.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index b4c08ac86983568a9511258708724da15d0b999e..fbcb0ce13171d46aa3697abcd48482b08e78e5e0 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -172,6 +172,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
> return (struct tcp_request_sock *)req;
> }
>
> +#define TCP_RMEM_TO_WIN_SCALE 8
> +
> struct tcp_sock {
> /* inet_connection_sock has to be the first member of tcp_sock */
> struct inet_connection_sock inet_conn;
> @@ -238,7 +240,7 @@ struct tcp_sock {
>
> u32 window_clamp; /* Maximal window to advertise */
> u32 rcv_ssthresh; /* Current window clamp */
> -
> + u8 scaling_ratio; /* see tcp_win_from_space() */
> /* Information of the most recently (s)acked skb */
> struct tcp_rack {
> u64 mstamp; /* (Re)sent time of the skb */
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index f003747181593559a4efe1838be719d445417041..7a41c4791536732005cedbb80c223b86aa43249e 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -152,7 +152,7 @@ struct netns_ipv4 {
> u8 sysctl_tcp_abort_on_overflow;
> u8 sysctl_tcp_fack; /* obsolete */
> int sysctl_tcp_max_reordering;
> - int sysctl_tcp_adv_win_scale;
> + int sysctl_tcp_adv_win_scale; /* obsolete */
> u8 sysctl_tcp_dsack;
> u8 sysctl_tcp_app_win;
> u8 sysctl_tcp_frto;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 226bce6d1e8c30185260baadec449b67323db91c..2104a71c75ba7eee40612395be4103ae370b3c03 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1434,11 +1434,27 @@ void tcp_select_initial_window(const struct sock *sk, int __space,
>
> static inline int tcp_win_from_space(const struct sock *sk, int space)
> {
> - int tcp_adv_win_scale = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_adv_win_scale);
> + s64 scaled_space = (s64)space * tcp_sk(sk)->scaling_ratio;
>
> - return tcp_adv_win_scale <= 0 ?
> - (space>>(-tcp_adv_win_scale)) :
> - space - (space>>tcp_adv_win_scale);
> + return scaled_space >> TCP_RMEM_TO_WIN_SCALE;
> +}
> +
> +/* inverse of tcp_win_from_space() */
> +static inline int tcp_space_from_win(const struct sock *sk, int win)
> +{
> + u64 val = (u64)win << TCP_RMEM_TO_WIN_SCALE;
> +
> + do_div(val, tcp_sk(sk)->scaling_ratio);
> + return val;
> +}
> +
> +static inline void tcp_scaling_ratio_init(struct sock *sk)
> +{
> + /* Assume a conservative default of 1200 bytes of payload per 4K page.
> + * This may be adjusted later in tcp_measure_rcv_mss().
> + */
> + tcp_sk(sk)->scaling_ratio = (1200 << TCP_RMEM_TO_WIN_SCALE) /
> + SKB_TRUESIZE(4096);
> }
>
> /* Note: caller must be prepared to deal with negative returns */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03e08745308189c9d64509c2cff94da56c86a0c..88f4ebab12acc11d5f3feb6b13974a0b8e565671 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -457,6 +457,7 @@ void tcp_init_sock(struct sock *sk)
>
> WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_wmem[1]));
> WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[1]));
> + tcp_scaling_ratio_init(sk);
>
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> sk_sockets_allocated_inc(sk);
> @@ -1700,7 +1701,7 @@ EXPORT_SYMBOL(tcp_peek_len);
> /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
> int tcp_set_rcvlowat(struct sock *sk, int val)
> {
> - int cap;
> + int space, cap;
>
> if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
> cap = sk->sk_rcvbuf >> 1;
> @@ -1715,10 +1716,10 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
> if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
> return 0;
>
> - val <<= 1;
> - if (val > sk->sk_rcvbuf) {
> - WRITE_ONCE(sk->sk_rcvbuf, val);
> - tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
> + space = tcp_space_from_win(sk, val);
> + if (space > sk->sk_rcvbuf) {
> + WRITE_ONCE(sk->sk_rcvbuf, space);
> + tcp_sk(sk)->window_clamp = val;
> }
> return 0;
> }
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 57c8af1859c16eba5e952a23ea959b628006f9c1..3cd92035e0902298baa8afd89ae5edcbfce300e5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -237,6 +237,16 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
> */
> len = skb_shinfo(skb)->gso_size ? : skb->len;
> if (len >= icsk->icsk_ack.rcv_mss) {
> + /* Note: divides are still a bit expensive.
> + * For the moment, only adjust scaling_ratio
> + * when we update icsk_ack.rcv_mss.
> + */
> + if (unlikely(len != icsk->icsk_ack.rcv_mss)) {
> + u64 val = (u64)skb->len << TCP_RMEM_TO_WIN_SCALE;
> +
> + do_div(val, skb->truesize);
> + tcp_sk(sk)->scaling_ratio = val ? val : 1;
> + }
> icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> tcp_sk(sk)->advmss);
> /* Account for possibly-removed options */
> @@ -727,8 +737,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
>
> if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) &&
> !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
> - int rcvmem, rcvbuf;
> u64 rcvwin, grow;
> + int rcvbuf;
>
> /* minimal window to cope with packet losses, assuming
> * steady state. Add some cushion because of small variations.
> @@ -740,12 +750,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
> do_div(grow, tp->rcvq_space.space);
> rcvwin += (grow << 1);
>
> - rcvmem = SKB_TRUESIZE(tp->advmss + MAX_TCP_HEADER);
> - while (tcp_win_from_space(sk, rcvmem) < tp->advmss)
> - rcvmem += 128;
> -
> - do_div(rcvwin, tp->advmss);
> - rcvbuf = min_t(u64, rcvwin * rcvmem,
> + rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin),
> READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));
> if (rcvbuf > sk->sk_rcvbuf) {
> WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
--
js
suse labs
^ permalink raw reply
* [PATCH net v2 1/3] net/smc: fix dangling sock under state SMC_APPFINCLOSEWAIT
From: D. Wythe @ 2023-11-03 6:07 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <1698991660-82957-1-git-send-email-alibuda@linux.alibaba.com>
From: "D. Wythe" <alibuda@linux.alibaba.com>
Considering scenario:
smc_cdc_rx_handler
__smc_release
sock_set_flag
smc_close_active()
sock_set_flag
__set_bit(DEAD) __set_bit(DONE)
Dues to __set_bit is not atomic, the DEAD or DONE might be lost.
if the DEAD flag lost, the state SMC_CLOSED will be never be reached
in smc_close_passive_work:
if (sock_flag(sk, SOCK_DEAD) &&
smc_close_sent_any_close(conn)) {
sk->sk_state = SMC_CLOSED;
} else {
/* just shutdown, but not yet closed locally */
sk->sk_state = SMC_APPFINCLOSEWAIT;
}
Replace sock_set_flags or __set_bit to set_bit will fix this problem.
Since set_bit is atomic.
Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
net/smc/af_smc.c | 4 ++--
net/smc/smc.h | 5 +++++
net/smc/smc_cdc.c | 2 +-
net/smc/smc_close.c | 2 +-
4 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index abd2667..da97f94 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -275,7 +275,7 @@ static int __smc_release(struct smc_sock *smc)
if (!smc->use_fallback) {
rc = smc_close_active(smc);
- sock_set_flag(sk, SOCK_DEAD);
+ smc_sock_set_flag(sk, SOCK_DEAD);
sk->sk_shutdown |= SHUTDOWN_MASK;
} else {
if (sk->sk_state != SMC_CLOSED) {
@@ -1743,7 +1743,7 @@ static int smc_clcsock_accept(struct smc_sock *lsmc, struct smc_sock **new_smc)
if (new_clcsock)
sock_release(new_clcsock);
new_sk->sk_state = SMC_CLOSED;
- sock_set_flag(new_sk, SOCK_DEAD);
+ smc_sock_set_flag(new_sk, SOCK_DEAD);
sock_put(new_sk); /* final */
*new_smc = NULL;
goto out;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 24745fd..e377980 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -377,4 +377,9 @@ void smc_fill_gid_list(struct smc_link_group *lgr,
int smc_nl_enable_hs_limitation(struct sk_buff *skb, struct genl_info *info);
int smc_nl_disable_hs_limitation(struct sk_buff *skb, struct genl_info *info);
+static inline void smc_sock_set_flag(struct sock *sk, enum sock_flags flag)
+{
+ set_bit(flag, &sk->sk_flags);
+}
+
#endif /* __SMC_H */
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 89105e9..01bdb79 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -385,7 +385,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
smc->sk.sk_shutdown |= RCV_SHUTDOWN;
if (smc->clcsock && smc->clcsock->sk)
smc->clcsock->sk->sk_shutdown |= RCV_SHUTDOWN;
- sock_set_flag(&smc->sk, SOCK_DONE);
+ smc_sock_set_flag(&smc->sk, SOCK_DONE);
sock_hold(&smc->sk); /* sock_put in close_work */
if (!queue_work(smc_close_wq, &conn->close_work))
sock_put(&smc->sk);
diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
index dbdf03e..449ef45 100644
--- a/net/smc/smc_close.c
+++ b/net/smc/smc_close.c
@@ -173,7 +173,7 @@ void smc_close_active_abort(struct smc_sock *smc)
break;
}
- sock_set_flag(sk, SOCK_DEAD);
+ smc_sock_set_flag(sk, SOCK_DEAD);
sk->sk_state_change(sk);
if (release_clcsock) {
--
1.8.3.1
^ permalink raw reply related
* [PATCH net v2 3/3] net/smc: put sk reference if close work was canceled
From: D. Wythe @ 2023-11-03 6:07 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <1698991660-82957-1-git-send-email-alibuda@linux.alibaba.com>
From: "D. Wythe" <alibuda@linux.alibaba.com>
Note that we always hold a reference to sock when attempting
to submit close_work. Therefore, if we have successfully
canceled close_work from pending, we MUST release that reference
to avoid potential leaks.
Fixes: 42bfba9eaa33 ("net/smc: immediate termination for SMCD link groups")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
net/smc/smc_close.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
index 449ef45..10219f5 100644
--- a/net/smc/smc_close.c
+++ b/net/smc/smc_close.c
@@ -116,7 +116,8 @@ static void smc_close_cancel_work(struct smc_sock *smc)
struct sock *sk = &smc->sk;
release_sock(sk);
- cancel_work_sync(&smc->conn.close_work);
+ if (cancel_work_sync(&smc->conn.close_work))
+ sock_put(sk);
cancel_delayed_work_sync(&smc->conn.tx_work);
lock_sock(sk);
}
--
1.8.3.1
^ permalink raw reply related
* [PATCH net v2 0/3] bugfixs for smc
From: D. Wythe @ 2023-11-03 6:07 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patches includes bugfix following:
1. hung state
2. sock leak
3. potential panic
We have been testing these patches for some time, but
if you have any questions, please let us know.
--
v1:
Fix spelling errors and incorrect function names in descriptions
v2->v1:
Add fix tags for bugfix patch
D. Wythe (3):
net/smc: fix dangling sock under state SMC_APPFINCLOSEWAIT
net/smc: allow cdc msg send rather than drop it with NULL sndbuf_desc
net/smc: put sk reference if close work was canceled
net/smc/af_smc.c | 4 ++--
net/smc/smc.h | 5 +++++
net/smc/smc_cdc.c | 11 +++++------
net/smc/smc_close.c | 5 +++--
4 files changed, 15 insertions(+), 10 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH net v2 2/3] net/smc: allow cdc msg send rather than drop it with NULL sndbuf_desc
From: D. Wythe @ 2023-11-03 6:07 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <1698991660-82957-1-git-send-email-alibuda@linux.alibaba.com>
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch re-fix the issues mentioned by commit 22a825c541d7
("net/smc: fix NULL sndbuf_desc in smc_cdc_tx_handler()").
Blocking sending message do solve the issues though, but it also
prevents the peer to receive the final message. Besides, in logic,
whether the sndbuf_desc is NULL or not have no impact on the processing
of cdc message sending.
Hence that, this patch allows the cdc message sending but to check the
sndbuf_desc with care in smc_cdc_tx_handler().
Fixes: 22a825c541d7 ("net/smc: fix NULL sndbuf_desc in smc_cdc_tx_handler()")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
net/smc/smc_cdc.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 01bdb79..3c06625 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -28,13 +28,15 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
{
struct smc_cdc_tx_pend *cdcpend = (struct smc_cdc_tx_pend *)pnd_snd;
struct smc_connection *conn = cdcpend->conn;
+ struct smc_buf_desc *sndbuf_desc;
struct smc_sock *smc;
int diff;
+ sndbuf_desc = conn->sndbuf_desc;
smc = container_of(conn, struct smc_sock, conn);
bh_lock_sock(&smc->sk);
- if (!wc_status) {
- diff = smc_curs_diff(cdcpend->conn->sndbuf_desc->len,
+ if (!wc_status && sndbuf_desc) {
+ diff = smc_curs_diff(sndbuf_desc->len,
&cdcpend->conn->tx_curs_fin,
&cdcpend->cursor);
/* sndbuf_space is decreased in smc_sendmsg */
@@ -114,9 +116,6 @@ int smc_cdc_msg_send(struct smc_connection *conn,
union smc_host_cursor cfed;
int rc;
- if (unlikely(!READ_ONCE(conn->sndbuf_desc)))
- return -ENOBUFS;
-
smc_cdc_add_pending_send(conn, pend);
conn->tx_cdc_seq++;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH bpf-next v9 09/12] bpf, net: switch to dynamic registration
From: kernel test robot @ 2023-11-03 5:57 UTC (permalink / raw)
To: thinker.li, bpf, ast, martin.lau, song, kernel-team, andrii,
drosen
Cc: oe-kbuild-all, sinquersw, kuifeng, Kui-Feng Lee, netdev
In-Reply-To: <20231101204519.677870-10-thinker.li@gmail.com>
Hi,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/thinker-li-gmail-com/bpf-refactory-struct_ops-type-initialization-to-a-function/20231102-044820
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20231101204519.677870-10-thinker.li%40gmail.com
patch subject: [PATCH bpf-next v9 09/12] bpf, net: switch to dynamic registration
config: xtensa-randconfig-002-20231102 (https://download.01.org/0day-ci/archive/20231103/202311031305.1bFdKyKl-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231103/202311031305.1bFdKyKl-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311031305.1bFdKyKl-lkp@intel.com/
All errors (new ones prefixed by >>):
xtensa-linux-ld: kernel/bpf/btf.o: in function `btf_int_show':
btf.c:(.text+0xb0c4): undefined reference to `bpf_struct_ops_desc_init'
>> xtensa-linux-ld: btf.c:(.text+0xb330): undefined reference to `bpf_struct_ops_desc_init'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox