[PATCH net] net: fix propagation of EPERM from tcp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] net: fix propagation of EPERM from tcp_connect()
@ 2025-11-21  1:59 Maciej Żenczykowski
  2025-11-21  2:05 ` Maciej Żenczykowski
  0 siblings, 1 reply; 6+ messages in thread
From: Maciej Żenczykowski @ 2025-11-21  1:59 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Żenczykowski, Lorenzo Colitti, Neal Cardwell, bpf

bpf CGROUP_INET_EGRESS hook can fail packet transmit resulting
in -EPERM, however as this is not -ECONNREFUSED it results in tcp
simply treating it as a lost packet resulting in a need to wait
for retransmits and timeout before an error is signaled back
to userspace.

Android implements a lot of security/power savings policy
in this hook, so these failures are common and more or less
permanent (at least until something significant happens).

We cannot currently call bpf_set_retval() from that hook point
and while this could be trivially fixed with a one line deletion,
it's not clear if that's truly a good idea (would we want to
be able to set arbitrary error values??).

If the hook *truly* wants to drop the packet without signaling
an error, it should IMHO return '2' for congestion caused drop
instead of '0' for drop.

Another possibility would be to teach the hook to treat (a new)
return value of '4' as meaning 'drop and return ECONNREFUSED',
but this seems easier... furthermore EPERM seems like a better
return to userspace for 'policy denied your transmit', while
ECONNREFUSED seems to suggest the remote server refused it.

Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 479afb714bdf..3ab21249e196 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4336,7 +4336,7 @@ int tcp_connect(struct sock *sk)
 	/* Send off SYN; include data in Fast Open. */
 	err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
 	      tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
-	if (err == -ECONNREFUSED)
+	if (err == -ECONNREFUSED || err == -EPERM)
 		return err;

 	/* We change tp->snd_nxt after the tcp_transmit_skb() call
-- 
2.52.0.rc2.455.g230fcf2819-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: fix propagation of EPERM from tcp_connect()
  2025-11-21  1:59 [PATCH net] net: fix propagation of EPERM from tcp_connect() Maciej Żenczykowski
@ 2025-11-21  2:05 ` Maciej Żenczykowski
  2025-11-21 14:43   ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Maciej Żenczykowski @ 2025-11-21  2:05 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Lorenzo Colitti,
	Neal Cardwell, bpf

On Fri, Nov 21, 2025 at 10:59 AM Maciej Żenczykowski <maze@google.com> wrote:
> bpf CGROUP_INET_EGRESS hook can fail packet transmit resulting
> in -EPERM, however as this is not -ECONNREFUSED it results in tcp
> simply treating it as a lost packet resulting in a need to wait
> for retransmits and timeout before an error is signaled back
> to userspace.
>
> Android implements a lot of security/power savings policy
> in this hook, so these failures are common and more or less
> permanent (at least until something significant happens).
>
> We cannot currently call bpf_set_retval() from that hook point
> and while this could be trivially fixed with a one line deletion,
> it's not clear if that's truly a good idea (would we want to
> be able to set arbitrary error values??).
>
> If the hook *truly* wants to drop the packet without signaling
> an error, it should IMHO return '2' for congestion caused drop
> instead of '0' for drop.
>
> Another possibility would be to teach the hook to treat (a new)
> return value of '4' as meaning 'drop and return ECONNREFUSED',
> but this seems easier... furthermore EPERM seems like a better
> return to userspace for 'policy denied your transmit', while
> ECONNREFUSED seems to suggest the remote server refused it.
>
> Cc: Lorenzo Colitti <lorenzo@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> ---
>  net/ipv4/tcp_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 479afb714bdf..3ab21249e196 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -4336,7 +4336,7 @@ int tcp_connect(struct sock *sk)
>         /* Send off SYN; include data in Fast Open. */
>         err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
>               tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
> -       if (err == -ECONNREFUSED)
> +       if (err == -ECONNREFUSED || err == -EPERM)
>                 return err;
>
>         /* We change tp->snd_nxt after the tcp_transmit_skb() call
> --
> 2.52.0.rc2.455.g230fcf2819-goog

Perhaps I should have sent this as an RFC, as I haven't had the
opportunity to test this fix in the full blown environment where we're
running into problems.

I'm hoping to at least get feedback on whether this is acceptable
and/or even the right approach -- or if someone has a better idea or
if there's some fundamental reason we cannot return EPERM here.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: fix propagation of EPERM from tcp_connect()
  2025-11-21  2:05 ` Maciej Żenczykowski
@ 2025-11-21 14:43   ` Jakub Kicinski
  2025-11-26  1:08     ` Maciej Żenczykowski
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2025-11-21 14:43 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Paolo Abeni, Lorenzo Colitti, Neal Cardwell, bpf

On Fri, 21 Nov 2025 11:05:39 +0900 Maciej Żenczykowski wrote:
> Perhaps I should have sent this as an RFC, as I haven't had the
> opportunity to test this fix in the full blown environment where we're
> running into problems.
> 
> I'm hoping to at least get feedback on whether this is acceptable
> and/or even the right approach -- or if someone has a better idea or
> if there's some fundamental reason we cannot return EPERM here.

FWIW this breaks the mptcp_join.sh test, too:

https://netdev-3.bots.linux.dev/vmksft-mptcp/results/394900/1-mptcp-join-sh/stdout

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: fix propagation of EPERM from tcp_connect()
  2025-11-21 14:43   ` Jakub Kicinski
@ 2025-11-26  1:08     ` Maciej Żenczykowski
  2025-11-26  1:21       ` Jakub Kicinski
  2025-11-26 11:15       ` Matthieu Baerts
  0 siblings, 2 replies; 6+ messages in thread
From: Maciej Żenczykowski @ 2025-11-26  1:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Paolo Abeni, Lorenzo Colitti, Neal Cardwell, bpf

> FWIW this breaks the mptcp_join.sh test, too:

What do you mean by 'too', does it break something else as well, or
just the quoted mptcp_join?

> https://netdev-3.bots.linux.dev/vmksft-mptcp/results/394900/1-mptcp-join-sh/stdout

My still very preliminary investigation is that this is actually
correct (though obviously the tests need to be adjusted).

See tools/testing/selftests/net/mptcp/mptcp_join.sh:89

# generated using "nfbpf_compile '(ip && (ip[54] & 0xf0) == 0x30) ||
#                                (ip6 && (ip6[74] & 0xf0) == 0x30)'"
CBPF_MPTCP_SUBOPTION_ADD_ADDR=...

mptcp_join.sh:365
      if ! ip netns exec $ns2 $tables -A OUTPUT -p tcp \
                      -m tcp --tcp-option 30 \
                      -m bpf --bytecode \
                      "$CBPF_MPTCP_SUBOPTION_ADD_ADDR" \
                      -j DROP

So basically this is using iptables -j DROP which presumably
propagates to EPERM and thus results in a faster local failure...

Although this is probably trying to replicate packet loss rather than
a local error...

So I'm not sure if I should:
(a) fix the asserts with new values (presumably easiest by far),
or
(b) change how it does DROP to make it more like network packet loss
(maybe an extra namespace, so the drop is in a diff netns, during
forwarding??? not even sure if that would help though, or maybe add
drop on other netns INPUT instead of OUTPUT).
or
(c) introduce some iptables -j DROP_CN type return... (seems like that
might be worthwhile anyway)

I'll think about it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: fix propagation of EPERM from tcp_connect()
  2025-11-26  1:08     ` Maciej Żenczykowski
@ 2025-11-26  1:21       ` Jakub Kicinski
  2025-11-26 11:15       ` Matthieu Baerts
  1 sibling, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2025-11-26  1:21 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Paolo Abeni, Lorenzo Colitti, Neal Cardwell, bpf

On Tue, 25 Nov 2025 17:08:17 -0800 Maciej Żenczykowski wrote:
> > FWIW this breaks the mptcp_join.sh test, too:  
> 
> What do you mean by 'too',

I meant - in addition to your self-reply saying that it should have
been an RFC. I haven't see any other failures.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: fix propagation of EPERM from tcp_connect()
  2025-11-26  1:08     ` Maciej Żenczykowski
  2025-11-26  1:21       ` Jakub Kicinski
@ 2025-11-26 11:15       ` Matthieu Baerts
  1 sibling, 0 replies; 6+ messages in thread
From: Matthieu Baerts @ 2025-11-26 11:15 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Linux Network Development Mailing List, David S . Miller,
	Eric Dumazet, Paolo Abeni, Lorenzo Colitti, Neal Cardwell, bpf,
	MPTCP Linux, Jakub Kicinski

Hi Maciej,

(+cc MPTCP list)

On 26/11/2025 02:08, Maciej Żenczykowski wrote:
>> FWIW this breaks the mptcp_join.sh test, too:
> 
> What do you mean by 'too', does it break something else as well, or
> just the quoted mptcp_join?
> 
>> https://netdev-3.bots.linux.dev/vmksft-mptcp/results/394900/1-mptcp-join-sh/stdout
> 
> My still very preliminary investigation is that this is actually
> correct (though obviously the tests need to be adjusted).
> 
> See tools/testing/selftests/net/mptcp/mptcp_join.sh:89
> 
> # generated using "nfbpf_compile '(ip && (ip[54] & 0xf0) == 0x30) ||
> #                                (ip6 && (ip6[74] & 0xf0) == 0x30)'"
> CBPF_MPTCP_SUBOPTION_ADD_ADDR=...
> 
> mptcp_join.sh:365
>       if ! ip netns exec $ns2 $tables -A OUTPUT -p tcp \
>                       -m tcp --tcp-option 30 \
>                       -m bpf --bytecode \
>                       "$CBPF_MPTCP_SUBOPTION_ADD_ADDR" \
>                       -j DROP
> 
> So basically this is using iptables -j DROP which presumably
> propagates to EPERM and thus results in a faster local failure...

I don't think that's what caused the issue: according to the logs, two
tests have failed: "delete and re-add" and "flush re-add" and they don't
use the mentioned snippet. Still it might be caused by a Netfilter rule,
because they both call:

  reset_with_tcp_filter "..." ns2 10.0.3.2 REJECT OUTPUT

This helper will call:

  iptables -A OUTPUT -s 10.0.3.2 -j REJECT

Note that you can easily reproduce the issue by only launching the
problematic tests with './mptcp_join.sh "<test name or id>"', e.g.

  cd tools/testing/selftests/net/mptcp
  ./mptcp_join.sh "delete and re-add"

> Although this is probably trying to replicate packet loss rather than
> a local error...
> 
> So I'm not sure if I should:
> (a) fix the asserts with new values (presumably easiest by far),
> or
> (b) change how it does DROP to make it more like network packet loss
> (maybe an extra namespace, so the drop is in a diff netns, during
> forwarding??? not even sure if that would help though, or maybe add
> drop on other netns INPUT instead of OUTPUT).
> or
> (c) introduce some iptables -j DROP_CN type return... (seems like that
> might be worthwhile anyway)

From what I understand, with your RFC patch, a "connect()" (or
"kernel_connect()") will get an error because of the Netfilter rule. If
that's normal, then probably the expected results can be adapted, e.g.
from ...

  join_syn_tx=3 join_connect_err=1 \
          chk_join_nr 2 2 2

... to ...

  join_connect_err=2 \
          chk_join_nr 2 2 2

At least that would show the effect of your patch.

An important note: the selftests can be executed on older kernels: if
your patch is changing the behaviour, and it is a fix that is going to
be backported to stable, that's fine. If that's not a fix, the selftests
should continue to work with and without the kernel patch. Then, the
Netfilter rule should probably be adapted instead, maybe by moving it to
the other side (ns1) in INPUT if that still makes the tests valid.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-26 11:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21  1:59 [PATCH net] net: fix propagation of EPERM from tcp_connect() Maciej Żenczykowski
2025-11-21  2:05 ` Maciej Żenczykowski
2025-11-21 14:43   ` Jakub Kicinski
2025-11-26  1:08     ` Maciej Żenczykowski
2025-11-26  1:21       ` Jakub Kicinski
2025-11-26 11:15       ` Matthieu Baerts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).