The sk_err mechanism is infuriating in userspace

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* The sk_err mechanism is infuriating in userspace
@ 2024-02-05 23:03 Andy Lutomirski
  2024-02-05 23:22 ` Andy Lutomirski
  2024-02-06  8:43 ` Paolo Abeni
  0 siblings, 2 replies; 5+ messages in thread
From: Andy Lutomirski @ 2024-02-05 23:03 UTC (permalink / raw)
  To: Network Development; +Cc: Linux API

Hi all-

I encounter this issue every couple of years, and it still seems to be
an issue, and it drives me nuts every time I see it.

I write software that uses unconnected datagram-style sockets.  Errors
happen for all kinds of reasons, and my software knows it.  My
software even handles the errors and moves on with its life.  I use
MSG_ERRQUEUE to understand the errors.  But the kernel fights back:

struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
                                        struct sk_buff_head *queue,
                                        unsigned int flags, int *off, int *err,
                                        struct sk_buff **last)
{
        struct sk_buff *skb;
        unsigned long cpu_flags;
        /*
         * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
         */
        int error = sock_error(sk);

        if (error)
                goto no_packet;
        ^^^^^^^^^^ <----- EXCUSE ME?

The kernel even fights back on the *send* path?!?

static long sock_wait_for_wmem(struct sock *sk, long timeo)
{
        DEFINE_WAIT(wait);

        sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
        for (;;) {
                if (!timeo)
                        break;
                if (signal_pending(current))
                        break;
                set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
                ...
                if (READ_ONCE(sk->sk_err))
                        break;  <-- KERNEL HATES UNCONNECTED SOCKETS!

This is IMO just broken.  I realize it's legacy behavior, but it's
BROKEN legacy behavior.  sk_err does not (at least for an unconnected
socket) indicate that anything is wrong with the socket.  It indicates
that something is worthy of notice, and it wants to tell me.

So:

1. sock_wait_for_wmem should IMO just not do that on an unconnected
socket.  AFAICS it's simply a bug.

2. How, exactly, am I supposed to call recvmsg() and, unambiguously,
find out whether recvmsg() actually failed?  There are actual errors
(something that indicates that the kernel malfunctioned or the socket
is broken), errors indicating that the packet being received is busted
(skb_copy_datagram_msg, for example), and also errors indicating that
there's an error queued up.

I would like to know that there's an error queued up.  That's what
poll and epoll are for, right?  Or a hint from recvmsg() that I should
call MSG_RECVERR too.  Or it could have a mode where it returns a
normal datagram *or* an error as appropriate.  But the current state
of affairs is just brittle and racy.

Are there any reasonably implementable, non-breaking ways to improve
the API so that programs that understand socket errors can actually
function fully correctly without gnarly retry loops in userspace and
silly heuristics about what errors are actually errors?

Grumpily,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The sk_err mechanism is infuriating in userspace
  2024-02-05 23:03 The sk_err mechanism is infuriating in userspace Andy Lutomirski
@ 2024-02-05 23:22 ` Andy Lutomirski
  2024-02-06  8:43 ` Paolo Abeni
  1 sibling, 0 replies; 5+ messages in thread
From: Andy Lutomirski @ 2024-02-05 23:22 UTC (permalink / raw)
  To: Network Development; +Cc: Linux API



> On Feb 5, 2024, at 3:03 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> Hi all-
> 
> I encounter this issue every couple of years, and it still seems to be
> an issue, and it drives me nuts every time I see it.
> 
> I write software that uses unconnected datagram-style sockets.  Errors
> happen for all kinds of reasons, and my software knows it.  My
> software even handles the errors and moves on with its life.  I use
> MSG_ERRQUEUE to understand the errors.  But the kernel fights back:
> 
> struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
>                                        struct sk_buff_head *queue,
>                                        unsigned int flags, int *off, int *err,
>                                        struct sk_buff **last)
> {
>        struct sk_buff *skb;
>        unsigned long cpu_flags;
>        /*
>         * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
>         */
>        int error = sock_error(sk);
> 
>        if (error)
>                goto no_packet;
>        ^^^^^^^^^^ <----- EXCUSE ME?
> 
> The kernel even fights back on the *send* path?!?
> 
> static long sock_wait_for_wmem(struct sock *sk, long timeo)
> {
>        DEFINE_WAIT(wait);
> 
>        sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>        for (;;) {
>                if (!timeo)
>                        break;
>                if (signal_pending(current))
>                        break;
>                set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                ...
>                if (READ_ONCE(sk->sk_err))
>                        break;  <-- KERNEL HATES UNCONNECTED SOCKETS!
> 
> This is IMO just broken.  I realize it's legacy behavior, but it's
> BROKEN legacy behavior.  sk_err does not (at least for an unconnected
> socket) indicate that anything is wrong with the socket.  It indicates
> that something is worthy of notice, and it wants to tell me.
> 
> So:
> 
> 1. sock_wait_for_wmem should IMO just not do that on an unconnected
> socket.  AFAICS it's simply a bug.
> 
> 2. How, exactly, am I supposed to call recvmsg() and, unambiguously,
> find out whether recvmsg() actually failed?  There are actual errors
> (something that indicates that the kernel malfunctioned or the socket
> is broken), errors indicating that the packet being received is busted
> (skb_copy_datagram_msg, for example), and also errors indicating that
> there's an error queued up.
> 
> I would like to know that there's an error queued up.  That's what
> poll and epoll are for, right?  Or a hint from recvmsg() that I should
> call MSG_RECVERR too.  Or it could have a mode where it returns a
> normal datagram *or* an error as appropriate.  But the current state
> of affairs is just brittle and racy.
> 
> Are there any reasonably implementable, non-breaking ways to improve
> the API so that programs that understand socket errors can actually
> function fully correctly without gnarly retry loops in userspace and
> silly heuristics about what errors are actually errors?

Contemplating this, recvmsg() can sent status information back via msg_flags.  Maybe we could characterize a recvmsg() call as doing one of the following things:

1. Actually fails, via -EFAULT or otherwise.  Userspace can get an errno but doesn’t know beyond that what actually went wrong. Should never happen in a correct program. ENOMEM is not in this category.

2. There is nothing to receive. This is -EAGAIN.

3. Received an sk_err error. This is a *success*, and it comes with an error code. Users of RECVERR can’t reliably correlate this with an ERRQUEUE message.  Maybe they don’t care.

4. Received a datagram.

5. Received a queued error message a la ERRQUEUE.

6. Dequeued a datagram (or ERRQUEUE) but did *not* receive it due to a checksum error or other error. (And there should be a clear indication of whether the call succeeded but something was wrong with the message or whether the call *failed* for an unexpected reason but the offending message was nonetheless removed from the socket’s queue).

Maybe 7: Received a message (or ERRQUEUE), and the checksum was wrong, but the data is being returned anyway.

I suppose that a flag could enable this mode and then all but #1 would return a *success* code from the syscall.  And msg_flags would contain an indication as to what actually happened.

Thoughts?  Does io_uring affect any of this?

> 
> Grumpily,
> Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The sk_err mechanism is infuriating in userspace
  2024-02-05 23:03 The sk_err mechanism is infuriating in userspace Andy Lutomirski
  2024-02-05 23:22 ` Andy Lutomirski
@ 2024-02-06  8:43 ` Paolo Abeni
  2024-02-06 17:24   ` Andy Lutomirski
  1 sibling, 1 reply; 5+ messages in thread
From: Paolo Abeni @ 2024-02-06  8:43 UTC (permalink / raw)
  To: Andy Lutomirski, Network Development; +Cc: Linux API

On Mon, 2024-02-05 at 15:03 -0800, Andy Lutomirski wrote:
> Hi all-
> 
> I encounter this issue every couple of years, and it still seems to be
> an issue, and it drives me nuts every time I see it.
> 
> I write software that uses unconnected datagram-style sockets.  Errors
> happen for all kinds of reasons, and my software knows it.  My
> software even handles the errors and moves on with its life.  I use
> MSG_ERRQUEUE to understand the errors.  But the kernel fights back:
> 
> struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
>                                         struct sk_buff_head *queue,
>                                         unsigned int flags, int *off, int *err,
>                                         struct sk_buff **last)
> {
>         struct sk_buff *skb;
>         unsigned long cpu_flags;
>         /*
>          * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
>          */
>         int error = sock_error(sk);
> 
>         if (error)
>                 goto no_packet;
>         ^^^^^^^^^^ <----- EXCUSE ME?
> 
> The kernel even fights back on the *send* path?!?
> 
> static long sock_wait_for_wmem(struct sock *sk, long timeo)
> {
>         DEFINE_WAIT(wait);
> 
>         sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>         for (;;) {
>                 if (!timeo)
>                         break;
>                 if (signal_pending(current))
>                         break;
>                 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                 ...
>                 if (READ_ONCE(sk->sk_err))
>                         break;  <-- KERNEL HATES UNCONNECTED SOCKETS!
> 
> This is IMO just broken.  I realize it's legacy behavior, but it's
> BROKEN legacy behavior. 

As you noted this is an established behaviour exposed to the user-
space, and we can't simply change it, regardless of it's own (eventual
lack of) merit.

>  sk_err does not (at least for an unconnected
> socket) indicate that anything is wrong with the socket. 

What about 'destination/port unreachable' and many other similar errors
reported by sk_err? Which specific errors reported by sk_err does not
indicate that anything is wrong with the socket ?

I guess that if you really want to ignore socket error for datagram
sockets at recvmsg()/sendmsg() time you could implement some new socket
option to conditionally enable such behaviour on a per socket basis.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The sk_err mechanism is infuriating in userspace
  2024-02-06  8:43 ` Paolo Abeni
@ 2024-02-06 17:24   ` Andy Lutomirski
  2024-02-28 20:00     ` Andy Lutomirski
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2024-02-06 17:24 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Network Development, Linux API

On Tue, Feb 6, 2024 at 12:43 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> What about 'destination/port unreachable' and many other similar errors
> reported by sk_err? Which specific errors reported by sk_err does not
> indicate that anything is wrong with the socket ?

Destination/port unreachable are *exactly* the primary offenders.  Consider:

1. TCP socket.  If the peer becomes unreachable, the connection is
unusable.  Maybe reading previously queued data is reasonable; maybe
it's not, but one way or another the connection isn't working any
more.  The current API seems okay.

2. UDP peer-to-peer connection.  I have a socket and it's connected to
a peer.  The peer sends an ICMP error or a route changes and the
kernel can't route to the peer.  The connection is at least
temporarily dead.  If we accept that temporarily dead equals
permanently dead, then returning errors codes makes sense.  Even if we
expect the application to try to recover without making a new socket,
telling the application seems fine.  The application will understand
that an error occurred communicating with its peer and can do
something about it.

3. UDP *server* with multiple clients.  (Or unconnected UDP socket
communicating with multiple peers, etc.)  Imagine a DNS server or a
QUIC server -- I hear QUIC is cool lately.  A userspace server has a
socket, and it does sendto() or sendmsg() to a whole bunch of
addresses.  One of them sends an ICMP error.  There are multiple
things the server might do.  It might ignore the error entirely and
treat it just like a timeout, because it probably already has
perfectly nice timeout handling.  Or it might want to know that there
was an error communicating with a *specific* peer and release
resources sooner than it would for a timeout.  Or it might want to
collect the entire ICMP error (via RECVERR) and do something useful
with it.  But it gets no value whatsoever from knowing that an
unspecified peer sent an ICMP error, and it gets negative value from
having a call to recvfrom() or recvmsg() fail and needing to look up
in some hopefully-correct table whether the failure indicates an
actual problem (EFAULT, for example) or a completely useless return
value that should be ignored (EHOSTUNREACH).

(#3 is probably worse if the application uses one-shot notifications
-- the application needs to make a decision as to whether to call
recvfrom/recvmsg again or go back to polling.)

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The sk_err mechanism is infuriating in userspace
  2024-02-06 17:24   ` Andy Lutomirski
@ 2024-02-28 20:00     ` Andy Lutomirski
  0 siblings, 0 replies; 5+ messages in thread
From: Andy Lutomirski @ 2024-02-28 20:00 UTC (permalink / raw)
  To: Paolo Abeni, David S. Miller, Jakub Kicinski
  Cc: Network Development, Linux API

On Tue, Feb 6, 2024 at 9:24 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Tue, Feb 6, 2024 at 12:43 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > What about 'destination/port unreachable' and many other similar errors
> > reported by sk_err? Which specific errors reported by sk_err does not
> > indicate that anything is wrong with the socket ?

I started writing a series to improve this in a backwards-compatible
way, but now I'm wondering whether the current behavior may be
partially a regression and not actually something well-enshrined in
history.

The nasty behavior in question is that, if a UDP or ping (or
presumably TCP, but that case is not necessarily a problem) socket
enables IP_RECVERR, then an ICMP error will asynchronously cause the
next sendmsg() to fail.  The code that causes this seems to be ancient
(I think it's sock_wait_for_wmem, which predates git, but I won't
swear to that)

Looking at my own logs, though, a Linux 4.5.2 did not seem to
regularly trigger this, and I'm getting it on a regular basis on 6.2
and some newer kernels.  And, somewhat damningly (with IP addresses
redacted):

$ traceroute -I 10.1.2.3
traceroute to 10.1.2.3 (10.1.2.3), 30 hops max, 60 byte packets
 1  * * *
 2  10.5.6.7 (10.5.6.7)  0.593 ms  0.793 ms  0.988 ms
 3  10.8.9.10 (10.8.9.10)  1.247 ms  1.547 ms  1.881 ms
 4  10.11.12.13 (10.11.12.13)  1.032 ms  1.333 ms  1.679 ms
send: No route to host

Whoops, traceroute is getting a bogus return when it sends a packet,
causing it to give up.  The real trace should be longer.

So I'm wondering if maybe this behavior should be seen as a bug to be
fixed and not a weird old API that needs to be preserved.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-02-28 20:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-05 23:03 The sk_err mechanism is infuriating in userspace Andy Lutomirski
2024-02-05 23:22 ` Andy Lutomirski
2024-02-06  8:43 ` Paolo Abeni
2024-02-06 17:24   ` Andy Lutomirski
2024-02-28 20:00     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).