* [PATCH] 2.6 UDP recvmsg vs POSIX
@ 2004-02-17 12:17 Olaf Kirch
2004-02-18 4:30 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Olaf Kirch @ 2004-02-17 12:17 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 1454 bytes --]
Hi,
I'm currently investigating a problem in udp recvmsg. In short, a test
program selects on the socket, gets woken up as a packet arrives, but
recvmsg return EAGAIN because the UDP checksum was wrong.
This broke one real-life application, and reportedly this violates POSIX,
as operations on a blocking socket must never return EAGAIN.
The quick and dirty fix would be to always do the checksum in
udp_queue_rcv_skb, but I assume this "checksum-outside-the-bottom-half"
approach is there because it improves performance, so I've tried to
come up with a different solution.
Please find attached a patch that tries to pluck bad udp packets from the
receive queue as part of poll/select handling, so that we never mark a
udp socket as POLLIN when there are only bad packets in the queue. This
is admittedly not very elegant, and it may have side effects I'm not
seeing yet. But it seems to fix things for me here, without slowing
the bottom half handlers down.
The second half of the patch deals with bad checksums in recvmsg. If an
application calls recvmsg on a blocking socket, and we find a packet
with a bad checksum (this may happen if the application didn't select
first), we start over at the top of the function, so that we block in
skb_datagram_receive rather than returning EAGAIN.
Any comments welcome.
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
okir@suse.de | tempfile names today!
---------------+
[-- Attachment #2: linux-2.6.2-udpfix.patch --]
[-- Type: text/plain, Size: 3335 bytes --]
--- linux-2.6.2/include/net/udp.h.udpfix 2004-02-17 10:10:03.000000000 +0100
+++ linux-2.6.2/include/net/udp.h 2004-02-17 10:10:18.000000000 +0100
@@ -74,6 +74,10 @@
extern int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
extern int udp_disconnect(struct sock *sk, int flags);
+extern unsigned int udp_poll(struct file *file,
+ struct socket *sock,
+ struct poll_table_struct *wait);
+
DECLARE_SNMP_STAT(struct udp_mib, udp_statistics);
#define UDP_INC_STATS(field) SNMP_INC_STATS(udp_statistics, field)
#define UDP_INC_STATS_BH(field) SNMP_INC_STATS_BH(udp_statistics, field)
--- linux-2.6.2/net/ipv4/af_inet.c.udpfix 2004-02-04 04:43:12.000000000 +0100
+++ linux-2.6.2/net/ipv4/af_inet.c 2004-02-17 10:36:00.000000000 +0100
@@ -912,7 +912,7 @@
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname = inet_getname,
- .poll = datagram_poll,
+ .poll = udp_poll,
.ioctl = inet_ioctl,
.listen = sock_no_listen,
.shutdown = inet_shutdown,
--- linux-2.6.2/net/ipv4/udp.c.udpfix 2004-02-13 16:07:30.000000000 +0100
+++ linux-2.6.2/net/ipv4/udp.c 2004-02-17 10:35:41.000000000 +0100
@@ -766,6 +766,64 @@
}
/*
+ * Packets queued on the UDP socket may have a bad checksum,
+ * and need to be discarded. On the other hand, recvmsg is not
+ * allowed to return EAGAIN on a blocking socket, especially
+ * if select() indicated that data is available.
+ *
+ * To solve this, udp_poll makes sure that the next packet in
+ * the queue has a valid checksum. This fixes the select/recvmsg
+ * case.
+ * The general case of recvmsg on a blocking socket is solved by
+ * simply looping over in udp_recvmsg when the checksum is bad. --okir
+ */
+unsigned int
+udp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait)
+{
+ struct sock *sk = sock->sk;
+ struct sk_buff *skb;
+ int error;
+
+ do {
+ spin_lock_irq(&sk->sk_receive_queue.lock);
+ skb = skb_peek(&sk->sk_receive_queue);
+ if (skb != NULL)
+ atomic_inc(&skb->users);
+ spin_unlock_irq(&sk->sk_receive_queue.lock);
+
+ if (skb == NULL)
+ break;
+
+ error = 0;
+ if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
+ /* All is well */
+ } else if (!(error = __udp_checksum_complete(skb))) {
+ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ } else {
+ /* Bad checksum. Pluck it from there queue */
+ UDP_INC_STATS_BH(UdpInErrors);
+
+ spin_lock_irq(&sk->sk_receive_queue.lock);
+ if (skb == skb_peek(&sk->sk_receive_queue)) {
+ __skb_unlink(skb, &sk->sk_receive_queue);
+ spin_unlock_irq(&sk->sk_receive_queue.lock);
+ kfree_skb(skb);
+ } else {
+ spin_unlock_irq(&sk->sk_receive_queue.lock);
+ }
+ }
+
+ skb_free_datagram(sk, skb);
+ } while (error);
+
+ /* We have made sure there's at least one valid
+ * UDP packet in the queue, so do the normal datagram
+ * poll now */
+ return datagram_poll(file, sock, wait);
+}
+
+
+/*
* This should be easy, if there is something there we
* return it, otherwise we block.
*/
@@ -787,6 +845,7 @@
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len);
+try_again:
skb = skb_recv_datagram(sk, flags, noblock, &err);
if (!skb)
goto out;
@@ -852,7 +911,7 @@
skb_free_datagram(sk, skb);
- return -EAGAIN;
+ goto try_again;
}
int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-17 12:17 [PATCH] 2.6 UDP recvmsg vs POSIX Olaf Kirch
@ 2004-02-18 4:30 ` Andi Kleen
2004-02-18 6:17 ` David S. Miller
2004-02-21 21:08 ` David S. Miller
2 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2004-02-18 4:30 UTC (permalink / raw)
To: Olaf Kirch; +Cc: netdev
On Tue, 17 Feb 2004 13:17:26 +0100
Olaf Kirch <okir@suse.de> wrote:
> Any comments welcome.
Looks good to me.
One small optimization would be to quickly check the receive queue without taking the
receive queue lock. This would avoid the additional lock in many cases.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-17 12:17 [PATCH] 2.6 UDP recvmsg vs POSIX Olaf Kirch
2004-02-18 4:30 ` Andi Kleen
@ 2004-02-18 6:17 ` David S. Miller
2004-02-19 7:07 ` Andi Kleen
2004-02-21 21:08 ` David S. Miller
2 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2004-02-18 6:17 UTC (permalink / raw)
To: Olaf Kirch; +Cc: netdev
On Tue, 17 Feb 2004 13:17:26 +0100
Olaf Kirch <okir@suse.de> wrote:
> I'm currently investigating a problem in udp recvmsg. In short, a test
> program selects on the socket, gets woken up as a packet arrives, but
> recvmsg return EAGAIN because the UDP checksum was wrong.
Thanks Olaf, I'm not ignoring your patch I just didn't get to it today.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-18 6:17 ` David S. Miller
@ 2004-02-19 7:07 ` Andi Kleen
0 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2004-02-19 7:07 UTC (permalink / raw)
To: David S. Miller; +Cc: okir, netdev
On Tue, 17 Feb 2004 22:17:05 -0800
"David S. Miller" <davem@redhat.com> wrote:
> On Tue, 17 Feb 2004 13:17:26 +0100
> Olaf Kirch <okir@suse.de> wrote:
>
> > I'm currently investigating a problem in udp recvmsg. In short, a test
> > program selects on the socket, gets woken up as a packet arrives, but
> > recvmsg return EAGAIN because the UDP checksum was wrong.
>
> Thanks Olaf, I'm not ignoring your patch I just didn't get to it today.
BTW I think TCP could hit the same problem. But it did already csum-copy
in user context in 2.4 and nobody reported it.
-Andi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-17 12:17 [PATCH] 2.6 UDP recvmsg vs POSIX Olaf Kirch
2004-02-18 4:30 ` Andi Kleen
2004-02-18 6:17 ` David S. Miller
@ 2004-02-21 21:08 ` David S. Miller
2004-02-23 9:42 ` Olaf Kirch
2 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2004-02-21 21:08 UTC (permalink / raw)
To: Olaf Kirch; +Cc: netdev
On Tue, 17 Feb 2004 13:17:26 +0100
Olaf Kirch <okir@suse.de> wrote:
> I'm currently investigating a problem in udp recvmsg. In short, a test
> program selects on the socket, gets woken up as a packet arrives, but
> recvmsg return EAGAIN because the UDP checksum was wrong.
>
> This broke one real-life application, and reportedly this violates POSIX,
> as operations on a blocking socket must never return EAGAIN.
Well, first things first, using blocking sockets with select/poll is kind
of stupid programming.
Nonetheless, udp_recvmsg() can't return -EAGAIN on a blocking socket.
Andi, TCP does not have this issue, it simply goes back to sleep waiting
for more data to arrive if the prequeue delayed checksumming fails.
I think we should fix it like the following, which is basically the last
part of Olaf's original patch in this thread. Just wait for another packet
if a blocking socket and checksum fails, else if non-blocking -EAGAIN is OK.
To be honest, I believe this was Alexey's original intention when he wrote this
code.
===== net/ipv4/udp.c 1.56 vs edited =====
--- 1.56/net/ipv4/udp.c Fri Jan 9 01:50:23 2004
+++ edited/net/ipv4/udp.c Sat Feb 21 13:03:05 2004
@@ -787,6 +787,7 @@
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len);
+try_again:
skb = skb_recv_datagram(sk, flags, noblock, &err);
if (!skb)
goto out;
@@ -852,7 +853,9 @@
skb_free_datagram(sk, skb);
- return -EAGAIN;
+ if (noblock)
+ return -EAGAIN;
+ goto try_again;
}
int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-21 21:08 ` David S. Miller
@ 2004-02-23 9:42 ` Olaf Kirch
2004-02-23 17:42 ` David S. Miller
0 siblings, 1 reply; 8+ messages in thread
From: Olaf Kirch @ 2004-02-23 9:42 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev
On Sat, Feb 21, 2004 at 01:08:53PM -0800, David S. Miller wrote:
> Well, first things first, using blocking sockets with select/poll is kind
> of stupid programming.
But that's the way many real world applications do it, including
for instance the glibc sunrpc code.
> I think we should fix it like the following, which is basically the last
> part of Olaf's original patch in this thread. Just wait for another packet
> if a blocking socket and checksum fails, else if non-blocking -EAGAIN is OK.
But that breaks poll/recvmsg on a blocking socket, because poll with
assert POLLIN, but the recvmsg call with block.
I agree that plucking packets off the queue inside poll() is ugly.
The alternative would be to just walk the queue to see if there is at
least one packet with valid checksum, and clear POLLIN if there isn't,
but otherwise leave the queue untouched. This would work in all except
the most extreme scenarios where your recv queue is filled to the limit
with bad packets, and any new packets with potentially good checksums
get dropped. This would make it easy to DoS any UDP based service
by spraying it with bad udp packets.
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
okir@suse.de | tempfile names today!
---------------+
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-23 9:42 ` Olaf Kirch
@ 2004-02-23 17:42 ` David S. Miller
2004-02-24 9:32 ` Olaf Kirch
0 siblings, 1 reply; 8+ messages in thread
From: David S. Miller @ 2004-02-23 17:42 UTC (permalink / raw)
To: Olaf Kirch; +Cc: netdev
On Mon, 23 Feb 2004 10:42:10 +0100
Olaf Kirch <okir@suse.de> wrote:
> But that breaks poll/recvmsg on a blocking socket, because poll with
> assert POLLIN, but the recvmsg call with block.
There is nothing wrong with that, I even pinged Linus about this and he
agreed with me.
What stops another thread from pulling a packet from the
receive queue and thus emptying it? Nothing. POLLIN does not guarentee
a read will give you data right now, it never has and it never will.
Therefore, the only bug was returning -EAGAIN unconditionally and that's
what is fixed by the correct half of your changes.
TCP does the same thing Olaf, and there is no way I'm adding the gross poll
hacks there :-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] 2.6 UDP recvmsg vs POSIX
2004-02-23 17:42 ` David S. Miller
@ 2004-02-24 9:32 ` Olaf Kirch
0 siblings, 0 replies; 8+ messages in thread
From: Olaf Kirch @ 2004-02-24 9:32 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev
On Mon, Feb 23, 2004 at 09:42:16AM -0800, David S. Miller wrote:
> Therefore, the only bug was returning -EAGAIN unconditionally and that's
> what is fixed by the correct half of your changes.
>
> TCP does the same thing Olaf, and there is no way I'm adding the gross poll
> hacks there :-)
Okay, I can live with that. I will push this info to our posix
guys and let them chew on it.
UDPv6 needs the exact same fix BTW.
Cheers,
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
okir@suse.de | tempfile names today!
---------------+
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-02-24 9:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-17 12:17 [PATCH] 2.6 UDP recvmsg vs POSIX Olaf Kirch
2004-02-18 4:30 ` Andi Kleen
2004-02-18 6:17 ` David S. Miller
2004-02-19 7:07 ` Andi Kleen
2004-02-21 21:08 ` David S. Miller
2004-02-23 9:42 ` Olaf Kirch
2004-02-23 17:42 ` David S. Miller
2004-02-24 9:32 ` Olaf Kirch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).