From: Willy Tarreau <w@1wt.eu>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: David Miller <davem@davemloft.net>,
ben@zeus.com, jarkao2@gmail.com, mingo@elte.hu,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
jens.axboe@oracle.com
Subject: Re: [PATCH] tcp: splice as many packets as possible at once
Date: Fri, 9 Jan 2009 23:09:46 +0100 [thread overview]
Message-ID: <20090109220946.GA4787@1wt.eu> (raw)
In-Reply-To: <4967C95E.3070602@cosmosbay.com>
On Fri, Jan 09, 2009 at 11:02:06PM +0100, Eric Dumazet wrote:
> Willy Tarreau a écrit :
> > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> > (...)
> >>> Also, in your second mail, you're saying that your change
> >>> might return more data than requested by the user. I can't
> >>> find why, could you please explain to me, as I'm still quite
> >>> ignorant in this area ?
> >> Well, I just tested various user programs and indeed got this
> >> strange result :
> >>
> >> Here I call splice() with len=1000 (0x3e8), and you can see
> >> it gives a result of 1460 at the second call.
> >
> > huh, not nice indeed!
> >
> > While looking at the code to see how this could be possible, I
> > came across this minor thing (unrelated IMHO) :
> >
> > if (__skb_splice_bits(skb, &offset, &tlen, &spd))
> > goto done;
> >>>>>>> else if (!tlen) <<<<<<
> > goto done;
> >
> > /*
> > * now see if we have a frag_list to map
> > */
> > if (skb_shinfo(skb)->frag_list) {
> > struct sk_buff *list = skb_shinfo(skb)->frag_list;
> >
> > for (; list && tlen; list = list->next) {
> > if (__skb_splice_bits(list, &offset, &tlen, &spd))
> > break;
> > }
> > }
> >
> > done:
> >
> > Above on the enlighted line, we'd better remove the else and leave a plain
> > "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
> > tlen, we still enter the if and evaluate the for condition for nothing. But
> > let's leave that for later.
> >
> >> I suspect a bug in splice code, that my patch just exposed.
> >
> > I've checked in skb_splice_bits() and below and can't see how we can move
> > more than the requested len.
> >
> > However, with your change, I don't clearly see how we break out of
> > the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
> > and read remaining data ? I suspect that we should at least exit when
> > ((struct tcp_splice_state *)desc->arg.data)->len = 0.
> >
> > At least that's something easy to add just before or after !desc->count
> > for a test.
> >
>
> I believe the bug is in tcp_splice_data_recv()
yes, see my other mail.
> I am going to test a new patch, but here is the thing I found:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index bd6ff90..fbbddf4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> {
> struct tcp_splice_state *tss = rd_desc->arg.data;
>
> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> + return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
> }
it will not work, len is always 1000 in your case, for every call,
as I had in my logs :
kernel :
tcp_splice_data_recv: skb=ed3d3480 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3480 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3540 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3540 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3600 offset=0 len=1176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3600 offset=1000 len=176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
program :
accept(3, 0, NULL) = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 176
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0xb0, 0x3) = 176
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
close(5) = 0
close(4) = 0
exit_group(0) = ?
Now with the fix :
accept(3, 0, NULL) = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 96
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x60, 0x3) = 96
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
close(5) = 0
close(4) = 0
exit_group(0) = ?
Regards,
Willy
next prev parent reply other threads:[~2009-01-09 22:13 UTC|newest]
Thread overview: 191+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
2009-01-08 19:44 ` Jens Axboe
2009-01-08 22:03 ` Willy Tarreau
2009-01-08 20:16 ` Willy Tarreau
2009-01-08 21:50 ` Ben Mansell
2009-01-08 21:55 ` David Miller
2009-01-08 22:20 ` Willy Tarreau
2009-01-13 23:08 ` David Miller
2009-01-09 6:47 ` Eric Dumazet
2009-01-09 7:04 ` Willy Tarreau
2009-01-09 7:28 ` Eric Dumazet
2009-01-09 7:42 ` Willy Tarreau
2009-01-13 23:27 ` David Miller
2009-01-13 23:35 ` Eric Dumazet
2009-01-09 15:42 ` Eric Dumazet
2009-01-09 17:57 ` Eric Dumazet
2009-01-09 18:54 ` Willy Tarreau
2009-01-09 20:51 ` Eric Dumazet
2009-01-09 21:24 ` Willy Tarreau
2009-01-09 22:02 ` Eric Dumazet
2009-01-09 22:09 ` Willy Tarreau [this message]
2009-01-09 22:07 ` Willy Tarreau
2009-01-09 22:12 ` Eric Dumazet
2009-01-09 22:17 ` Willy Tarreau
2009-01-09 22:42 ` Evgeniy Polyakov
2009-01-09 22:50 ` Willy Tarreau
2009-01-09 23:01 ` Evgeniy Polyakov
2009-01-09 23:06 ` Willy Tarreau
2009-01-10 7:40 ` Eric Dumazet
2009-01-11 12:58 ` Evgeniy Polyakov
2009-01-11 13:14 ` Eric Dumazet
2009-01-11 13:35 ` Evgeniy Polyakov
2009-01-11 16:00 ` Eric Dumazet
2009-01-11 16:05 ` Evgeniy Polyakov
2009-01-14 0:07 ` David Miller
2009-01-14 0:13 ` Evgeniy Polyakov
2009-01-14 0:16 ` David Miller
2009-01-14 0:22 ` Evgeniy Polyakov
2009-01-14 0:37 ` David Miller
2009-01-14 3:51 ` Herbert Xu
2009-01-14 4:25 ` David Miller
2009-01-14 7:27 ` David Miller
2009-01-14 8:26 ` Herbert Xu
2009-01-14 8:53 ` Jarek Poplawski
2009-01-14 9:29 ` David Miller
2009-01-14 9:42 ` Jarek Poplawski
2009-01-14 10:06 ` David Miller
2009-01-14 10:47 ` Jarek Poplawski
2009-01-14 11:29 ` Herbert Xu
2009-01-14 11:40 ` Jarek Poplawski
2009-01-14 11:45 ` Jarek Poplawski
2009-01-14 9:54 ` Jarek Poplawski
2009-01-14 10:01 ` Willy Tarreau
2009-01-14 12:06 ` Jarek Poplawski
2009-01-14 12:15 ` Jarek Poplawski
2009-01-14 11:28 ` Herbert Xu
2009-01-15 23:03 ` Willy Tarreau
2009-01-15 23:19 ` David Miller
2009-01-15 23:19 ` Herbert Xu
2009-01-15 23:26 ` David Miller
2009-01-15 23:32 ` Herbert Xu
2009-01-15 23:34 ` David Miller
2009-01-15 23:42 ` Willy Tarreau
2009-01-15 23:44 ` Willy Tarreau
2009-01-15 23:54 ` David Miller
2009-01-19 0:42 ` Willy Tarreau
2009-01-19 3:08 ` Herbert Xu
2009-01-19 3:27 ` David Miller
2009-01-19 6:14 ` Willy Tarreau
2009-01-19 6:19 ` David Miller
2009-01-19 6:45 ` Willy Tarreau
2009-01-19 10:19 ` Herbert Xu
2009-01-19 20:59 ` David Miller
2009-01-19 21:24 ` Herbert Xu
2009-01-25 21:03 ` Willy Tarreau
2009-01-26 7:59 ` Jarek Poplawski
2009-01-26 8:12 ` Willy Tarreau
2009-01-19 8:40 ` Jarek Poplawski
2009-01-19 3:28 ` David Miller
2009-01-19 6:11 ` Willy Tarreau
2009-01-24 21:23 ` Willy Tarreau
2009-01-20 12:01 ` Ben Mansell
2009-01-20 12:11 ` Evgeniy Polyakov
2009-01-20 13:43 ` Ben Mansell
2009-01-20 14:06 ` Jarek Poplawski
2009-01-16 6:51 ` Jarek Poplawski
2009-01-19 6:08 ` David Miller
2009-01-19 6:16 ` David Miller
2009-01-19 10:20 ` Herbert Xu
2009-01-20 8:37 ` Jarek Poplawski
2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski
2009-01-20 10:00 ` Evgeniy Polyakov
2009-01-20 10:20 ` Jarek Poplawski
2009-01-20 10:31 ` Evgeniy Polyakov
2009-01-20 11:01 ` Jarek Poplawski
2009-01-20 17:16 ` David Miller
2009-01-21 9:54 ` Jarek Poplawski
2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski
2009-01-26 5:22 ` David Miller
2009-01-27 7:11 ` Herbert Xu
2009-01-27 7:54 ` Jarek Poplawski
2009-01-27 10:09 ` Herbert Xu
2009-01-27 10:35 ` Jarek Poplawski
2009-01-27 10:57 ` Jarek Poplawski
2009-01-27 11:48 ` Herbert Xu
2009-01-27 12:16 ` Jarek Poplawski
2009-01-27 12:31 ` Jarek Poplawski
2009-01-27 17:06 ` David Miller
2009-01-28 8:10 ` Jarek Poplawski
2009-02-01 8:41 ` David Miller
2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski
2009-01-26 21:21 ` Evgeniy Polyakov
2009-01-27 6:10 ` David Miller
2009-01-27 7:40 ` Jarek Poplawski
2009-01-30 21:42 ` David Miller
2009-01-30 21:59 ` Willy Tarreau
2009-01-30 22:03 ` David Miller
2009-01-30 22:13 ` Willy Tarreau
2009-01-30 22:15 ` David Miller
2009-01-30 22:16 ` Herbert Xu
2009-02-02 8:08 ` Jarek Poplawski
2009-02-02 8:18 ` David Miller
2009-02-02 8:43 ` Jarek Poplawski
2009-02-03 7:50 ` David Miller
2009-02-03 9:41 ` Jarek Poplawski
2009-02-03 11:10 ` Evgeniy Polyakov
2009-02-03 11:24 ` Herbert Xu
2009-02-03 11:49 ` Evgeniy Polyakov
2009-02-03 11:53 ` Herbert Xu
2009-02-03 12:07 ` Evgeniy Polyakov
2009-02-03 12:12 ` Herbert Xu
2009-02-03 12:18 ` Evgeniy Polyakov
2009-02-03 12:25 ` Willy Tarreau
2009-02-03 12:28 ` Herbert Xu
2009-02-04 0:47 ` David Miller
2009-02-04 6:19 ` Willy Tarreau
2009-02-04 8:12 ` Evgeniy Polyakov
2009-02-04 8:54 ` Willy Tarreau
2009-02-04 8:59 ` Herbert Xu
2009-02-04 9:01 ` David Miller
2009-02-04 9:12 ` Willy Tarreau
2009-02-04 9:15 ` David Miller
2009-02-04 19:19 ` Roland Dreier
2009-02-04 19:28 ` Willy Tarreau
2009-02-04 19:48 ` Jarek Poplawski
2009-02-05 8:32 ` Bill Fink
2009-02-04 9:12 ` David Miller
2009-02-03 12:27 ` Herbert Xu
2009-02-03 13:05 ` david
2009-02-03 12:12 ` Evgeniy Polyakov
2009-02-03 12:18 ` Herbert Xu
2009-02-03 12:30 ` Evgeniy Polyakov
2009-02-03 12:33 ` Herbert Xu
2009-02-03 12:33 ` Nick Piggin
2009-02-04 0:46 ` David Miller
2009-02-04 9:41 ` Benny Amorsen
2009-02-04 12:01 ` Herbert Xu
2009-02-03 12:36 ` Jarek Poplawski
2009-02-03 13:06 ` Evgeniy Polyakov
2009-02-03 13:25 ` Jarek Poplawski
2009-02-03 14:20 ` Evgeniy Polyakov
2009-02-04 0:46 ` David Miller
2009-02-04 8:08 ` Evgeniy Polyakov
2009-02-04 9:23 ` Nick Piggin
2009-02-04 7:56 ` Jarek Poplawski
2009-02-06 7:52 ` David Miller
2009-02-06 8:09 ` Herbert Xu
2009-02-06 9:10 ` Jarek Poplawski
2009-02-06 9:17 ` David Miller
2009-02-06 9:42 ` Jarek Poplawski
2009-02-06 9:49 ` David Miller
2009-02-06 9:23 ` Herbert Xu
2009-02-06 9:51 ` Jarek Poplawski
2009-02-06 10:28 ` Herbert Xu
2009-02-06 10:58 ` Jarek Poplawski
2009-02-06 11:10 ` Willy Tarreau
2009-02-06 11:47 ` Jarek Poplawski
2009-02-06 18:59 ` Jarek Poplawski
2009-02-03 11:38 ` Nick Piggin
2009-01-27 18:42 ` David Miller
2009-01-15 23:32 ` [PATCH] " Willy Tarreau
2009-01-15 23:35 ` David Miller
2009-01-14 0:51 ` Herbert Xu
2009-01-14 1:24 ` David Miller
2009-01-09 22:45 ` Eric Dumazet
2009-01-09 22:53 ` Willy Tarreau
2009-01-09 23:34 ` Eric Dumazet
2009-01-13 5:45 ` David Miller
2009-01-14 0:05 ` David Miller
2009-01-13 23:31 ` David Miller
2009-01-13 23:26 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090109220946.GA4787@1wt.eu \
--to=w@1wt.eu \
--cc=ben@zeus.com \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=jarkao2@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).