From: Jarek Poplawski <jarkao2@gmail.com>
To: David Miller <davem@davemloft.net>
Cc: herbert@gondor.apana.org.au, zbr@ioremap.net,
dada1@cosmosbay.com, w@1wt.eu, ben@zeus.com, mingo@elte.hu,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
jens.axboe@oracle.com
Subject: Re: [PATCH] tcp: splice as many packets as possible at once
Date: Wed, 14 Jan 2009 09:54:54 +0000 [thread overview]
Message-ID: <20090114095454.GD4234@ff.dom.local> (raw)
In-Reply-To: <20090114.012919.117682429.davem@davemloft.net>
On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> Therefore what I'll likely do is push Jarek's copy based cure,
> and meanwhile we can brainstorm some more on how to fix this
> properly in the long term.
>
> So, I've put together a full commit message and Jarek's patch
> below. One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
>
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
>
> Comments?
Yes, this should lessen the overhead a bit.
>
> net: Fix data corruption when splicing from sockets.
>
> From: Jarek Poplawski <jarkao2@gmail.com>
>
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
>
> The idea is to pass the virt_to_page() reference via the pipe
> buffer, and refcount the buffer using a SKB reference.
>
> But if we are splicing from a socket to a socket (via sendpage)
> this doesn't work.
>
> The from side processing will grab the page (and SKB) references.
> The sendpage() calls will grab page references only, return, and
> then the from side processing completes and drops the SKB ref.
>
> The page based reference to skb->data is not enough to keep the
> kmalloc() buffer backing it from being reused. Yet, that is
> all that the socket send side has at this point.
>
> This leads to data corruption if the skb->data buffer is reused
> by SLAB before the send side socket actually gets the TX packet
> out to the device.
>
> The fix employed here is to simply allocate a page and copy the
> skb->data bytes into that page.
>
> This will hurt performance, but there is no clear way to fix this
> properly without a copy at the present time, and it is important
> to get rid of the data corruption.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
You are very brave! I'd prefer to wait for at least minimal testing
by Willy...
Thanks,
Jarek P.
BTW, an skb parameter could be removed from spd_fill_page() to make it
even faster...
...
> static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> unsigned int len, unsigned int offset,
> - struct sk_buff *skb)
> + struct sk_buff *skb, int linear)
> {
> if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> return 1;
>
> + if (linear) {
> + page = linear_to_page(page, len, offset);
> + if (!page)
> + return 1;
> + }
> +
> spd->pages[spd->nr_pages] = page;
> spd->partial[spd->nr_pages].len = len;
> spd->partial[spd->nr_pages].offset = offset;
> - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> spd->nr_pages++;
> + get_page(page);
> +
> return 0;
> }
>
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> static inline int __splice_segment(struct page *page, unsigned int poff,
> unsigned int plen, unsigned int *off,
> unsigned int *len, struct sk_buff *skb,
> - struct splice_pipe_desc *spd)
> + struct splice_pipe_desc *spd, int linear)
> {
> if (!*len)
> return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> /* the linear region may spread across several pages */
> flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>
> - if (spd_fill_page(spd, page, flen, poff, skb))
> + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> return 1;
>
> __segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> if (__splice_segment(virt_to_page(skb->data),
> (unsigned long) skb->data & (PAGE_SIZE - 1),
> skb_headlen(skb),
> - offset, len, skb, spd))
> + offset, len, skb, spd, 1))
> return 1;
>
> /*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>
> if (__splice_segment(f->page, f->page_offset, f->size,
> - offset, len, skb, spd))
> + offset, len, skb, spd, 0))
> return 1;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-01-14 9:55 UTC|newest]
Thread overview: 191+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
2009-01-08 19:44 ` Jens Axboe
2009-01-08 22:03 ` Willy Tarreau
2009-01-08 20:16 ` Willy Tarreau
2009-01-08 21:50 ` Ben Mansell
2009-01-08 21:55 ` David Miller
2009-01-08 22:20 ` Willy Tarreau
2009-01-13 23:08 ` David Miller
2009-01-09 6:47 ` Eric Dumazet
2009-01-09 7:04 ` Willy Tarreau
2009-01-09 7:28 ` Eric Dumazet
2009-01-09 7:42 ` Willy Tarreau
2009-01-13 23:27 ` David Miller
2009-01-13 23:35 ` Eric Dumazet
2009-01-09 15:42 ` Eric Dumazet
2009-01-09 17:57 ` Eric Dumazet
2009-01-09 18:54 ` Willy Tarreau
2009-01-09 20:51 ` Eric Dumazet
2009-01-09 21:24 ` Willy Tarreau
2009-01-09 22:02 ` Eric Dumazet
2009-01-09 22:09 ` Willy Tarreau
2009-01-09 22:07 ` Willy Tarreau
2009-01-09 22:12 ` Eric Dumazet
2009-01-09 22:17 ` Willy Tarreau
2009-01-09 22:42 ` Evgeniy Polyakov
2009-01-09 22:50 ` Willy Tarreau
2009-01-09 23:01 ` Evgeniy Polyakov
2009-01-09 23:06 ` Willy Tarreau
2009-01-10 7:40 ` Eric Dumazet
2009-01-11 12:58 ` Evgeniy Polyakov
2009-01-11 13:14 ` Eric Dumazet
2009-01-11 13:35 ` Evgeniy Polyakov
2009-01-11 16:00 ` Eric Dumazet
2009-01-11 16:05 ` Evgeniy Polyakov
2009-01-14 0:07 ` David Miller
2009-01-14 0:13 ` Evgeniy Polyakov
2009-01-14 0:16 ` David Miller
2009-01-14 0:22 ` Evgeniy Polyakov
2009-01-14 0:37 ` David Miller
2009-01-14 3:51 ` Herbert Xu
2009-01-14 4:25 ` David Miller
2009-01-14 7:27 ` David Miller
2009-01-14 8:26 ` Herbert Xu
2009-01-14 8:53 ` Jarek Poplawski
2009-01-14 9:29 ` David Miller
2009-01-14 9:42 ` Jarek Poplawski
2009-01-14 10:06 ` David Miller
2009-01-14 10:47 ` Jarek Poplawski
2009-01-14 11:29 ` Herbert Xu
2009-01-14 11:40 ` Jarek Poplawski
2009-01-14 11:45 ` Jarek Poplawski
2009-01-14 9:54 ` Jarek Poplawski [this message]
2009-01-14 10:01 ` Willy Tarreau
2009-01-14 12:06 ` Jarek Poplawski
2009-01-14 12:15 ` Jarek Poplawski
2009-01-14 11:28 ` Herbert Xu
2009-01-15 23:03 ` Willy Tarreau
2009-01-15 23:19 ` David Miller
2009-01-15 23:19 ` Herbert Xu
2009-01-15 23:26 ` David Miller
2009-01-15 23:32 ` Herbert Xu
2009-01-15 23:34 ` David Miller
2009-01-15 23:42 ` Willy Tarreau
2009-01-15 23:44 ` Willy Tarreau
2009-01-15 23:54 ` David Miller
2009-01-19 0:42 ` Willy Tarreau
2009-01-19 3:08 ` Herbert Xu
2009-01-19 3:27 ` David Miller
2009-01-19 6:14 ` Willy Tarreau
2009-01-19 6:19 ` David Miller
2009-01-19 6:45 ` Willy Tarreau
2009-01-19 10:19 ` Herbert Xu
2009-01-19 20:59 ` David Miller
2009-01-19 21:24 ` Herbert Xu
2009-01-25 21:03 ` Willy Tarreau
2009-01-26 7:59 ` Jarek Poplawski
2009-01-26 8:12 ` Willy Tarreau
2009-01-19 8:40 ` Jarek Poplawski
2009-01-19 3:28 ` David Miller
2009-01-19 6:11 ` Willy Tarreau
2009-01-24 21:23 ` Willy Tarreau
2009-01-20 12:01 ` Ben Mansell
2009-01-20 12:11 ` Evgeniy Polyakov
2009-01-20 13:43 ` Ben Mansell
2009-01-20 14:06 ` Jarek Poplawski
2009-01-16 6:51 ` Jarek Poplawski
2009-01-19 6:08 ` David Miller
2009-01-19 6:16 ` David Miller
2009-01-19 10:20 ` Herbert Xu
2009-01-20 8:37 ` Jarek Poplawski
2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski
2009-01-20 10:00 ` Evgeniy Polyakov
2009-01-20 10:20 ` Jarek Poplawski
2009-01-20 10:31 ` Evgeniy Polyakov
2009-01-20 11:01 ` Jarek Poplawski
2009-01-20 17:16 ` David Miller
2009-01-21 9:54 ` Jarek Poplawski
2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski
2009-01-26 5:22 ` David Miller
2009-01-27 7:11 ` Herbert Xu
2009-01-27 7:54 ` Jarek Poplawski
2009-01-27 10:09 ` Herbert Xu
2009-01-27 10:35 ` Jarek Poplawski
2009-01-27 10:57 ` Jarek Poplawski
2009-01-27 11:48 ` Herbert Xu
2009-01-27 12:16 ` Jarek Poplawski
2009-01-27 12:31 ` Jarek Poplawski
2009-01-27 17:06 ` David Miller
2009-01-28 8:10 ` Jarek Poplawski
2009-02-01 8:41 ` David Miller
2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski
2009-01-26 21:21 ` Evgeniy Polyakov
2009-01-27 6:10 ` David Miller
2009-01-27 7:40 ` Jarek Poplawski
2009-01-30 21:42 ` David Miller
2009-01-30 21:59 ` Willy Tarreau
2009-01-30 22:03 ` David Miller
2009-01-30 22:13 ` Willy Tarreau
2009-01-30 22:15 ` David Miller
2009-01-30 22:16 ` Herbert Xu
2009-02-02 8:08 ` Jarek Poplawski
2009-02-02 8:18 ` David Miller
2009-02-02 8:43 ` Jarek Poplawski
2009-02-03 7:50 ` David Miller
2009-02-03 9:41 ` Jarek Poplawski
2009-02-03 11:10 ` Evgeniy Polyakov
2009-02-03 11:24 ` Herbert Xu
2009-02-03 11:49 ` Evgeniy Polyakov
2009-02-03 11:53 ` Herbert Xu
2009-02-03 12:07 ` Evgeniy Polyakov
2009-02-03 12:12 ` Herbert Xu
2009-02-03 12:18 ` Evgeniy Polyakov
2009-02-03 12:25 ` Willy Tarreau
2009-02-03 12:28 ` Herbert Xu
2009-02-04 0:47 ` David Miller
2009-02-04 6:19 ` Willy Tarreau
2009-02-04 8:12 ` Evgeniy Polyakov
2009-02-04 8:54 ` Willy Tarreau
2009-02-04 8:59 ` Herbert Xu
2009-02-04 9:01 ` David Miller
2009-02-04 9:12 ` Willy Tarreau
2009-02-04 9:15 ` David Miller
2009-02-04 19:19 ` Roland Dreier
2009-02-04 19:28 ` Willy Tarreau
2009-02-04 19:48 ` Jarek Poplawski
2009-02-05 8:32 ` Bill Fink
2009-02-04 9:12 ` David Miller
2009-02-03 12:27 ` Herbert Xu
2009-02-03 13:05 ` david
2009-02-03 12:12 ` Evgeniy Polyakov
2009-02-03 12:18 ` Herbert Xu
2009-02-03 12:30 ` Evgeniy Polyakov
2009-02-03 12:33 ` Herbert Xu
2009-02-03 12:33 ` Nick Piggin
2009-02-04 0:46 ` David Miller
2009-02-04 9:41 ` Benny Amorsen
2009-02-04 12:01 ` Herbert Xu
2009-02-03 12:36 ` Jarek Poplawski
2009-02-03 13:06 ` Evgeniy Polyakov
2009-02-03 13:25 ` Jarek Poplawski
2009-02-03 14:20 ` Evgeniy Polyakov
2009-02-04 0:46 ` David Miller
2009-02-04 8:08 ` Evgeniy Polyakov
2009-02-04 9:23 ` Nick Piggin
2009-02-04 7:56 ` Jarek Poplawski
2009-02-06 7:52 ` David Miller
2009-02-06 8:09 ` Herbert Xu
2009-02-06 9:10 ` Jarek Poplawski
2009-02-06 9:17 ` David Miller
2009-02-06 9:42 ` Jarek Poplawski
2009-02-06 9:49 ` David Miller
2009-02-06 9:23 ` Herbert Xu
2009-02-06 9:51 ` Jarek Poplawski
2009-02-06 10:28 ` Herbert Xu
2009-02-06 10:58 ` Jarek Poplawski
2009-02-06 11:10 ` Willy Tarreau
2009-02-06 11:47 ` Jarek Poplawski
2009-02-06 18:59 ` Jarek Poplawski
2009-02-03 11:38 ` Nick Piggin
2009-01-27 18:42 ` David Miller
2009-01-15 23:32 ` [PATCH] " Willy Tarreau
2009-01-15 23:35 ` David Miller
2009-01-14 0:51 ` Herbert Xu
2009-01-14 1:24 ` David Miller
2009-01-09 22:45 ` Eric Dumazet
2009-01-09 22:53 ` Willy Tarreau
2009-01-09 23:34 ` Eric Dumazet
2009-01-13 5:45 ` David Miller
2009-01-14 0:05 ` David Miller
2009-01-13 23:31 ` David Miller
2009-01-13 23:26 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090114095454.GD4234@ff.dom.local \
--to=jarkao2@gmail.com \
--cc=ben@zeus.com \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=herbert@gondor.apana.org.au \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
--cc=w@1wt.eu \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).