From: David Miller <davem@davemloft.net>
To: jarkao2@gmail.com
Cc: herbert@gondor.apana.org.au, zbr@ioremap.net,
dada1@cosmosbay.com, w@1wt.eu, ben@zeus.com, mingo@elte.hu,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
jens.axboe@oracle.com
Subject: Re: [PATCH] tcp: splice as many packets as possible at once
Date: Wed, 14 Jan 2009 01:29:19 -0800 (PST) [thread overview]
Message-ID: <20090114.012919.117682429.davem@davemloft.net> (raw)
In-Reply-To: <20090114085308.GB4234@ff.dom.local>
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 14 Jan 2009 08:53:08 +0000
> Actually, I still think my second approach (the PageSlab) is probably
> (if tested) the easiest for now, because it should fix the reported
> (Willy's) problem, without any change or copy overhead for splice to
> file (which could be still wrong, but not obviously wrong).
It's a simple fix, but as Herbert stated it leaves other ->sendpage()
implementations exposed to data corruption when the from side of the
pipe buffer is a socket.
That, to me, is almost worse than a bad fix.
It's definitely worse than a slower but full fix, which the copy
patch is.
Therefore what I'll likely do is push Jarek's copy based cure,
and meanwhile we can brainstorm some more on how to fix this
properly in the long term.
So, I've put together a full commit message and Jarek's patch
below. One thing I notice is that the silly skb_clone() done
by SKB splicing is no longer necessary.
We could get rid of that to offset (some) of the cost we are
adding with this bug fix.
Comments?
net: Fix data corruption when splicing from sockets.
From: Jarek Poplawski <jarkao2@gmail.com>
The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.
The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.
But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.
The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.
The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.
This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.
The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.
This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}
- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}
/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;
+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}
@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;
__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;
/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-01-14 9:29 UTC|newest]
Thread overview: 191+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
2009-01-08 19:44 ` Jens Axboe
2009-01-08 22:03 ` Willy Tarreau
2009-01-08 20:16 ` Willy Tarreau
2009-01-08 21:50 ` Ben Mansell
2009-01-08 21:55 ` David Miller
2009-01-08 22:20 ` Willy Tarreau
2009-01-13 23:08 ` David Miller
2009-01-09 6:47 ` Eric Dumazet
2009-01-09 7:04 ` Willy Tarreau
2009-01-09 7:28 ` Eric Dumazet
2009-01-09 7:42 ` Willy Tarreau
2009-01-13 23:27 ` David Miller
2009-01-13 23:35 ` Eric Dumazet
2009-01-09 15:42 ` Eric Dumazet
2009-01-09 17:57 ` Eric Dumazet
2009-01-09 18:54 ` Willy Tarreau
2009-01-09 20:51 ` Eric Dumazet
2009-01-09 21:24 ` Willy Tarreau
2009-01-09 22:02 ` Eric Dumazet
2009-01-09 22:09 ` Willy Tarreau
2009-01-09 22:07 ` Willy Tarreau
2009-01-09 22:12 ` Eric Dumazet
2009-01-09 22:17 ` Willy Tarreau
2009-01-09 22:42 ` Evgeniy Polyakov
2009-01-09 22:50 ` Willy Tarreau
2009-01-09 23:01 ` Evgeniy Polyakov
2009-01-09 23:06 ` Willy Tarreau
2009-01-10 7:40 ` Eric Dumazet
2009-01-11 12:58 ` Evgeniy Polyakov
2009-01-11 13:14 ` Eric Dumazet
2009-01-11 13:35 ` Evgeniy Polyakov
2009-01-11 16:00 ` Eric Dumazet
2009-01-11 16:05 ` Evgeniy Polyakov
2009-01-14 0:07 ` David Miller
2009-01-14 0:13 ` Evgeniy Polyakov
2009-01-14 0:16 ` David Miller
2009-01-14 0:22 ` Evgeniy Polyakov
2009-01-14 0:37 ` David Miller
2009-01-14 3:51 ` Herbert Xu
2009-01-14 4:25 ` David Miller
2009-01-14 7:27 ` David Miller
2009-01-14 8:26 ` Herbert Xu
2009-01-14 8:53 ` Jarek Poplawski
2009-01-14 9:29 ` David Miller [this message]
2009-01-14 9:42 ` Jarek Poplawski
2009-01-14 10:06 ` David Miller
2009-01-14 10:47 ` Jarek Poplawski
2009-01-14 11:29 ` Herbert Xu
2009-01-14 11:40 ` Jarek Poplawski
2009-01-14 11:45 ` Jarek Poplawski
2009-01-14 9:54 ` Jarek Poplawski
2009-01-14 10:01 ` Willy Tarreau
2009-01-14 12:06 ` Jarek Poplawski
2009-01-14 12:15 ` Jarek Poplawski
2009-01-14 11:28 ` Herbert Xu
2009-01-15 23:03 ` Willy Tarreau
2009-01-15 23:19 ` David Miller
2009-01-15 23:19 ` Herbert Xu
2009-01-15 23:26 ` David Miller
2009-01-15 23:32 ` Herbert Xu
2009-01-15 23:34 ` David Miller
2009-01-15 23:42 ` Willy Tarreau
2009-01-15 23:44 ` Willy Tarreau
2009-01-15 23:54 ` David Miller
2009-01-19 0:42 ` Willy Tarreau
2009-01-19 3:08 ` Herbert Xu
2009-01-19 3:27 ` David Miller
2009-01-19 6:14 ` Willy Tarreau
2009-01-19 6:19 ` David Miller
2009-01-19 6:45 ` Willy Tarreau
2009-01-19 10:19 ` Herbert Xu
2009-01-19 20:59 ` David Miller
2009-01-19 21:24 ` Herbert Xu
2009-01-25 21:03 ` Willy Tarreau
2009-01-26 7:59 ` Jarek Poplawski
2009-01-26 8:12 ` Willy Tarreau
2009-01-19 8:40 ` Jarek Poplawski
2009-01-19 3:28 ` David Miller
2009-01-19 6:11 ` Willy Tarreau
2009-01-24 21:23 ` Willy Tarreau
2009-01-20 12:01 ` Ben Mansell
2009-01-20 12:11 ` Evgeniy Polyakov
2009-01-20 13:43 ` Ben Mansell
2009-01-20 14:06 ` Jarek Poplawski
2009-01-16 6:51 ` Jarek Poplawski
2009-01-19 6:08 ` David Miller
2009-01-19 6:16 ` David Miller
2009-01-19 10:20 ` Herbert Xu
2009-01-20 8:37 ` Jarek Poplawski
2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski
2009-01-20 10:00 ` Evgeniy Polyakov
2009-01-20 10:20 ` Jarek Poplawski
2009-01-20 10:31 ` Evgeniy Polyakov
2009-01-20 11:01 ` Jarek Poplawski
2009-01-20 17:16 ` David Miller
2009-01-21 9:54 ` Jarek Poplawski
2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski
2009-01-26 5:22 ` David Miller
2009-01-27 7:11 ` Herbert Xu
2009-01-27 7:54 ` Jarek Poplawski
2009-01-27 10:09 ` Herbert Xu
2009-01-27 10:35 ` Jarek Poplawski
2009-01-27 10:57 ` Jarek Poplawski
2009-01-27 11:48 ` Herbert Xu
2009-01-27 12:16 ` Jarek Poplawski
2009-01-27 12:31 ` Jarek Poplawski
2009-01-27 17:06 ` David Miller
2009-01-28 8:10 ` Jarek Poplawski
2009-02-01 8:41 ` David Miller
2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski
2009-01-26 21:21 ` Evgeniy Polyakov
2009-01-27 6:10 ` David Miller
2009-01-27 7:40 ` Jarek Poplawski
2009-01-30 21:42 ` David Miller
2009-01-30 21:59 ` Willy Tarreau
2009-01-30 22:03 ` David Miller
2009-01-30 22:13 ` Willy Tarreau
2009-01-30 22:15 ` David Miller
2009-01-30 22:16 ` Herbert Xu
2009-02-02 8:08 ` Jarek Poplawski
2009-02-02 8:18 ` David Miller
2009-02-02 8:43 ` Jarek Poplawski
2009-02-03 7:50 ` David Miller
2009-02-03 9:41 ` Jarek Poplawski
2009-02-03 11:10 ` Evgeniy Polyakov
2009-02-03 11:24 ` Herbert Xu
2009-02-03 11:49 ` Evgeniy Polyakov
2009-02-03 11:53 ` Herbert Xu
2009-02-03 12:07 ` Evgeniy Polyakov
2009-02-03 12:12 ` Herbert Xu
2009-02-03 12:18 ` Evgeniy Polyakov
2009-02-03 12:25 ` Willy Tarreau
2009-02-03 12:28 ` Herbert Xu
2009-02-04 0:47 ` David Miller
2009-02-04 6:19 ` Willy Tarreau
2009-02-04 8:12 ` Evgeniy Polyakov
2009-02-04 8:54 ` Willy Tarreau
2009-02-04 8:59 ` Herbert Xu
2009-02-04 9:01 ` David Miller
2009-02-04 9:12 ` Willy Tarreau
2009-02-04 9:15 ` David Miller
2009-02-04 19:19 ` Roland Dreier
2009-02-04 19:28 ` Willy Tarreau
2009-02-04 19:48 ` Jarek Poplawski
2009-02-05 8:32 ` Bill Fink
2009-02-04 9:12 ` David Miller
2009-02-03 12:27 ` Herbert Xu
2009-02-03 13:05 ` david
2009-02-03 12:12 ` Evgeniy Polyakov
2009-02-03 12:18 ` Herbert Xu
2009-02-03 12:30 ` Evgeniy Polyakov
2009-02-03 12:33 ` Herbert Xu
2009-02-03 12:33 ` Nick Piggin
2009-02-04 0:46 ` David Miller
2009-02-04 9:41 ` Benny Amorsen
2009-02-04 12:01 ` Herbert Xu
2009-02-03 12:36 ` Jarek Poplawski
2009-02-03 13:06 ` Evgeniy Polyakov
2009-02-03 13:25 ` Jarek Poplawski
2009-02-03 14:20 ` Evgeniy Polyakov
2009-02-04 0:46 ` David Miller
2009-02-04 8:08 ` Evgeniy Polyakov
2009-02-04 9:23 ` Nick Piggin
2009-02-04 7:56 ` Jarek Poplawski
2009-02-06 7:52 ` David Miller
2009-02-06 8:09 ` Herbert Xu
2009-02-06 9:10 ` Jarek Poplawski
2009-02-06 9:17 ` David Miller
2009-02-06 9:42 ` Jarek Poplawski
2009-02-06 9:49 ` David Miller
2009-02-06 9:23 ` Herbert Xu
2009-02-06 9:51 ` Jarek Poplawski
2009-02-06 10:28 ` Herbert Xu
2009-02-06 10:58 ` Jarek Poplawski
2009-02-06 11:10 ` Willy Tarreau
2009-02-06 11:47 ` Jarek Poplawski
2009-02-06 18:59 ` Jarek Poplawski
2009-02-03 11:38 ` Nick Piggin
2009-01-27 18:42 ` David Miller
2009-01-15 23:32 ` [PATCH] " Willy Tarreau
2009-01-15 23:35 ` David Miller
2009-01-14 0:51 ` Herbert Xu
2009-01-14 1:24 ` David Miller
2009-01-09 22:45 ` Eric Dumazet
2009-01-09 22:53 ` Willy Tarreau
2009-01-09 23:34 ` Eric Dumazet
2009-01-13 5:45 ` David Miller
2009-01-14 0:05 ` David Miller
2009-01-13 23:31 ` David Miller
2009-01-13 23:26 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090114.012919.117682429.davem@davemloft.net \
--to=davem@davemloft.net \
--cc=ben@zeus.com \
--cc=dada1@cosmosbay.com \
--cc=herbert@gondor.apana.org.au \
--cc=jarkao2@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
--cc=w@1wt.eu \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).