From: Rusty Russell <rusty@rustcorp.com.au>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Evgeniy Polyakov <zbr@ioremap.net>,
Herbert Xu <herbert@gondor.apana.org.au>,
Vladislav Bolkhovitin <vst@vlnb.net>,
linux-scsi@vger.kernel.org,
James Bottomley <James.Bottomley@hansenpartnership.com>,
Andrew Morton <akpm@linux-foundation.org>,
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>,
Mike Christie <michaelc@cs.wisc.edu>,
Jeff Garzik <jeff@garzik.org>,
Boaz Harrosh <bharrosh@panasas.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
linux-kernel@vger.kernel.org, scst-devel@lists.sourceforge.net,
Bart Van Assche <bart.vanassche@gmail.com>,
"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
netdev@vger.kernel.org, David Miller <davem@davemloft.net>
Subject: Re: [PATCH][RFC 23/23]: Support for zero-copy TCP transmit of user space data
Date: Mon, 22 Dec 2008 11:13:22 +1030 [thread overview]
Message-ID: <200812221113.24044.rusty@rustcorp.com.au> (raw)
In-Reply-To: <494D49E6.9000703@goop.org>
On Sunday 21 December 2008 06:09:18 Jeremy Fitzhardinge wrote:
> Evgeniy Polyakov wrote:
> > Things should work fine, since pskb_expand_head() copies whole shared
> > info structure (and thus will copy destructor), get all pages and then
> > copy all pointers into the new skb, and then release old skb's data.
> >
> > So destructor for the pages should not rely on which skb it is called on
> > and check if pages are about to be really freed (i.e. check theirs
> > reference counter).
> >
>
> OK.
>
> > __pskb_pull_tail() is tricky, it just puts some pages it does not want
> > to be present in the skb, but it could be possible to add there
> > destructor callback from the original skb with partial flag (or just
> > having destructor with two parameters: skb and page, and if page is not
> > NULL, then actually only given page is freed, otherwise the whole skb).
> >
>
> Yes, that doesn't sound too bad.
That would be one approach. Actually, my patch solved this by keeping a
parent ref in various cases if the parent had a destructor: we only destroy
the parent when all the clones are gone.
Here's the patch for reference:
net: add destructor for skb data.
If we want to notify something when an skb is truly finished (such as
for tun vringfd support), we need a destructor on the data.
This turns out to be slightly non-trivial as fragments from one skb
get copied to another skb: if the first skb has a destructor (or its
parent does) we need to keep a reference to it and destroy it only
when (all the) children are destroyed. We add an 'orig' pointer to
the skb_shared_info to do this.
But there's currently no way to get from the shinfo to the head (to
kfree it), so we add a 'len' field. A better alternative to this
might be to move the skb_shared_info to before the head of the skb data.
Note that the destructor is responsible for calling kfree: for the tun
device, this is critical since the destructor can be called from any
context and it has to do a copy_to_user, so it queues the skb.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
include/linux/skbuff.h | 9 ++++++
net/core/skbuff.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 64 insertions(+), 11 deletions(-)
diff -r 3948a0050f81 include/linux/skbuff.h
--- a/include/linux/skbuff.h Tue May 06 21:18:01 2008 +1000
+++ b/include/linux/skbuff.h Thu May 08 13:12:47 2008 +1000
@@ -146,8 +146,12 @@ struct skb_shared_info {
unsigned short gso_segs;
unsigned short gso_type;
__be32 ip6_frag_id;
+ unsigned int len; /* Subtract from this shinfo to find skb->head */
struct sk_buff *frag_list;
skb_frag_t frags[MAX_SKB_FRAGS];
+ struct skb_shared_info *orig;
+ /* This is responsible for kfree() of header. */
+ void (*destructor)(struct skb_shared_info *);
};
/* We divide dataref into two halves. The higher 16 bits hold references
@@ -827,6 +831,11 @@ static inline void skb_fill_page_desc(st
#define SKB_PAGE_ASSERT(skb) BUG_ON(skb_shinfo(skb)->nr_frags)
#define SKB_FRAG_ASSERT(skb) BUG_ON(skb_shinfo(skb)->frag_list)
#define SKB_LINEAR_ASSERT(skb) BUG_ON(skb_is_nonlinear(skb))
+
+static inline unsigned char *skb_shinfo_to_head(struct skb_shared_info *shinfo)
+{
+ return (unsigned char *)shinfo - shinfo->len;
+}
#ifdef NET_SKBUFF_DATA_USES_OFFSET
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
diff -r 3948a0050f81 net/core/skbuff.c
--- a/net/core/skbuff.c Tue May 06 21:18:01 2008 +1000
+++ b/net/core/skbuff.c Thu May 08 13:12:47 2008 +1000
@@ -218,6 +218,9 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->gso_type = 0;
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
+ shinfo->destructor = NULL;
+ shinfo->orig = NULL;
+ shinfo->len = skb_end_pointer(skb) - skb->head;
if (fclone) {
struct sk_buff *child = skb + 1;
@@ -311,21 +314,53 @@ static void skb_clone_fraglist(struct sk
skb_get(list);
}
+static void shinfo_put(struct skb_shared_info *shinfo, bool nohdr, bool clone)
+{
+ struct skb_shared_info *orig;
+
+ do {
+ if (clone &&
+ atomic_sub_return(nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
+ &shinfo->dataref)) {
+ return;
+ }
+
+ if (shinfo->nr_frags) {
+ int i;
+ for (i = 0; i < shinfo->nr_frags; i++)
+ put_page(shinfo->frags[i].page);
+ }
+
+ if (shinfo->frag_list)
+ skb_drop_list(&shinfo->frag_list);
+
+ orig = shinfo->orig;
+ if (shinfo->destructor)
+ shinfo->destructor(shinfo);
+ else
+ kfree(skb_shinfo_to_head(shinfo));
+
+ /* We hold a payload reference to our parent. */
+ nohdr = true;
+ clone = true;
+ } while ((shinfo = orig) != NULL);
+}
+
static void skb_release_data(struct sk_buff *skb)
{
- if (!skb->cloned ||
- !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
- &skb_shinfo(skb)->dataref)) {
- if (skb_shinfo(skb)->nr_frags) {
- int i;
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
- }
+ shinfo_put(skb_shinfo(skb), skb->nohdr, skb->cloned);
+}
- if (skb_shinfo(skb)->frag_list)
- skb_drop_fraglist(skb);
+/* Now hold reference to older data, if has a destructor (recursively). */
+static void skb_ref_data_parent(struct sk_buff *parent,
+ struct skb_shared_info *shinfo)
+{
+ struct skb_shared_info *pshinfo = skb_shinfo(parent);
- kfree(skb->head);
+ if (pshinfo->destructor || pshinfo->orig) {
+ shinfo->orig = pshinfo;
+ atomic_add((1 << SKB_DATAREF_SHIFT) + 1, &pshinfo->dataref);
+ parent->cloned = 1;
}
}
@@ -656,6 +691,7 @@ struct sk_buff *pskb_copy(struct sk_buff
get_page(skb_shinfo(n)->frags[i].page);
}
skb_shinfo(n)->nr_frags = i;
+ skb_ref_data_parent(skb, skb_shinfo(n));
}
if (skb_shinfo(skb)->frag_list) {
@@ -721,6 +757,8 @@ int pskb_expand_head(struct sk_buff *skb
if (skb_shinfo(skb)->frag_list)
skb_clone_fraglist(skb);
+ skb_ref_data_parent(skb, (void *)(data + size));
+
skb_release_data(skb);
off = (data + nhead) - skb->head;
@@ -743,6 +781,8 @@ int pskb_expand_head(struct sk_buff *skb
skb->hdr_len = 0;
skb->nohdr = 0;
atomic_set(&skb_shinfo(skb)->dataref, 1);
+ skb_shinfo(skb)->len = skb_end_pointer(skb) - skb->head;
+ skb_shinfo(skb)->destructor = NULL;
return 0;
nodata:
@@ -1962,6 +2002,8 @@ void skb_split(struct sk_buff *skb, stru
skb_split_inside_header(skb, skb1, len, pos);
else /* Second chunk has no header, nothing to copy. */
skb_split_no_header(skb, skb1, len, pos);
+
+ skb_ref_data_parent(skb, skb_shinfo(skb1));
}
/**
@@ -2334,6 +2376,8 @@ struct sk_buff *skb_segment(struct sk_bu
nskb->data_len = len - hsize;
nskb->len += nskb->data_len;
nskb->truesize += nskb->data_len;
+
+ skb_ref_data_parent(skb, skb_shinfo(nskb));
} while ((offset += len) < skb->len);
return segs;
next prev parent reply other threads:[~2008-12-22 0:43 UTC|newest]
Thread overview: 128+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-10 18:26 [PATCH][RFC 0/23] New SCSI target framework (SCST) and 4 target drivers Vladislav Bolkhovitin
2008-12-10 18:28 ` [PATCH][RFC 1/23]: SCST public headers Vladislav Bolkhovitin
2008-12-10 18:30 ` [PATCH][RFC 2/23]: SCST core Vladislav Bolkhovitin
2008-12-10 19:12 ` Sam Ravnborg
2008-12-11 17:28 ` Vladislav Bolkhovitin
2008-12-11 21:09 ` Sam Ravnborg
2008-12-12 19:24 ` Vladislav Bolkhovitin
2008-12-12 21:50 ` Steven Rostedt
[not found] ` <20081212230523.GB4775@ghostprotocols.net>
2008-12-13 1:25 ` Frédéric Weisbecker
2008-12-13 1:25 ` Frédéric Weisbecker
2008-12-13 1:27 ` Frédéric Weisbecker
2008-12-13 1:27 ` Frédéric Weisbecker
2008-12-13 14:46 ` Vladislav Bolkhovitin
2008-12-14 0:35 ` Frédéric Weisbecker
2008-12-16 21:49 ` Ingo Molnar
2008-12-16 21:49 ` Ingo Molnar
2008-12-16 22:13 ` Frédéric Weisbecker
2008-12-16 22:13 ` Frédéric Weisbecker
2008-12-16 22:22 ` Ingo Molnar
2008-12-16 22:22 ` Ingo Molnar
2008-12-16 23:46 ` Frédéric Weisbecker
2008-12-16 23:46 ` Frédéric Weisbecker
2008-12-18 11:45 ` Vladislav Bolkhovitin
2008-12-20 13:06 ` Frédéric Weisbecker
2008-12-20 13:06 ` Frédéric Weisbecker
2008-12-23 19:11 ` Vladislav Bolkhovitin
2008-12-23 19:11 ` Vladislav Bolkhovitin
2008-12-27 11:20 ` Ingo Molnar
2008-12-30 17:13 ` Vladislav Bolkhovitin
2008-12-30 21:03 ` Frederic Weisbecker
2008-12-30 21:35 ` Steven Rostedt
2008-12-10 18:34 ` [PATCH][RFC 3/23]: SCST core docs Vladislav Bolkhovitin
2008-12-10 18:34 ` Vladislav Bolkhovitin
2008-12-10 18:36 ` [PATCH][RFC 4/23]: SCST debug support Vladislav Bolkhovitin
2008-12-10 18:37 ` [PATCH][RFC 5/23]: SCST /proc interface Vladislav Bolkhovitin
2008-12-11 20:23 ` Nicholas A. Bellinger
2008-12-12 19:23 ` Vladislav Bolkhovitin
2008-12-10 18:39 ` [PATCH][RFC 6/23]: SCST SGV cache Vladislav Bolkhovitin
2008-12-10 18:40 ` [PATCH][RFC 7/23]: SCST integration into the kernel Vladislav Bolkhovitin
2008-12-10 18:42 ` [PATCH][RFC 8/23]: SCST pass-through backend handlers Vladislav Bolkhovitin
2008-12-10 18:43 ` [PATCH][RFC 9/23]: SCST virtual disk backend handler Vladislav Bolkhovitin
2008-12-10 18:44 ` [PATCH][RFC 10/23]: SCST user space " Vladislav Bolkhovitin
2008-12-10 18:46 ` [PATCH][RFC 11/23]: Makefile for SCST backend handlers Vladislav Bolkhovitin
2008-12-10 18:47 ` [PATCH][RFC 12/23]: Patch to add necessary support for SCST pass-through Vladislav Bolkhovitin
2008-12-10 18:49 ` [PATCH][RFC 13/23]: Export of alloc_io_context() function Vladislav Bolkhovitin
2008-12-11 13:34 ` Jens Axboe
2008-12-11 18:17 ` Vladislav Bolkhovitin
2008-12-11 18:41 ` Jens Axboe
2008-12-11 19:00 ` Vladislav Bolkhovitin
2008-12-11 19:06 ` Jens Axboe
2008-12-12 19:16 ` Vladislav Bolkhovitin
2008-12-10 18:50 ` [PATCH][RFC 14/23]: Necessary functionality in qla2xxx driver to support target mode Vladislav Bolkhovitin
2008-12-10 18:51 ` [PATCH][RFC 15/23]: QLogic target driver Vladislav Bolkhovitin
2008-12-10 18:54 ` [PATCH][RFC 16/23]: Documentation for " Vladislav Bolkhovitin
2008-12-10 18:55 ` [PATCH][RFC 17/23]: InfiniBand SRP " Vladislav Bolkhovitin
2008-12-10 18:57 ` [PATCH][RFC 18/23]: Documentation for " Vladislav Bolkhovitin
2008-12-10 18:58 ` [PATCH][RFC 19/23]: scst_local " Vladislav Bolkhovitin
2008-12-10 19:00 ` [PATCH][RFC 20/23]: Documentation for scst_local driver Vladislav Bolkhovitin
2008-12-10 19:01 ` [PATCH][RFC 21/23]: iSCSI target driver Vladislav Bolkhovitin
2008-12-11 22:55 ` Nicholas A. Bellinger
2008-12-11 22:59 ` Nicholas A. Bellinger
2008-12-12 19:26 ` Vladislav Bolkhovitin
2008-12-13 10:03 ` Nicholas A. Bellinger
2008-12-13 10:11 ` Bart Van Assche
2008-12-13 10:16 ` Nicholas A. Bellinger
2008-12-13 10:27 ` Bart Van Assche
2008-12-13 15:01 ` Vladislav Bolkhovitin
2008-12-13 15:01 ` Vladislav Bolkhovitin
2008-12-13 14:57 ` Vladislav Bolkhovitin
2008-12-10 19:02 ` [PATCH][RFC 22/23]: Documentation for iSCSI-SCST Vladislav Bolkhovitin
2008-12-10 19:04 ` [PATCH][RFC 23/23]: Support for zero-copy TCP transmit of user space data Vladislav Bolkhovitin
2008-12-10 21:45 ` Evgeniy Polyakov
2008-12-11 18:16 ` Vladislav Bolkhovitin
2008-12-11 19:12 ` James Bottomley
2008-12-12 19:25 ` Vladislav Bolkhovitin
2008-12-12 19:37 ` James Bottomley
2008-12-15 17:58 ` Vladislav Bolkhovitin
2008-12-15 23:18 ` Christoph Hellwig
2008-12-16 18:57 ` Vladislav Bolkhovitin
2008-12-18 18:35 ` [RFC]: " Vladislav Bolkhovitin
2008-12-18 18:35 ` Vladislav Bolkhovitin
2008-12-18 18:43 ` David M. Lloyd
2008-12-18 18:43 ` David M. Lloyd
2008-12-19 17:37 ` Vladislav Bolkhovitin
2008-12-19 17:37 ` Vladislav Bolkhovitin
2008-12-19 19:07 ` Jens Axboe
2008-12-19 19:07 ` Jens Axboe
2008-12-19 19:17 ` Vladislav Bolkhovitin
2008-12-19 19:17 ` Vladislav Bolkhovitin
2008-12-19 19:27 ` Jens Axboe
2008-12-19 19:27 ` Jens Axboe
2008-12-19 21:58 ` Evgeniy Polyakov
2008-12-19 21:58 ` Evgeniy Polyakov
2008-12-23 19:11 ` Vladislav Bolkhovitin
2008-12-23 19:11 ` Vladislav Bolkhovitin
2008-12-19 11:27 ` Andi Kleen
2008-12-19 11:27 ` Andi Kleen
2008-12-19 17:38 ` Vladislav Bolkhovitin
2008-12-19 17:38 ` Vladislav Bolkhovitin
2008-12-19 18:00 ` Andi Kleen
2008-12-19 18:00 ` Andi Kleen
2008-12-19 17:57 ` Vladislav Bolkhovitin
2008-12-19 17:57 ` Vladislav Bolkhovitin
2008-12-16 16:00 ` [PATCH][RFC 23/23]: " Bart Van Assche
2008-12-16 17:41 ` Evgeniy Polyakov
2008-12-19 20:21 ` Jeremy Fitzhardinge
2008-12-19 22:04 ` Evgeniy Polyakov
2008-12-19 22:21 ` Jeremy Fitzhardinge
2008-12-19 22:33 ` Evgeniy Polyakov
2008-12-20 1:56 ` Jeremy Fitzhardinge
2008-12-20 2:02 ` Herbert Xu
2008-12-20 6:14 ` Jeremy Fitzhardinge
2008-12-20 6:51 ` Herbert Xu
2008-12-20 7:43 ` Jeremy Fitzhardinge
2008-12-20 8:10 ` Herbert Xu
2008-12-20 10:32 ` Evgeniy Polyakov
2008-12-20 19:39 ` Jeremy Fitzhardinge
2008-12-22 0:43 ` Rusty Russell [this message]
2008-12-23 19:14 ` Vladislav Bolkhovitin
2008-12-23 19:16 ` Vladislav Bolkhovitin
2008-12-23 21:38 ` Evgeniy Polyakov
2008-12-24 14:37 ` Vladislav Bolkhovitin
2008-12-24 14:44 ` Evgeniy Polyakov
2008-12-24 17:46 ` Vladislav Bolkhovitin
2008-12-24 18:08 ` Evgeniy Polyakov
2008-12-30 17:37 ` Vladislav Bolkhovitin
2008-12-30 21:35 ` Evgeniy Polyakov
2008-12-23 19:13 ` Vladislav Bolkhovitin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200812221113.24044.rusty@rustcorp.com.au \
--to=rusty@rustcorp.com.au \
--cc=James.Bottomley@hansenpartnership.com \
--cc=akpm@linux-foundation.org \
--cc=bart.vanassche@gmail.com \
--cc=bharrosh@panasas.com \
--cc=davem@davemloft.net \
--cc=fujita.tomonori@lab.ntt.co.jp \
--cc=herbert@gondor.apana.org.au \
--cc=jeff@garzik.org \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=michaelc@cs.wisc.edu \
--cc=nab@linux-iscsi.org \
--cc=netdev@vger.kernel.org \
--cc=scst-devel@lists.sourceforge.net \
--cc=torvalds@linux-foundation.org \
--cc=vst@vlnb.net \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.