Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] net: reserve ooo_okay when copying skb header
From: Changli Gao @ 2011-08-19 14:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, Tom Herbert, netdev, Changli Gao

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/core/skbuff.c |    1 +
 1 file changed, 1 insertion(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index edb66f3..e27334e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -529,6 +529,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->mac_header		= old->mac_header;
 	skb_dst_copy(new, old);
 	new->rxhash		= old->rxhash;
+	new->ooo_okay		= old->ooo_okay;
 	new->l4_rxhash		= old->l4_rxhash;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);

^ permalink raw reply related

* Re: [RFC 0/0] Introducing a generic socket offload framework
From: San Mehat @ 2011-08-19 14:44 UTC (permalink / raw)
  To: jhs
  Cc: mst, netdev, miche, linux-kernel, virtualization, digitaleric,
	mikew, maccarro, davem
In-Reply-To: <1313758185.16299.43.camel@mojatatu>


[-- Attachment #1.1: Type: text/plain, Size: 2605 bytes --]

On Fri, Aug 19, 2011 at 5:49 AM, jamal <hadi@cyberus.ca> wrote:

> On Thu, 2011-08-18 at 15:07 -0700, San Mehat wrote:
> > TL;DR
> > -----
> > In this RFC we propose the introduction of the concept of hardware socket
> > offload to the Linux kernel. Patches will accompany this RFC in a few
> days,
> > but we felt we had enough on the design to solicit constructive
> discussion
> > from the community at-large.
> >
>
> [..]
>
> > ALTERNATIVE STRATEGIES
> > ----------------------
> >
> > An alternative strategy for providing similar functionality involves
> either
> > modifying glibc or using LD_PRELOAD tricks to intercept socket calls. We
> were
> > forced to rule this out due to the complexity (and fragility) involved
> with
> > attempting to maintain a general solution compatible across various
> > distributions where platform-libraries differ.
>
> Above should have been in your TL;DR;->
>
> LD_PRELOAD is also horrible because of the granularity of the socket
> calls;
> Having things in the kernel and specifically tagging socket as needing
> this feature is much much more manageable.
>
> Tying things to virtualization may miss the big picture because there
> are many other use cases for intercepting socket calls, example:
> Samir Bellabes <sam@synack.fr> has been trying to get what he refers to
> as "personal firewall" (equivalent to the silly windows firewall) which
> prompts the user "ping from blah, do you want to allow a response?"
> That requires intercepting send/recv calls, prompt the user in possibly
> some pop-up and act based on response. It requires looking at content.
> He is trying to use selinux for that interface,
> but i think this would be the right abstraction.
>

I agree; there's no reason this needs to be tied to virtualization - it was
just the
driving force behind the design. I will generalize the backend interface
types


> I hope you plan to support send/recv.
>

yes


> I also hope you add support for SOCK_RAW (and maybe SOCK_PACKET).
>
>
 Can you explain a good use-case for SOCK_RAW in this type of environment?
We were noodling it around locally and couldn't come up with one that we
needed to support.

Q: If you want this to be transparent to the apps, who/what is doing
> the tagging of SOCK_HWASSIST? clearly not the app if you dont want to
> change it.
>
>
 The decision of whether to tag a socket or not is made by the 'hardware'

I like the uri abstraction if it doesnt come at the expense of the
> app transparency.
>
>
Thank you,

-san


> cheers,
> jamal
>
>


-- 
San Mehat | Staff Software Engineer | san@google.com | 415-366-6172

[-- Attachment #1.2: Type: text/html, Size: 6533 bytes --]

[-- Attachment #2: Type: text/plain, Size: 184 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: Move interface across network namespaces
From: Renato Westphal @ 2011-08-19 14:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, kaber, David Lamparter
In-Reply-To: <m14o1e71qe.fsf@fess.ebiederm.org>

>> Well, this regression was introduced by commit a2835763e130c343ac,
>> which was merged into v2.6.34. Reverting parts of this commit makes
>> the problem go away but breaks the support of "specifying device flags
>> during device creation". I don't know the best way to fix this... any
>> ideas?
>
> Everything going through dev_change_net_namespace already needs to be
> in the initialized state.  So it looks like we just need to do:
>
> Does the patch below work for you?
>
> Eric
>
> ---
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 17d67b5..bfbde69 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6108,6 +6108,8 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
>        call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>        call_netdevice_notifiers(NETDEV_UNREGISTER_BATCH, dev);
>
> +       rtmsg_ifinfo(RTM_DELLINK, dev, ~0U);
> +
>        /*
>         *      Flush the unicast and multicast chains
>         */
>

 This works pretty fine. Thanks! :)

-- 
Renato Westphal

^ permalink raw reply

* Re: [PATCH/RFC v3 0/75] enable SKB paged fragment lifetime visibility
From: Ian Campbell @ 2011-08-19 14:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-nfs
In-Reply-To: <20110819.070451.1651299907098952592.davem@davemloft.net>

On Fri, 2011-08-19 at 07:04 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Fri, 19 Aug 2011 06:34:10 -0700 (PDT)
> 
> > From: David Miller <davem@davemloft.net>
> > Date: Fri, 19 Aug 2011 06:29:58 -0700 (PDT)
> > 
> >> From: Ian Campbell <Ian.Campbell@citrix.com>
> >> Date: Fri, 19 Aug 2011 14:26:33 +0100
> >> 
> >>> This is v3 of my series to enable visibility into SKB paged fragment's
> >> 
> >> Please tone down the patch count :-/  I'm not going to review anything more
> >> than ~20 or so patches at a time from any one person.
> > 
> > Also none of your patches will even apply to net-next GIT since nearly
> > all the ethernet drivers have been moved under drivers/net/ethernet
> 
> Ian, please acknowledge my grievances here.  I see you replying to other
> people, but not to me and I'm the one who has to process all of this
> stuff eventually.

Sorry, I thought I had replied to you first, did it go missing?

> If you want this series to be taken seriously:
> 
> 1) Make your patches against net-next GIT, none of your driver patches will
>    apply because they have all been moved around to different locations
>    under drivers/net

> 2) Submit this in a _SANE_ way.  This means, get the first patch that adds
>    the new interfaces approved and merged.  Then slowly and carefuly submit
>    small, reasonably sized, sets of patches that convert the drivers over.

That all makes sense, I'll do it that way.

> Otherwise there is no way I'm even looking at this stuff, let alone actually
> apply it.

> Realize that every time you patch bomb rediculiously like this I have
> to click and classify every single patch in your bomb at
> http://patchwork.ozlabs.org/project/netdev/list/

Ah, right, I hadn't considered the patchwork thing, I'm very sorry about
that!

Ian.



^ permalink raw reply

* Re: [PATCH] net: add the comment for skb->l4_hash
From: David Miller @ 2011-08-19 14:27 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <20110819.072547.515057198170181830.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 19 Aug 2011 07:25:47 -0700 (PDT)

> From: Changli Gao <xiaosuo@gmail.com>
> Date: Fri, 19 Aug 2011 22:22:24 +0800
> 
>> On Fri, Aug 19, 2011 at 10:19 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Changli Gao <xiaosuo@gmail.com>
>>> Date: Fri, 19 Aug 2011 22:14:12 +0800
>>>
>>>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>>>
>>> It was obvious to me this time, but please in the future indicate in
>>> "[PATCH]" that a change is targetted for net-next as this one was.
>>>
>>> Applied, thanks.
>>>
>> 
>> Sorry again. Please apply the new one instead, as I made a typo
>> mistake in the previous two patches. Thanks.
> 
> Ok, done.

BTW for next time, "net-next" indication belongs inside of "[]"
brackets, it should look something like:

	Subject: [PATCH net-next] net: blah blah blah

Thanks.

^ permalink raw reply

* Re: [PATCH] net: add the comment for skb->l4_hash
From: David Miller @ 2011-08-19 14:25 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <CABa6K_G+cv3z-WoFSu3s017uVa54aHcn_D=R6WX3=-YA2o2j8g@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Fri, 19 Aug 2011 22:22:24 +0800

> On Fri, Aug 19, 2011 at 10:19 PM, David Miller <davem@davemloft.net> wrote:
>> From: Changli Gao <xiaosuo@gmail.com>
>> Date: Fri, 19 Aug 2011 22:14:12 +0800
>>
>>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>>
>> It was obvious to me this time, but please in the future indicate in
>> "[PATCH]" that a change is targetted for net-next as this one was.
>>
>> Applied, thanks.
>>
> 
> Sorry again. Please apply the new one instead, as I made a typo
> mistake in the previous two patches. Thanks.

Ok, done.

^ permalink raw reply

* Re: [PATCH] net: add the comment for skb->l4_hash
From: Changli Gao @ 2011-08-19 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <20110819.071918.2148980329707419757.davem@davemloft.net>

On Fri, Aug 19, 2011 at 10:19 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Fri, 19 Aug 2011 22:14:12 +0800
>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>
> It was obvious to me this time, but please in the future indicate in
> "[PATCH]" that a change is targetted for net-next as this one was.
>
> Applied, thanks.
>

Sorry again. Please apply the new one instead, as I made a typo
mistake in the previous two patches. Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next v4 af-packet 2/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: chetan loke @ 2011-08-19 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110819.033613.1959565749346388734.davem@davemloft.net>

On Fri, Aug 19, 2011 at 6:36 AM, David Miller <davem@davemloft.net> wrote:
> From: Chetan Loke <loke.chetan@gmail.com>
> Date: Thu, 11 Aug 2011 22:31:00 -0400
>
>> +struct kbdq_ft_ops {
>> +     int num_ops;
>> +     void (*ft_ops[2])(void *, void *);
>> +};
> ...
>> +     struct kbdq_ft_ops kfops;
>  ...
>> +static void prb_init_ft_ops(struct kbdq_core *p1,
>> +                     union tpacket_req_u *req_u)
>> +{
>> +     p1->kfops.ft_ops[p1->kfops.num_ops++] = prb_fill_vlan_info;
>> +
>> +     if (req_u->req3.tp_feature_req_word) {
>> +             if (req_u->req3.tp_feature_req_word & TP_FT_REQ_FILL_RXHASH)
>> +                     p1->kfops.ft_ops[p1->kfops.num_ops++] = prb_fill_rxhash;
>> +             else
>> +                     p1->kfops.ft_ops[p1->kfops.num_ops++] =
>> +                     prb_clear_rxhash;
>> +     }
>> +}
>
> It is a lot cheaper to just test the flags in-line than do indirect calls.
> Indirect calls are very expensive on many cpus.
>
> In fact, since the first op (prb_fill_vlan_info) is unconditional, we eat
> the indirect call cost for absolutely no reason at all.

Oki. Will test the flags in-line in prb_run_all_ft_ops().


>
> This kfops stuff was not present in your previous changes.  And I'm
> going to tell you that if you keep adding things each revision instead
> of just fixing the specific items you've received feedback about, a
> set of changes this invasive and of this size will never get merged.
>

First, thanks(to you and others) for reviewing the huge patchset. I
did think about breaking it into multiple patches but since it
involves changes in a single file it was a little difficult. And I
didn't want to create a separate file for this.

> Please resist the urge to further tinker with the code, and just
> address the feedback we give you.
>
Sorry won't happen. These changes should have been present in v3 when
I added the feature-request word because that was the intent of that
word. It was a screw up on my part.


Chetan Loke

^ permalink raw reply

* [PATCH v2] net-next: add the comment for skb->l4_rxhash
From: Changli Gao @ 2011-08-19 14:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, Tom Herbert, netdev, Changli Gao

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 include/linux/skbuff.h |    2 ++
 1 file changed, 2 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f902c33..10109a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -322,6 +322,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
+ *	@l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport
+ *		ports.
  *	@dma_cookie: a cookie to one of several possible DMA operations
  *		done by skb DMA functions
  *	@secmark: security marking

^ permalink raw reply related

* Re: [PATCH] net: add the comment for skb->l4_hash
From: David Miller @ 2011-08-19 14:19 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <1313763252-3235-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Fri, 19 Aug 2011 22:14:12 +0800

> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

It was obvious to me this time, but please in the future indicate in
"[PATCH]" that a change is targetted for net-next as this one was.

Applied, thanks.

^ permalink raw reply

* [PATCH] net: add the comment for skb->l4_hash
From: Changli Gao @ 2011-08-19 14:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, Tom Herbert, netdev, Changli Gao

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 include/linux/skbuff.h |    2 ++
 1 file changed, 2 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f902c33..10109a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -322,6 +322,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
+ *	@l4_hash: indicate rxhash is a canonical 4-tuple hash over transport
+ *		ports.
  *	@dma_cookie: a cookie to one of several possible DMA operations
  *		done by skb DMA functions
  *	@secmark: security marking

^ permalink raw reply related

* Re: [PATCH net-next v4 af-packet 0/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: chetan loke @ 2011-08-19 14:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110819.033311.1204611705865122620.davem@davemloft.net>

On Fri, Aug 19, 2011 at 6:33 AM, David Miller <davem@davemloft.net> wrote:
> From: Chetan Loke <loke.chetan@gmail.com>
> Date: Thu, 11 Aug 2011 22:30:58 -0400
>
>> This patch attempts to:
>> 1)Improve network capture visibility by increasing packet density
>> 2)Assist in analyzing multiple(aggregated) capture ports.
>
> Just some general comments before I say something about the actual
> code.
>
> Claiming loss-less, or almost loss-less, packet capture is something
> which is highly dependent upon the hardware and traffic load.

Totally agree. I have added the text in [Patch 2/2], bullet 1.3 -
capture 99% traffic as seen by the kernel. So lossless is relative to
'the traffic processed by the kernel' and not to the (inline)hardware.
But yes, the patch subject itself could be misleading. Not trying to
argue but just wanted to make sure we are all on the same page.


> So I want you to describe this as what it is, an improvement.  More
> specifically it's just a more flexible mmap ring buffer scheme that
> reduces the wastage incurred by the fixed buffer size scheme we
> have currently.
>
Oki. I will change the cover-subject to 'Enhance af-packet to provide
a flexible mmap ring buffer scheme eliminating fixed frame-size
requirement'.

> Next, you need to use a different subject line for the two patches.
> Right now you say the same thing in both, and someone scanning the
> changesets headers will have no idea what is different about one
> change vs. the other.
>
Done.


Chetan Loke

^ permalink raw reply

* Re: [PATCH/RFC v3 0/75] enable SKB paged fragment lifetime visibility
From: David Miller @ 2011-08-19 14:04 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, linux-nfs
In-Reply-To: <20110819.063410.475175569086272526.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 19 Aug 2011 06:34:10 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Fri, 19 Aug 2011 06:29:58 -0700 (PDT)
> 
>> From: Ian Campbell <Ian.Campbell@citrix.com>
>> Date: Fri, 19 Aug 2011 14:26:33 +0100
>> 
>>> This is v3 of my series to enable visibility into SKB paged fragment's
>> 
>> Please tone down the patch count :-/  I'm not going to review anything more
>> than ~20 or so patches at a time from any one person.
> 
> Also none of your patches will even apply to net-next GIT since nearly
> all the ethernet drivers have been moved under drivers/net/ethernet

Ian, please acknowledge my grievances here.  I see you replying to other
people, but not to me and I'm the one who has to process all of this
stuff eventually.

If you want this series to be taken seriously:

1) Make your patches against net-next GIT, none of your driver patches will
   apply because they have all been moved around to different locations
   under drivers/net

2) Submit this in a _SANE_ way.  This means, get the first patch that adds
   the new interfaces approved and merged.  Then slowly and carefuly submit
   small, reasonably sized, sets of patches that convert the drivers over.

Otherwise there is no way I'm even looking at this stuff, let alone actually
apply it.

Realize that every time you patch bomb rediculiously like this I have
to click and classify every single patch in your bomb at
http://patchwork.ozlabs.org/project/netdev/list/

^ permalink raw reply

* Re: Traceroute and "ping" sockets: some questions
From: David Miller @ 2011-08-19 14:00 UTC (permalink / raw)
  To: buc; +Cc: netdev
In-Reply-To: <4E4E6B8A.1030707@odu.neva.ru>

From: Dmitry Butskoy <buc@odusz.so-cdu.ru>
Date: Fri, 19 Aug 2011 17:56:26 +0400

> Why then there is "net/ipv4/ping_group_range" restrictions, with
> default values (low=1, high=0) which denies this way even for root?

While we shake out the bugs and better understand the security
implications of this feature.  One this is resolved we can deprecate
this.

^ permalink raw reply

* Re: [PATCH] net: add the comment for skb->l4_hash
From: Changli Gao @ 2011-08-19 14:00 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <20110819.065119.890658681263456287.davem@davemloft.net>

On Fri, Aug 19, 2011 at 9:51 PM, David Miller <davem@davemloft.net> wrote:
>
>
> Please fully resubmit with a proper signoff.
>

Oh, sorry. when editing this patch, my poor vim showed an empty line
instead, so I removed it by mistake. I'll resubmit it.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: Traceroute and "ping" sockets: some questions
From: Dmitry Butskoy @ 2011-08-19 13:56 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20110819.062249.2234872106788628654.davem@davemloft.net>

David Miller wrote:
>>> Why limit?  He can spam with UDP socket just as easily at any rate
>>> he pleases,
>>>        
>> Yes, but most cases such UDP is "one-way" spam (until services like
>> "echo 7/udp" are enabled).
>> For icmp echo, we normally receive icmp replies, hence it is
>> bidirectional crap. Which was not present before.
>>      
> Well, replace UDP with TCP syn flood.
>    

Well.

Why then there is "net/ipv4/ping_group_range" restrictions, with default 
values (low=1, high=0) which denies this way even for root?


Regards,
Dmitry

^ permalink raw reply

* Re: [PATCH] net: add the comment for skb->l4_hash
From: David Miller @ 2011-08-19 13:51 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <1313761541-1109-1-git-send-email-xiaosuo@gmail.com>



Please fully resubmit with a proper signoff.

Thank you.


^ permalink raw reply

* Re: [PATCH 68/75] hv: netvsc: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-19 13:50 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Hank Janssen, Haiyang Zhang, Greg Kroah-Hartman,
	K. Y. Srinivasan, Abhishek Kane, devel
In-Reply-To: <1313760467-8598-68-git-send-email-ian.campbell@citrix.com>

On Fri, 2011-08-19 at 14:27 +0100, Ian Campbell wrote:
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Hank Janssen <hjanssen@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Greg Kroah-Hartman <gregkh@suse.de>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Abhishek Kane <v-abkane@microsoft.com>
> Cc: devel@driverdev.osuosl.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  drivers/net/cxgb4/sge.c         |   14 +++++-----
>  drivers/net/cxgb4vf/sge.c       |   18 ++++++------
>  drivers/net/mlx4/en_rx.c        |    2 +-
>  drivers/scsi/cxgbi/libcxgbi.c   |    2 +-
>  drivers/staging/hv/netvsc_drv.c |    2 +-
>  include/linux/skbuff.h          |   54 ++++++++++++++++++++++++++++++++++++---
>  net/core/skbuff.c               |   26 +++++++++++++++++-
>  7 files changed, 93 insertions(+), 25 deletions(-)

Somewhere along the line I've managed to rebase/merge the "net: add
support for per-paged-fragment destructors" patch into the preceding
patch :-( Everything here apart from the drivers/staging/hv/netvsc_drv.c
change should be a changeset with the following comment. I'll fix for
the next post.

    net: add support for per-paged-fragment destructors
    
    Entities which care about the complete lifecycle of pages which they inject
    into the network stack via an skb paged fragment can choose to set this
    destructor in order to receive a callback when the stack is really finished
    with a page (including all clones, retransmits, pull-ups etc etc).
    
    This destructor will always be propagated alongside the struct page when
    copying skb_frag_t->page. This is the reason I chose to embed the destructor in
    a "struct { } page" within the skb_frag_t, rather than as a separate field,
    since it allows existing code which propagates ->frags[N].page to Just
    Work(tm).
    
    When the destructor is present the page reference counting is done slightly
    differently. No references are held by the network stack on the struct page (it
    is up to the caller to manage this as necessary) instead the network stack will
    track references via the count embedded in the destructor structure. When this
    reference count reaches zero then the destructor will be called and the caller
    can take the necesary steps to release the page (i.e. release the struct page
    reference itself).
    
    The intention is that callers can use this callback to delay completion to
    _their_ callers until the network stack has completely released the page, in
    order to prevent use-after-free or modification of data pages which are still
    in use by the stack.
    
    It is allowable (indeed expected) for a caller to share a single destructor
    instance between multiple pages injected into the stack e.g. a group of pages
    included in a single higher level operation might share a destructor which is
    used to complete that higher level operation.
    
    NB: a small number of drivers use skb_frag_t independently of struct sk_buff so
    this patch is slightly larger than necessary. I did consider leaving skb_frag_t
    alone and defining a new (but similar) structure to be used in the struct
    sk_buff itself. This would also have the advantage of more clearly separating
    the two uses, which is useful since there are now special reference counting
    accessors for skb_frag_t within a struct sk_buff but not (necessarily) for
    those used outside of an skb.
    
    Signed-off-by: Ian Campbell <ian.campbell@citrix.com>


> 
> diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
> index f1813b5..3e7c4b3 100644
> --- a/drivers/net/cxgb4/sge.c
> +++ b/drivers/net/cxgb4/sge.c
> @@ -1416,7 +1416,7 @@ static inline void copy_frags(struct sk_buff *skb,
>  	unsigned int n;
>  
>  	/* usually there's just one frag */
> -	skb_frag_set_page(skb, 0, gl->frags[0].page);
> +	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */
>  	ssi->frags[0].page_offset = gl->frags[0].page_offset + offset;
>  	ssi->frags[0].size = gl->frags[0].size - offset;
>  	ssi->nr_frags = gl->nfrags;
> @@ -1425,7 +1425,7 @@ static inline void copy_frags(struct sk_buff *skb,
>  		memcpy(&ssi->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
>  
>  	/* get a reference to the last page, we don't own it */
> -	get_page(gl->frags[n].page);
> +	get_page(gl->frags[n].page.p);	/* XXX */
>  }
>  
>  /**
> @@ -1482,7 +1482,7 @@ static void t4_pktgl_free(const struct pkt_gl *gl)
>  	const skb_frag_t *p;
>  
>  	for (p = gl->frags, n = gl->nfrags - 1; n--; p++)
> -		put_page(p->page);
> +		put_page(p->page.p); /* XXX */
>  }
>  
>  /*
> @@ -1635,7 +1635,7 @@ static void restore_rx_bufs(const struct pkt_gl *si, struct sge_fl *q,
>  		else
>  			q->cidx--;
>  		d = &q->sdesc[q->cidx];
> -		d->page = si->frags[frags].page;
> +		d->page = si->frags[frags].page.p; /* XXX */
>  		d->dma_addr |= RX_UNMAPPED_BUF;
>  		q->avail++;
>  	}
> @@ -1717,7 +1717,7 @@ static int process_responses(struct sge_rspq *q, int budget)
>  			for (frags = 0, fp = si.frags; ; frags++, fp++) {
>  				rsd = &rxq->fl.sdesc[rxq->fl.cidx];
>  				bufsz = get_buf_size(rsd);
> -				fp->page = rsd->page;
> +				fp->page.p = rsd->page; /* XXX */
>  				fp->page_offset = q->offset;
>  				fp->size = min(bufsz, len);
>  				len -= fp->size;
> @@ -1734,8 +1734,8 @@ static int process_responses(struct sge_rspq *q, int budget)
>  						get_buf_addr(rsd),
>  						fp->size, DMA_FROM_DEVICE);
>  
> -			si.va = page_address(si.frags[0].page) +
> -				si.frags[0].page_offset;
> +			si.va = page_address(si.frags[0].page.p) +
> +				si.frags[0].page_offset; /* XXX */
>  
>  			prefetch(si.va);
>  
> diff --git a/drivers/net/cxgb4vf/sge.c b/drivers/net/cxgb4vf/sge.c
> index 6d6060e..3688423 100644
> --- a/drivers/net/cxgb4vf/sge.c
> +++ b/drivers/net/cxgb4vf/sge.c
> @@ -1397,7 +1397,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
>  		skb_copy_to_linear_data(skb, gl->va, pull_len);
>  
>  		ssi = skb_shinfo(skb);
> -		skb_frag_set_page(skb, 0, gl->frags[0].page);
> +		skb_frag_set_page(skb, 0, gl->frags[0].page.p); /* XXX */
>  		ssi->frags[0].page_offset = gl->frags[0].page_offset + pull_len;
>  		ssi->frags[0].size = gl->frags[0].size - pull_len;
>  		if (gl->nfrags > 1)
> @@ -1410,7 +1410,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
>  		skb->truesize += skb->data_len;
>  
>  		/* Get a reference for the last page, we don't own it */
> -		get_page(gl->frags[gl->nfrags - 1].page);
> +		get_page(gl->frags[gl->nfrags - 1].page.p); /* XXX */
>  	}
>  
>  out:
> @@ -1430,7 +1430,7 @@ void t4vf_pktgl_free(const struct pkt_gl *gl)
>  
>  	frag = gl->nfrags - 1;
>  	while (frag--)
> -		put_page(gl->frags[frag].page);
> +		put_page(gl->frags[frag].page.p); /* XXX */
>  }
>  
>  /**
> @@ -1450,7 +1450,7 @@ static inline void copy_frags(struct sk_buff *skb,
>  	unsigned int n;
>  
>  	/* usually there's just one frag */
> -	skb_frag_set_page(skb, 0, gl->frags[0].page);
> +	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */
>  	si->frags[0].page_offset = gl->frags[0].page_offset + offset;
>  	si->frags[0].size = gl->frags[0].size - offset;
>  	si->nr_frags = gl->nfrags;
> @@ -1460,7 +1460,7 @@ static inline void copy_frags(struct sk_buff *skb,
>  		memcpy(&si->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
>  
>  	/* get a reference to the last page, we don't own it */
> -	get_page(gl->frags[n].page);
> +	get_page(gl->frags[n].page.p); /* XXX */
>  }
>  
>  /**
> @@ -1613,7 +1613,7 @@ static void restore_rx_bufs(const struct pkt_gl *gl, struct sge_fl *fl,
>  		else
>  			fl->cidx--;
>  		sdesc = &fl->sdesc[fl->cidx];
> -		sdesc->page = gl->frags[frags].page;
> +		sdesc->page = gl->frags[frags].page.p; /* XXX */
>  		sdesc->dma_addr |= RX_UNMAPPED_BUF;
>  		fl->avail++;
>  	}
> @@ -1701,7 +1701,7 @@ int process_responses(struct sge_rspq *rspq, int budget)
>  				BUG_ON(rxq->fl.avail == 0);
>  				sdesc = &rxq->fl.sdesc[rxq->fl.cidx];
>  				bufsz = get_buf_size(sdesc);
> -				fp->page = sdesc->page;
> +				fp->page.p = sdesc->page; /* XXX */
>  				fp->page_offset = rspq->offset;
>  				fp->size = min(bufsz, len);
>  				len -= fp->size;
> @@ -1719,8 +1719,8 @@ int process_responses(struct sge_rspq *rspq, int budget)
>  			dma_sync_single_for_cpu(rspq->adapter->pdev_dev,
>  						get_buf_addr(sdesc),
>  						fp->size, DMA_FROM_DEVICE);
> -			gl.va = (page_address(gl.frags[0].page) +
> -				 gl.frags[0].page_offset);
> +			gl.va = (page_address(gl.frags[0].page.p) +
> +				 gl.frags[0].page_offset); /* XXX */
>  			prefetch(gl.va);
>  
>  			/*
> diff --git a/drivers/net/mlx4/en_rx.c b/drivers/net/mlx4/en_rx.c
> index 0c26ee7..bd4e86e 100644
> --- a/drivers/net/mlx4/en_rx.c
> +++ b/drivers/net/mlx4/en_rx.c
> @@ -418,7 +418,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>  			break;
>  
>  		/* Save page reference in skb */
> -		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page);
> +		__skb_frag_set_page(&skb_frags_rx[nr], skb_frags[nr].page.p); /* XXX */
>  		skb_frags_rx[nr].size = skb_frags[nr].size;
>  		skb_frags_rx[nr].page_offset = skb_frags[nr].page_offset;
>  		dma = be64_to_cpu(rx_desc->data[nr].addr);
> diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> index 0debb06..43cc6c6 100644
> --- a/drivers/scsi/cxgbi/libcxgbi.c
> +++ b/drivers/scsi/cxgbi/libcxgbi.c
> @@ -1823,7 +1823,7 @@ static int sgl_read_to_frags(struct scatterlist *sg, unsigned int sgoffset,
>  				return -EINVAL;
>  			}
>  
> -			frags[i].page = page;
> +			frags[i].page.p = page;
>  			frags[i].page_offset = sg->offset + sgoffset;
>  			frags[i].size = copy;
>  			i++;
> diff --git a/drivers/staging/hv/netvsc_drv.c b/drivers/staging/hv/netvsc_drv.c
> index 61989f0..4fc7b3e 100644
> --- a/drivers/staging/hv/netvsc_drv.c
> +++ b/drivers/staging/hv/netvsc_drv.c
> @@ -171,7 +171,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net)
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>  		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
>  
> -		packet->page_buf[i+2].pfn = page_to_pfn(f->page);
> +		packet->page_buf[i+2].pfn = page_to_pfn(skb_frag_page(f));
>  		packet->page_buf[i+2].offset = f->page_offset;
>  		packet->page_buf[i+2].len = f->size;
>  	}
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index c015e61..c6a3d4b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -134,8 +134,17 @@ struct sk_buff;
>  
>  typedef struct skb_frag_struct skb_frag_t;
>  
> +struct skb_frag_destructor {
> +	atomic_t ref;
> +	int (*destroy)(void *data);
> +	void *data;
> +};
> +
>  struct skb_frag_struct {
> -	struct page *page;
> +	struct {
> +		struct page *p;
> +		struct skb_frag_destructor *destructor;
> +	} page;
>  #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
>  	__u32 page_offset;
>  	__u32 size;
> @@ -1128,6 +1137,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
>  }
>  
>  /**
> + * skb_frag_set_destructor - set destructor for a paged fragment
> + * @skb: buffer containing fragment to be initialised
> + * @i: paged fragment index to initialise
> + * @destroy: the destructor to use for this fragment
> + *
> + * Sets @destroy as the destructor to be called when all references to
> + * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
> + * etc) are released.
> + *
> + * When a destructor is set then reference counting is performed on
> + * @destroy->ref. When the ref reaches zero then @destroy->destroy
> + * will be called. The caller is responsible for holding and managing
> + * any other references (such a the struct page reference count).
> + *
> + * This function must be called before any use of skb_frag_ref() or
> + * skb_frag_unref().
> + */
> +static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
> +					   struct skb_frag_destructor *destroy)
> +{
> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +	frag->page.destructor = destroy;
> +}
> +
> +/**
>   * __skb_fill_page_desc - initialise a paged fragment in an skb
>   * @skb: buffer containing fragment to be initialised
>   * @i: paged fragment index to initialise
> @@ -1145,7 +1179,8 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
>  {
>  	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>  
> -	frag->page		  = page;
> +	frag->page.p		  = page;
> +	frag->page.destructor     = NULL;
>  	frag->page_offset	  = off;
>  	frag->size		  = size;
>  }
> @@ -1669,9 +1704,12 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
>   */
>  static inline struct page *skb_frag_page(const skb_frag_t *frag)
>  {
> -	return frag->page;
> +	return frag->page.p;
>  }
>  
> +extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
> +extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
> +
>  /**
>   * __skb_frag_ref - take an addition reference on a paged fragment.
>   * @frag: the paged fragment
> @@ -1680,6 +1718,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_ref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_ref(frag->page.destructor);
> +		return;
> +	}
>  	get_page(skb_frag_page(frag));
>  }
>  
> @@ -1703,6 +1745,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
>   */
>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> +	if (unlikely(frag->page.destructor)) {
> +		skb_frag_destructor_unref(frag->page.destructor);
> +		return;
> +	}
>  	put_page(skb_frag_page(frag));
>  }
>  
> @@ -1755,7 +1801,7 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
>  {
> -	frag->page = page;
> +	frag->page.p = page;
>  	__skb_frag_ref(frag);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 418f3435..906a3e7 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -292,6 +292,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
>  }
>  EXPORT_SYMBOL(dev_alloc_skb);
>  
> +void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
> +{
> +	BUG_ON(destroy == NULL);
> +	atomic_inc(&destroy->ref);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_ref);
> +
> +void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
> +{
> +	if (destroy == NULL)
> +		return;
> +
> +	if (atomic_dec_and_test(&destroy->ref))
> +		destroy->destroy(destroy->data);
> +}
> +EXPORT_SYMBOL(skb_frag_destructor_unref);
> +
>  static void skb_drop_list(struct sk_buff **listp)
>  {
>  	struct sk_buff *list = *listp;
> @@ -642,14 +659,19 @@ static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>  
>  	/* skb frags release userspace buffers */
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> -		put_page(skb_shinfo(skb)->frags[i].page);
> +		skb_frag_unref(skb, i);
>  
>  	uarg->callback(uarg);
>  
>  	/* skb frags point to kernel buffers */
>  	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
>  		skb_shinfo(skb)->frags[i - 1].page_offset = 0;
> -		skb_shinfo(skb)->frags[i - 1].page = head;
> +		skb_shinfo(skb)->frags[i - 1].page.p = head;
> +		/*
> +		 * The original destructor is called as necessary when
> +		 * releasing userspace buffers.
> +		 */
> +		skb_shinfo(skb)->frags[i - 1].page.destructor = NULL;
>  		head = (struct page *)head->private;
>  	}
>  	return 0;



^ permalink raw reply

* [PATCH] net: add the comment for skb->l4_hash
From: Changli Gao @ 2011-08-19 13:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, Tom Herbert, netdev

---
 include/linux/skbuff.h |    2 ++
 1 file changed, 2 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f902c33..10109a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -322,6 +322,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
+ *	@l4_hash: indicate rxhash is a canonical 4-tuple hash over transport
+ *		ports.
  *	@dma_cookie: a cookie to one of several possible DMA operations
  *		done by skb DMA functions
  *	@secmark: security marking

^ permalink raw reply related

* Re: [PATCH 68/75] hv: netvsc: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-19 13:46 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: devel@driverdev.osuosl.org, Abhishek Kane, netdev@vger.kernel.org,
	Haiyang Zhang, Greg Kroah-Hartman, linux-kernel@vger.kernel.org
In-Reply-To: <20110819133739.GA3715@shale.localdomain>

On Fri, 2011-08-19 at 14:37 +0100, Dan Carpenter wrote:
> On Fri, Aug 19, 2011 at 02:27:40PM +0100, Ian Campbell wrote:
> >  	/* usually there's just one frag */
> > -	skb_frag_set_page(skb, 0, gl->frags[0].page);
> > +	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */
> 
> There are bunch of new /* XXX */ comments.  What are they for?

They are a reminder referring to the dilemma I mentioned in the intro
mail. Basically that ".p" is ugly because gl->frags[0].page isn't
actually a paged fragment, it's just that this driver uses skb_frag_t in
its internal datastructures.

Ian.

^ permalink raw reply

* Re: [PATCH/RFC v3 0/75] enable SKB paged fragment lifetime visibility
From: Ian Campbell @ 2011-08-19 13:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, linux-nfs@vger.kernel.org
In-Reply-To: <20110819.063410.475175569086272526.davem@davemloft.net>

On Fri, 2011-08-19 at 14:34 +0100, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Fri, 19 Aug 2011 06:29:58 -0700 (PDT)
> 
> > From: Ian Campbell <Ian.Campbell@citrix.com>
> > Date: Fri, 19 Aug 2011 14:26:33 +0100
> > 
> >> This is v3 of my series to enable visibility into SKB paged fragment's
> > 
> > Please tone down the patch count :-/  I'm not going to review anything more
> > than ~20 or so patches at a time from any one person.

Unfortunately I didn't see any middle ground between one massive patch
munging drivers with lots of different maintainers and one patch per
maintainer (which is loads).

I'll resend in batches of ~20. After the first patch all the "convert to
SKB paged frag API" patches are independent.

Would you be willing to just review patches 1-20 of this series? In
which case I could avoid spamming the list with those again now.

> Also none of your patches will even apply to net-next GIT since nearly
> all the ethernet drivers have been moved under drivers/net/ethernet

Yes, I noted that in my intro mail and will rebase for v4. Since it's
mostly just been code motion I think review in the old place will still
be largely valid.

Ian.


^ permalink raw reply

* Re: [PATCH 68/75] hv: netvsc: convert to SKB paged frag API.
From: Dan Carpenter @ 2011-08-19 13:37 UTC (permalink / raw)
  To: Ian Campbell
  Cc: devel, Abhishek Kane, netdev, Haiyang Zhang, Greg Kroah-Hartman,
	linux-kernel
In-Reply-To: <1313760467-8598-68-git-send-email-ian.campbell@citrix.com>

On Fri, Aug 19, 2011 at 02:27:40PM +0100, Ian Campbell wrote:
>  	/* usually there's just one frag */
> -	skb_frag_set_page(skb, 0, gl->frags[0].page);
> +	skb_frag_set_page(skb, 0, gl->frags[0].page.p);	/* XXX */

There are bunch of new /* XXX */ comments.  What are they for?

regards,
dan carpenter

^ permalink raw reply

* Re: [PATCH/RFC v3 0/75] enable SKB paged fragment lifetime visibility
From: David Miller @ 2011-08-19 13:34 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: netdev, linux-nfs
In-Reply-To: <20110819.062958.1560313540665582388.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 19 Aug 2011 06:29:58 -0700 (PDT)

> From: Ian Campbell <Ian.Campbell@citrix.com>
> Date: Fri, 19 Aug 2011 14:26:33 +0100
> 
>> This is v3 of my series to enable visibility into SKB paged fragment's
> 
> Please tone down the patch count :-/  I'm not going to review anything more
> than ~20 or so patches at a time from any one person.

Also none of your patches will even apply to net-next GIT since nearly
all the ethernet drivers have been moved under drivers/net/ethernet

^ permalink raw reply

* [PATCH 74/75] net: add skb_frag_k(un)map convenience functions.
From: Ian Campbell @ 2011-08-19 13:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Ian Campbell, David S. Miller, Eric Dumazet,
	Michał Mirosław, Tom Herbert, Neil Horman, Koki Sanagi
In-Reply-To: <1313760393.5010.356.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Tom Herbert <therbert@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   22 ++++++++++++++++++++++
 net/core/datagram.c    |   20 ++++++++------------
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73d8f7a..0328428 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1839,6 +1839,28 @@ static inline void skb_frag_kunmap_atomic(void *vaddr)
 }
 
 /**
+ * skb_frag_kmap - kmaps a paged fragment
+ * @frag: the paged fragment
+ *
+ * kmap()s the paged fragment @frag and returns the virtual address.
+ */
+static inline void *skb_frag_kmap(skb_frag_t *frag)
+{
+	return kmap(skb_frag_page(frag));
+}
+
+/**
+ * skb_frag_kunmap - kunmaps a paged fragment
+ * @frag: the paged fragment
+ *
+ * kunmap()s the paged fragment @frag.
+ */
+static inline void skb_frag_kunmap(skb_frag_t *frag)
+{
+	kunmap(skb_frag_page(frag));
+}
+
+/**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @device: the device to map the fragment to
  * @frag: the paged fragment to map
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 6449bed..f0dcaa2 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -332,14 +332,13 @@ int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = skb_frag_page(frag);
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_toiovec(to, vaddr + frag->page_offset +
 					     offset - start, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			if (!(len -= copy))
@@ -418,14 +417,13 @@ int skb_copy_datagram_const_iovec(const struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = skb_frag_page(frag);
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_toiovecend(to, vaddr + frag->page_offset +
 						offset - start, to_offset, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			if (!(len -= copy))
@@ -508,15 +506,14 @@ int skb_copy_datagram_from_iovec(struct sk_buff *skb, int offset,
 			int err;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = skb_frag_page(frag);
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			err = memcpy_fromiovecend(vaddr + frag->page_offset +
 						  offset - start,
 						  from, from_offset, copy);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 
@@ -594,16 +591,15 @@ static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 			int err = 0;
 			u8  *vaddr;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			struct page *page = skb_frag_page(frag);
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
+			vaddr = skb_frag_kmap(frag);
 			csum2 = csum_and_copy_to_user(vaddr +
 							frag->page_offset +
 							offset - start,
 						      to, copy, 0, &err);
-			kunmap(page);
+			skb_frag_kunmap(frag);
 			if (err)
 				goto fault;
 			*csump = csum_block_add(*csump, csum2, pos);
-- 
1.7.2.5


^ permalink raw reply related

* [PATCH 71/75] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: Ian Campbell @ 2011-08-19 13:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Ian Campbell, David S. Miller, Neil Brown,
	J. Bruce Fields, linux-nfs
In-Reply-To: <1313760393.5010.356.camel@zakaz.uk.xensource.com>

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
[since v1:
  Push down from NFS layer into RPM layer
]
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   13 +++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index a20970e..172f81e 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 15518a1..75131eb 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index c5347d2..919538d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1113,6 +1114,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1276,6 +1279,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1309,8 +1313,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1383,7 +1387,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1423,7 +1428,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1589,12 +1593,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1609,6 +1615,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 852a258..3685cad 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -196,7 +196,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index f4385e4..925aa0c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1103,6 +1103,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(void *calldata)
+{
+	struct rpc_task *task = calldata;
+	struct rpc_rqst	*req = task->tk_rqstp;
+
+	dprintk("RPC: %5u completing skb pages\n", task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1115,6 +1125,9 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
+	req->destructor.data = task;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f79e40e..af3a106 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox