* [PATCH v5 0/9] skb paged fragment destructors @ 2012-05-03 14:55 Ian Campbell 2012-05-03 14:56 ` [PATCH 1/9] net: add and use SKB_ALLOCSIZE Ian Campbell ` (8 more replies) 0 siblings, 9 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:55 UTC (permalink / raw) To: netdev@vger.kernel.org Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, David VomLehn, Bart Van Assche, xen-devel, Ian Campbell, Alexander Duyck The following series makes use of the skb fragment API (which is in 3.2 +) to add a per-paged-fragment destructor callback. This can be used by creators of skbs who are interested in the lifecycle of the pages included in that skb after they have handed it off to the network stack. The mail at [0] contains some more background and rationale but basically the completed series will allow entities which inject pages into the networking stack to receive a notification when the stack has really finished with those pages (i.e. including retransmissions, clones, pull-ups etc) and not just when the original skb is finished with, which is beneficial to many subsystems which wish to inject pages into the network stack without giving up full ownership of those page's lifecycle. It implements something broadly along the lines of what was described in [1]. I have also included a patch to the RPC subsystem which uses this API to fix the bug which I describe at [2]. I've also had some interest from David VemLehn and Bart Van Assche regarding using this functionality in the context of vmsplice and iSCSI targets respectively (I think). Changes since last time: * The big change is that the patches now explicitly align the "nr_frags" member of the shinfo, as suggested by Alexander Duyck. This ensures that the placement is optimal irrespective of page size (in particular the variation of MAX_SKB_FRAGS). It is still the case that for 4k pages a maximum MTU frame + SKB_PAD + shinfo, still fit within 2048k. * As part of the preceeding I squashed the patches manipulating the shinfo layout and alignment into a single patch (which is far more coherent than the piecemeal approach used previously) * I crushed "net: only allow paged fragments with the same destructor to be coalesced." into the baseline patch (Ben Hutchings) * Added and used skb_shinfo_init to centralise several copies of that code. * Reduced CC list on "net: add paged frag destructor support to kernel_sendpage", it was rather long and seemed a bit overly spammy on the non-netdev recipients. Changes since time before: * Added skb_orphan_frags API for the use of recipients of SKBs who may hold onto the SKB for a long time (this is analogous to skb_orphan). This was pointed out by Michael. The TUN driver is currently the only user. * I can't for the life of me get anything to actually hit this code path. I've been trying with an NFS server running in a Xen HVM domain with emulated (e.g. tap) networking and a client in domain 0, using the NFS fix in this series which generates SKBs with destructors set, so far -- nothing. I suspect that lack of TSO/GSO etc on the TAP interface is causing the frags to be copied to normal pages during skb_segment(). * Various fixups related to the change of alignment/padding in shinfo, in particular to build_skb as pointed out by Eric. * Tweaked ordering of shinfo members to ensure that all hotpath variables up to and including the first frag fit within (and are aligned to) a single 64 byte cache line. (Eric again) I ran a monothread UDP benchmark (similar to that described by Eric in e52fcb2462ac) and don't see any difference in pps throughput, it was ~810,000 pps both before and after. Cheers, Ian. [0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2 [1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2 [2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/9] net: add and use SKB_ALLOCSIZE 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell ` (7 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell This gives the allocation size required for an skb containing X bytes of data Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> --- drivers/net/ethernet/broadcom/bnx2.c | 7 +++---- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 3 +-- drivers/net/ethernet/broadcom/tg3.c | 3 +-- include/linux/skbuff.h | 12 ++++++++++++ net/core/skbuff.c | 8 +------- 5 files changed, 18 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c index ac7b744..62eb000 100644 --- a/drivers/net/ethernet/broadcom/bnx2.c +++ b/drivers/net/ethernet/broadcom/bnx2.c @@ -5321,8 +5321,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) /* 8 for CRC and VLAN */ rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8; - rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + rx_space = SKB_ALLOCSIZE(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD; bp->rx_copy_thresh = BNX2_RX_COPY_THRESH; bp->rx_pg_ring_size = 0; @@ -5345,8 +5344,8 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size) bp->rx_buf_use_size = rx_size; /* hw alignment + build_skb() overhead*/ - bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) + - NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + bp->rx_buf_size = SKB_ALLOCSIZE(bp->rx_buf_use_size + BNX2_RX_ALIGN) + + NET_SKB_PAD; bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET; bp->rx_ring_size = size; bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS); diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h index e30e2a2..3586879 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h @@ -1252,8 +1252,7 @@ struct bnx2x { #define BNX2X_FW_RX_ALIGN_START (1UL << BNX2X_RX_ALIGN_SHIFT) #define BNX2X_FW_RX_ALIGN_END \ - max(1UL << BNX2X_RX_ALIGN_SHIFT, \ - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) + max(1UL << BNX2X_RX_ALIGN_SHIFT, SKB_ALLOCSIZE(0)) #define BNX2X_PXP_DRAM_ALIGN (BNX2X_RX_ALIGN_SHIFT - 5) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 482138e..6869f17 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -5714,8 +5714,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr, * Callers depend upon this behavior and assume that * we leave everything unchanged if we fail. */ - skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + skb_size = SKB_ALLOCSIZE(data_size + TG3_RX_OFFSET(tp)); if (skb_size <= TG3_FRAGSIZE) { data = tg3_frag_alloc(tpr); *frag_size = TG3_FRAGSIZE; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 988fc49..19e348f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -41,8 +41,20 @@ #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ ~(SMP_CACHE_BYTES - 1)) +/* maximum data size which can fit into an allocation of X bytes */ #define SKB_WITH_OVERHEAD(X) \ ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) +/* + * minimum allocation size required for an skb containing X bytes of data + * + * We do our best to align skb_shared_info on a separate cache + * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives + * aligned memory blocks, unless SLUB/SLAB debug is enabled. Both + * skb->head and skb_shared_info are cache line aligned. + */ +#define SKB_ALLOCSIZE(X) \ + (SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) + #define SKB_MAX_ORDER(X, ORDER) \ SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X)) #define SKB_MAX_HEAD(X) (SKB_MAX_ORDER((X), 0)) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 52ba2b5..a056d7c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -182,13 +182,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, goto out; prefetchw(skb); - /* We do our best to align skb_shared_info on a separate cache - * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives - * aligned memory blocks, unless SLUB/SLAB debug is enabled. - * Both skb->head and skb_shared_info are cache line aligned. - */ - size = SKB_DATA_ALIGN(size); - size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + size = SKB_ALLOCSIZE(size); data = kmalloc_node_track_caller(size, gfp_mask, node); if (!data) goto nodata; -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell 2012-05-03 14:56 ` [PATCH 1/9] net: add and use SKB_ALLOCSIZE Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD Ian Campbell ` (6 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/core/skbuff.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a056d7c..c60b603 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -263,7 +263,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) if (!skb) return NULL; - size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + size = SKB_WITH_OVERHEAD(size); memset(skb, 0, offsetof(struct sk_buff, tail)); skb->truesize = SKB_TRUESIZE(size); -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell 2012-05-03 14:56 ` [PATCH 1/9] net: add and use SKB_ALLOCSIZE Ian Campbell 2012-05-03 14:56 ` [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle Ian Campbell ` (5 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell, Divy Le Ray Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Divy Le Ray <divy@chelsio.com> --- drivers/net/ethernet/chelsio/cxgb/sge.c | 3 +-- drivers/net/ethernet/chelsio/cxgb3/sge.c | 6 +++--- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c index 47a8435..52373db 100644 --- a/drivers/net/ethernet/chelsio/cxgb/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb/sge.c @@ -599,8 +599,7 @@ static int alloc_rx_resources(struct sge *sge, struct sge_params *p) sizeof(struct cpl_rx_data) + sge->freelQ[!sge->jumbo_fl].dma_offset; - size = (16 * 1024) - - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + size = SKB_WITH_OVERHEAD(16 * 1024); sge->freelQ[sge->jumbo_fl].rx_buffer_size = size; diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c index cfb60e1..b804470 100644 --- a/drivers/net/ethernet/chelsio/cxgb3/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c @@ -3043,7 +3043,7 @@ int t3_sge_alloc_qset(struct adapter *adapter, unsigned int id, int nports, q->fl[1].buf_size = FL1_PG_CHUNK_SIZE; #else q->fl[1].buf_size = is_offload(adapter) ? - (16 * 1024) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) : + SKB_WITH_OVERHEAD(16 * 1024) : MAX_FRAME_SIZE + 2 + sizeof(struct cpl_rx_pkt); #endif @@ -3282,8 +3282,8 @@ void t3_sge_prep(struct adapter *adap, struct sge_params *p) { int i; - p->max_pkt_size = (16 * 1024) - sizeof(struct cpl_rx_data) - - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + p->max_pkt_size = + SKB_WITH_OVERHEAD((16*1024) - sizeof(struct cpl_rx_data)); for (i = 0; i < SGE_QSETS; ++i) { struct qset_params *q = p->qset + i; -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (2 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually Ian Campbell ` (4 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell There is only one semantic change here which is that skb_recycle now does: kmemcheck_annotate_variable(shinfo->destructor_arg) I don't think it was erroneously missing before (since in the skb_recycle case it will have happened previously) but I beleive it is harmless to do it again and this saves having a different copy of the same code for the recycle case. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> --- net/core/skbuff.c | 30 +++++++++++++----------------- 1 files changed, 13 insertions(+), 17 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c60b603..e96f68b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -145,6 +145,16 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here) BUG(); } +static void skb_shinfo_init(struct sk_buff *skb) +{ + struct skb_shared_info *shinfo = skb_shinfo(skb); + + /* make sure we initialize shinfo sequentially */ + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); + atomic_set(&shinfo->dataref, 1); + kmemcheck_annotate_variable(shinfo->destructor_arg); +} + /* Allocate a new skbuff. We do this ourselves so we can fill in a few * 'private' fields and also do memory statistics to find all the * [BEEP] leaks. @@ -170,7 +180,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone, int node) { struct kmem_cache *cache; - struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; @@ -210,11 +219,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, skb->mac_header = ~0U; #endif - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); - kmemcheck_annotate_variable(shinfo->destructor_arg); + skb_shinfo_init(skb); if (fclone) { struct sk_buff *child = skb + 1; @@ -255,7 +260,6 @@ EXPORT_SYMBOL(__alloc_skb); */ struct sk_buff *build_skb(void *data, unsigned int frag_size) { - struct skb_shared_info *shinfo; struct sk_buff *skb; unsigned int size = frag_size ? : ksize(data); @@ -277,11 +281,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) skb->mac_header = ~0U; #endif - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); - kmemcheck_annotate_variable(shinfo->destructor_arg); + skb_shinfo_init(skb); return skb; } @@ -546,13 +546,9 @@ EXPORT_SYMBOL(consume_skb); */ void skb_recycle(struct sk_buff *skb) { - struct skb_shared_info *shinfo; - skb_release_head_state(skb); - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); + skb_shinfo_init(skb); memset(skb, 0, offsetof(struct sk_buff, tail)); skb->data = skb->head + NET_SKB_PAD; -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (3 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 6/9] net: add support for per-paged-fragment destructors Ian Campbell ` (3 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell This reduces the minimum overhead required for this allocation such that the shinfo can be grown in the following patch without overflowing 2048 bytes for a 1500 byte frame. Reducing this overhead while also growing the shinfo means that sometimes the tail end of the data can end up in the same cache line as the beginning of the shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many cases the allocation slop means that there is no overlap. In order to ensure that the hot struct members remain on the same 64 byte cache line move the "destructor_arg" member to the front, this member is not used on any hot path so it is a good choice to potentially be on a separate cache line (and which addtionally may be shared with skb->data). Also rather than relying on knowledge about the size and layout of the rest of the shinfo to ensure that the right parts of the shinfo are aligned decree that nr_frags will be cache aligned and therefore that the 64 bytes starting at nr_frags should contain the hot struct members. All this avoids hitting an extra cache line on hot operations such as kfree_skb. On 4k pages this motion and alignment strategy (along with the following frag size increase) results in the shinfo abutting the very end of the allocation. On larger pages (where SKB_MAX_FRAGS can be smaller) it means that we still correctly align the hot data without needing to make assumptions about the data layout outside of the hot 64-bytes of the shinfo. Explicitly aligning nr_frags, rather than relying on analysis of the shinfo layout was suggested by Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> --- include/linux/skbuff.h | 50 +++++++++++++++++++++++++++++------------------ net/core/skbuff.c | 9 +++++++- 2 files changed, 39 insertions(+), 20 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 19e348f..3698625 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -41,19 +41,24 @@ #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ ~(SMP_CACHE_BYTES - 1)) -/* maximum data size which can fit into an allocation of X bytes */ -#define SKB_WITH_OVERHEAD(X) \ - ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) + /* - * minimum allocation size required for an skb containing X bytes of data - * - * We do our best to align skb_shared_info on a separate cache - * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives - * aligned memory blocks, unless SLUB/SLAB debug is enabled. Both - * skb->head and skb_shared_info are cache line aligned. + * We do our best to align the hot members of skb_shared_info on a + * separate cache line. We explicitly align the nr_frags field and + * arrange that the order of the fields in skb_shared_info is such + * that the interesting fields are nr_frags onwards and are therefore + * cache line aligned. */ -#define SKB_ALLOCSIZE(X) \ - (SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) +#define SKB_SHINFO_SIZE \ + (SKB_DATA_ALIGN(sizeof(struct skb_shared_info) \ + - offsetof(struct skb_shared_info, nr_frags)) \ + + offsetof(struct skb_shared_info, nr_frags)) + +/* maximum data size which can fit into an allocation of X bytes */ +#define SKB_WITH_OVERHEAD(X) ((X) - SKB_SHINFO_SIZE) + +/* minimum allocation size required for an skb containing X bytes of data */ +#define SKB_ALLOCSIZE(X) (SKB_DATA_ALIGN((X) + SKB_SHINFO_SIZE)) #define SKB_MAX_ORDER(X, ORDER) \ SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X)) @@ -63,7 +68,7 @@ /* return minimum truesize of one skb containing X bytes of data */ #define SKB_TRUESIZE(X) ((X) + \ SKB_DATA_ALIGN(sizeof(struct sk_buff)) + \ - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) + SKB_SHINFO_SIZE) /* A. Checksumming of received packets by device. * @@ -263,6 +268,19 @@ struct ubuf_info { * the end of the header data, ie. at skb->end. */ struct skb_shared_info { + /* Intermediate layers must ensure that destructor_arg + * remains valid until skb destructor */ + void *destructor_arg; + + /* Warning: all fields from here until dataref are cleared in + * skb_shinfo_init() (called from __alloc_skb, build_skb, + * skb_recycle, etc). + * + * nr_frags will always be aligned to the start of a cache + * line. It is intended that everything from nr_frags until at + * least frags[0] (inclusive) should fit into the same 64-byte + * cache line. + */ unsigned char nr_frags; __u8 tx_flags; unsigned short gso_size; @@ -273,15 +291,9 @@ struct skb_shared_info { struct skb_shared_hwtstamps hwtstamps; __be32 ip6_frag_id; - /* - * Warning : all fields before dataref are cleared in __alloc_skb() - */ + /* fields from nr_frags until dataref are cleared in skb_shinfo_init */ atomic_t dataref; - /* Intermediate layers must ensure that destructor_arg - * remains valid until skb destructor */ - void * destructor_arg; - /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e96f68b..fab6de0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -149,8 +149,15 @@ static void skb_shinfo_init(struct sk_buff *skb) { struct skb_shared_info *shinfo = skb_shinfo(skb); + /* Ensure that nr_frags->frags[0] (at least) fits into a + * single cache line. */ + BUILD_BUG_ON((offsetof(struct skb_shared_info, frags[1]) + - offsetof(struct skb_shared_info, nr_frags)) > 64); + /* make sure we initialize shinfo sequentially */ - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); + memset(&shinfo->nr_frags, 0, + offsetof(struct skb_shared_info, dataref) + - offsetof(struct skb_shared_info, nr_frags)); atomic_set(&shinfo->dataref, 1); kmemcheck_annotate_variable(shinfo->destructor_arg); } -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 6/9] net: add support for per-paged-fragment destructors 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (4 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell ` (2 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell, Michał Mirosław Entities which care about the complete lifecycle of pages which they inject into the network stack via an skb paged fragment can choose to set this destructor in order to receive a callback when the stack is really finished with a page (including all clones, retransmits, pull-ups etc etc). This destructor will always be propagated alongside the struct page when copying skb_frag_t->page. This is the reason I chose to embed the destructor in a "struct { } page" within the skb_frag_t, rather than as a separate field, since it allows existing code which propagates ->frags[N].page to Just Work(tm). When the destructor is present the page reference counting is done slightly differently. No references are held by the network stack on the struct page (it is up to the caller to manage this as necessary) instead the network stack will track references via the count embedded in the destructor structure. When this reference count reaches zero then the destructor will be called and the caller can take the necesary steps to release the page (i.e. release the struct page reference itself). The intention is that callers can use this callback to delay completion to _their_ callers until the network stack has completely released the page, in order to prevent use-after-free or modification of data pages which are still in use by the stack. It is allowable (indeed expected) for a caller to share a single destructor instance between multiple pages injected into the stack e.g. a group of pages included in a single higher level operation might share a destructor which is used to complete that higher level operation. Previous changes have ensured that, even with the increase in frag size, the hot fields (nr_frags through to at least frags[0]) fit with and are aligned to a 64 byte cache line. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: netdev@vger.kernel.org --- include/linux/skbuff.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++- net/core/skbuff.c | 18 +++++++++++++++++ net/ipv4/ip_output.c | 2 +- net/ipv4/tcp.c | 4 +- 4 files changed, 69 insertions(+), 5 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 3698625..ccc7d93 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -168,9 +168,15 @@ struct sk_buff; typedef struct skb_frag_struct skb_frag_t; +struct skb_frag_destructor { + atomic_t ref; + int (*destroy)(struct skb_frag_destructor *destructor); +}; + struct skb_frag_struct { struct { struct page *p; + struct skb_frag_destructor *destructor; } page; #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536) __u32 page_offset; @@ -1232,6 +1238,31 @@ static inline int skb_pagelen(const struct sk_buff *skb) } /** + * skb_frag_set_destructor - set destructor for a paged fragment + * @skb: buffer containing fragment to be initialised + * @i: paged fragment index to initialise + * @destroy: the destructor to use for this fragment + * + * Sets @destroy as the destructor to be called when all references to + * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups, + * etc) are released. + * + * When a destructor is set then reference counting is performed on + * @destroy->ref. When the ref reaches zero then @destroy->destroy + * will be called. The caller is responsible for holding and managing + * any other references (such a the struct page reference count). + * + * This function must be called before any use of skb_frag_ref() or + * skb_frag_unref(). + */ +static inline void skb_frag_set_destructor(struct sk_buff *skb, int i, + struct skb_frag_destructor *destroy) +{ + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + frag->page.destructor = destroy; +} + +/** * __skb_fill_page_desc - initialise a paged fragment in an skb * @skb: buffer containing fragment to be initialised * @i: paged fragment index to initialise @@ -1250,6 +1281,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; frag->page.p = page; + frag->page.destructor = NULL; frag->page_offset = off; skb_frag_size_set(frag, size); } @@ -1766,6 +1798,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) return frag->page.p; } +extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy); +extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy); + /** * __skb_frag_ref - take an addition reference on a paged fragment. * @frag: the paged fragment @@ -1774,6 +1809,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) { + if (unlikely(frag->page.destructor)) { + skb_frag_destructor_ref(frag->page.destructor); + return; + } get_page(skb_frag_page(frag)); } @@ -1797,6 +1836,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) */ static inline void __skb_frag_unref(skb_frag_t *frag) { + if (unlikely(frag->page.destructor)) { + skb_frag_destructor_unref(frag->page.destructor); + return; + } put_page(skb_frag_page(frag)); } @@ -1994,13 +2037,16 @@ static inline int skb_add_data(struct sk_buff *skb, } static inline bool skb_can_coalesce(struct sk_buff *skb, int i, - const struct page *page, int off) + const struct page *page, + const struct skb_frag_destructor *destroy, + int off) { if (i) { const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; return page == skb_frag_page(frag) && - off == frag->page_offset + skb_frag_size(frag); + off == frag->page_offset + skb_frag_size(frag) && + frag->page.destructor == destroy; } return false; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index fab6de0..945b807 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -353,6 +353,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length) } EXPORT_SYMBOL(dev_alloc_skb); +void skb_frag_destructor_ref(struct skb_frag_destructor *destroy) +{ + BUG_ON(destroy == NULL); + atomic_inc(&destroy->ref); +} +EXPORT_SYMBOL(skb_frag_destructor_ref); + +void skb_frag_destructor_unref(struct skb_frag_destructor *destroy) +{ + if (destroy == NULL) + return; + + if (atomic_dec_and_test(&destroy->ref)) + destroy->destroy(destroy); +} +EXPORT_SYMBOL(skb_frag_destructor_unref); + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; @@ -2334,6 +2351,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen) */ if (!to || !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom), + fragfrom->page.destructor, fragfrom->page_offset)) { merge = -1; } else { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 4910176..7652751 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1242,7 +1242,7 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, i = skb_shinfo(skb)->nr_frags; if (len > size) len = size; - if (skb_can_coalesce(skb, i, page, offset)) { + if (skb_can_coalesce(skb, i, page, NULL, offset)) { skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len); } else if (i < MAX_SKB_FRAGS) { get_page(page); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9670af3..2d590ca 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -870,7 +870,7 @@ new_segment: copy = size; i = skb_shinfo(skb)->nr_frags; - can_coalesce = skb_can_coalesce(skb, i, page, offset); + can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset); if (!can_coalesce && i >= MAX_SKB_FRAGS) { tcp_mark_push(tp, skb); goto new_segment; @@ -1124,7 +1124,7 @@ new_segment: off = sk->sk_sndmsg_off; - if (skb_can_coalesce(skb, i, page, off) && + if (skb_can_coalesce(skb, i, page, NULL, off) && off != PAGE_SIZE) { /* We can extend the last page * fragment. */ -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (5 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 6/9] net: add support for per-paged-fragment destructors Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-03 15:41 ` Michael S. Tsirkin ` (2 more replies) 2012-05-03 14:56 ` [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage Ian Campbell 2012-05-03 14:56 ` [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell 8 siblings, 3 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell This should be used by drivers which need to hold on to an skb for an extended (perhaps unbounded) period of time. e.g. the tun driver which relies on userspace consuming the skb. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: mst@redhat.com --- drivers/net/tun.c | 1 + include/linux/skbuff.h | 11 ++++++++ net/core/skbuff.c | 68 ++++++++++++++++++++++++++++++++++------------- 3 files changed, 61 insertions(+), 19 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index bb8c72c..b53e04e 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) /* Orphan the skb - required as we might hang on to it * for indefinite time. */ skb_orphan(skb); + skb_orphan_frags(skb, GFP_KERNEL); /* Enqueue packet */ skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ccc7d93..9145f83 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1711,6 +1711,17 @@ static inline void skb_orphan(struct sk_buff *skb) } /** + * skb_orphan_frags - orphan the frags contained in a buffer + * @skb: buffer to orphan frags from + * @gfp_mask: allocation mask for replacement pages + * + * For each frag in the SKB which has a destructor (i.e. has an + * owner) create a copy of that frag and release the original + * page by calling the destructor. + */ +extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask); + +/** * __skb_queue_purge - empty a list * @list: list to empty * diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 945b807..f009abb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) } EXPORT_SYMBOL_GPL(skb_morph); -/* skb_copy_ubufs - copy userspace skb frags buffers to kernel - * @skb: the skb to modify - * @gfp_mask: allocation priority - * - * This must be called on SKBTX_DEV_ZEROCOPY skb. - * It will copy all frags into kernel and drop the reference - * to userspace pages. - * - * If this function is called from an interrupt gfp_mask() must be - * %GFP_ATOMIC. - * - * Returns 0 on success or a negative error code on failure - * to allocate kernel memory to copy to. +/* + * If uarg != NULL copy and replace all frags. + * If uarg == NULL then only copy and replace those which have a destructor + * pointer. */ -int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) +static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask, + struct ubuf_info *uarg) { int i; int num_frags = skb_shinfo(skb)->nr_frags; struct page *page, *head = NULL; - struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; for (i = 0; i < num_frags; i++) { u8 *vaddr; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; + if (!uarg && !f->page.destructor) + continue; + page = alloc_page(GFP_ATOMIC); if (!page) { while (head) { @@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) head = page; } - /* skb frags release userspace buffers */ - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) + /* skb frags release buffers */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; + if (!uarg && !f->page.destructor) + continue; skb_frag_unref(skb, i); + } - uarg->callback(uarg); + if (uarg) + uarg->callback(uarg); /* skb frags point to kernel buffers */ for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) { @@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) head = (struct page *)head->private; } - skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; return 0; } +/* skb_copy_ubufs - copy userspace skb frags buffers to kernel + * @skb: the skb to modify + * @gfp_mask: allocation priority + * + * This must be called on SKBTX_DEV_ZEROCOPY skb. + * It will copy all frags into kernel and drop the reference + * to userspace pages. + * + * If this function is called from an interrupt gfp_mask() must be + * %GFP_ATOMIC. + * + * Returns 0 on success or a negative error code on failure + * to allocate kernel memory to copy to. + */ +int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) +{ + struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; + int rc; + + rc = skb_copy_frags(skb, gfp_mask, uarg); + + if (rc == 0) + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + + return rc; +} + +int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) +{ + return skb_copy_frags(skb, gfp_mask, NULL); +} +EXPORT_SYMBOL(skb_orphan_frags); /** * skb_clone - duplicate an sk_buff -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell @ 2012-05-03 15:41 ` Michael S. Tsirkin 2012-05-03 17:55 ` David Miller 2012-05-06 13:54 ` Michael S. Tsirkin 2012-05-07 10:24 ` Michael S. Tsirkin 2 siblings, 1 reply; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-03 15:41 UTC (permalink / raw) To: Ian Campbell; +Cc: netdev, David Miller, Eric Dumazet On Thu, May 03, 2012 at 03:56:09PM +0100, Ian Campbell wrote: > This should be used by drivers which need to hold on to an skb for an extended > (perhaps unbounded) period of time. e.g. the tun driver which relies on > userspace consuming the skb. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > Cc: mst@redhat.com Right. But local sockets queue at socket forever as well. I think this should be called in skb_set_owner_r? This might somewhat penalize speed for local clients in the name of correctness but these are rare so being correct is more important I think. Also, mactap can get this when running in bridge mode, right? > --- > drivers/net/tun.c | 1 + > include/linux/skbuff.h | 11 ++++++++ > net/core/skbuff.c | 68 ++++++++++++++++++++++++++++++++++------------- > 3 files changed, 61 insertions(+), 19 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index bb8c72c..b53e04e 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > /* Orphan the skb - required as we might hang on to it > * for indefinite time. */ > skb_orphan(skb); > + skb_orphan_frags(skb, GFP_KERNEL); > > /* Enqueue packet */ > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index ccc7d93..9145f83 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1711,6 +1711,17 @@ static inline void skb_orphan(struct sk_buff *skb) > } > > /** > + * skb_orphan_frags - orphan the frags contained in a buffer > + * @skb: buffer to orphan frags from > + * @gfp_mask: allocation mask for replacement pages > + * > + * For each frag in the SKB which has a destructor (i.e. has an > + * owner) create a copy of that frag and release the original > + * page by calling the destructor. > + */ > +extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask); > + > +/** > * __skb_queue_purge - empty a list > * @list: list to empty > * > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 945b807..f009abb 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) > } > EXPORT_SYMBOL_GPL(skb_morph); > > -/* skb_copy_ubufs - copy userspace skb frags buffers to kernel > - * @skb: the skb to modify > - * @gfp_mask: allocation priority > - * > - * This must be called on SKBTX_DEV_ZEROCOPY skb. > - * It will copy all frags into kernel and drop the reference > - * to userspace pages. > - * > - * If this function is called from an interrupt gfp_mask() must be > - * %GFP_ATOMIC. > - * > - * Returns 0 on success or a negative error code on failure > - * to allocate kernel memory to copy to. > +/* > + * If uarg != NULL copy and replace all frags. > + * If uarg == NULL then only copy and replace those which have a destructor > + * pointer. > */ > -int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > +static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask, > + struct ubuf_info *uarg) > { > int i; > int num_frags = skb_shinfo(skb)->nr_frags; > struct page *page, *head = NULL; > - struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > > for (i = 0; i < num_frags; i++) { > u8 *vaddr; > skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > > + if (!uarg && !f->page.destructor) > + continue; > + > page = alloc_page(GFP_ATOMIC); > if (!page) { > while (head) { > @@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > head = page; > } > > - /* skb frags release userspace buffers */ > - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > + /* skb frags release buffers */ > + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { > + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > + if (!uarg && !f->page.destructor) > + continue; > skb_frag_unref(skb, i); > + } > > - uarg->callback(uarg); > + if (uarg) > + uarg->callback(uarg); > > /* skb frags point to kernel buffers */ > for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) { > @@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > head = (struct page *)head->private; > } > > - skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; > return 0; > } > > +/* skb_copy_ubufs - copy userspace skb frags buffers to kernel > + * @skb: the skb to modify > + * @gfp_mask: allocation priority > + * > + * This must be called on SKBTX_DEV_ZEROCOPY skb. > + * It will copy all frags into kernel and drop the reference > + * to userspace pages. > + * > + * If this function is called from an interrupt gfp_mask() must be > + * %GFP_ATOMIC. > + * > + * Returns 0 on success or a negative error code on failure > + * to allocate kernel memory to copy to. > + */ > +int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > +{ > + struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > + int rc; > + > + rc = skb_copy_frags(skb, gfp_mask, uarg); > + > + if (rc == 0) > + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; > + > + return rc; > +} > + > +int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) > +{ > + return skb_copy_frags(skb, gfp_mask, NULL); > +} > +EXPORT_SYMBOL(skb_orphan_frags); > > /** > * skb_clone - duplicate an sk_buff > -- > 1.7.2.5 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 15:41 ` Michael S. Tsirkin @ 2012-05-03 17:55 ` David Miller 2012-05-03 21:10 ` Michael S. Tsirkin 0 siblings, 1 reply; 23+ messages in thread From: David Miller @ 2012-05-03 17:55 UTC (permalink / raw) To: mst; +Cc: ian.campbell, netdev, eric.dumazet From: "Michael S. Tsirkin" <mst@redhat.com> Date: Thu, 3 May 2012 18:41:43 +0300 > On Thu, May 03, 2012 at 03:56:09PM +0100, Ian Campbell wrote: >> This should be used by drivers which need to hold on to an skb for an extended >> (perhaps unbounded) period of time. e.g. the tun driver which relies on >> userspace consuming the skb. >> >> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> >> Cc: mst@redhat.com > > > Right. But local sockets queue at socket forever as well. > I think this should be called in skb_set_owner_r? > > This might somewhat penalize speed for local clients in the name > of correctness but these are rare so being correct is > more important I think. But, on the other hand, putting the check into skb_set_owner_r() is a not so nice test to have in the fast path of every socket receive. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 17:55 ` David Miller @ 2012-05-03 21:10 ` Michael S. Tsirkin 2012-05-04 6:54 ` David Miller 0 siblings, 1 reply; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-03 21:10 UTC (permalink / raw) To: David Miller; +Cc: ian.campbell, netdev, eric.dumazet On Thu, May 03, 2012 at 01:55:32PM -0400, David Miller wrote: > From: "Michael S. Tsirkin" <mst@redhat.com> > Date: Thu, 3 May 2012 18:41:43 +0300 > > > On Thu, May 03, 2012 at 03:56:09PM +0100, Ian Campbell wrote: > >> This should be used by drivers which need to hold on to an skb for an extended > >> (perhaps unbounded) period of time. e.g. the tun driver which relies on > >> userspace consuming the skb. > >> > >> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > >> Cc: mst@redhat.com > > > > > > Right. But local sockets queue at socket forever as well. > > I think this should be called in skb_set_owner_r? > > > > This might somewhat penalize speed for local clients in the name > > of correctness but these are rare so being correct is > > more important I think. > > But, on the other hand, putting the check into skb_set_owner_r() is a > not so nice test to have in the fast path of every socket receive. True. Hmm we orphan skbs when we loop them back so how about reusing the skb->destructor for this? We could teach pskb_copy pskb_expand_head etc that when skb with this flag is cloned (expand head etc) destructor would be set to a function that copies frags. (clone is less of a fast path so I think adding a branch there is less of an issue). Of course destructor is also called from kfree_skb but we could clear this flag before the call in kfree_skb so that destructor can distinguish. -- MST ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 21:10 ` Michael S. Tsirkin @ 2012-05-04 6:54 ` David Miller 2012-05-04 10:03 ` Michael S. Tsirkin 2012-05-06 17:01 ` Michael S. Tsirkin 0 siblings, 2 replies; 23+ messages in thread From: David Miller @ 2012-05-04 6:54 UTC (permalink / raw) To: mst; +Cc: ian.campbell, netdev, eric.dumazet From: "Michael S. Tsirkin" <mst@redhat.com> Date: Fri, 4 May 2012 00:10:24 +0300 > Hmm we orphan skbs when we loop them back so how about reusing the > skb->destructor for this? That's one possibility. But I fear we're about to toss Ian into yet another rabbit hole. :-) Let's try to converge on something quickly as I think integration of his work has been delayed enough as-is. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-04 6:54 ` David Miller @ 2012-05-04 10:03 ` Michael S. Tsirkin 2012-05-04 10:51 ` Ian Campbell 2012-05-06 17:01 ` Michael S. Tsirkin 1 sibling, 1 reply; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-04 10:03 UTC (permalink / raw) To: David Miller; +Cc: ian.campbell, netdev, eric.dumazet On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote: > From: "Michael S. Tsirkin" <mst@redhat.com> > Date: Fri, 4 May 2012 00:10:24 +0300 > > > Hmm we orphan skbs when we loop them back so how about reusing the > > skb->destructor for this? > > That's one possibility. > > But I fear we're about to toss Ian into yet another rabbit hole. :-) > > Let's try to converge on something quickly as I think integration of > his work has been delayed enough as-is. This is more intrusive than I'd like because on error we need to free the skb, so need to return error code from orphan, but it does not add an extra branch since skb_orphan is inline. So intrusive but should not hurt fast path. It's weekend here, I'll work on a patch like this Sunday. -- MST ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-04 10:03 ` Michael S. Tsirkin @ 2012-05-04 10:51 ` Ian Campbell 0 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-04 10:51 UTC (permalink / raw) To: Michael S. Tsirkin Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com On Fri, 2012-05-04 at 11:03 +0100, Michael S. Tsirkin wrote: > On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote: > > From: "Michael S. Tsirkin" <mst@redhat.com> > > Date: Fri, 4 May 2012 00:10:24 +0300 > > > > > Hmm we orphan skbs when we loop them back so how about reusing the > > > skb->destructor for this? > > > > That's one possibility. > > > > But I fear we're about to toss Ian into yet another rabbit hole. :-) > > > > Let's try to converge on something quickly as I think integration of > > his work has been delayed enough as-is. > > This is more intrusive than I'd like because on > error we need to free the skb, so need > to return error code from orphan, but it does not add > an extra branch since skb_orphan is inline. > So intrusive but should not hurt fast path. > > It's weekend here, I'll work on a patch like this > Sunday. Thanks, I was starting to feel my nose twitching and my ears beginning to elongate ;-) Ian. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-04 6:54 ` David Miller 2012-05-04 10:03 ` Michael S. Tsirkin @ 2012-05-06 17:01 ` Michael S. Tsirkin 1 sibling, 0 replies; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-06 17:01 UTC (permalink / raw) To: David Miller; +Cc: ian.campbell, netdev, eric.dumazet On Fri, May 04, 2012 at 02:54:33AM -0400, David Miller wrote: > From: "Michael S. Tsirkin" <mst@redhat.com> > Date: Fri, 4 May 2012 00:10:24 +0300 > > > Hmm we orphan skbs when we loop them back so how about reusing the > > skb->destructor for this? > > That's one possibility. > > But I fear we're about to toss Ian into yet another rabbit hole. :-) > > Let's try to converge on something quickly as I think integration of > his work has been delayed enough as-is. OK I tried doing this and I recalled why we do the copy with ubufs before clone: the problem is that shinfo is shared between skbs, so modifying frags like skb_orphan_frags does is racy. Stuck for now. So I have a question: how about reusing the TX_DEV_ZEROCOPY machinery for this, instead of frag destructors? Thanks, -- MST ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell 2012-05-03 15:41 ` Michael S. Tsirkin @ 2012-05-06 13:54 ` Michael S. Tsirkin 2012-05-07 10:24 ` Michael S. Tsirkin 2 siblings, 0 replies; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-06 13:54 UTC (permalink / raw) To: Ian Campbell; +Cc: netdev, David Miller, Eric Dumazet On Thu, May 03, 2012 at 03:56:09PM +0100, Ian Campbell wrote: > This should be used by drivers which need to hold on to an skb for an extended > (perhaps unbounded) period of time. e.g. the tun driver which relies on > userspace consuming the skb. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > Cc: mst@redhat.com > --- > drivers/net/tun.c | 1 + > include/linux/skbuff.h | 11 ++++++++ > net/core/skbuff.c | 68 ++++++++++++++++++++++++++++++++++------------- > 3 files changed, 61 insertions(+), 19 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index bb8c72c..b53e04e 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > /* Orphan the skb - required as we might hang on to it > * for indefinite time. */ > skb_orphan(skb); > + skb_orphan_frags(skb, GFP_KERNEL); > > /* Enqueue packet */ > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); BTW, didn't notice at the moment but there seem to be a couple of other problems: 1. this is using GFP_KERNEL on xmit path. 2. And it's not just a question of passing in GFP_ATOMIC: return status needs to be checked. 3. Scanning all frags just in case one of them has a descructor also won't help performace :( -- MST ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell 2012-05-03 15:41 ` Michael S. Tsirkin 2012-05-06 13:54 ` Michael S. Tsirkin @ 2012-05-07 10:24 ` Michael S. Tsirkin 2012-05-09 9:36 ` Ian Campbell 2 siblings, 1 reply; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-07 10:24 UTC (permalink / raw) To: Ian Campbell; +Cc: netdev, David Miller, Eric Dumazet On Thu, May 03, 2012 at 03:56:09PM +0100, Ian Campbell wrote: > This should be used by drivers which need to hold on to an skb for an extended > (perhaps unbounded) period of time. e.g. the tun driver which relies on > userspace consuming the skb. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > Cc: mst@redhat.com > --- > drivers/net/tun.c | 1 + > include/linux/skbuff.h | 11 ++++++++ > net/core/skbuff.c | 68 ++++++++++++++++++++++++++++++++++------------- > 3 files changed, 61 insertions(+), 19 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index bb8c72c..b53e04e 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > /* Orphan the skb - required as we might hang on to it > * for indefinite time. */ > skb_orphan(skb); > + skb_orphan_frags(skb, GFP_KERNEL); > > /* Enqueue packet */ > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index ccc7d93..9145f83 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1711,6 +1711,17 @@ static inline void skb_orphan(struct sk_buff *skb) > } > > /** > + * skb_orphan_frags - orphan the frags contained in a buffer > + * @skb: buffer to orphan frags from > + * @gfp_mask: allocation mask for replacement pages > + * > + * For each frag in the SKB which has a destructor (i.e. has an > + * owner) create a copy of that frag and release the original > + * page by calling the destructor. > + */ > +extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask); > + > +/** > * __skb_queue_purge - empty a list > * @list: list to empty > * > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 945b807..f009abb 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) > } > EXPORT_SYMBOL_GPL(skb_morph); > > -/* skb_copy_ubufs - copy userspace skb frags buffers to kernel > - * @skb: the skb to modify > - * @gfp_mask: allocation priority > - * > - * This must be called on SKBTX_DEV_ZEROCOPY skb. > - * It will copy all frags into kernel and drop the reference > - * to userspace pages. > - * > - * If this function is called from an interrupt gfp_mask() must be > - * %GFP_ATOMIC. > - * > - * Returns 0 on success or a negative error code on failure > - * to allocate kernel memory to copy to. > +/* > + * If uarg != NULL copy and replace all frags. > + * If uarg == NULL then only copy and replace those which have a destructor > + * pointer. > */ > -int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > +static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask, > + struct ubuf_info *uarg) > { > int i; > int num_frags = skb_shinfo(skb)->nr_frags; > struct page *page, *head = NULL; > - struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > > for (i = 0; i < num_frags; i++) { > u8 *vaddr; > skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > > + if (!uarg && !f->page.destructor) > + continue; > + > page = alloc_page(GFP_ATOMIC); > if (!page) { > while (head) { > @@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > head = page; > } > > - /* skb frags release userspace buffers */ > - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > + /* skb frags release buffers */ > + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { > + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > + if (!uarg && !f->page.destructor) > + continue; > skb_frag_unref(skb, i); > + } > > - uarg->callback(uarg); > + if (uarg) > + uarg->callback(uarg); > So above we only linked up copied pages, but below we try to use the list for all frags. Looks like a bug, I think it needs to check destructor and uarg too. > /* skb frags point to kernel buffers */ > for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) { > @@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > head = (struct page *)head->private; > } > > - skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; > return 0; > } > > +/* skb_copy_ubufs - copy userspace skb frags buffers to kernel > + * @skb: the skb to modify > + * @gfp_mask: allocation priority > + * > + * This must be called on SKBTX_DEV_ZEROCOPY skb. > + * It will copy all frags into kernel and drop the reference > + * to userspace pages. > + * > + * If this function is called from an interrupt gfp_mask() must be > + * %GFP_ATOMIC. > + * > + * Returns 0 on success or a negative error code on failure > + * to allocate kernel memory to copy to. > + */ > +int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > +{ > + struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > + int rc; > + > + rc = skb_copy_frags(skb, gfp_mask, uarg); > + > + if (rc == 0) > + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; > + > + return rc; > +} > + > +int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) > +{ > + return skb_copy_frags(skb, gfp_mask, NULL); > +} > +EXPORT_SYMBOL(skb_orphan_frags); > > /** > * skb_clone - duplicate an sk_buff > -- > 1.7.2.5 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors 2012-05-07 10:24 ` Michael S. Tsirkin @ 2012-05-09 9:36 ` Ian Campbell 0 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-09 9:36 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: netdev@vger.kernel.org, David Miller, Eric Dumazet > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 945b807..f009abb 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) > > } > > EXPORT_SYMBOL_GPL(skb_morph); > > > > -/* skb_copy_ubufs - copy userspace skb frags buffers to kernel > > - * @skb: the skb to modify > > - * @gfp_mask: allocation priority > > - * > > - * This must be called on SKBTX_DEV_ZEROCOPY skb. > > - * It will copy all frags into kernel and drop the reference > > - * to userspace pages. > > - * > > - * If this function is called from an interrupt gfp_mask() must be > > - * %GFP_ATOMIC. > > - * > > - * Returns 0 on success or a negative error code on failure > > - * to allocate kernel memory to copy to. > > +/* > > + * If uarg != NULL copy and replace all frags. > > + * If uarg == NULL then only copy and replace those which have a destructor > > + * pointer. > > */ > > -int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > > +static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask, > > + struct ubuf_info *uarg) > > { > > int i; > > int num_frags = skb_shinfo(skb)->nr_frags; > > struct page *page, *head = NULL; > > - struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > > > > for (i = 0; i < num_frags; i++) { > > u8 *vaddr; > > skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > > > > + if (!uarg && !f->page.destructor) > > + continue; > > + > > page = alloc_page(GFP_ATOMIC); > > if (!page) { > > while (head) { > > @@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > > head = page; > > } > > > > - /* skb frags release userspace buffers */ > > - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) > > + /* skb frags release buffers */ > > + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { > > + skb_frag_t *f = &skb_shinfo(skb)->frags[i]; > > + if (!uarg && !f->page.destructor) > > + continue; > > skb_frag_unref(skb, i); > > + } > > > > - uarg->callback(uarg); > > + if (uarg) > > + uarg->callback(uarg); > > > > So above we only linked up copied pages, but below we > try to use the list for all frags. Looks like a bug, > I think it needs to check destructor and uarg too. You are absolutely correct. We need the same check in all three loops. > > > > /* skb frags point to kernel buffers */ > > for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) { > > @@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > > head = (struct page *)head->private; > > } > > > > - skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; > > return 0; > > } ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage. 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (6 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell 2012-05-10 11:48 ` Michael S. Tsirkin 2012-05-03 14:56 ` [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell 8 siblings, 1 reply; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell This requires adding a new argument to various sendpage hooks up and down the stack. At the moment this parameter is always NULL. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org --- drivers/block/drbd/drbd_main.c | 1 + drivers/scsi/iscsi_tcp.c | 4 ++-- drivers/scsi/iscsi_tcp.h | 3 ++- drivers/target/iscsi/iscsi_target_util.c | 3 ++- fs/dlm/lowcomms.c | 4 ++-- fs/ocfs2/cluster/tcp.c | 1 + include/linux/net.h | 6 +++++- include/net/inet_common.h | 4 +++- include/net/ip.h | 4 +++- include/net/sock.h | 8 +++++--- include/net/tcp.h | 4 +++- net/ceph/messenger.c | 2 +- net/core/sock.c | 6 +++++- net/ipv4/af_inet.c | 9 ++++++--- net/ipv4/ip_output.c | 6 ++++-- net/ipv4/tcp.c | 24 +++++++++++++++--------- net/ipv4/udp.c | 11 ++++++----- net/ipv4/udp_impl.h | 5 +++-- net/rds/tcp_send.c | 1 + net/socket.c | 11 +++++++---- net/sunrpc/svcsock.c | 6 +++--- net/sunrpc/xprtsock.c | 2 +- 22 files changed, 81 insertions(+), 44 deletions(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 211fc44..e70ba0c 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page, set_fs(KERNEL_DS); do { sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page, + NULL, offset, len, msg_flags); if (sent == -EAGAIN) { diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 9220861..724d32538 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn, if (!segment->data) { sg = segment->sg; offset += segment->sg_offset + sg->offset; - r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset, - copy, flags); + r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL, + offset, copy, flags); } else { struct msghdr msg = { .msg_flags = flags }; struct kvec iov = { diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h index 666fe09..1e23265 100644 --- a/drivers/scsi/iscsi_tcp.h +++ b/drivers/scsi/iscsi_tcp.h @@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn { uint32_t sendpage_failures_cnt; uint32_t discontiguous_hdr_cnt; - ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int); + ssize_t (*sendpage)(struct socket *, struct page *, + struct skb_frag_destructor *, int, size_t, int); }; struct iscsi_sw_tcp_host { diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 4eba86d..d876dae 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -1323,7 +1323,8 @@ send_hdr: u32 sub_len = min_t(u32, data_len, space); send_pg: tx_sent = conn->sock->ops->sendpage(conn->sock, - sg_page(sg), sg->offset + offset, sub_len, 0); + sg_page(sg), NULL, + sg->offset + offset, sub_len, 0); if (tx_sent != sub_len) { if (tx_sent == -EAGAIN) { pr_err("tcp_sendpage() returned" diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 133ef6d..0673cea 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con) ret = 0; if (len) { - ret = kernel_sendpage(con->sock, e->page, offset, len, - msg_flags); + ret = kernel_sendpage(con->sock, e->page, NULL, + offset, len, msg_flags); if (ret == -EAGAIN || ret == 0) { if (ret == -EAGAIN && test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) && diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 1bfe880..c82a711 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc, mutex_lock(&sc->sc_send_lock); ret = sc->sc_sock->ops->sendpage(sc->sc_sock, virt_to_page(kmalloced_virt), + NULL, (long)kmalloced_virt & ~PAGE_MASK, size, MSG_DONTWAIT); mutex_unlock(&sc->sc_send_lock); diff --git a/include/linux/net.h b/include/linux/net.h index be60c7f..d9b0d648 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -157,6 +157,7 @@ struct kiocb; struct sockaddr; struct msghdr; struct module; +struct skb_frag_destructor; struct proto_ops { int family; @@ -203,6 +204,7 @@ struct proto_ops { int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, + struct skb_frag_destructor *destroy, int offset, size_t size, int flags); ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); @@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname, char *optval, int *optlen); extern int kernel_setsockopt(struct socket *sock, int level, int optname, char *optval, unsigned int optlen); -extern int kernel_sendpage(struct socket *sock, struct page *page, int offset, +extern int kernel_sendpage(struct socket *sock, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags); extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg); extern int kernel_sock_shutdown(struct socket *sock, diff --git a/include/net/inet_common.h b/include/net/inet_common.h index 22fac98..91cd8d0 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr, extern int inet_accept(struct socket *sock, struct socket *newsock, int flags); extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size); -extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, +extern ssize_t inet_sendpage(struct socket *sock, struct page *page, + struct skb_frag_destructor *frag, + int offset, size_t size, int flags); extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size, int flags); diff --git a/include/net/ip.h b/include/net/ip.h index 94ddb69c..dbd7ecb 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -114,7 +114,9 @@ extern int ip_append_data(struct sock *sk, struct flowi4 *fl4, struct rtable **rt, unsigned int flags); extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb); -extern ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, +extern ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, + struct page *page, + struct skb_frag_destructor *destroy, int offset, size_t size, int flags); extern struct sk_buff *__ip_make_skb(struct sock *sk, struct flowi4 *fl4, diff --git a/include/net/sock.h b/include/net/sock.h index 68a2834..c999f48 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -848,6 +848,7 @@ struct proto { size_t len, int noblock, int flags, int *addr_len); int (*sendpage)(struct sock *sk, struct page *page, + struct skb_frag_destructor *destroy, int offset, size_t size, int flags); int (*bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len); @@ -1466,9 +1467,10 @@ extern int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma); extern ssize_t sock_no_sendpage(struct socket *sock, - struct page *page, - int offset, size_t size, - int flags); + struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, + int flags); /* * Functions to fill in entries in struct proto_ops when a protocol diff --git a/include/net/tcp.h b/include/net/tcp.h index 0fb84de..81dbfde8 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk); extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw); extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size); -extern int tcp_sendpage(struct sock *sk, struct page *page, int offset, +extern int tcp_sendpage(struct sock *sk, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags); extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg); extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb, diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 36fa6bf..b355be1 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -320,7 +320,7 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page, int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR); int ret; - ret = kernel_sendpage(sock, page, offset, size, flags); + ret = kernel_sendpage(sock, page, NULL, offset, size, flags); if (ret == -EAGAIN) ret = 0; diff --git a/net/core/sock.c b/net/core/sock.c index 1a88351..cffff5f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1953,7 +1953,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct * } EXPORT_SYMBOL(sock_no_mmap); -ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags) +ssize_t sock_no_sendpage(struct socket *sock, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags) { ssize_t res; struct msghdr msg = {.msg_flags = flags}; @@ -1963,6 +1965,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz iov.iov_len = size; res = kernel_sendmsg(sock, &msg, &iov, 1, size); kunmap(page); + /* kernel_sendmsg copies so we can destroy immediately */ + skb_frag_destructor_unref(destroy); return res; } EXPORT_SYMBOL(sock_no_sendpage); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c8f7aee..b1caf89 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -747,7 +747,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, } EXPORT_SYMBOL(inet_sendmsg); -ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, +ssize_t inet_sendpage(struct socket *sock, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags) { struct sock *sk = sock->sk; @@ -760,8 +762,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset, return -EAGAIN; if (sk->sk_prot->sendpage) - return sk->sk_prot->sendpage(sk, page, offset, size, flags); - return sock_no_sendpage(sock, page, offset, size, flags); + return sk->sk_prot->sendpage(sk, page, destroy, + offset, size, flags); + return sock_no_sendpage(sock, page, destroy, offset, size, flags); } EXPORT_SYMBOL(inet_sendpage); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 7652751..877ff62 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1129,6 +1129,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4, } ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, + struct skb_frag_destructor *destroy, int offset, size_t size, int flags) { struct inet_sock *inet = inet_sk(sk); @@ -1242,11 +1243,12 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, i = skb_shinfo(skb)->nr_frags; if (len > size) len = size; - if (skb_can_coalesce(skb, i, page, NULL, offset)) { + if (skb_can_coalesce(skb, i, page, destroy, offset)) { skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len); } else if (i < MAX_SKB_FRAGS) { - get_page(page); skb_fill_page_desc(skb, i, page, offset, len); + skb_frag_set_destructor(skb, i, destroy); + skb_frag_ref(skb, i); } else { err = -EMSGSIZE; goto error; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 2d590ca..bee7864 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -822,8 +822,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) return mss_now; } -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset, - size_t psize, int flags) +static ssize_t do_tcp_sendpages(struct sock *sk, + struct page **pages, + struct skb_frag_destructor *destroy, + int poffset, + size_t psize, int flags) { struct tcp_sock *tp = tcp_sk(sk); int mss_now, size_goal; @@ -870,7 +873,7 @@ new_segment: copy = size; i = skb_shinfo(skb)->nr_frags; - can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset); + can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset); if (!can_coalesce && i >= MAX_SKB_FRAGS) { tcp_mark_push(tp, skb); goto new_segment; @@ -881,8 +884,9 @@ new_segment: if (can_coalesce) { skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); } else { - get_page(page); skb_fill_page_desc(skb, i, page, offset, copy); + skb_frag_set_destructor(skb, i, destroy); + skb_frag_ref(skb, i); } skb->len += copy; @@ -937,18 +941,20 @@ out_err: return sk_stream_error(sk, flags, err); } -int tcp_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags) +int tcp_sendpage(struct sock *sk, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags) { ssize_t res; if (!(sk->sk_route_caps & NETIF_F_SG) || !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) - return sock_no_sendpage(sk->sk_socket, page, offset, size, - flags); + return sock_no_sendpage(sk->sk_socket, page, destroy, + offset, size, flags); lock_sock(sk); - res = do_tcp_sendpages(sk, &page, offset, size, flags); + res = do_tcp_sendpages(sk, &page, destroy, + offset, size, flags); release_sock(sk); return res; } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 279fd08..c69aa65 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1032,8 +1032,9 @@ do_confirm: } EXPORT_SYMBOL(udp_sendmsg); -int udp_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags) +int udp_sendpage(struct sock *sk, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags) { struct inet_sock *inet = inet_sk(sk); struct udp_sock *up = udp_sk(sk); @@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset, } ret = ip_append_page(sk, &inet->cork.fl.u.ip4, - page, offset, size, flags); + page, destroy, offset, size, flags); if (ret == -EOPNOTSUPP) { release_sock(sk); - return sock_no_sendpage(sk->sk_socket, page, offset, - size, flags); + return sock_no_sendpage(sk->sk_socket, page, destroy, + offset, size, flags); } if (ret < 0) { udp_flush_pending_frames(sk); diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h index 5a681e2..aa8eca2 100644 --- a/net/ipv4/udp_impl.h +++ b/net/ipv4/udp_impl.h @@ -23,8 +23,9 @@ extern int compat_udp_getsockopt(struct sock *sk, int level, int optname, #endif extern int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len, int noblock, int flags, int *addr_len); -extern int udp_sendpage(struct sock *sk, struct page *page, int offset, - size_t size, int flags); +extern int udp_sendpage(struct sock *sk, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags); extern int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); extern void udp_destroy_sock(struct sock *sk); diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c index 1b4fd68..71503ad 100644 --- a/net/rds/tcp_send.c +++ b/net/rds/tcp_send.c @@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, while (sg < rm->data.op_nents) { ret = tc->t_sock->ops->sendpage(tc->t_sock, sg_page(&rm->data.op_sg[sg]), + NULL, rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off, MSG_DONTWAIT|MSG_NOSIGNAL); diff --git a/net/socket.c b/net/socket.c index d3aaa4f..f92c9c2 100644 --- a/net/socket.c +++ b/net/socket.c @@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page, /* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */ flags |= more; - return kernel_sendpage(sock, page, offset, size, flags); + return kernel_sendpage(sock, page, NULL, offset, size, flags); } static ssize_t sock_splice_read(struct file *file, loff_t *ppos, @@ -3349,15 +3349,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname, } EXPORT_SYMBOL(kernel_setsockopt); -int kernel_sendpage(struct socket *sock, struct page *page, int offset, +int kernel_sendpage(struct socket *sock, struct page *page, + struct skb_frag_destructor *destroy, + int offset, size_t size, int flags) { sock_update_classid(sock->sk); if (sock->ops->sendpage) - return sock->ops->sendpage(sock, page, offset, size, flags); + return sock->ops->sendpage(sock, page, destroy, + offset, size, flags); - return sock_no_sendpage(sock, page, offset, size, flags); + return sock_no_sendpage(sock, page, destroy, offset, size, flags); } EXPORT_SYMBOL(kernel_sendpage); diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index f0132b2..f6d8c73 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, /* send head */ if (slen == xdr->head[0].iov_len) flags = 0; - len = kernel_sendpage(sock, headpage, headoffset, + len = kernel_sendpage(sock, headpage, NULL, headoffset, xdr->head[0].iov_len, flags); if (len != xdr->head[0].iov_len) goto out; @@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, while (pglen > 0) { if (slen == size) flags = 0; - result = kernel_sendpage(sock, *ppage, base, size, flags); + result = kernel_sendpage(sock, *ppage, NULL, base, size, flags); if (result > 0) len += result; if (result != size) @@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, /* send tail */ if (xdr->tail[0].iov_len) { - result = kernel_sendpage(sock, tailpage, tailoffset, + result = kernel_sendpage(sock, tailpage, NULL, tailoffset, xdr->tail[0].iov_len, 0); if (result > 0) len += result; diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 890b03f..f1995dc 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i remainder -= len; if (remainder != 0 || more) flags |= MSG_MORE; - err = sock->ops->sendpage(sock, *ppage, base, len, flags); + err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags); if (remainder == 0 || err != len) break; sent += err; -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage. 2012-05-03 14:56 ` [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage Ian Campbell @ 2012-05-10 11:48 ` Michael S. Tsirkin 0 siblings, 0 replies; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-10 11:48 UTC (permalink / raw) To: Ian Campbell; +Cc: netdev, David Miller, Eric Dumazet On Thu, May 03, 2012 at 03:56:10PM +0100, Ian Campbell wrote: > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 2d590ca..bee7864 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -822,8 +822,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) > return mss_now; > } > > -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset, > - size_t psize, int flags) > +static ssize_t do_tcp_sendpages(struct sock *sk, > + struct page **pages, > + struct skb_frag_destructor *destroy, > + int poffset, > + size_t psize, int flags) > { > struct tcp_sock *tp = tcp_sk(sk); > int mss_now, size_goal; > @@ -870,7 +873,7 @@ new_segment: > copy = size; > > i = skb_shinfo(skb)->nr_frags; > - can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset); > + can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset); > if (!can_coalesce && i >= MAX_SKB_FRAGS) { > tcp_mark_push(tp, skb); > goto new_segment; > @@ -881,8 +884,9 @@ new_segment: > if (can_coalesce) { > skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); > } else { > - get_page(page); > skb_fill_page_desc(skb, i, page, offset, copy); > + skb_frag_set_destructor(skb, i, destroy); > + skb_frag_ref(skb, i); > } > > skb->len += copy; > @@ -937,18 +941,20 @@ out_err: > return sk_stream_error(sk, flags, err); > } > > -int tcp_sendpage(struct sock *sk, struct page *page, int offset, > - size_t size, int flags) > +int tcp_sendpage(struct sock *sk, struct page *page, > + struct skb_frag_destructor *destroy, > + int offset, size_t size, int flags) > { > ssize_t res; > > if (!(sk->sk_route_caps & NETIF_F_SG) || > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > - return sock_no_sendpage(sk->sk_socket, page, offset, size, > - flags); > + return sock_no_sendpage(sk->sk_socket, page, destroy, > + offset, size, flags); > > lock_sock(sk); > - res = do_tcp_sendpages(sk, &page, offset, size, flags); > + res = do_tcp_sendpages(sk, &page, destroy, > + offset, size, flags); > release_sock(sk); > return res; > } Sorry about making more noise but I realized there's something I don't understand here. Is it true that all this does is stick the destructor in the frag list? If so, could this deadlock (or delay application significantly) if tcp has queued the skb on the write queue but is not transmitting it, while the application is waiting for pages to complete? -- MST ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack. 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell ` (7 preceding siblings ...) 2012-05-03 14:56 ` [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage Ian Campbell @ 2012-05-03 14:56 ` Ian Campbell [not found] ` <1336056971-7839-9-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org> 8 siblings, 1 reply; 23+ messages in thread From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw) To: netdev Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell, Neil Brown, J. Bruce Fields, linux-nfs This prevents an issue where an ACK is delayed, a retransmit is queued (either at the RPC or TCP level) and the ACK arrives before the retransmission hits the wire. If this happens to an NFS WRITE RPC then the write() system call completes and the userspace process can continue, potentially modifying data referenced by the retransmission before the retransmission occurs. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org --- include/linux/sunrpc/xdr.h | 2 ++ include/linux/sunrpc/xprt.h | 5 ++++- net/sunrpc/clnt.c | 27 ++++++++++++++++++++++----- net/sunrpc/svcsock.c | 3 ++- net/sunrpc/xprt.c | 12 ++++++++++++ net/sunrpc/xprtsock.c | 3 ++- 6 files changed, 44 insertions(+), 8 deletions(-) diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index af70af3..ff1b121 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -16,6 +16,7 @@ #include <asm/byteorder.h> #include <asm/unaligned.h> #include <linux/scatterlist.h> +#include <linux/skbuff.h> /* * Buffer adjustment @@ -57,6 +58,7 @@ struct xdr_buf { tail[1]; /* Appended after page data */ struct page ** pages; /* Array of contiguous pages */ + struct skb_frag_destructor *destructor; unsigned int page_base, /* Start of page data */ page_len, /* Length of page data */ flags; /* Flags for data disposition */ diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h index 77d278d..e8d3f18 100644 --- a/include/linux/sunrpc/xprt.h +++ b/include/linux/sunrpc/xprt.h @@ -92,7 +92,10 @@ struct rpc_rqst { /* A cookie used to track the state of the transport connection */ - + struct skb_frag_destructor destructor; /* SKB paged fragment + * destructor for + * transmitted pages*/ + /* * Partial send handling */ diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index 6797246..351bf3d 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -61,6 +61,7 @@ static void call_reserve(struct rpc_task *task); static void call_reserveresult(struct rpc_task *task); static void call_allocate(struct rpc_task *task); static void call_decode(struct rpc_task *task); +static void call_complete(struct rpc_task *task); static void call_bind(struct rpc_task *task); static void call_bind_status(struct rpc_task *task); static void call_transmit(struct rpc_task *task); @@ -1416,6 +1417,8 @@ rpc_xdr_encode(struct rpc_task *task) (char *)req->rq_buffer + req->rq_callsize, req->rq_rcvsize); + req->rq_snd_buf.destructor = &req->destructor; + p = rpc_encode_header(task); if (p == NULL) { printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n"); @@ -1581,6 +1584,7 @@ call_connect_status(struct rpc_task *task) static void call_transmit(struct rpc_task *task) { + struct rpc_rqst *req = task->tk_rqstp; dprint_status(task); task->tk_action = call_status; @@ -1614,8 +1618,8 @@ call_transmit(struct rpc_task *task) call_transmit_status(task); if (rpc_reply_expected(task)) return; - task->tk_action = rpc_exit_task; - rpc_wake_up_queued_task(&task->tk_xprt->pending, task); + task->tk_action = call_complete; + skb_frag_destructor_unref(&req->destructor); } /* @@ -1688,7 +1692,8 @@ call_bc_transmit(struct rpc_task *task) return; } - task->tk_action = rpc_exit_task; + task->tk_action = call_complete; + skb_frag_destructor_unref(&req->destructor); if (task->tk_status < 0) { printk(KERN_NOTICE "RPC: Could not send backchannel reply " "error: %d\n", task->tk_status); @@ -1728,7 +1733,6 @@ call_bc_transmit(struct rpc_task *task) "error: %d\n", task->tk_status); break; } - rpc_wake_up_queued_task(&req->rq_xprt->pending, task); } #endif /* CONFIG_SUNRPC_BACKCHANNEL */ @@ -1906,12 +1910,14 @@ call_decode(struct rpc_task *task) return; } - task->tk_action = rpc_exit_task; + task->tk_action = call_complete; if (decode) { task->tk_status = rpcauth_unwrap_resp(task, decode, req, p, task->tk_msg.rpc_resp); } + rpc_sleep_on(&req->rq_xprt->pending, task, NULL); + skb_frag_destructor_unref(&req->destructor); dprintk("RPC: %5u call_decode result %d\n", task->tk_pid, task->tk_status); return; @@ -1926,6 +1932,17 @@ out_retry: } } +/* + * 8. Wait for pages to be released by the network stack. + */ +static void +call_complete(struct rpc_task *task) +{ + dprintk("RPC: %5u call_complete result %d\n", + task->tk_pid, task->tk_status); + task->tk_action = rpc_exit_task; +} + static __be32 * rpc_encode_header(struct rpc_task *task) { diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index f6d8c73..1145929 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, while (pglen > 0) { if (slen == size) flags = 0; - result = kernel_sendpage(sock, *ppage, NULL, base, size, flags); + result = kernel_sendpage(sock, *ppage, xdr->destructor, + base, size, flags); if (result > 0) len += result; if (result != size) diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index 6fe2dce..f8418a0 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt) xprt->xid = net_random(); } +static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy) +{ + struct rpc_rqst *req = + container_of(destroy, struct rpc_rqst, destructor); + + dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid); + rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task); + return 0; +} + static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt) { struct rpc_rqst *req = task->tk_rqstp; @@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt) req->rq_xid = xprt_alloc_xid(xprt); req->rq_release_snd_buf = NULL; xprt_reset_majortimeo(req); + atomic_set(&req->destructor.ref, 1); + req->destructor.destroy = &xprt_complete_skb_pages; dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid, req, ntohl(req->rq_xid)); } diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index f1995dc..44e07f3 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i remainder -= len; if (remainder != 0 || more) flags |= MSG_MORE; - err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags); + err = sock->ops->sendpage(sock, *ppage, xdr->destructor, + base, len, flags); if (remainder == 0 || err != len) break; sent += err; -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 23+ messages in thread
[parent not found: <1336056971-7839-9-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack. [not found] ` <1336056971-7839-9-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org> @ 2012-05-10 11:19 ` Michael S. Tsirkin [not found] ` <20120510111948.GA9609-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Michael S. Tsirkin @ 2012-05-10 11:19 UTC (permalink / raw) To: Ian Campbell Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller, Eric Dumazet, Neil Brown, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, May 03, 2012 at 03:56:11PM +0100, Ian Campbell wrote: > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index f6d8c73..1145929 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, > while (pglen > 0) { > if (slen == size) > flags = 0; > - result = kernel_sendpage(sock, *ppage, NULL, base, size, flags); > + result = kernel_sendpage(sock, *ppage, xdr->destructor, > + base, size, flags); > if (result > 0) > len += result; > if (result != size) So I tried triggering this by simply creating an nfs export on localhost and copying a large file out with dd, but this never seems to trigger this code. Any idea how to test? -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20120510111948.GA9609-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack. [not found] ` <20120510111948.GA9609-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-05-10 13:26 ` Ian Campbell 0 siblings, 0 replies; 23+ messages in thread From: Ian Campbell @ 2012-05-10 13:26 UTC (permalink / raw) To: Michael S. Tsirkin Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Miller, Eric Dumazet, Neil Brown, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 1578 bytes --] On Thu, 2012-05-10 at 12:19 +0100, Michael S. Tsirkin wrote: > On Thu, May 03, 2012 at 03:56:11PM +0100, Ian Campbell wrote: > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > > index f6d8c73..1145929 100644 > > --- a/net/sunrpc/svcsock.c > > +++ b/net/sunrpc/svcsock.c > > @@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr, > > while (pglen > 0) { > > if (slen == size) > > flags = 0; > > - result = kernel_sendpage(sock, *ppage, NULL, base, size, flags); > > + result = kernel_sendpage(sock, *ppage, xdr->destructor, > > + base, size, flags); > > if (result > 0) > > len += result; > > if (result != size) > > So I tried triggering this by simply creating an nfs export on localhost > and copying a large file out with dd, but this never seems to trigger > this code. > > Any idea how to test? My test code, which is a bit overly complex for this because it also tries to demonstrate corruption on the wire, is attached. Using dd I suspect you probably need to increase the block size, and possibly enable O_DIRECT (conv=direct?) My typical scenario has been to mount a remote NFS and run tcpdump -s 4096 -x -ne -v -i eth0 'host $client and ip[184:4] == 0x55555555' to watch for on-wire corruption. FYI I've triggered a BUG_ON in my local debug patches with your series applied, I'm just investigating whether its my debugging or something in the series which causes it. After that I'll try it with local NFS and VMs with bridging etc to test the extra aspects which your series is exercising. Ian. [-- Attachment #2: blktest3.c --] [-- Type: text/x-csrc, Size: 1090 bytes --] #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <err.h> #include <errno.h> #define NR 256 #define SIZE 4096 int main(int argc, char **argv) { int fd, rc, n, iter = 0; const char *path; static unsigned char __attribute__ ((aligned (4096))) buf[NR][SIZE]; if (argc != 2) { fprintf(stderr, "usage: blktest2 [PATH]\n"); exit(1); } path = argv[1]; printf("opening %s for O_DIRECT access\n", path); fd = open(path, O_CREAT/*|O_DIRECT*/|O_RDWR, 0666); if (fd < 0) err(1,"unable to open file"); while(1) { if ((iter%10)==0) printf("iteration %d ...", iter); if (lseek(fd, (iter%NR)*SIZE, SEEK_SET) < 0) err(1, "seek for write %d %d\n", iter, n); memset(buf[iter%NR], 0xaa, SIZE); rc = write(fd, buf[iter%NR], SIZE); memset(buf[iter%NR], 0x55, SIZE); if (rc == -1) //warn("write failed"); err(1, "write failed"); else if (rc != SIZE) err(1, "only wrote %d/%d bytes\n", rc, SIZE); if ((iter%10)==0) printf("\n", iter); iter++; } } ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2012-05-10 13:26 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-03 14:55 [PATCH v5 0/9] skb paged fragment destructors Ian Campbell 2012-05-03 14:56 ` [PATCH 1/9] net: add and use SKB_ALLOCSIZE Ian Campbell 2012-05-03 14:56 ` [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb Ian Campbell 2012-05-03 14:56 ` [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD Ian Campbell 2012-05-03 14:56 ` [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle Ian Campbell 2012-05-03 14:56 ` [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually Ian Campbell 2012-05-03 14:56 ` [PATCH 6/9] net: add support for per-paged-fragment destructors Ian Campbell 2012-05-03 14:56 ` [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors Ian Campbell 2012-05-03 15:41 ` Michael S. Tsirkin 2012-05-03 17:55 ` David Miller 2012-05-03 21:10 ` Michael S. Tsirkin 2012-05-04 6:54 ` David Miller 2012-05-04 10:03 ` Michael S. Tsirkin 2012-05-04 10:51 ` Ian Campbell 2012-05-06 17:01 ` Michael S. Tsirkin 2012-05-06 13:54 ` Michael S. Tsirkin 2012-05-07 10:24 ` Michael S. Tsirkin 2012-05-09 9:36 ` Ian Campbell 2012-05-03 14:56 ` [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage Ian Campbell 2012-05-10 11:48 ` Michael S. Tsirkin 2012-05-03 14:56 ` [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack Ian Campbell [not found] ` <1336056971-7839-9-git-send-email-ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org> 2012-05-10 11:19 ` Michael S. Tsirkin [not found] ` <20120510111948.GA9609-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-05-10 13:26 ` Ian Campbell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).