* [PATCH RFC 0/3] NFS41: optimize layoutget @ 2012-08-08 2:03 Peng Tao 2012-08-08 2:03 ` [PATCH RFC 1/3] NFS: track direct IO left bytes Peng Tao ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Peng Tao @ 2012-08-08 2:03 UTC (permalink / raw) To: bharrosh; +Cc: linux-nfs The patches optimize layoutget size by sending real IO size instead of just a small length in each layoutget. It helps server to do better decisions about layout segment size and thus helps reduce layoutget operation numbers a lot. Boaz please see if you like the behavior in generic layout. If not, I can make it block layout only. Thanks, Tao Peng Tao (3): NFS: track direct IO left bytes NFS41: send real write size in layoutget NFS41: send real read size in layoutget for DIO fs/nfs/direct.c | 12 +++++++++ fs/nfs/internal.h | 1 + fs/nfs/pnfs.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 76 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH RFC 1/3] NFS: track direct IO left bytes 2012-08-08 2:03 [PATCH RFC 0/3] NFS41: optimize layoutget Peng Tao @ 2012-08-08 2:03 ` Peng Tao 2012-08-08 2:03 ` [PATCH RFC 2/3] NFS41: send real write size in layoutget Peng Tao 2012-08-08 2:03 ` [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO Peng Tao 2 siblings, 0 replies; 15+ messages in thread From: Peng Tao @ 2012-08-08 2:03 UTC (permalink / raw) To: bharrosh; +Cc: linux-nfs, Peng Tao From: Peng Tao <tao.peng@emc.com> Signed-off-by: Peng Tao <tao.peng@emc.com> --- fs/nfs/direct.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 34a6282..c39f775 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -78,6 +78,7 @@ struct nfs_direct_req { atomic_t io_count; /* i/os we're waiting for */ spinlock_t lock; /* protect completion state */ ssize_t count, /* bytes actually processed */ + bytes_left, /* bytes left to be sent */ error; /* any reported error */ struct completion completion; /* wait for i/o completion */ @@ -390,6 +391,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de user_addr += req_len; pos += req_len; count -= req_len; + dreq->bytes_left -= req_len; } /* The nfs_page now hold references to these pages */ nfs_direct_release_pages(pagevec, npages); @@ -456,6 +458,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, goto out; dreq->inode = inode; + dreq->bytes_left = iov_length(iov, nr_segs); dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp)); dreq->l_ctx = nfs_get_lock_context(dreq->ctx); if (dreq->l_ctx == NULL) @@ -706,6 +709,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d user_addr += req_len; pos += req_len; count -= req_len; + dreq->bytes_left -= req_len; } /* The nfs_page now hold references to these pages */ nfs_direct_release_pages(pagevec, npages); @@ -855,6 +859,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, goto out; dreq->inode = inode; + dreq->bytes_left = count; dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp)); dreq->l_ctx = nfs_get_lock_context(dreq->ctx); if (dreq->l_ctx == NULL) -- 1.7.1.262.g5ef3d ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-08 2:03 [PATCH RFC 0/3] NFS41: optimize layoutget Peng Tao 2012-08-08 2:03 ` [PATCH RFC 1/3] NFS: track direct IO left bytes Peng Tao @ 2012-08-08 2:03 ` Peng Tao 2012-08-08 18:50 ` Myklebust, Trond 2012-08-12 18:30 ` Boaz Harrosh 2012-08-08 2:03 ` [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO Peng Tao 2 siblings, 2 replies; 15+ messages in thread From: Peng Tao @ 2012-08-08 2:03 UTC (permalink / raw) To: bharrosh; +Cc: linux-nfs, Peng Tao From: Peng Tao <tao.peng@emc.com> For bufferred write, scan dirty pages to find out longest continuous dirty pages. In this case, also allow layout driver to specify a maximum layoutget size which is useful to avoid busy scanning dirty pages for block layout client. For direct write, just use dreq->bytes_left. Signed-off-by: Peng Tao <tao.peng@emc.com> --- fs/nfs/direct.c | 7 ++++++ fs/nfs/internal.h | 1 + fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 64 insertions(+), 2 deletions(-) diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index c39f775..c1899dd 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -46,6 +46,7 @@ #include <linux/kref.h> #include <linux/slab.h> #include <linux/task_io_accounting_ops.h> +#include <linux/module.h> #include <linux/nfs_fs.h> #include <linux/nfs_page.h> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq) kref_put(&dreq->kref, nfs_direct_req_free); } +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq) +{ + return dreq->bytes_left; +} +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left); + /* * Collects and returns the final error value/byte-count. */ diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 31fdb03..e68d329 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode) { inode_dio_wait(inode); } +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq); /* nfs4proc.c */ extern void __nfs4_read_done_cb(struct nfs_read_data *); diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 2e00fea..e61a373 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -29,6 +29,7 @@ #include <linux/nfs_fs.h> #include <linux/nfs_page.h> +#include <linux/pagevec.h> #include <linux/module.h> #include "internal.h" #include "pnfs.h" @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r } EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read); +/* + * Return the number of contiguous bytes in dirty pages for a given inode + * starting at page frame idx. + */ +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx) +{ + struct address_space *mapping = inode->i_mapping; + pgoff_t index; + struct pagevec pvec; + pgoff_t num = 1; /* self */ + int i, done = 0; + + pagevec_init(&pvec, 0); + idx++; /* self */ + while (!done) { + index = idx; + pagevec_lookup_tag(&pvec, mapping, &index, + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE); + if (pagevec_count(&pvec) == 0) + break; + + for (i = 0; i < pagevec_count(&pvec); i++) { + struct page *page = pvec.pages[i]; + + lock_page(page); + if (unlikely(page->mapping != mapping) || + !PageDirty(page) || + PageWriteback(page) || + page->index != idx) { + done = 1; + unlock_page(page); + break; + } + unlock_page(page); + if (done) + break; + idx++; + num++; + } + pagevec_release(&pvec); + } + return num << PAGE_CACHE_SHIFT; +} + void -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, + struct nfs_page *req) { + u64 wb_size; + BUG_ON(pgio->pg_lseg != NULL); if (req->wb_offset != req->wb_pgbase) { nfs_pageio_reset_write_mds(pgio); return; } + + if (pgio->pg_dreq == NULL) + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); + else + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); + pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, req->wb_context, req_offset(req), - req->wb_bytes, + wb_size?:req->wb_bytes, IOMODE_RW, GFP_NOFS); /* If no lseg, fall back to write through mds */ -- 1.7.1.262.g5ef3d ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-08 2:03 ` [PATCH RFC 2/3] NFS41: send real write size in layoutget Peng Tao @ 2012-08-08 18:50 ` Myklebust, Trond 2012-08-09 2:24 ` Peng Tao 2012-08-12 18:30 ` Boaz Harrosh 1 sibling, 1 reply; 15+ messages in thread From: Myklebust, Trond @ 2012-08-08 18:50 UTC (permalink / raw) To: Peng Tao; +Cc: bharrosh@panasas.com, linux-nfs@vger.kernel.org, Peng Tao T24gV2VkLCAyMDEyLTA4LTA4IGF0IDEwOjAzICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gRnJv bTogUGVuZyBUYW8gPHRhby5wZW5nQGVtYy5jb20+DQo+IA0KPiBGb3IgYnVmZmVycmVkIHdyaXRl LCBzY2FuIGRpcnR5IHBhZ2VzIHRvIGZpbmQgb3V0IGxvbmdlc3QgY29udGludW91cw0KPiBkaXJ0 eSBwYWdlcy4gSW4gdGhpcyBjYXNlLCBhbHNvIGFsbG93IGxheW91dCBkcml2ZXIgdG8gc3BlY2lm eSBhDQo+IG1heGltdW0gbGF5b3V0Z2V0IHNpemUgd2hpY2ggaXMgdXNlZnVsIHRvIGF2b2lkIGJ1 c3kgc2Nhbm5pbmcgZGlydHkgcGFnZXMNCj4gZm9yIGJsb2NrIGxheW91dCBjbGllbnQuDQoNClRo aXMgaXMgX3JlYWxseV8gdWdseSwgYW5kIGlzIGEgc291cmNlIG9mIHBvdGVudGlhbCBkZWFkbG9j a3MgaWYNCm11bHRpcGxlIHRocmVhZHMgYXJlIHNjYW5uaW5nK2xvY2tpbmcgcGFnZXMgYXQgdGhl IHNhbWUgdGltZS4NCg0KV2h5IG5vdCBzaW1wbGlmeSB0aGlzIGJ5IGRyb3BwaW5nIHRoZSBjb21w bGljYXRlZCB0ZXN0cyBmb3Igd2hldGhlciBvcg0Kbm90IHRoZSBwYWdlcyBhcmUgZGlydHk/IFRo YXQgZ2V0cyByaWQgb2YgdGhlIGRlYWRsb2NrLXByb25lIGxvY2tfcGFnZSgpDQpjYWxscywgYW5k IHdvdWxkIGVuYWJsZSB5b3UgdG8ganVzdCB1c2UgcmFkaXhfdHJlZV9uZXh0X2hvbGUoKS4NCg0K PiBGb3IgZGlyZWN0IHdyaXRlLCBqdXN0IHVzZSBkcmVxLT5ieXRlc19sZWZ0Lg0KDQpUaGlzIG1h a2VzIHNlbnNlLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWlu dGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAu Y29tDQoNCg== ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-08 18:50 ` Myklebust, Trond @ 2012-08-09 2:24 ` Peng Tao 0 siblings, 0 replies; 15+ messages in thread From: Peng Tao @ 2012-08-09 2:24 UTC (permalink / raw) To: Myklebust, Trond; +Cc: bharrosh@panasas.com, linux-nfs@vger.kernel.org On Thu, Aug 9, 2012 at 2:50 AM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote: >> From: Peng Tao <tao.peng@emc.com> >> >> For bufferred write, scan dirty pages to find out longest continuous >> dirty pages. In this case, also allow layout driver to specify a >> maximum layoutget size which is useful to avoid busy scanning dirty pages >> for block layout client. > > This is _really_ ugly, and is a source of potential deadlocks if > multiple threads are scanning+locking pages at the same time. Multiple threads won't be able to scan into the same region as borders are protected by page writeback bit. > > Why not simplify this by dropping the complicated tests for whether or > not the pages are dirty? That gets rid of the deadlock-prone lock_page() > calls, and would enable you to just use radix_tree_next_hole(). > OK. I will cook a patch to drop the complicated page dirty/writeback checks and just use radix_tree_next_hole. -- Thanks, Tao ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-08 2:03 ` [PATCH RFC 2/3] NFS41: send real write size in layoutget Peng Tao 2012-08-08 18:50 ` Myklebust, Trond @ 2012-08-12 18:30 ` Boaz Harrosh 2012-08-12 18:40 ` Boaz Harrosh ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Boaz Harrosh @ 2012-08-12 18:30 UTC (permalink / raw) To: Peng Tao; +Cc: linux-nfs, Peng Tao On 08/08/2012 05:03 AM, Peng Tao wrote: > From: Peng Tao <tao.peng@emc.com> > > For bufferred write, scan dirty pages to find out longest continuous > dirty pages. In this case, also allow layout driver to specify a > maximum layoutget size which is useful to avoid busy scanning dirty pages > for block layout client. > > For direct write, just use dreq->bytes_left. > > Signed-off-by: Peng Tao <tao.peng@emc.com> > --- > fs/nfs/direct.c | 7 ++++++ > fs/nfs/internal.h | 1 + > fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 64 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c > index c39f775..c1899dd 100644 > --- a/fs/nfs/direct.c > +++ b/fs/nfs/direct.c > @@ -46,6 +46,7 @@ > #include <linux/kref.h> > #include <linux/slab.h> > #include <linux/task_io_accounting_ops.h> > +#include <linux/module.h> > > #include <linux/nfs_fs.h> > #include <linux/nfs_page.h> > @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq) > kref_put(&dreq->kref, nfs_direct_req_free); > } > > +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq) > +{ > + return dreq->bytes_left; > +} > +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left); > + > /* > * Collects and returns the final error value/byte-count. > */ > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index 31fdb03..e68d329 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode) > { > inode_dio_wait(inode); > } > +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq); > Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline here ? > /* nfs4proc.c */ > extern void __nfs4_read_done_cb(struct nfs_read_data *); > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c > index 2e00fea..e61a373 100644 > --- a/fs/nfs/pnfs.c > +++ b/fs/nfs/pnfs.c > @@ -29,6 +29,7 @@ > > #include <linux/nfs_fs.h> > #include <linux/nfs_page.h> > +#include <linux/pagevec.h> > #include <linux/module.h> > #include "internal.h" > #include "pnfs.h" > @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r > } > EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read); > > +/* > + * Return the number of contiguous bytes in dirty pages for a given inode > + * starting at page frame idx. > + */ > +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx) > +{ > + struct address_space *mapping = inode->i_mapping; > + pgoff_t index; > + struct pagevec pvec; > + pgoff_t num = 1; /* self */ > + int i, done = 0; > + > + pagevec_init(&pvec, 0); > + idx++; /* self */ > + while (!done) { > + index = idx; > + pagevec_lookup_tag(&pvec, mapping, &index, > + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE); > + if (pagevec_count(&pvec) == 0) > + break; > + > + for (i = 0; i < pagevec_count(&pvec); i++) { > + struct page *page = pvec.pages[i]; > + > + lock_page(page); > + if (unlikely(page->mapping != mapping) || > + !PageDirty(page) || > + PageWriteback(page) || > + page->index != idx) { > + done = 1; > + unlock_page(page); > + break; > + } > + unlock_page(page); > + if (done) > + break; > + idx++; > + num++; > + } > + pagevec_release(&pvec); > + } > + return num << PAGE_CACHE_SHIFT; > +} > + Same as what Trond said. radix_tree_next_hole() should be nicer. We need never do any locking this is just an hint, and not a transaction guaranty. Best first guess approximation is all we need. Also you might want to optimize for the most common case of a linear write from zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty pages, if they are the same you know there are no holes, and need not scan. > void > -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) > +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, > + struct nfs_page *req) Nothing changed here, please don't do this > { > + u64 wb_size; > + > BUG_ON(pgio->pg_lseg != NULL); > > if (req->wb_offset != req->wb_pgbase) { > nfs_pageio_reset_write_mds(pgio); > return; > } > + > + if (pgio->pg_dreq == NULL) > + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); > + else > + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); > + > pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, > req->wb_context, > req_offset(req), > - req->wb_bytes, > + wb_size?:req->wb_bytes, wb_size is always set above in the if() or else. No need to check here. > IOMODE_RW, > GFP_NOFS); > /* If no lseg, fall back to write through mds */ But No! much much better then last time, thank you for working on this but it is not optimum for objects and files (when "files" supports segments) For blocks, Yes radix_tree_next_hole() is the perfect fit. But for objects (and files) it is i_size_read(). The objects/files server usually determines it's topology according to file-size. And it does not have any bigger resources because of holes or no holes. (The files example I think of is CEPH) So for objects the wasting factor is the actual i_size extending as a cause of layout_get, and not the number of pages served. So for us the gain is if client, that has a much newer information about i_size, sends it on first layout_get. Though extending file size only once on first layout_get and not on every layout_get. So the small change I want is: +enum pnfs_layout_get_strategy { + PLGS_USE_ISIZE, + PLGS_SEARCH_FIRST_HOLE, + PLGS_ALL_FILE, +}; -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req, + enum pnfs_layout_get_strategy plgs) ... + if (pgio->pg_dreq == NULL) { + switch (plgs) { + case PLGS_SEARCH_FIRST_HOLE: + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); + break; + case PLGS_USE_ISIZE: + wb_size = i_size_read() - req_offset(req); + break; + case PLGS_ALL_FILE: + wb_size = NFS4_MAX_UINT64; + break; + default: + WARN_ON(1); + wb_size = PAGE_SIZE; + } + else + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init and each LD should pass the proper pnfs_layout_get_strategy flag. But yes, this is much more like it. Thanks again, nice progress. Thanks Boaz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-12 18:30 ` Boaz Harrosh @ 2012-08-12 18:40 ` Boaz Harrosh 2012-08-13 6:15 ` Peng Tao 2012-08-13 9:44 ` Peng Tao 2 siblings, 0 replies; 15+ messages in thread From: Boaz Harrosh @ 2012-08-12 18:40 UTC (permalink / raw) To: Peng Tao; +Cc: linux-nfs, Peng Tao On 08/12/2012 09:30 PM, Boaz Harrosh wrote: > So for objects the wasting factor is the actual i_size extending as a cause > of layout_get, and not the number of pages served. So for us the gain is if > client, that has a much newer information about i_size, sends it on first > layout_get. Though extending file size only once on first layout_get and > not on every layout_get. > I want to clarify here. The i_size does not and must not grow as part of a layout_get. Only a layout_commit might extend i_size. the "file-size" I meant above is the current maximum size that can be described by the inode's layout device-map. The device map does grow on layout_get both for objects, as well as for example a CEPH cluster. If we send i_size from client then we only need extend device-map once during the complete writeout. (If need extending) Thanks Boaz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-12 18:30 ` Boaz Harrosh 2012-08-12 18:40 ` Boaz Harrosh @ 2012-08-13 6:15 ` Peng Tao 2012-08-13 9:44 ` Peng Tao 2 siblings, 0 replies; 15+ messages in thread From: Peng Tao @ 2012-08-13 6:15 UTC (permalink / raw) To: Boaz Harrosh; +Cc: linux-nfs On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote: > On 08/08/2012 05:03 AM, Peng Tao wrote: > >> From: Peng Tao <tao.peng@emc.com> >> >> For bufferred write, scan dirty pages to find out longest continuous >> dirty pages. In this case, also allow layout driver to specify a >> maximum layoutget size which is useful to avoid busy scanning dirty pages >> for block layout client. >> >> For direct write, just use dreq->bytes_left. >> >> Signed-off-by: Peng Tao <tao.peng@emc.com> >> --- >> fs/nfs/direct.c | 7 ++++++ >> fs/nfs/internal.h | 1 + >> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++- >> 3 files changed, 64 insertions(+), 2 deletions(-) >> >> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c >> index c39f775..c1899dd 100644 >> --- a/fs/nfs/direct.c >> +++ b/fs/nfs/direct.c >> @@ -46,6 +46,7 @@ >> #include <linux/kref.h> >> #include <linux/slab.h> >> #include <linux/task_io_accounting_ops.h> >> +#include <linux/module.h> >> >> #include <linux/nfs_fs.h> >> #include <linux/nfs_page.h> >> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq) >> kref_put(&dreq->kref, nfs_direct_req_free); >> } >> >> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq) >> +{ >> + return dreq->bytes_left; >> +} >> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left); >> + >> /* >> * Collects and returns the final error value/byte-count. >> */ >> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h >> index 31fdb03..e68d329 100644 >> --- a/fs/nfs/internal.h >> +++ b/fs/nfs/internal.h >> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode) >> { >> inode_dio_wait(inode); >> } >> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq); >> > > > Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline > here ? > struct nfs_direct_req is internal in direct.c. Cannot access its member structure outside without exported APIs. >> /* nfs4proc.c */ >> extern void __nfs4_read_done_cb(struct nfs_read_data *); >> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c >> index 2e00fea..e61a373 100644 >> --- a/fs/nfs/pnfs.c >> +++ b/fs/nfs/pnfs.c >> @@ -29,6 +29,7 @@ >> >> #include <linux/nfs_fs.h> >> #include <linux/nfs_page.h> >> +#include <linux/pagevec.h> >> #include <linux/module.h> >> #include "internal.h" >> #include "pnfs.h" >> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r >> } >> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read); >> >> +/* >> + * Return the number of contiguous bytes in dirty pages for a given inode >> + * starting at page frame idx. >> + */ >> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx) >> +{ >> + struct address_space *mapping = inode->i_mapping; >> + pgoff_t index; >> + struct pagevec pvec; >> + pgoff_t num = 1; /* self */ >> + int i, done = 0; >> + >> + pagevec_init(&pvec, 0); >> + idx++; /* self */ >> + while (!done) { >> + index = idx; >> + pagevec_lookup_tag(&pvec, mapping, &index, >> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE); >> + if (pagevec_count(&pvec) == 0) >> + break; >> + >> + for (i = 0; i < pagevec_count(&pvec); i++) { >> + struct page *page = pvec.pages[i]; >> + >> + lock_page(page); >> + if (unlikely(page->mapping != mapping) || >> + !PageDirty(page) || >> + PageWriteback(page) || >> + page->index != idx) { >> + done = 1; >> + unlock_page(page); >> + break; >> + } >> + unlock_page(page); >> + if (done) >> + break; >> + idx++; >> + num++; >> + } >> + pagevec_release(&pvec); >> + } >> + return num << PAGE_CACHE_SHIFT; >> +} >> + > > > Same as what Trond said. radix_tree_next_hole() should be nicer. We need never > do any locking this is just an hint, and not a transaction guaranty. Best first > guess approximation is all we need. > > Also you might want to optimize for the most common case of a linear write from > zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty > pages, if they are the same you know there are no holes, and need not scan. > Sounds reasonable. Will do it in v2. >> void >> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) >> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, >> + struct nfs_page *req) > > > Nothing changed here, please don't do this > >> { >> + u64 wb_size; >> + >> BUG_ON(pgio->pg_lseg != NULL); >> >> if (req->wb_offset != req->wb_pgbase) { >> nfs_pageio_reset_write_mds(pgio); >> return; >> } >> + >> + if (pgio->pg_dreq == NULL) >> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); >> + else >> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); >> + >> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, >> req->wb_context, >> req_offset(req), >> - req->wb_bytes, >> + wb_size?:req->wb_bytes, > > > wb_size is always set above in the if() or else. No need to check here. > >> IOMODE_RW, >> GFP_NOFS); >> /* If no lseg, fall back to write through mds */ > > > > But No! > > much much better then last time, thank you for working on this > but it is not optimum for objects and files > (when "files" supports segments) > > For blocks, Yes radix_tree_next_hole() is the perfect fit. But for > objects (and files) it is i_size_read(). The objects/files server usually > determines it's topology according to file-size. And it does not have any > bigger resources because of holes or no holes. (The files example I think of > is CEPH) > > So for objects the wasting factor is the actual i_size extending as a cause > of layout_get, and not the number of pages served. So for us the gain is if > client, that has a much newer information about i_size, sends it on first > layout_get. Though extending file size only once on first layout_get and > not on every layout_get. > > So the small change I want is: > > +enum pnfs_layout_get_strategy { > + PLGS_USE_ISIZE, > + PLGS_SEARCH_FIRST_HOLE, > + PLGS_ALL_FILE, > +}; > > -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) > +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req, > + enum pnfs_layout_get_strategy plgs) > > > ... > > + if (pgio->pg_dreq == NULL) { > + switch (plgs) { > + case PLGS_SEARCH_FIRST_HOLE: > + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); > + break; > + case PLGS_USE_ISIZE: > + wb_size = i_size_read() - req_offset(req); > + break; > + case PLGS_ALL_FILE: > + wb_size = NFS4_MAX_UINT64; > + break; > + default: > + WARN_ON(1); > + wb_size = PAGE_SIZE; > + } > + else > + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); > > The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init > and each LD should pass the proper pnfs_layout_get_strategy flag. > Will add it in v2. Thanks for reviewing. Cheers, Tao ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-12 18:30 ` Boaz Harrosh 2012-08-12 18:40 ` Boaz Harrosh 2012-08-13 6:15 ` Peng Tao @ 2012-08-13 9:44 ` Peng Tao 2012-08-13 20:13 ` Boaz Harrosh 2 siblings, 1 reply; 15+ messages in thread From: Peng Tao @ 2012-08-13 9:44 UTC (permalink / raw) To: Boaz Harrosh; +Cc: linux-nfs On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote: > On 08/08/2012 05:03 AM, Peng Tao wrote: > >> From: Peng Tao <tao.peng@emc.com> >> >> For bufferred write, scan dirty pages to find out longest continuous >> dirty pages. In this case, also allow layout driver to specify a >> maximum layoutget size which is useful to avoid busy scanning dirty pages >> for block layout client. >> >> For direct write, just use dreq->bytes_left. >> >> Signed-off-by: Peng Tao <tao.peng@emc.com> >> --- >> fs/nfs/direct.c | 7 ++++++ >> fs/nfs/internal.h | 1 + >> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++- >> 3 files changed, 64 insertions(+), 2 deletions(-) >> >> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c >> index c39f775..c1899dd 100644 >> --- a/fs/nfs/direct.c >> +++ b/fs/nfs/direct.c >> @@ -46,6 +46,7 @@ >> #include <linux/kref.h> >> #include <linux/slab.h> >> #include <linux/task_io_accounting_ops.h> >> +#include <linux/module.h> >> >> #include <linux/nfs_fs.h> >> #include <linux/nfs_page.h> >> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq) >> kref_put(&dreq->kref, nfs_direct_req_free); >> } >> >> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq) >> +{ >> + return dreq->bytes_left; >> +} >> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left); >> + >> /* >> * Collects and returns the final error value/byte-count. >> */ >> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h >> index 31fdb03..e68d329 100644 >> --- a/fs/nfs/internal.h >> +++ b/fs/nfs/internal.h >> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode) >> { >> inode_dio_wait(inode); >> } >> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq); >> > > > Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline > here ? > >> /* nfs4proc.c */ >> extern void __nfs4_read_done_cb(struct nfs_read_data *); >> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c >> index 2e00fea..e61a373 100644 >> --- a/fs/nfs/pnfs.c >> +++ b/fs/nfs/pnfs.c >> @@ -29,6 +29,7 @@ >> >> #include <linux/nfs_fs.h> >> #include <linux/nfs_page.h> >> +#include <linux/pagevec.h> >> #include <linux/module.h> >> #include "internal.h" >> #include "pnfs.h" >> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r >> } >> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read); >> >> +/* >> + * Return the number of contiguous bytes in dirty pages for a given inode >> + * starting at page frame idx. >> + */ >> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx) >> +{ >> + struct address_space *mapping = inode->i_mapping; >> + pgoff_t index; >> + struct pagevec pvec; >> + pgoff_t num = 1; /* self */ >> + int i, done = 0; >> + >> + pagevec_init(&pvec, 0); >> + idx++; /* self */ >> + while (!done) { >> + index = idx; >> + pagevec_lookup_tag(&pvec, mapping, &index, >> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE); >> + if (pagevec_count(&pvec) == 0) >> + break; >> + >> + for (i = 0; i < pagevec_count(&pvec); i++) { >> + struct page *page = pvec.pages[i]; >> + >> + lock_page(page); >> + if (unlikely(page->mapping != mapping) || >> + !PageDirty(page) || >> + PageWriteback(page) || >> + page->index != idx) { >> + done = 1; >> + unlock_page(page); >> + break; >> + } >> + unlock_page(page); >> + if (done) >> + break; >> + idx++; >> + num++; >> + } >> + pagevec_release(&pvec); >> + } >> + return num << PAGE_CACHE_SHIFT; >> +} >> + > > > Same as what Trond said. radix_tree_next_hole() should be nicer. We need never > do any locking this is just an hint, and not a transaction guaranty. Best first > guess approximation is all we need. > > Also you might want to optimize for the most common case of a linear write from > zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty > pages, if they are the same you know there are no holes, and need not scan. > >> void >> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) >> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, >> + struct nfs_page *req) > > > Nothing changed here, please don't do this > >> { >> + u64 wb_size; >> + >> BUG_ON(pgio->pg_lseg != NULL); >> >> if (req->wb_offset != req->wb_pgbase) { >> nfs_pageio_reset_write_mds(pgio); >> return; >> } >> + >> + if (pgio->pg_dreq == NULL) >> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); >> + else >> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); >> + >> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, >> req->wb_context, >> req_offset(req), >> - req->wb_bytes, >> + wb_size?:req->wb_bytes, > > > wb_size is always set above in the if() or else. No need to check here. > >> IOMODE_RW, >> GFP_NOFS); >> /* If no lseg, fall back to write through mds */ > > > > But No! > > much much better then last time, thank you for working on this > but it is not optimum for objects and files > (when "files" supports segments) > > For blocks, Yes radix_tree_next_hole() is the perfect fit. But for > objects (and files) it is i_size_read(). The objects/files server usually > determines it's topology according to file-size. And it does not have any > bigger resources because of holes or no holes. (The files example I think of > is CEPH) > > So for objects the wasting factor is the actual i_size extending as a cause > of layout_get, and not the number of pages served. So for us the gain is if > client, that has a much newer information about i_size, sends it on first > layout_get. Though extending file size only once on first layout_get and > not on every layout_get. > > So the small change I want is: > > +enum pnfs_layout_get_strategy { > + PLGS_USE_ISIZE, > + PLGS_SEARCH_FIRST_HOLE, > + PLGS_ALL_FILE, > +}; > Just a second thought, since each layout driver would use one strategy, it is more reasonable to set the strategy in pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I will do it this way. -- Thanks, Tao ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-13 9:44 ` Peng Tao @ 2012-08-13 20:13 ` Boaz Harrosh 2012-08-13 20:21 ` Myklebust, Trond 0 siblings, 1 reply; 15+ messages in thread From: Boaz Harrosh @ 2012-08-13 20:13 UTC (permalink / raw) To: Peng Tao; +Cc: linux-nfs On 08/13/2012 12:44 PM, Peng Tao wrote: > On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote: >> On 08/08/2012 05:03 AM, Peng Tao wrote: >> >>> From: Peng Tao <tao.peng@emc.com> >>> >>> For bufferred write, scan dirty pages to find out longest continuous >>> dirty pages. In this case, also allow layout driver to specify a >>> maximum layoutget size which is useful to avoid busy scanning dirty pages >>> for block layout client. >>> >>> For direct write, just use dreq->bytes_left. >>> >>> Signed-off-by: Peng Tao <tao.peng@emc.com> >>> --- >>> fs/nfs/direct.c | 7 ++++++ >>> fs/nfs/internal.h | 1 + >>> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++- >>> 3 files changed, 64 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c >>> index c39f775..c1899dd 100644 >>> --- a/fs/nfs/direct.c >>> +++ b/fs/nfs/direct.c >>> @@ -46,6 +46,7 @@ >>> #include <linux/kref.h> >>> #include <linux/slab.h> >>> #include <linux/task_io_accounting_ops.h> >>> +#include <linux/module.h> >>> >>> #include <linux/nfs_fs.h> >>> #include <linux/nfs_page.h> >>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq) >>> kref_put(&dreq->kref, nfs_direct_req_free); >>> } >>> >>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq) >>> +{ >>> + return dreq->bytes_left; >>> +} >>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left); >>> + >>> /* >>> * Collects and returns the final error value/byte-count. >>> */ >>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h >>> index 31fdb03..e68d329 100644 >>> --- a/fs/nfs/internal.h >>> +++ b/fs/nfs/internal.h >>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode) >>> { >>> inode_dio_wait(inode); >>> } >>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq); >>> >> >> >> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline >> here ? >> >>> /* nfs4proc.c */ >>> extern void __nfs4_read_done_cb(struct nfs_read_data *); >>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c >>> index 2e00fea..e61a373 100644 >>> --- a/fs/nfs/pnfs.c >>> +++ b/fs/nfs/pnfs.c >>> @@ -29,6 +29,7 @@ >>> >>> #include <linux/nfs_fs.h> >>> #include <linux/nfs_page.h> >>> +#include <linux/pagevec.h> >>> #include <linux/module.h> >>> #include "internal.h" >>> #include "pnfs.h" >>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r >>> } >>> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read); >>> >>> +/* >>> + * Return the number of contiguous bytes in dirty pages for a given inode >>> + * starting at page frame idx. >>> + */ >>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx) >>> +{ >>> + struct address_space *mapping = inode->i_mapping; >>> + pgoff_t index; >>> + struct pagevec pvec; >>> + pgoff_t num = 1; /* self */ >>> + int i, done = 0; >>> + >>> + pagevec_init(&pvec, 0); >>> + idx++; /* self */ >>> + while (!done) { >>> + index = idx; >>> + pagevec_lookup_tag(&pvec, mapping, &index, >>> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE); >>> + if (pagevec_count(&pvec) == 0) >>> + break; >>> + >>> + for (i = 0; i < pagevec_count(&pvec); i++) { >>> + struct page *page = pvec.pages[i]; >>> + >>> + lock_page(page); >>> + if (unlikely(page->mapping != mapping) || >>> + !PageDirty(page) || >>> + PageWriteback(page) || >>> + page->index != idx) { >>> + done = 1; >>> + unlock_page(page); >>> + break; >>> + } >>> + unlock_page(page); >>> + if (done) >>> + break; >>> + idx++; >>> + num++; >>> + } >>> + pagevec_release(&pvec); >>> + } >>> + return num << PAGE_CACHE_SHIFT; >>> +} >>> + >> >> >> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never >> do any locking this is just an hint, and not a transaction guaranty. Best first >> guess approximation is all we need. >> >> Also you might want to optimize for the most common case of a linear write from >> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty >> pages, if they are the same you know there are no holes, and need not scan. >> >>> void >>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) >>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, >>> + struct nfs_page *req) >> >> >> Nothing changed here, please don't do this >> >>> { >>> + u64 wb_size; >>> + >>> BUG_ON(pgio->pg_lseg != NULL); >>> >>> if (req->wb_offset != req->wb_pgbase) { >>> nfs_pageio_reset_write_mds(pgio); >>> return; >>> } >>> + >>> + if (pgio->pg_dreq == NULL) >>> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index); >>> + else >>> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq); >>> + >>> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, >>> req->wb_context, >>> req_offset(req), >>> - req->wb_bytes, >>> + wb_size?:req->wb_bytes, >> >> >> wb_size is always set above in the if() or else. No need to check here. >> >>> IOMODE_RW, >>> GFP_NOFS); >>> /* If no lseg, fall back to write through mds */ >> >> >> >> But No! >> >> much much better then last time, thank you for working on this >> but it is not optimum for objects and files >> (when "files" supports segments) >> >> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for >> objects (and files) it is i_size_read(). The objects/files server usually >> determines it's topology according to file-size. And it does not have any >> bigger resources because of holes or no holes. (The files example I think of >> is CEPH) >> >> So for objects the wasting factor is the actual i_size extending as a cause >> of layout_get, and not the number of pages served. So for us the gain is if >> client, that has a much newer information about i_size, sends it on first >> layout_get. Though extending file size only once on first layout_get and >> not on every layout_get. >> >> So the small change I want is: >> >> +enum pnfs_layout_get_strategy { >> + PLGS_USE_ISIZE, >> + PLGS_SEARCH_FIRST_HOLE, >> + PLGS_ALL_FILE, >> +}; >> > Just a second thought, since each layout driver would use one > strategy, it is more reasonable to set the strategy in > pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I > will do it this way. > It's fine, as you see fit. I think it's more flexible this way but both ways will work for now. Please note that for files, once it will support segments, it would want to use i_size like objects. Thanks Boaz ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget 2012-08-13 20:13 ` Boaz Harrosh @ 2012-08-13 20:21 ` Myklebust, Trond 0 siblings, 0 replies; 15+ messages in thread From: Myklebust, Trond @ 2012-08-13 20:21 UTC (permalink / raw) To: Boaz Harrosh; +Cc: Peng Tao, linux-nfs@vger.kernel.org T24gTW9uLCAyMDEyLTA4LTEzIGF0IDIzOjEzICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+ IE9uIDA4LzEzLzIwMTIgMTI6NDQgUE0sIFBlbmcgVGFvIHdyb3RlOg0KPiANCj4gPiBPbiBNb24s IEF1ZyAxMywgMjAxMiBhdCAyOjMwIEFNLCBCb2F6IEhhcnJvc2ggPGJoYXJyb3NoQHBhbmFzYXMu Y29tPiB3cm90ZToNCj4gPj4gU28gdGhlIHNtYWxsIGNoYW5nZSBJIHdhbnQgaXM6DQo+ID4+DQo+ ID4+ICtlbnVtIHBuZnNfbGF5b3V0X2dldF9zdHJhdGVneSB7DQo+ID4+ICsgICAgICAgUExHU19V U0VfSVNJWkUsDQo+ID4+ICsgICAgICAgUExHU19TRUFSQ0hfRklSU1RfSE9MRSwNCj4gPj4gKyAg ICAgICBQTEdTX0FMTF9GSUxFLA0KPiA+PiArfTsNCj4gPj4NCj4gPiBKdXN0IGEgc2Vjb25kIHRo b3VnaHQsIHNpbmNlIGVhY2ggbGF5b3V0IGRyaXZlciB3b3VsZCB1c2Ugb25lDQo+ID4gc3RyYXRl Z3ksIGl0IGlzIG1vcmUgcmVhc29uYWJsZSB0byBzZXQgdGhlIHN0cmF0ZWd5IGluDQo+ID4gcG5m c19jdXJyX2xkLT5mbGFncyBpbnN0ZWFkIG9mIGNoYW5naW5nIHBnX2luaXQgQVBJIHRvIHBhc3Mg aXQgaW4uIEkNCj4gPiB3aWxsIGRvIGl0IHRoaXMgd2F5Lg0KPiA+IA0KPiANCj4gDQo+IEl0J3Mg ZmluZSwgYXMgeW91IHNlZSBmaXQuIEkgdGhpbmsgaXQncyBtb3JlIGZsZXhpYmxlIHRoaXMgd2F5 IGJ1dA0KPiBib3RoIHdheXMgd2lsbCB3b3JrIGZvciBub3cuDQo+IA0KPiBQbGVhc2Ugbm90ZSB0 aGF0IGZvciBmaWxlcywgb25jZSBpdCB3aWxsIHN1cHBvcnQgc2VnbWVudHMsIGl0IHdvdWxkDQo+ IHdhbnQgdG8gdXNlIGlfc2l6ZSBsaWtlIG9iamVjdHMuDQoNCg0KTGF5b3V0LXNwZWNpZmljIGZs YWdzIGluIHRoZSBnZW5lcmljIGNvZGUgYXJlIG5vdCBhY2NlcHRhYmxlLiBJZiB0aGUNCnN0cmF0 ZWd5IGlzIGxheW91dCBzcGVjaWZpYywgdGhlbiB0aGVyZSBzaG91bGQgYmUgbm8gbmVlZCB0byBw YXNzIGl0DQphcm91bmQgdGhlIGdlbmVyaWMgbGF5ZXJzIGF0IGFsbC4gSnVzdCBkbyB0aGUgcmln aHQgdGhpbmcgaW4gdGhlIGRyaXZlcg0KKGkuZS4gcGdfaW5pdCkgYW5kIHdlJ3JlIGRvbmUuDQoN Ci0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0 QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO 2012-08-08 2:03 [PATCH RFC 0/3] NFS41: optimize layoutget Peng Tao 2012-08-08 2:03 ` [PATCH RFC 1/3] NFS: track direct IO left bytes Peng Tao 2012-08-08 2:03 ` [PATCH RFC 2/3] NFS41: send real write size in layoutget Peng Tao @ 2012-08-08 2:03 ` Peng Tao 2012-08-08 18:57 ` Myklebust, Trond 2 siblings, 1 reply; 15+ messages in thread From: Peng Tao @ 2012-08-08 2:03 UTC (permalink / raw) To: bharrosh; +Cc: linux-nfs, Peng Tao From: Peng Tao <tao.peng@emc.com> We don't have the real IO size information in buffer read, but for direct read, we can get it from dreq->bytes_left. Let's make use of it. Signed-off-by: Peng Tao <tao.peng@emc.com> --- fs/nfs/pnfs.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index e61a373..e4cfd1e 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1154,16 +1154,22 @@ out_forget_reply: void pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req) { + u64 rd_size = req->wb_bytes; + BUG_ON(pgio->pg_lseg != NULL); if (req->wb_offset != req->wb_pgbase) { nfs_pageio_reset_read_mds(pgio); return; } + + if (pgio->pg_dreq != NULL) + rd_size = nfs_dreq_bytes_left(pgio->pg_dreq); + pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, req->wb_context, req_offset(req), - req->wb_bytes, + rd_size, IOMODE_READ, GFP_KERNEL); /* If no lseg, fall back to read through mds */ -- 1.7.1.262.g5ef3d ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO 2012-08-08 2:03 ` [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO Peng Tao @ 2012-08-08 18:57 ` Myklebust, Trond 2012-08-09 2:30 ` Peng Tao 0 siblings, 1 reply; 15+ messages in thread From: Myklebust, Trond @ 2012-08-08 18:57 UTC (permalink / raw) To: Peng Tao; +Cc: bharrosh@panasas.com, linux-nfs@vger.kernel.org, Peng Tao T24gV2VkLCAyMDEyLTA4LTA4IGF0IDEwOjAzICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gRnJv bTogUGVuZyBUYW8gPHRhby5wZW5nQGVtYy5jb20+DQo+IA0KPiBXZSBkb24ndCBoYXZlIHRoZSBy ZWFsIElPIHNpemUgaW5mb3JtYXRpb24gaW4gYnVmZmVyIHJlYWQsIGJ1dCBmb3IgZGlyZWN0DQo+ IHJlYWQsIHdlIGNhbiBnZXQgaXQgZnJvbSBkcmVxLT5ieXRlc19sZWZ0LiBMZXQncyBtYWtlIHVz ZSBvZiBpdC4NCg0KV2hhdCdzIHdyb25nIHdpdGggdXNpbmcgdGhlIG5mc19yZWFkcGFnZXMgYXJn dW1lbnRzPyBUaGUgcmVhZGFoZWFkIGNvZGUNCndpbGwgdGVsbCB5b3UgZXhhY3RseSBob3cgbWFu eSBwYWdlcyBpdCBpcyB0cnlpbmcgdG8gcmVhZC4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxp bnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRh cHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo= ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO 2012-08-08 18:57 ` Myklebust, Trond @ 2012-08-09 2:30 ` Peng Tao 2012-08-12 17:39 ` Boaz Harrosh 0 siblings, 1 reply; 15+ messages in thread From: Peng Tao @ 2012-08-09 2:30 UTC (permalink / raw) To: Myklebust, Trond; +Cc: bharrosh@panasas.com, linux-nfs@vger.kernel.org On Thu, Aug 9, 2012 at 2:57 AM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote: >> From: Peng Tao <tao.peng@emc.com> >> >> We don't have the real IO size information in buffer read, but for direct >> read, we can get it from dreq->bytes_left. Let's make use of it. > > What's wrong with using the nfs_readpages arguments? The readahead code > will tell you exactly how many pages it is trying to read. > The nr_pages information gets dropped before calling into pnfs code... How about something like bellow for buffer reads? (untested patch and needs to integrate with DIO size for sure). diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 2e00fea..cec79c2 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1162,7 +1162,7 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, req->wb_context, req_offset(req), - req->wb_bytes, + (unsigned)pgio->pg_layout_private, IOMODE_READ, GFP_KERNEL); /* If no lseg, fall back to read through mds */ diff --git a/fs/nfs/read.c b/fs/nfs/read.c index b6bdb18..64fb0e2 100644 --- a/fs/nfs/read.c +++ b/fs/nfs/read.c @@ -649,6 +649,7 @@ int nfs_readpages(struct file *filp, struct address_space *mapping, NFS_PROTO(inode)->read_pageio_init(&pgio, inode, &nfs_async_read_completion_ops); + pgio.pg_layout_private = (void *)nr_pages; ret = read_cache_pages(mapping, pages, readpage_async_filler, &desc); nfs_pageio_complete(&pgio); -- Thanks, Tao ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO 2012-08-09 2:30 ` Peng Tao @ 2012-08-12 17:39 ` Boaz Harrosh 0 siblings, 0 replies; 15+ messages in thread From: Boaz Harrosh @ 2012-08-12 17:39 UTC (permalink / raw) To: Peng Tao; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org On 08/09/2012 05:30 AM, Peng Tao wrote: > On Thu, Aug 9, 2012 at 2:57 AM, Myklebust, Trond > <Trond.Myklebust@netapp.com> wrote: >> On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote: >>> From: Peng Tao <tao.peng@emc.com> >>> >>> We don't have the real IO size information in buffer read, but for direct >>> read, we can get it from dreq->bytes_left. Let's make use of it. >> >> What's wrong with using the nfs_readpages arguments? The readahead code >> will tell you exactly how many pages it is trying to read. >> > The nr_pages information gets dropped before calling into pnfs code... > > How about something like bellow for buffer reads? (untested patch and > needs to integrate with DIO size for sure). > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c > index 2e00fea..cec79c2 100644 > --- a/fs/nfs/pnfs.c > +++ b/fs/nfs/pnfs.c > @@ -1162,7 +1162,7 @@ pnfs_generic_pg_init_read(struct > nfs_pageio_descriptor *pgio, struct nfs_page *r > pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode, > req->wb_context, > req_offset(req), > - req->wb_bytes, > + (unsigned)pgio->pg_layout_private, > IOMODE_READ, > GFP_KERNEL); > /* If no lseg, fall back to read through mds */ > diff --git a/fs/nfs/read.c b/fs/nfs/read.c > index b6bdb18..64fb0e2 100644 > --- a/fs/nfs/read.c > +++ b/fs/nfs/read.c > @@ -649,6 +649,7 @@ int nfs_readpages(struct file *filp, struct > address_space *mapping, > > NFS_PROTO(inode)->read_pageio_init(&pgio, inode, > &nfs_async_read_completion_ops); > > + pgio.pg_layout_private = (void *)nr_pages; NACK!!! pgio.pg_layout_private is for use by LD only. Generic code should never touch it. If you actually need it and this info is realy lost please use a new member. > ret = read_cache_pages(mapping, pages, readpage_async_filler, &desc); > > nfs_pageio_complete(&pgio); > > For reads, for both files and objects but I think also for blocks the optimum size is just offset-to-i_size. I do not see any resource constrained at server. If the extent list gets to big to be sent as one RPC for blocks layout, (as can be the case for objects when spanning groups) The server can just send the biggest chunk it wants, that fits. Or consider some other optimum max size according to internal topology knowledge. The server may always send a smaller layout as long as it includes "offset". If so the client will return for the reminder. I test this al the time it works perfectly. Actually for exofs regardless of requested size I always return a group_size full for read-layouts of the group that includes "offset". Only write-layouts need be sliced thin in exofs. Boaz ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-08-13 20:21 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-08 2:03 [PATCH RFC 0/3] NFS41: optimize layoutget Peng Tao 2012-08-08 2:03 ` [PATCH RFC 1/3] NFS: track direct IO left bytes Peng Tao 2012-08-08 2:03 ` [PATCH RFC 2/3] NFS41: send real write size in layoutget Peng Tao 2012-08-08 18:50 ` Myklebust, Trond 2012-08-09 2:24 ` Peng Tao 2012-08-12 18:30 ` Boaz Harrosh 2012-08-12 18:40 ` Boaz Harrosh 2012-08-13 6:15 ` Peng Tao 2012-08-13 9:44 ` Peng Tao 2012-08-13 20:13 ` Boaz Harrosh 2012-08-13 20:21 ` Myklebust, Trond 2012-08-08 2:03 ` [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO Peng Tao 2012-08-08 18:57 ` Myklebust, Trond 2012-08-09 2:30 ` Peng Tao 2012-08-12 17:39 ` Boaz Harrosh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).