* [RFC PATCH 0/3] O_DIRECT locking rework
@ 2006-10-27 18:22 Chris Mason
2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Chris Mason @ 2006-10-27 18:22 UTC (permalink / raw)
To: linux-fsdevel; +Cc: akpm, zach.brown
Hello everyone,
A new spin of my DIO locking patches. I've fixed a few bugs in patch 2,
and added a new patch to extend the file and fill holes without falling
back to buffered.
This was tested on top of Zach's DIO patches, but rediffed on top of
Linus' tree for sending (minor changes required). I also tested on top
of 2.6.19-rc2-mm2, but my code for inserting/removing placeholders
stopped working right. I think this is either a bug in my code or a bug
in Nick's RCU stuff, I'm still trying to track it down.
-chris
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC PATCH 1/3] Introduce a place holder page for the pagecache. 2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason @ 2006-10-27 18:23 ` Chris Mason 2006-11-01 10:23 ` Suparna Bhattacharya 2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason 2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason 2 siblings, 1 reply; 11+ messages in thread From: Chris Mason @ 2006-10-27 18:23 UTC (permalink / raw) To: linux-fsdevel; +Cc: akpm, zach.brown Introduce a place holder page for the pagecache. mm/filemap.c is changed to wait on these before adding a page into the page cache, and truncates are changed to wait for all of the place holder pages to disappear. Place holder pages can only be tested or looked at with the mapping lock held, and only PagePlaceHolder(page) can be trusted. They cannot be locked, and cannot have references increased or decreased on them. Signed-off-by: Chris Mason <chris.mason@oracle.com> diff -r 18a9e9f5c707 include/linux/mm.h --- a/include/linux/mm.h Thu Oct 19 08:30:00 2006 +0700 +++ b/include/linux/mm.h Tue Oct 24 15:10:48 2006 -0400 @@ -276,6 +276,7 @@ static inline void get_page(struct page if (unlikely(PageCompound(page))) page = (struct page *)page_private(page); VM_BUG_ON(atomic_read(&page->_count) == 0); + VM_BUG_ON(PagePlaceHolder(page)); atomic_inc(&page->_count); } diff -r 18a9e9f5c707 include/linux/page-flags.h --- a/include/linux/page-flags.h Thu Oct 19 08:30:00 2006 +0700 +++ b/include/linux/page-flags.h Tue Oct 24 15:10:48 2006 -0400 @@ -267,4 +266,6 @@ static inline void set_page_writeback(st test_set_page_writeback(page); } +void set_page_placeholder(struct page *page); + #endif /* PAGE_FLAGS_H */ diff -r 18a9e9f5c707 include/linux/pagemap.h --- a/include/linux/pagemap.h Thu Oct 19 08:30:00 2006 +0700 +++ b/include/linux/pagemap.h Tue Oct 24 15:10:48 2006 -0400 @@ -72,6 +72,12 @@ extern struct page * find_get_page(struc unsigned long index); extern struct page * find_lock_page(struct address_space *mapping, unsigned long index); +int find_or_insert_placeholders(struct address_space *mapping, + struct page **pages, unsigned long start, + unsigned long end, unsigned long nr, + gfp_t gfp_mask, + struct page *insert, + int wait); extern __deprecated_for_modules struct page * find_trylock_page( struct address_space *mapping, unsigned long index); extern struct page * find_or_create_page(struct address_space *mapping, @@ -82,6 +88,16 @@ unsigned find_get_pages_contig(struct ad unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +void remove_placeholder_pages(struct address_space *mapping, + struct page **pages, + struct page *placeholder, + unsigned long offset, + unsigned long end, + unsigned long nr); +void wake_up_placeholder_page(struct page *page); +void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start, + pgoff_t end); + /* * Returns locked page at given index in given cache, creating it if needed. diff -r 18a9e9f5c707 include/linux/radix-tree.h --- a/include/linux/radix-tree.h Thu Oct 19 08:30:00 2006 +0700 +++ b/include/linux/radix-tree.h Tue Oct 24 15:10:48 2006 -0400 @@ -56,6 +56,7 @@ radix_tree_gang_lookup(struct radix_tree radix_tree_gang_lookup(struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items); int radix_tree_preload(gfp_t gfp_mask); +int radix_tree_needs_preload(void); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff -r 18a9e9f5c707 lib/radix-tree.c --- a/lib/radix-tree.c Thu Oct 19 08:30:00 2006 +0700 +++ b/lib/radix-tree.c Tue Oct 24 15:10:48 2006 -0400 @@ -139,6 +139,19 @@ out: out: return ret; } + +/* + * Must be called inside the preload pair. This tells you if another + * run of radix_tree_preload is required, allowing the caller to decide + * if a repeated insert will succeed + */ +int radix_tree_needs_preload(void) +{ + struct radix_tree_preload *rtp; + rtp = &__get_cpu_var(radix_tree_preloads); + return rtp->nr < ARRAY_SIZE(rtp->nodes); +} +EXPORT_SYMBOL_GPL(radix_tree_needs_preload); static inline void tag_set(struct radix_tree_node *node, unsigned int tag, int offset) diff -r 18a9e9f5c707 mm/filemap.c --- a/mm/filemap.c Thu Oct 19 08:30:00 2006 +0700 +++ b/mm/filemap.c Tue Oct 24 15:10:48 2006 -0400 @@ -44,6 +44,25 @@ generic_file_direct_IO(int rw, struct ki generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs); +static wait_queue_head_t *page_waitqueue(struct page *page); +static void wait_on_placeholder_page(struct address_space *mapping, + struct page *page, + int write_lock); + +static struct address_space placeholder_address_space; +#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space) + +/* + * Turns the page struct into a marker for a placeholder page. + * This should not be called with a real page, only a struct page stub. + */ +void set_page_placeholder(struct page *page) +{ + memset(page, 0, sizeof(*page)); + page->mapping = &placeholder_address_space; +} +EXPORT_SYMBOL_GPL(set_page_placeholder); + /* * Shared mappings implemented 30.11.1994. It's not fully working yet, * though. @@ -437,12 +456,24 @@ int add_to_page_cache(struct page *page, int add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { - int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + int error; +again: + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); - if (!error) { + if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) { + struct page *tmp; + tmp = radix_tree_lookup(&mapping->page_tree, offset); + if (tmp && PagePlaceHolder(tmp)) { + radix_tree_preload_end(); + /* drops the lock */ + wait_on_placeholder_page(mapping, tmp, 1); + goto again; + } + } + if (!error && !PagePlaceHolder(page)) { page_cache_get(page); SetPageLocked(page); page->mapping = mapping; @@ -526,6 +557,79 @@ void fastcall wait_on_page_bit(struct pa } EXPORT_SYMBOL(wait_on_page_bit); +/* + * Call with either a read lock or a write lock on the mapping tree. + * The lock is dropped just before waiting for the place holder + */ +static void wait_on_placeholder_page(struct address_space *mapping, + struct page *page, + int write_lock) +{ + DEFINE_WAIT(wait); + wait_queue_head_t *wqh = page_waitqueue(page); + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + if (write_lock) + write_unlock_irq(&mapping->tree_lock); + else + read_unlock_irq(&mapping->tree_lock); + io_schedule(); + finish_wait(wqh, &wait); +} + +void wake_up_placeholder_page(struct page *page) +{ + __wake_up_bit(page_waitqueue(page), &page->flags, PG_locked); +} +EXPORT_SYMBOL_GPL(wake_up_placeholder_page); + +/** + * wait_on_placeholder_pages - gang placeholder page waiter + * @mapping: The address_space to search + * @start: The starting page index + * @end: The max page index + * + * wait_on_placeholder_pages() will search for and wait on a range of pages + * in the mapping + * + * On return, the range has no placeholder pages sitting in it. + */ +void wait_on_placeholder_pages_range(struct address_space *mapping, + pgoff_t start, pgoff_t end) +{ + unsigned int i; + unsigned int ret; + struct page *pages[8]; + pgoff_t cur = start; + pgoff_t highest = start; + + /* + * we expect a very small number of place holder pages, so + * this code isn't trying to be very fast. + */ +again: + read_lock_irq(&mapping->tree_lock); + ret = radix_tree_gang_lookup(&mapping->page_tree, + (void **)pages, cur, ARRAY_SIZE(pages)); + for (i = 0; i < ret; i++) { + if (PagePlaceHolder(pages[i])) { + /* drops the lock */ + wait_on_placeholder_page(mapping, pages[i], 0); + goto again; + } else if (pages[i]->index > highest) + highest = pages[i]->index; + } + if (highest < end && ret == ARRAY_SIZE(pages)) { + cur = highest; + if (need_resched()) { + read_unlock_irq(&mapping->tree_lock); + cond_resched(); + } + goto again; + } + read_unlock_irq(&mapping->tree_lock); +} +EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range); + /** * unlock_page - unlock a locked page * @page: the page @@ -542,6 +646,7 @@ EXPORT_SYMBOL(wait_on_page_bit); */ void fastcall unlock_page(struct page *page) { + BUG_ON(PagePlaceHolder(page)); smp_mb__before_clear_bit(); if (!TestClearPageLocked(page)) BUG(); @@ -578,6 +683,7 @@ void fastcall __lock_page(struct page *p { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + BUG_ON(PagePlaceHolder(page)); __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, TASK_UNINTERRUPTIBLE); } @@ -590,6 +696,7 @@ void fastcall __lock_page_nosync(struct void fastcall __lock_page_nosync(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + BUG_ON(PagePlaceHolder(page)); __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock, TASK_UNINTERRUPTIBLE); } @@ -608,12 +715,262 @@ struct page * find_get_page(struct addre read_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); - if (page) - page_cache_get(page); + if (page) { + if (PagePlaceHolder(page)) + page = NULL; + else + page_cache_get(page); + } read_unlock_irq(&mapping->tree_lock); return page; } EXPORT_SYMBOL(find_get_page); + +/** + * remove_placeholder_pages - remove a range of placeholder or locked pages + * @mapping: the page's address_space + * @pages: an array of page pointers to use for gang looukps + * @placeholder: the placeholder page previously inserted (for verification) + * @start: the search starting point + * @end: the search end point (offsets >= end are not touched) + * @nr: the size of the pages array. + * + * Any placeholder pages in the range specified are removed. Any real + * pages are unlocked and released. + */ +void remove_placeholder_pages(struct address_space *mapping, + struct page **pages, + struct page *placeholder, + unsigned long start, + unsigned long end, + unsigned long nr) +{ + struct page *page; + int ret; + int i; + unsigned long num; + + write_lock_irq(&mapping->tree_lock); + while (start < end) { + num = min(nr, end-start); + ret = radix_tree_gang_lookup(&mapping->page_tree, + (void **)pages, start, num); + BUG_ON(ret != num); + for (i = 0; i < ret; i++) { + page = pages[i]; + if (PagePlaceHolder(page)) { + BUG_ON(page != placeholder); + radix_tree_delete(&mapping->page_tree, + start + i); + } else { + BUG_ON(page->index >= end); + unlock_page(page); + page_cache_release(page); + } + } + start += ret; + if (need_resched()) { + write_unlock_irq(&mapping->tree_lock); + cond_resched(); + write_lock_irq(&mapping->tree_lock); + } + } + write_unlock_irq(&mapping->tree_lock); + if (placeholder) + wake_up_placeholder_page(placeholder); +} +EXPORT_SYMBOL_GPL(remove_placeholder_pages); + +/* + * a helper function to insert a placeholder into multiple slots + * in the radix tree. This could probably use an optimized version + * in the radix code. It may insert fewer than the request number + * of placeholders if we need to reschedule or the radix tree needs to + * be preloaded again. + * + * returns zero on error or the number actually inserted. + */ +static unsigned long insert_placeholders(struct address_space *mapping, + struct page *insert, + unsigned long start, unsigned long nr) +{ + int err; + unsigned long i; + for (i = 0 ; i < nr; i++) { + err = radix_tree_insert(&mapping->page_tree, start + i, + insert); + if (err) + return i; + if (radix_tree_needs_preload() || (i && need_resched())) + return i + 1; + } + return i; +} + + +/** + * find_or_insert_placeholders - locate a group of pagecache pages or insert one + * @mapping: the page's address_space + * @pages: an array of page pointers to use for gang looukps + * @start: the search starting point + * @end: the search end point (offsets >= end are not touched) + * @nr: the size of the pages array. + * @gfp_mask: page allocation mode + * @insert: the page to insert if none is found + * @iowait: 1 if you want to wait for dirty or writeback pages. + * + * This locks down a range of offsets in the address space. Any pages + * already present are locked and a placeholder page is inserted into + * the radix tree for any offsets without pages. + */ +int find_or_insert_placeholders(struct address_space *mapping, + struct page **pages, unsigned long start, + unsigned long end, unsigned long nr, + gfp_t gfp_mask, + struct page *insert, + int iowait) +{ + int err = 0; + int i, ret; + unsigned long cur = start; + unsigned long ins; + struct page *page; + int restart; + + /* + * this gets complicated. Placeholders and page locks need to + * be taken in order. We use gang lookup to cut down on the cpu + * cost, but we need to keep track of holes in the results and + * insert placeholders as appropriate. + * + * If a locked page or a placeholder is found, we wait for it and + * pick up where we left off. If a dirty or PG_writeback page is found + * and iowait==1, we have to drop all of our locks, kick/wait for the + * io and start over. + */ +repeat: + if (cur != start ) + cond_resched(); + err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (err) + goto fail; + write_lock_irq(&mapping->tree_lock); +repeat_lock: + ret = radix_tree_gang_lookup(&mapping->page_tree, + (void **)pages, cur, + min(nr, end-cur)); + for (i = 0 ; i < ret ; i++) { + restart = 0; + page = pages[i]; + + if (PagePlaceHolder(page)) { + radix_tree_preload_end(); + /* drops the lock */ + wait_on_placeholder_page(mapping, page, 1); + goto repeat; + } + + if (page->index > cur) { + unsigned long top = min(end, page->index); + ins = insert_placeholders(mapping, insert, cur, + top - cur); + cur += ins; + if (radix_tree_needs_preload() || ins != top - cur) { + write_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + if (ins == 0) { + err = -1; + goto fail; + } + goto repeat; + } + } + if (page->index >= end) { + ret = 0; + break; + } + page_cache_get(page); + if (TestSetPageLocked(page)) { + unsigned long tmpoff = page->index; + page_cache_get(page); + write_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + __lock_page(page); + /* Has the page been truncated while we slept? */ + if (unlikely(page->mapping != mapping || + page->index != tmpoff)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } else { + /* we've locked the page, but we need to + * check it for dirty/writeback + */ + restart = 1; + } + } + if (iowait && (PageDirty(page) || PageWriteback(page))) { + /* + * mapge_writepages collects a number of pages + * into a bio, marking them PG_writeback as it goes. + * It also locks pages while deciding what to do + * with them. What we want to do is have a bunch + * of pages locked, and then wait for any io to + * finish. But if mpage_writepages is waiting for + * one of our locked pages, it may never start the + * io on the bio it has created. + * + * so, we have to drop all of our locks and + * start over + */ + unlock_page(page); + page_cache_release(page); + if (!restart) { + write_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + } + remove_placeholder_pages(mapping, pages, insert, + start, cur, nr); + filemap_write_and_wait_range(mapping, + start << PAGE_CACHE_SHIFT, + end << PAGE_CACHE_SHIFT); + cur = start; + goto repeat; + } + cur++; + if (restart) + goto repeat; + if (cur >= end) + break; + } + + /* we haven't yet filled the range */ + if (cur < end) { + /* if the search filled our array, there is more to do. */ + if (ret && ret == nr) + goto repeat_lock; + + /* otherwise insert placeholders for the remaining offsets */ + ins = insert_placeholders(mapping, insert, cur, end - cur); + cur += ins; + if (ins != end -cur) { + write_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + if (ins == 0) { + err = -1; + goto fail; + } + goto repeat; + } + } + write_unlock_irq(&mapping->tree_lock); + radix_tree_preload_end(); + return err; +fail: + remove_placeholder_pages(mapping, pages, insert, start, cur, nr); + return err; +} +EXPORT_SYMBOL_GPL(find_or_insert_placeholders); /** * find_trylock_page - find and lock a page @@ -628,7 +985,7 @@ struct page *find_trylock_page(struct ad read_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); - if (page && TestSetPageLocked(page)) + if (page && (PagePlaceHolder(page) || TestSetPageLocked(page))) page = NULL; read_unlock_irq(&mapping->tree_lock); return page; @@ -654,6 +1011,12 @@ repeat: repeat: page = radix_tree_lookup(&mapping->page_tree, offset); if (page) { + if (PagePlaceHolder(page)) { + /* drops the lock */ + wait_on_placeholder_page(mapping, page, 0); + read_lock_irq(&mapping->tree_lock); + goto repeat; + } page_cache_get(page); if (TestSetPageLocked(page)) { read_unlock_irq(&mapping->tree_lock); @@ -737,14 +1100,25 @@ unsigned find_get_pages(struct address_s unsigned find_get_pages(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages) { - unsigned int i; + unsigned int i = 0; unsigned int ret; read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); - for (i = 0; i < ret; i++) - page_cache_get(pages[i]); + while(i < ret) { + if (PagePlaceHolder(pages[i])) { + /* we can't return a place holder, shift it away */ + if (i + 1 < ret) { + memcpy(&pages[i], &pages[i+1], + (ret - i - 1) * sizeof(struct page *)); + } + ret--; + continue; + } else + page_cache_get(pages[i]); + i++; + } read_unlock_irq(&mapping->tree_lock); return ret; } @@ -771,6 +1145,8 @@ unsigned find_get_pages_contig(struct ad ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, index, nr_pages); for (i = 0; i < ret; i++) { + if (PagePlaceHolder(pages[i])) + break; if (pages[i]->mapping == NULL || pages[i]->index != index) break; @@ -795,14 +1171,25 @@ unsigned find_get_pages_tag(struct addre unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages) { - unsigned int i; + unsigned int i = 0; unsigned int ret; read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)pages, *index, nr_pages, tag); - for (i = 0; i < ret; i++) - page_cache_get(pages[i]); + while(i < ret) { + if (PagePlaceHolder(pages[i])) { + /* we can't return a place holder, shift it away */ + if (i + 1 < ret) { + memcpy(&pages[i], &pages[i+1], + (ret - i - 1) * sizeof(struct page *)); + } + ret--; + continue; + } else + page_cache_get(pages[i]); + i++; + } if (ret) *index = pages[ret - 1]->index + 1; read_unlock_irq(&mapping->tree_lock); diff -r 18a9e9f5c707 mm/truncate.c --- a/mm/truncate.c Thu Oct 19 08:30:00 2006 +0700 +++ b/mm/truncate.c Tue Oct 24 15:10:48 2006 -0400 @@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a } pagevec_release(&pvec); } + wait_on_placeholder_pages_range(mapping, start, end); } EXPORT_SYMBOL(truncate_inode_pages_range); ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache. 2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason @ 2006-11-01 10:23 ` Suparna Bhattacharya 2006-11-01 12:41 ` Chris Mason 0 siblings, 1 reply; 11+ messages in thread From: Suparna Bhattacharya @ 2006-11-01 10:23 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown Really appreciate your biting the bullet on this one. I have a few minor comments/questions as I'm trying to compare this to what we have discussed in the past and the last attempt approach outlined in http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt back in Feb this year. I recall that the first time you experimented with dummy struct pages (placeholder pages as they are now called), you ended up putting it on hold due to regressions in performance which we attributed to radix tree insertion overheads. I think you mention in an earlier note that you've managed to optimize some of that -- I'm just wondering what helped there - radix_tree_preload or something more than that ? This is sort of why we had discussed the alternative of tagged locking in the radix tree. Lookup and lock pages in the range by tagging the radix tree slots as "TAG_LOCKED", and locking pages that were present. Modify find_get_page et al to return a dummy locked cache page if TAG_LOCKED. (Alternatively put the check in add_to_page_cache, instead) Putting the check in add_to_page_cache is more consistent with what you do now. Where I think we were stuck at that point was that even this helps only upto a point in saving on leaf nodes, but did not avoid intermediate levels of nodes from being created. With Nick's changes to the radix tree .. I have been thinking that some kind of sliding height scheme could become possible. Anyway, the other reason for bringing this up is because I am wondering if some of the complexity in find_and_insert_placeholder_pages and a few other places could be reduced with a tagged locking approach where we do not actually stuff placeholder pages explicitly in the page cache and hence possibly can avoid some of the checks you have below. I need to take a closer look to be completely sure, but just a thought ... I guess the important thing at this point is to get the interfaces right in a way that allows for future optimization. Regards Suparna On Fri, Oct 27, 2006 at 02:23:58PM -0400, Chris Mason wrote: > Introduce a place holder page for the pagecache. > > mm/filemap.c is changed to wait on these before adding a page into the page > cache, and truncates are changed to wait for all of the place holder pages to > disappear. > > Place holder pages can only be tested or looked at with the mapping lock > held, and only PagePlaceHolder(page) can be trusted. They cannot be locked, > and cannot have references increased or decreased on them. > > Signed-off-by: Chris Mason <chris.mason@oracle.com> > > diff -r 18a9e9f5c707 include/linux/mm.h > --- a/include/linux/mm.h Thu Oct 19 08:30:00 2006 +0700 > +++ b/include/linux/mm.h Tue Oct 24 15:10:48 2006 -0400 > @@ -276,6 +276,7 @@ static inline void get_page(struct page > if (unlikely(PageCompound(page))) > page = (struct page *)page_private(page); > VM_BUG_ON(atomic_read(&page->_count) == 0); > + VM_BUG_ON(PagePlaceHolder(page)); > atomic_inc(&page->_count); > } > > diff -r 18a9e9f5c707 include/linux/page-flags.h > --- a/include/linux/page-flags.h Thu Oct 19 08:30:00 2006 +0700 > +++ b/include/linux/page-flags.h Tue Oct 24 15:10:48 2006 -0400 > @@ -267,4 +266,6 @@ static inline void set_page_writeback(st > test_set_page_writeback(page); > } > > +void set_page_placeholder(struct page *page); > + > #endif /* PAGE_FLAGS_H */ > diff -r 18a9e9f5c707 include/linux/pagemap.h > --- a/include/linux/pagemap.h Thu Oct 19 08:30:00 2006 +0700 > +++ b/include/linux/pagemap.h Tue Oct 24 15:10:48 2006 -0400 > @@ -72,6 +72,12 @@ extern struct page * find_get_page(struc > unsigned long index); > extern struct page * find_lock_page(struct address_space *mapping, > unsigned long index); > +int find_or_insert_placeholders(struct address_space *mapping, > + struct page **pages, unsigned long start, > + unsigned long end, unsigned long nr, > + gfp_t gfp_mask, > + struct page *insert, > + int wait); > extern __deprecated_for_modules struct page * find_trylock_page( > struct address_space *mapping, unsigned long index); > extern struct page * find_or_create_page(struct address_space *mapping, > @@ -82,6 +88,16 @@ unsigned find_get_pages_contig(struct ad > unsigned int nr_pages, struct page **pages); > unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, > int tag, unsigned int nr_pages, struct page **pages); > +void remove_placeholder_pages(struct address_space *mapping, > + struct page **pages, > + struct page *placeholder, > + unsigned long offset, > + unsigned long end, > + unsigned long nr); > +void wake_up_placeholder_page(struct page *page); > +void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start, > + pgoff_t end); > + > > /* > * Returns locked page at given index in given cache, creating it if needed. > diff -r 18a9e9f5c707 include/linux/radix-tree.h > --- a/include/linux/radix-tree.h Thu Oct 19 08:30:00 2006 +0700 > +++ b/include/linux/radix-tree.h Tue Oct 24 15:10:48 2006 -0400 > @@ -56,6 +56,7 @@ radix_tree_gang_lookup(struct radix_tree > radix_tree_gang_lookup(struct radix_tree_root *root, void **results, > unsigned long first_index, unsigned int max_items); > int radix_tree_preload(gfp_t gfp_mask); > +int radix_tree_needs_preload(void); > void radix_tree_init(void); > void *radix_tree_tag_set(struct radix_tree_root *root, > unsigned long index, unsigned int tag); > diff -r 18a9e9f5c707 lib/radix-tree.c > --- a/lib/radix-tree.c Thu Oct 19 08:30:00 2006 +0700 > +++ b/lib/radix-tree.c Tue Oct 24 15:10:48 2006 -0400 > @@ -139,6 +139,19 @@ out: > out: > return ret; > } > + > +/* > + * Must be called inside the preload pair. This tells you if another > + * run of radix_tree_preload is required, allowing the caller to decide > + * if a repeated insert will succeed > + */ > +int radix_tree_needs_preload(void) > +{ > + struct radix_tree_preload *rtp; > + rtp = &__get_cpu_var(radix_tree_preloads); > + return rtp->nr < ARRAY_SIZE(rtp->nodes); > +} > +EXPORT_SYMBOL_GPL(radix_tree_needs_preload); > > static inline void tag_set(struct radix_tree_node *node, unsigned int tag, > int offset) > diff -r 18a9e9f5c707 mm/filemap.c > --- a/mm/filemap.c Thu Oct 19 08:30:00 2006 +0700 > +++ b/mm/filemap.c Tue Oct 24 15:10:48 2006 -0400 > @@ -44,6 +44,25 @@ generic_file_direct_IO(int rw, struct ki > generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > loff_t offset, unsigned long nr_segs); > > +static wait_queue_head_t *page_waitqueue(struct page *page); > +static void wait_on_placeholder_page(struct address_space *mapping, > + struct page *page, > + int write_lock); > + > +static struct address_space placeholder_address_space; > +#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space) > + > +/* > + * Turns the page struct into a marker for a placeholder page. > + * This should not be called with a real page, only a struct page stub. > + */ > +void set_page_placeholder(struct page *page) > +{ > + memset(page, 0, sizeof(*page)); > + page->mapping = &placeholder_address_space; > +} > +EXPORT_SYMBOL_GPL(set_page_placeholder); > + > /* > * Shared mappings implemented 30.11.1994. It's not fully working yet, > * though. > @@ -437,12 +456,24 @@ int add_to_page_cache(struct page *page, > int add_to_page_cache(struct page *page, struct address_space *mapping, > pgoff_t offset, gfp_t gfp_mask) > { > - int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); > + int error; > +again: > + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); > > if (error == 0) { > write_lock_irq(&mapping->tree_lock); > error = radix_tree_insert(&mapping->page_tree, offset, page); > - if (!error) { > + if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) { > + struct page *tmp; > + tmp = radix_tree_lookup(&mapping->page_tree, offset); > + if (tmp && PagePlaceHolder(tmp)) { > + radix_tree_preload_end(); > + /* drops the lock */ > + wait_on_placeholder_page(mapping, tmp, 1); > + goto again; > + } > + } > + if (!error && !PagePlaceHolder(page)) { > page_cache_get(page); > SetPageLocked(page); > page->mapping = mapping; > @@ -526,6 +557,79 @@ void fastcall wait_on_page_bit(struct pa > } > EXPORT_SYMBOL(wait_on_page_bit); > > +/* > + * Call with either a read lock or a write lock on the mapping tree. > + * The lock is dropped just before waiting for the place holder > + */ > +static void wait_on_placeholder_page(struct address_space *mapping, > + struct page *page, > + int write_lock) > +{ > + DEFINE_WAIT(wait); > + wait_queue_head_t *wqh = page_waitqueue(page); > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > + if (write_lock) > + write_unlock_irq(&mapping->tree_lock); > + else > + read_unlock_irq(&mapping->tree_lock); > + io_schedule(); > + finish_wait(wqh, &wait); > +} > + > +void wake_up_placeholder_page(struct page *page) > +{ > + __wake_up_bit(page_waitqueue(page), &page->flags, PG_locked); > +} > +EXPORT_SYMBOL_GPL(wake_up_placeholder_page); > + > +/** > + * wait_on_placeholder_pages - gang placeholder page waiter > + * @mapping: The address_space to search > + * @start: The starting page index > + * @end: The max page index > + * > + * wait_on_placeholder_pages() will search for and wait on a range of pages > + * in the mapping > + * > + * On return, the range has no placeholder pages sitting in it. > + */ > +void wait_on_placeholder_pages_range(struct address_space *mapping, > + pgoff_t start, pgoff_t end) > +{ > + unsigned int i; > + unsigned int ret; > + struct page *pages[8]; > + pgoff_t cur = start; > + pgoff_t highest = start; > + > + /* > + * we expect a very small number of place holder pages, so > + * this code isn't trying to be very fast. > + */ > +again: > + read_lock_irq(&mapping->tree_lock); > + ret = radix_tree_gang_lookup(&mapping->page_tree, > + (void **)pages, cur, ARRAY_SIZE(pages)); > + for (i = 0; i < ret; i++) { > + if (PagePlaceHolder(pages[i])) { > + /* drops the lock */ > + wait_on_placeholder_page(mapping, pages[i], 0); > + goto again; > + } else if (pages[i]->index > highest) > + highest = pages[i]->index; > + } > + if (highest < end && ret == ARRAY_SIZE(pages)) { > + cur = highest; > + if (need_resched()) { > + read_unlock_irq(&mapping->tree_lock); > + cond_resched(); > + } > + goto again; > + } > + read_unlock_irq(&mapping->tree_lock); > +} > +EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range); > + > /** > * unlock_page - unlock a locked page > * @page: the page > @@ -542,6 +646,7 @@ EXPORT_SYMBOL(wait_on_page_bit); > */ > void fastcall unlock_page(struct page *page) > { > + BUG_ON(PagePlaceHolder(page)); > smp_mb__before_clear_bit(); > if (!TestClearPageLocked(page)) > BUG(); > @@ -578,6 +683,7 @@ void fastcall __lock_page(struct page *p > { > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > > + BUG_ON(PagePlaceHolder(page)); > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, > TASK_UNINTERRUPTIBLE); > } > @@ -590,6 +696,7 @@ void fastcall __lock_page_nosync(struct > void fastcall __lock_page_nosync(struct page *page) > { > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > + BUG_ON(PagePlaceHolder(page)); > __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock, > TASK_UNINTERRUPTIBLE); > } > @@ -608,12 +715,262 @@ struct page * find_get_page(struct addre > > read_lock_irq(&mapping->tree_lock); > page = radix_tree_lookup(&mapping->page_tree, offset); > - if (page) > - page_cache_get(page); > + if (page) { > + if (PagePlaceHolder(page)) > + page = NULL; > + else > + page_cache_get(page); > + } > read_unlock_irq(&mapping->tree_lock); > return page; > } > EXPORT_SYMBOL(find_get_page); > + > +/** > + * remove_placeholder_pages - remove a range of placeholder or locked pages > + * @mapping: the page's address_space > + * @pages: an array of page pointers to use for gang looukps > + * @placeholder: the placeholder page previously inserted (for verification) > + * @start: the search starting point > + * @end: the search end point (offsets >= end are not touched) > + * @nr: the size of the pages array. > + * > + * Any placeholder pages in the range specified are removed. Any real > + * pages are unlocked and released. > + */ > +void remove_placeholder_pages(struct address_space *mapping, > + struct page **pages, > + struct page *placeholder, > + unsigned long start, > + unsigned long end, > + unsigned long nr) > +{ > + struct page *page; > + int ret; > + int i; > + unsigned long num; > + > + write_lock_irq(&mapping->tree_lock); > + while (start < end) { > + num = min(nr, end-start); > + ret = radix_tree_gang_lookup(&mapping->page_tree, > + (void **)pages, start, num); > + BUG_ON(ret != num); > + for (i = 0; i < ret; i++) { > + page = pages[i]; > + if (PagePlaceHolder(page)) { > + BUG_ON(page != placeholder); > + radix_tree_delete(&mapping->page_tree, > + start + i); > + } else { > + BUG_ON(page->index >= end); > + unlock_page(page); > + page_cache_release(page); > + } > + } > + start += ret; > + if (need_resched()) { > + write_unlock_irq(&mapping->tree_lock); > + cond_resched(); > + write_lock_irq(&mapping->tree_lock); > + } > + } > + write_unlock_irq(&mapping->tree_lock); > + if (placeholder) > + wake_up_placeholder_page(placeholder); > +} > +EXPORT_SYMBOL_GPL(remove_placeholder_pages); > + > +/* > + * a helper function to insert a placeholder into multiple slots > + * in the radix tree. This could probably use an optimized version > + * in the radix code. It may insert fewer than the request number > + * of placeholders if we need to reschedule or the radix tree needs to > + * be preloaded again. > + * > + * returns zero on error or the number actually inserted. > + */ > +static unsigned long insert_placeholders(struct address_space *mapping, > + struct page *insert, > + unsigned long start, unsigned long nr) > +{ > + int err; > + unsigned long i; > + for (i = 0 ; i < nr; i++) { > + err = radix_tree_insert(&mapping->page_tree, start + i, > + insert); > + if (err) > + return i; > + if (radix_tree_needs_preload() || (i && need_resched())) > + return i + 1; > + } > + return i; > +} > + > + > +/** > + * find_or_insert_placeholders - locate a group of pagecache pages or insert one > + * @mapping: the page's address_space > + * @pages: an array of page pointers to use for gang looukps > + * @start: the search starting point > + * @end: the search end point (offsets >= end are not touched) > + * @nr: the size of the pages array. > + * @gfp_mask: page allocation mode > + * @insert: the page to insert if none is found > + * @iowait: 1 if you want to wait for dirty or writeback pages. > + * > + * This locks down a range of offsets in the address space. Any pages > + * already present are locked and a placeholder page is inserted into > + * the radix tree for any offsets without pages. > + */ > +int find_or_insert_placeholders(struct address_space *mapping, > + struct page **pages, unsigned long start, > + unsigned long end, unsigned long nr, > + gfp_t gfp_mask, > + struct page *insert, > + int iowait) > +{ > + int err = 0; > + int i, ret; > + unsigned long cur = start; > + unsigned long ins; > + struct page *page; > + int restart; > + > + /* > + * this gets complicated. Placeholders and page locks need to > + * be taken in order. We use gang lookup to cut down on the cpu > + * cost, but we need to keep track of holes in the results and > + * insert placeholders as appropriate. > + * > + * If a locked page or a placeholder is found, we wait for it and > + * pick up where we left off. If a dirty or PG_writeback page is found > + * and iowait==1, we have to drop all of our locks, kick/wait for the > + * io and start over. > + */ > +repeat: > + if (cur != start ) > + cond_resched(); > + err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); > + if (err) > + goto fail; > + write_lock_irq(&mapping->tree_lock); > +repeat_lock: > + ret = radix_tree_gang_lookup(&mapping->page_tree, > + (void **)pages, cur, > + min(nr, end-cur)); > + for (i = 0 ; i < ret ; i++) { > + restart = 0; > + page = pages[i]; > + > + if (PagePlaceHolder(page)) { > + radix_tree_preload_end(); > + /* drops the lock */ > + wait_on_placeholder_page(mapping, page, 1); > + goto repeat; > + } > + > + if (page->index > cur) { > + unsigned long top = min(end, page->index); > + ins = insert_placeholders(mapping, insert, cur, > + top - cur); > + cur += ins; > + if (radix_tree_needs_preload() || ins != top - cur) { > + write_unlock_irq(&mapping->tree_lock); > + radix_tree_preload_end(); > + if (ins == 0) { > + err = -1; > + goto fail; > + } > + goto repeat; > + } > + } > + if (page->index >= end) { > + ret = 0; > + break; > + } > + page_cache_get(page); > + if (TestSetPageLocked(page)) { > + unsigned long tmpoff = page->index; > + page_cache_get(page); > + write_unlock_irq(&mapping->tree_lock); > + radix_tree_preload_end(); > + __lock_page(page); > + /* Has the page been truncated while we slept? */ > + if (unlikely(page->mapping != mapping || > + page->index != tmpoff)) { > + unlock_page(page); > + page_cache_release(page); > + goto repeat; > + } else { > + /* we've locked the page, but we need to > + * check it for dirty/writeback > + */ > + restart = 1; > + } > + } > + if (iowait && (PageDirty(page) || PageWriteback(page))) { > + /* > + * mapge_writepages collects a number of pages > + * into a bio, marking them PG_writeback as it goes. > + * It also locks pages while deciding what to do > + * with them. What we want to do is have a bunch > + * of pages locked, and then wait for any io to > + * finish. But if mpage_writepages is waiting for > + * one of our locked pages, it may never start the > + * io on the bio it has created. > + * > + * so, we have to drop all of our locks and > + * start over > + */ > + unlock_page(page); > + page_cache_release(page); > + if (!restart) { > + write_unlock_irq(&mapping->tree_lock); > + radix_tree_preload_end(); > + } > + remove_placeholder_pages(mapping, pages, insert, > + start, cur, nr); > + filemap_write_and_wait_range(mapping, > + start << PAGE_CACHE_SHIFT, > + end << PAGE_CACHE_SHIFT); > + cur = start; > + goto repeat; > + } > + cur++; > + if (restart) > + goto repeat; > + if (cur >= end) > + break; > + } > + > + /* we haven't yet filled the range */ > + if (cur < end) { > + /* if the search filled our array, there is more to do. */ > + if (ret && ret == nr) > + goto repeat_lock; > + > + /* otherwise insert placeholders for the remaining offsets */ > + ins = insert_placeholders(mapping, insert, cur, end - cur); > + cur += ins; > + if (ins != end -cur) { > + write_unlock_irq(&mapping->tree_lock); > + radix_tree_preload_end(); > + if (ins == 0) { > + err = -1; > + goto fail; > + } > + goto repeat; > + } > + } > + write_unlock_irq(&mapping->tree_lock); > + radix_tree_preload_end(); > + return err; > +fail: > + remove_placeholder_pages(mapping, pages, insert, start, cur, nr); > + return err; > +} > +EXPORT_SYMBOL_GPL(find_or_insert_placeholders); > > /** > * find_trylock_page - find and lock a page > @@ -628,7 +985,7 @@ struct page *find_trylock_page(struct ad > > read_lock_irq(&mapping->tree_lock); > page = radix_tree_lookup(&mapping->page_tree, offset); > - if (page && TestSetPageLocked(page)) > + if (page && (PagePlaceHolder(page) || TestSetPageLocked(page))) > page = NULL; > read_unlock_irq(&mapping->tree_lock); > return page; > @@ -654,6 +1011,12 @@ repeat: > repeat: > page = radix_tree_lookup(&mapping->page_tree, offset); > if (page) { > + if (PagePlaceHolder(page)) { > + /* drops the lock */ > + wait_on_placeholder_page(mapping, page, 0); > + read_lock_irq(&mapping->tree_lock); > + goto repeat; > + } > page_cache_get(page); > if (TestSetPageLocked(page)) { > read_unlock_irq(&mapping->tree_lock); > @@ -737,14 +1100,25 @@ unsigned find_get_pages(struct address_s > unsigned find_get_pages(struct address_space *mapping, pgoff_t start, > unsigned int nr_pages, struct page **pages) > { > - unsigned int i; > + unsigned int i = 0; > unsigned int ret; > > read_lock_irq(&mapping->tree_lock); > ret = radix_tree_gang_lookup(&mapping->page_tree, > (void **)pages, start, nr_pages); > - for (i = 0; i < ret; i++) > - page_cache_get(pages[i]); > + while(i < ret) { > + if (PagePlaceHolder(pages[i])) { > + /* we can't return a place holder, shift it away */ > + if (i + 1 < ret) { > + memcpy(&pages[i], &pages[i+1], > + (ret - i - 1) * sizeof(struct page *)); > + } > + ret--; > + continue; > + } else > + page_cache_get(pages[i]); > + i++; > + } > read_unlock_irq(&mapping->tree_lock); > return ret; > } > @@ -771,6 +1145,8 @@ unsigned find_get_pages_contig(struct ad > ret = radix_tree_gang_lookup(&mapping->page_tree, > (void **)pages, index, nr_pages); > for (i = 0; i < ret; i++) { > + if (PagePlaceHolder(pages[i])) > + break; > if (pages[i]->mapping == NULL || pages[i]->index != index) > break; > > @@ -795,14 +1171,25 @@ unsigned find_get_pages_tag(struct addre > unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, > int tag, unsigned int nr_pages, struct page **pages) > { > - unsigned int i; > + unsigned int i = 0; > unsigned int ret; > > read_lock_irq(&mapping->tree_lock); > ret = radix_tree_gang_lookup_tag(&mapping->page_tree, > (void **)pages, *index, nr_pages, tag); > - for (i = 0; i < ret; i++) > - page_cache_get(pages[i]); > + while(i < ret) { > + if (PagePlaceHolder(pages[i])) { > + /* we can't return a place holder, shift it away */ > + if (i + 1 < ret) { > + memcpy(&pages[i], &pages[i+1], > + (ret - i - 1) * sizeof(struct page *)); > + } > + ret--; > + continue; > + } else > + page_cache_get(pages[i]); > + i++; > + } > if (ret) > *index = pages[ret - 1]->index + 1; > read_unlock_irq(&mapping->tree_lock); > diff -r 18a9e9f5c707 mm/truncate.c > --- a/mm/truncate.c Thu Oct 19 08:30:00 2006 +0700 > +++ b/mm/truncate.c Tue Oct 24 15:10:48 2006 -0400 > @@ -207,6 +207,7 @@ void truncate_inode_pages_range(struct a > } > pagevec_release(&pvec); > } > + wait_on_placeholder_pages_range(mapping, start, end); > } > EXPORT_SYMBOL(truncate_inode_pages_range); > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache. 2006-11-01 10:23 ` Suparna Bhattacharya @ 2006-11-01 12:41 ` Chris Mason 2006-11-03 7:12 ` Suparna Bhattacharya 0 siblings, 1 reply; 11+ messages in thread From: Chris Mason @ 2006-11-01 12:41 UTC (permalink / raw) To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown On Wed, Nov 01, 2006 at 03:53:50PM +0530, Suparna Bhattacharya wrote: > > Really appreciate your biting the bullet on this one. I have a few > minor comments/questions as I'm trying to compare this to what we > have discussed in the past and the last attempt approach outlined in > http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt > back in Feb this year. The main difference from DIO-simplify.txt is that I'm making truncate wait on the placeholder pages instead of using i_alloc_sem. I haven't coded the TAG_LOCKED part, but more on that below. As you noticed, I'm using expanding truncates to update i_size for extending the file, which has pluses and minuses. > > I recall that the first time you experimented with dummy struct pages > (placeholder pages as they are now called), you ended up putting it on > hold due to regressions in performance which we attributed to radix > tree insertion overheads. I think you mention in an earlier note that > you've managed to optimize some of that -- I'm just wondering what helped > there - radix_tree_preload or something more than that ? My original code was fairly dumb and didn't use gang lookups. It also dropped the radix tree lock after every insert/delete. The code posted here is about 2x as fast (or 1/2 the regression, depending on how you look at it). For smaller read ios, the performance hit is about zero. The cost of radix tree insertion is offset by less mutex and semaphore locking. > > This is sort of why we had discussed the alternative of tagged locking > in the radix tree. > Lookup and lock pages in the range by tagging the radix tree > slots as "TAG_LOCKED", and locking pages that were present. > Modify find_get_page et al to return a dummy locked cache page > if TAG_LOCKED. (Alternatively put the check in add_to_page_cache, > instead) > Putting the check in add_to_page_cache is more consistent with what you > do now. > > Where I think we were stuck at that point was that even this helps only upto > a point in saving on leaf nodes, but did not avoid intermediate levels of > nodes from being created. With Nick's changes to the radix tree .. I have > been thinking that some kind of sliding height scheme could become possible. > > Anyway, the other reason for bringing this up is because I am wondering if > some of the complexity in find_and_insert_placeholder_pages and a few other > places could be reduced with a tagged locking approach where we do not > actually stuff placeholder pages explicitly in the page cache and hence > possibly can avoid some of the checks you have below. I need to take a closer > look to be completely sure, but just a thought ... > > I guess the important thing at this point is to get the interfaces right > in a way that allows for future optimization. Yes, my goal was to get the code fast enough for most machines, so that we could test the result and see what impact it would have on the rest of the page cache. From here we can decide if additional radix tuning is a good idea, or if we just want to stick with the mutex/semaphore combo in use now. I suspect that we won't be able to tune the radix tree enough to make it suitable for the huge SGI 512 cpu test cases. So we might want to stop tuning now (letting XFS keep their current code) or switch away from radix to something that can lock down an extent. Radix tree gang lookups, insertions and deletions all have room for optimization though (even without the LOCKED tag). A new tag in the radix tree isn't free, Andrew might prefer not to increase the space usage across the board just for O_DIRECT. -chris ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache. 2006-11-01 12:41 ` Chris Mason @ 2006-11-03 7:12 ` Suparna Bhattacharya 2006-11-03 14:10 ` Chris Mason 0 siblings, 1 reply; 11+ messages in thread From: Suparna Bhattacharya @ 2006-11-03 7:12 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown On Wed, Nov 01, 2006 at 07:41:53AM -0500, Chris Mason wrote: > On Wed, Nov 01, 2006 at 03:53:50PM +0530, Suparna Bhattacharya wrote: > > > > Really appreciate your biting the bullet on this one. I have a few > > minor comments/questions as I'm trying to compare this to what we > > have discussed in the past and the last attempt approach outlined in > > http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt > > back in Feb this year. > > The main difference from DIO-simplify.txt is that I'm making truncate > wait on the placeholder pages instead of using i_alloc_sem. I haven't Ah OK, that is true. > coded the TAG_LOCKED part, but more on that below. As you > noticed, I'm using expanding truncates to update i_size for extending > the file, which has pluses and minuses. > > > > > I recall that the first time you experimented with dummy struct pages > > (placeholder pages as they are now called), you ended up putting it on > > hold due to regressions in performance which we attributed to radix > > tree insertion overheads. I think you mention in an earlier note that > > you've managed to optimize some of that -- I'm just wondering what helped > > there - radix_tree_preload or something more than that ? > > My original code was fairly dumb and didn't use gang lookups. It also > dropped the radix tree lock after every insert/delete. The code posted > here is about 2x as fast (or 1/2 the regression, depending on how you > look at it). For smaller read ios, the performance hit is about zero. > The cost of radix tree insertion is offset by less mutex and semaphore > locking. OK, so you think the main overheads had to do with radix-tree locking, rather than with the additional node insertions ? > > > > > This is sort of why we had discussed the alternative of tagged locking > > in the radix tree. > > Lookup and lock pages in the range by tagging the radix tree > > slots as "TAG_LOCKED", and locking pages that were present. > > Modify find_get_page et al to return a dummy locked cache page > > if TAG_LOCKED. (Alternatively put the check in add_to_page_cache, > > instead) > > Putting the check in add_to_page_cache is more consistent with what you > > do now. > > > > Where I think we were stuck at that point was that even this helps only upto > > a point in saving on leaf nodes, but did not avoid intermediate levels of > > nodes from being created. With Nick's changes to the radix tree .. I have > > been thinking that some kind of sliding height scheme could become possible. > > > > Anyway, the other reason for bringing this up is because I am wondering if > > some of the complexity in find_and_insert_placeholder_pages and a few other > > places could be reduced with a tagged locking approach where we do not > > actually stuff placeholder pages explicitly in the page cache and hence > > possibly can avoid some of the checks you have below. I need to take a closer > > look to be completely sure, but just a thought ... > > > > I guess the important thing at this point is to get the interfaces right > > in a way that allows for future optimization. > > Yes, my goal was to get the code fast enough for most machines, so that > we could test the result and see what impact it would have on the rest > of the page cache. From here we can decide if additional radix tuning > is a good idea, or if we just want to stick with the mutex/semaphore > combo in use now. Yup, that makes sense. The complexity in find_and_insert_placeholder_pages does worry me though, wouldn't want to end up debugging tricky problems in that code after all this. Do you have any ideas on that ? > > I suspect that we won't be able to tune the radix tree enough to make it > suitable for the huge SGI 512 cpu test cases. So we might want to stop > tuning now (letting XFS keep their current code) or switch away from > radix to something that can lock down an extent. > > Radix tree gang lookups, insertions and deletions all have room for > optimization though (even without the LOCKED tag). A new tag in the > radix tree isn't free, Andrew might prefer not to increase the space > usage across the board just for O_DIRECT. Yes that was a consideration on my mind as well. There are costs both ways. Regards Suparna > > -chris -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache. 2006-11-03 7:12 ` Suparna Bhattacharya @ 2006-11-03 14:10 ` Chris Mason 0 siblings, 0 replies; 11+ messages in thread From: Chris Mason @ 2006-11-03 14:10 UTC (permalink / raw) To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown On Fri, Nov 03, 2006 at 12:42:57PM +0530, Suparna Bhattacharya wrote: > > > > My original code was fairly dumb and didn't use gang lookups. It also > > dropped the radix tree lock after every insert/delete. The code posted > > here is about 2x as fast (or 1/2 the regression, depending on how you > > look at it). For smaller read ios, the performance hit is about zero. > > The cost of radix tree insertion is offset by less mutex and semaphore > > locking. > > OK, so you think the main overheads had to do with radix-tree locking, > rather than with the additional node insertions ? It depends on the machine I suppose. If you have a lot of cpus trying to work concurrently on the same file, bouncing the lock between them (especially on numa) may be the most expensive part. Hopefully we can talk the XFS guys into some benchmarking, since they have a bunch of numbers for comparison already. But there are other costs, including slab churn as we allocate/free radix nodes etc etc. I think (well, hope) that on most hardware this is a very small cost in comparison with the actual IO. [ ... ] > > Yes, my goal was to get the code fast enough for most machines, so that > > we could test the result and see what impact it would have on the rest > > of the page cache. From here we can decide if additional radix tuning > > is a good idea, or if we just want to stick with the mutex/semaphore > > combo in use now. > > Yup, that makes sense. > > The complexity in find_and_insert_placeholder_pages does worry me though, > wouldn't want to end up debugging tricky problems in that code after all > this. Do you have any ideas on that ? If we push bulk insertion down into the radix code, find_and_insert_placeholders will become less complex. We could even preallocate a node that is already filled with place holders and pop that into multiple spots in the radix tree. -chris ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 2/3] Change O_DIRECT to use placeholders 2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason 2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason @ 2006-10-27 18:25 ` Chris Mason 2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason 2 siblings, 0 replies; 11+ messages in thread From: Chris Mason @ 2006-10-27 18:25 UTC (permalink / raw) To: linux-fsdevel; +Cc: akpm, zach.brown Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Signed-off-by: Chris Mason <chris.mason@oracle.com> diff -r 6b363967df4b fs/direct-io.c --- a/fs/direct-io.c Fri Oct 27 13:54:01 2006 -0400 +++ b/fs/direct-io.c Fri Oct 27 13:55:18 2006 -0400 @@ -35,6 +35,7 @@ #include <linux/rwsem.h> #include <linux/uio.h> #include <asm/atomic.h> +#include <linux/writeback.h> /* * How many user pages to map in one call to get_user_pages(). This determines @@ -94,6 +95,14 @@ struct dio { struct buffer_head map_bh; /* last get_block() result */ /* + * kernel page pinning + */ + struct page fake; + struct page *tmppages[DIO_PAGES]; + unsigned long fspages_start_off; + unsigned long fspages_end_off; + + /* * Deferred addition of a page to the dio. These variables are * private to dio_send_cur_page(), submit_page_section() and * dio_bio_add_page(). @@ -190,6 +199,28 @@ out: return ret; } +static void unlock_page_range(struct dio *dio, unsigned long start, + unsigned long nr) +{ + remove_placeholder_pages(dio->inode->i_mapping, dio->tmppages, + &dio->fake, + start, start + nr, + ARRAY_SIZE(dio->tmppages)); +} + +static int lock_page_range(struct dio *dio, unsigned long start, + unsigned long nr) +{ + struct address_space *mapping = dio->inode->i_mapping; + struct page *fake = &dio->fake; + unsigned long end = start + nr; + return find_or_insert_placeholders(mapping, dio->tmppages, start, end, + ARRAY_SIZE(dio->tmppages), + GFP_KERNEL, fake, + dio->rw == READ); +} + + /* * Get another userspace page. Returns an ERR_PTR on error. Pages are * buffered inside the dio so that we can call get_user_pages() against a @@ -219,9 +250,9 @@ static void dio_complete(struct dio *dio { if (dio->end_io && dio->result) dio->end_io(dio->iocb, offset, bytes, dio->map_bh.b_private); - if (dio->lock_type == DIO_LOCKING) - /* lockdep: non-owner release */ - up_read_non_owner(&dio->inode->i_alloc_sem); + unlock_page_range(dio, dio->fspages_start_off, + dio->fspages_end_off - dio->fspages_start_off); + dio->fspages_end_off = dio->fspages_start_off; } /* @@ -517,6 +548,7 @@ static int get_more_blocks(struct dio *d unsigned long fs_count; /* Number of filesystem-sized blocks */ unsigned long dio_count;/* Number of dio_block-sized blocks */ unsigned long blkmask; + unsigned long index; int create; /* @@ -544,7 +576,21 @@ static int get_more_blocks(struct dio *d } else if (dio->lock_type == DIO_NO_LOCKING) { create = 0; } - + index = fs_startblk >> (PAGE_CACHE_SHIFT - + dio->inode->i_blkbits); + if (index >= dio->fspages_end_off) { + unsigned long end; + unsigned long nr; + end = (dio->final_block_in_request >> + dio->blkfactor) >> + (PAGE_CACHE_SHIFT - dio->inode->i_blkbits); + nr = min(end - index + 1, (unsigned long)DIO_PAGES); + ret = lock_page_range(dio, dio->fspages_end_off, nr); + if (ret) + goto error; + dio->fspages_end_off += nr; + BUG_ON(index >= dio->fspages_end_off); + } /* * For writes inside i_size we forbid block creations: only * overwrites are permitted. We fall back to buffered writes @@ -554,6 +600,7 @@ static int get_more_blocks(struct dio *d ret = (*dio->get_block)(dio->inode, fs_startblk, map_bh, create); } +error: return ret; } @@ -943,9 +990,6 @@ out: return ret; } -/* - * Releases both i_mutex and i_alloc_sem - */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, @@ -1074,14 +1118,6 @@ direct_io_worker(int rw, struct kiocb *i * In that case, we need to release all the pages we got hold on. */ dio_cleanup(dio); - - /* - * All block lookups have been performed. For READ requests - * we can let i_mutex go now that its achieved its purpose - * of protecting us from looking up uninitialized blocks. - */ - if ((rw == READ) && (dio->lock_type == DIO_LOCKING)) - mutex_unlock(&dio->inode->i_mutex); /* * OK, all BIOs are submitted, so we can decrement bio_count to truly @@ -1165,8 +1201,6 @@ direct_io_worker(int rw, struct kiocb *i * DIO_LOCKING (simple locking for regular files) * For writes we are called under i_mutex and return with i_mutex held, even * though it is internally dropped. - * For reads, i_mutex is not held on entry, but it is taken and dropped before - * returning. * * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of * uninitialised data, allowing parallel direct readers and writers) @@ -1191,8 +1225,6 @@ __blockdev_direct_IO(int rw, struct kioc ssize_t retval = -EINVAL; loff_t end = offset; struct dio *dio; - int release_i_mutex = 0; - int acquire_i_mutex = 0; if (rw & WRITE) rw = WRITE_SYNC; @@ -1221,51 +1253,24 @@ __blockdev_direct_IO(int rw, struct kioc goto out; } } - dio = kmalloc(sizeof(*dio), GFP_KERNEL); retval = -ENOMEM; if (!dio) goto out; + set_page_placeholder(&dio->fake); + dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT; + dio->fspages_end_off = dio->fspages_start_off; + /* * For block device access DIO_NO_LOCKING is used, * neither readers nor writers do any locking at all * For regular files using DIO_LOCKING, - * readers need to grab i_mutex and i_alloc_sem - * writers need to grab i_alloc_sem only (i_mutex is already held) + * No locks are taken * For regular files using DIO_OWN_LOCKING, * neither readers nor writers take any locks here */ dio->lock_type = dio_lock_type; - if (dio_lock_type != DIO_NO_LOCKING) { - /* watch out for a 0 len io from a tricksy fs */ - if (rw == READ && end > offset) { - struct address_space *mapping; - - mapping = iocb->ki_filp->f_mapping; - if (dio_lock_type != DIO_OWN_LOCKING) { - mutex_lock(&inode->i_mutex); - release_i_mutex = 1; - } - - retval = filemap_write_and_wait_range(mapping, offset, - end - 1); - if (retval) { - kfree(dio); - goto out; - } - - if (dio_lock_type == DIO_OWN_LOCKING) { - mutex_unlock(&inode->i_mutex); - acquire_i_mutex = 1; - } - } - - if (dio_lock_type == DIO_LOCKING) - /* lockdep: not the owner will release it */ - down_read_non_owner(&inode->i_alloc_sem); - } - /* * For file extending writes updating i_size before data * writeouts complete can expose uninitialized blocks. So @@ -1277,15 +1282,7 @@ __blockdev_direct_IO(int rw, struct kioc retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); - - if (rw == READ && dio_lock_type == DIO_LOCKING) - release_i_mutex = 0; - out: - if (release_i_mutex) - mutex_unlock(&inode->i_mutex); - else if (acquire_i_mutex) - mutex_lock(&inode->i_mutex); return retval; } EXPORT_SYMBOL(__blockdev_direct_IO); ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 3/3] DIO: don't fall back to buffered writes 2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason 2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason 2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason @ 2006-10-27 18:26 ` Chris Mason 2006-11-01 10:51 ` Suparna Bhattacharya 2 siblings, 1 reply; 11+ messages in thread From: Chris Mason @ 2006-10-27 18:26 UTC (permalink / raw) To: linux-fsdevel; +Cc: akpm, zach.brown DIO: don't fall back to buffered writes Placeholder pages allow DIO to use locking rules similar to that of writepage. DIO can now fill holes, and it can extend the file via expanding truncates. This creates holes that are filled during the DIO write. i_mutex can be dropped during writes once DIO decides if the file needs to be extended. The expanding truncate may cause some pagecache pages to be dirtied. The call to find_or_insert_placeholders is changed to wait on dirty or writeback pages in the case of writes as well as reads. Signed-off-by: Chris Mason <chris.mason@oracle.com> diff -r 30dcee85429c fs/direct-io.c --- a/fs/direct-io.c Fri Oct 27 13:49:19 2006 -0400 +++ b/fs/direct-io.c Fri Oct 27 13:50:48 2006 -0400 @@ -69,6 +69,7 @@ struct dio { int rw; loff_t i_size; /* i_size when submitted */ int lock_type; /* doesn't change */ + int reacquire_i_mutex; /* should we get i_mutex when done? */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which is finer than the filesystem's soft @@ -216,8 +217,7 @@ static int lock_page_range(struct dio *d unsigned long end = start + nr; return find_or_insert_placeholders(mapping, dio->tmppages, start, end, ARRAY_SIZE(dio->tmppages), - GFP_KERNEL, fake, - dio->rw == READ); + GFP_KERNEL, fake, 1); } @@ -253,6 +253,8 @@ static void dio_complete(struct dio *dio unlock_page_range(dio, dio->fspages_start_off, dio->fspages_end_off - dio->fspages_start_off); dio->fspages_end_off = dio->fspages_start_off; + if (dio->reacquire_i_mutex) + mutex_lock(&dio->inode->i_mutex); } /* @@ -569,13 +571,8 @@ static int get_more_blocks(struct dio *d map_bh->b_size = fs_count << dio->inode->i_blkbits; create = dio->rw & WRITE; - if (dio->lock_type == DIO_LOCKING) { - if (dio->block_in_file < (i_size_read(dio->inode) >> - dio->blkbits)) - create = 0; - } else if (dio->lock_type == DIO_NO_LOCKING) { + if (dio->lock_type == DIO_NO_LOCKING) create = 0; - } index = fs_startblk >> (PAGE_CACHE_SHIFT - dio->inode->i_blkbits); if (index >= dio->fspages_end_off) { @@ -1283,6 +1280,33 @@ __blockdev_direct_IO(int rw, struct kioc dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && (end > i_size_read(inode))); + /* + * extend the file if needed, and drop i_mutex for non-aio writes. + * We extend the file by creating a hole that is later filled in + * during the O_DIRECT. Because pages are locked or placeholders + * are inserted, the locking rules end up being the same as + * mmap'd writes using writepage to fill holes + */ + dio->reacquire_i_mutex = 0; + if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) { + /* if our write goes past i_size, do an expanding + * truncate to fill it before dropping i_mutex + */ + if (end > i_size_read(inode) && iocb->ki_filp) { + struct iattr newattrs; + newattrs.ia_size = end; + newattrs.ia_file = iocb->ki_filp; + newattrs.ia_valid = ATTR_SIZE | ATTR_FILE; + retval = notify_change(iocb->ki_filp->f_dentry, + &newattrs); + if (retval) + goto out; + } + if (is_sync_kiocb(iocb)) { + dio->reacquire_i_mutex = 1; + mutex_unlock(&inode->i_mutex); + } + } retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); out: ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes 2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason @ 2006-11-01 10:51 ` Suparna Bhattacharya 2006-11-01 12:47 ` Chris Mason 0 siblings, 1 reply; 11+ messages in thread From: Suparna Bhattacharya @ 2006-11-01 10:51 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown On Fri, Oct 27, 2006 at 02:26:12PM -0400, Chris Mason wrote: > DIO: don't fall back to buffered writes > > Placeholder pages allow DIO to use locking rules similar to that of > writepage. DIO can now fill holes, and it can extend the file via > expanding truncates. This creates holes that are filled during the DIO > write. > > i_mutex can be dropped during writes once DIO decides if the file needs to be > extended. > > The expanding truncate may cause some pagecache pages to be dirtied. > The call to find_or_insert_placeholders is changed to wait on dirty > or writeback pages in the case of writes as well as reads. > > Signed-off-by: Chris Mason <chris.mason@oracle.com> > > diff -r 30dcee85429c fs/direct-io.c > --- a/fs/direct-io.c Fri Oct 27 13:49:19 2006 -0400 > +++ b/fs/direct-io.c Fri Oct 27 13:50:48 2006 -0400 > @@ -69,6 +69,7 @@ struct dio { > int rw; > loff_t i_size; /* i_size when submitted */ > int lock_type; /* doesn't change */ > + int reacquire_i_mutex; /* should we get i_mutex when done? */ > unsigned blkbits; /* doesn't change */ > unsigned blkfactor; /* When we're using an alignment which > is finer than the filesystem's soft > @@ -216,8 +217,7 @@ static int lock_page_range(struct dio *d > unsigned long end = start + nr; > return find_or_insert_placeholders(mapping, dio->tmppages, start, end, > ARRAY_SIZE(dio->tmppages), > - GFP_KERNEL, fake, > - dio->rw == READ); > + GFP_KERNEL, fake, 1); > } > > > @@ -253,6 +253,8 @@ static void dio_complete(struct dio *dio > unlock_page_range(dio, dio->fspages_start_off, > dio->fspages_end_off - dio->fspages_start_off); > dio->fspages_end_off = dio->fspages_start_off; > + if (dio->reacquire_i_mutex) > + mutex_lock(&dio->inode->i_mutex); > } > > /* > @@ -569,13 +571,8 @@ static int get_more_blocks(struct dio *d > map_bh->b_size = fs_count << dio->inode->i_blkbits; > > create = dio->rw & WRITE; > - if (dio->lock_type == DIO_LOCKING) { > - if (dio->block_in_file < (i_size_read(dio->inode) >> > - dio->blkbits)) > - create = 0; > - } else if (dio->lock_type == DIO_NO_LOCKING) { > + if (dio->lock_type == DIO_NO_LOCKING) > create = 0; > - } > index = fs_startblk >> (PAGE_CACHE_SHIFT - > dio->inode->i_blkbits); > if (index >= dio->fspages_end_off) { > @@ -1283,6 +1280,33 @@ __blockdev_direct_IO(int rw, struct kioc > dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && > (end > i_size_read(inode))); > > + /* > + * extend the file if needed, and drop i_mutex for non-aio writes. > + * We extend the file by creating a hole that is later filled in > + * during the O_DIRECT. Because pages are locked or placeholders > + * are inserted, the locking rules end up being the same as > + * mmap'd writes using writepage to fill holes I like the idea of extending the size in the beginning to avoid having to hold i_sem for the extended writes case. The only question is whether there is a semantics issue here - e.g. if there isn't enough space, would the app which did an append still see the size change even though the blocks themselves are not populated ? Of course we already have this happen in the -EIO case for AIO-DIO writes, no am not sure if this is an issue. > + */ > + dio->reacquire_i_mutex = 0; > + if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) { Hmm, we really do want to be able to avoid all these various locking mode checks inside the DIO code to simplify things. I thought you had mostly managed to do away with these in your previous patch. Unfortunately this change while it should help concurrency, brings that check back in. One of the plans I had earlier was to pull this up to a higher level - i.e. in the locking version of blockdev_direct_IO() thus doing away with the flag and related confusion. Do you think that would make sense ? > + /* if our write goes past i_size, do an expanding > + * truncate to fill it before dropping i_mutex > + */ > + if (end > i_size_read(inode) && iocb->ki_filp) { > + struct iattr newattrs; > + newattrs.ia_size = end; > + newattrs.ia_file = iocb->ki_filp; > + newattrs.ia_valid = ATTR_SIZE | ATTR_FILE; > + retval = notify_change(iocb->ki_filp->f_dentry, > + &newattrs); > + if (retval) > + goto out; > + } > + if (is_sync_kiocb(iocb)) { > + dio->reacquire_i_mutex = 1; > + mutex_unlock(&inode->i_mutex); BTW, don't the page locks cover the AIO case as well ? I'm wondering why we need this special case here. Regards Suparna > + } > + } > retval = direct_io_worker(rw, iocb, inode, iov, offset, > nr_segs, blkbits, get_block, end_io, dio); > out: > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes 2006-11-01 10:51 ` Suparna Bhattacharya @ 2006-11-01 12:47 ` Chris Mason 2006-11-03 7:25 ` Suparna Bhattacharya 0 siblings, 1 reply; 11+ messages in thread From: Chris Mason @ 2006-11-01 12:47 UTC (permalink / raw) To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm, zach.brown On Wed, Nov 01, 2006 at 04:21:37PM +0530, Suparna Bhattacharya wrote: > > + /* > > + * extend the file if needed, and drop i_mutex for non-aio writes. > > + * We extend the file by creating a hole that is later filled in > > + * during the O_DIRECT. Because pages are locked or placeholders > > + * are inserted, the locking rules end up being the same as > > + * mmap'd writes using writepage to fill holes > > I like the idea of extending the size in the beginning to avoid having > to hold i_sem for the extended writes case. The only question is whether > there is a semantics issue here - e.g. if there isn't enough space, would the > app which did an append still see the size change even though the blocks > themselves are not populated ? Of course we already have this happen in the > -EIO case for AIO-DIO writes, no am not sure if this is an issue. You're right, there is a semantic issue, I'm being a little loose with the traditional write() expectations. The current code creates a hole and does not remove it in the face of errors. > > > + */ > > + dio->reacquire_i_mutex = 0; > > + if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) { > > Hmm, we really do want to be able to avoid all these various locking > mode checks inside the DIO code to simplify things. I thought you > had mostly managed to do away with these in your previous patch. > Unfortunately this change while it should help concurrency, brings > that check back in. One of the plans I had earlier was to pull this > up to a higher level - i.e. in the locking version of blockdev_direct_IO() > thus doing away with the flag and related confusion. Do you think > that would make sense ? I went the other direction, making the flags a little more formal (patch will go out today). I don't have strong feelings about this, but it seems easiest to me to keep the flags inside blockdev_direct_IO. > > > > + /* if our write goes past i_size, do an expanding > > + * truncate to fill it before dropping i_mutex > > + */ > > + if (end > i_size_read(inode) && iocb->ki_filp) { > > + struct iattr newattrs; > > + newattrs.ia_size = end; > > + newattrs.ia_file = iocb->ki_filp; > > + newattrs.ia_valid = ATTR_SIZE | ATTR_FILE; > > + retval = notify_change(iocb->ki_filp->f_dentry, > > + &newattrs); > > + if (retval) > > + goto out; > > + } > > + if (is_sync_kiocb(iocb)) { > > + dio->reacquire_i_mutex = 1; > > + mutex_unlock(&inode->i_mutex); > > BTW, don't the page locks cover the AIO case as well ? I'm wondering > why we need this special case here. Only because i_mutex is already dropped implicitly by the aio code when we return. Ideally we could drop it sooner inside fs/direct-io.c, but I'll leave that tuning for a later patch. -chris ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 3/3] DIO: don't fall back to buffered writes 2006-11-01 12:47 ` Chris Mason @ 2006-11-03 7:25 ` Suparna Bhattacharya 0 siblings, 0 replies; 11+ messages in thread From: Suparna Bhattacharya @ 2006-11-03 7:25 UTC (permalink / raw) To: Chris Mason; +Cc: linux-fsdevel, akpm, zach.brown On Wed, Nov 01, 2006 at 07:47:19AM -0500, Chris Mason wrote: > On Wed, Nov 01, 2006 at 04:21:37PM +0530, Suparna Bhattacharya wrote: > > > + /* > > > + * extend the file if needed, and drop i_mutex for non-aio writes. > > > + * We extend the file by creating a hole that is later filled in > > > + * during the O_DIRECT. Because pages are locked or placeholders > > > + * are inserted, the locking rules end up being the same as > > > + * mmap'd writes using writepage to fill holes > > > > I like the idea of extending the size in the beginning to avoid having > > to hold i_sem for the extended writes case. The only question is whether > > there is a semantics issue here - e.g. if there isn't enough space, would the > > app which did an append still see the size change even though the blocks > > themselves are not populated ? Of course we already have this happen in the > > -EIO case for AIO-DIO writes, no am not sure if this is an issue. > > You're right, there is a semantic issue, I'm being a little loose with > the traditional write() expectations. The current code creates a hole > and does not remove it in the face of errors. > > > > > > + */ > > > + dio->reacquire_i_mutex = 0; > > > + if (dio_lock_type == DIO_LOCKING && (rw & WRITE)) { > > > > Hmm, we really do want to be able to avoid all these various locking > > mode checks inside the DIO code to simplify things. I thought you > > had mostly managed to do away with these in your previous patch. > > Unfortunately this change while it should help concurrency, brings > > that check back in. One of the plans I had earlier was to pull this > > up to a higher level - i.e. in the locking version of blockdev_direct_IO() > > thus doing away with the flag and related confusion. Do you think > > that would make sense ? > > I went the other direction, making the flags a little more formal (patch > will go out today). I don't have strong feelings about this, but it > seems easiest to me to keep the flags inside blockdev_direct_IO. Do we have some kind of description of what conditions each flags would be used / who would use them (not just what they do). For example I have forgotton why DIO_CREATE was actually introduced for XFS. Is it still needed now that we no longer needed to do the buffered IO fallback for overwriting holes ? That would help the discussion. And then there used to be special flags for cluster filesystems which was introduced by GFS ... all of it got to be quite confusing. The formality helps, but only to an extent. > > > > > > > > + /* if our write goes past i_size, do an expanding > > > + * truncate to fill it before dropping i_mutex > > > + */ > > > + if (end > i_size_read(inode) && iocb->ki_filp) { > > > + struct iattr newattrs; > > > + newattrs.ia_size = end; > > > + newattrs.ia_file = iocb->ki_filp; > > > + newattrs.ia_valid = ATTR_SIZE | ATTR_FILE; > > > + retval = notify_change(iocb->ki_filp->f_dentry, > > > + &newattrs); > > > + if (retval) > > > + goto out; > > > + } > > > + if (is_sync_kiocb(iocb)) { > > > + dio->reacquire_i_mutex = 1; > > > + mutex_unlock(&inode->i_mutex); > > > > BTW, don't the page locks cover the AIO case as well ? I'm wondering > > why we need this special case here. > > Only because i_mutex is already dropped implicitly by the aio code when > we return. Ideally we could drop it sooner inside fs/direct-io.c, but > I'll leave that tuning for a later patch. OK, I see. There is one other thing that I'm not sure we have covered yet. That is ensuring ordered mode guarantees with AIO-DIO. As long as we were falling back to buffered IO in cases involving block allocation or forcing extends to be synchronous, we probably didn't need to worry about this. How do we make sure we address that now. Regards Suparna > > -chris -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-11-03 14:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-27 18:22 [RFC PATCH 0/3] O_DIRECT locking rework Chris Mason 2006-10-27 18:23 ` [RFC PATCH 1/3] Introduce a place holder page for the pagecache Chris Mason 2006-11-01 10:23 ` Suparna Bhattacharya 2006-11-01 12:41 ` Chris Mason 2006-11-03 7:12 ` Suparna Bhattacharya 2006-11-03 14:10 ` Chris Mason 2006-10-27 18:25 ` [RFC PATCH 2/3] Change O_DIRECT to use placeholders Chris Mason 2006-10-27 18:26 ` [RFC PATCH 3/3] DIO: don't fall back to buffered writes Chris Mason 2006-11-01 10:51 ` Suparna Bhattacharya 2006-11-01 12:47 ` Chris Mason 2006-11-03 7:25 ` Suparna Bhattacharya
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).