* [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback @ 2010-06-04 18:40 Jan Kara 2010-06-04 18:40 ` [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged Jan Kara 2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara 0 siblings, 2 replies; 13+ messages in thread From: Jan Kara @ 2010-06-04 18:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm Hi, I've revived my patches to implement livelock avoidance for data integrity writes. Due to some concerns whether tagging of pages before writeout cannot be too costly to use for WB_SYNC_NONE mode (where we stop after nr_to_write pages) I've changed the patch to use page tagging only in WB_SYNC_ALL mode where we are sure that we write out all the tagged pages. Later, we can think about using tagging for livelock avoidance for WB_SYNC_NONE mode as well... As always comments are welcome. Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged 2010-06-04 18:40 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara @ 2010-06-04 18:40 ` Jan Kara 2010-06-09 23:30 ` Andrew Morton 2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara 1 sibling, 1 reply; 13+ messages in thread From: Jan Kara @ 2010-06-04 18:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm, Jan Kara Implement function for setting one tag if another tag is set for each item in given range. Signed-off-by: Jan Kara <jack@suse.cz> --- include/linux/radix-tree.h | 3 ++ lib/radix-tree.c | 82 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 85 insertions(+), 0 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 55ca73c..efdfb07 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -192,6 +192,9 @@ unsigned int radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results, unsigned long first_index, unsigned int max_items, unsigned int tag); +unsigned long radix_tree_gang_tag_if_tagged(struct radix_tree_root *root, + unsigned long first_index, unsigned long last_index, + unsigned int fromtag, unsigned int totag); int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag); static inline void radix_tree_preload_end(void) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 05da38b..c4595b2 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -609,6 +609,88 @@ int radix_tree_tag_get(struct radix_tree_root *root, EXPORT_SYMBOL(radix_tree_tag_get); /** + * radix_tree_gang_tag_if_tagged - for each item in given range set given + * tag if item has another tag set + * @root: radix tree root + * @first_index: starting index of a range to scan + * @last_index: last index of a range to scan + * @iftag: tag index to test + * @settag: tag index to set if tested tag is set + * + * This function scans range of radix tree from first_index to last_index. + * For each item in the range if iftag is set, the function sets also + * settag. + * + * The function returns number of leaves where the tag was set. + */ +unsigned long radix_tree_gang_tag_if_tagged(struct radix_tree_root *root, + unsigned long first_index, unsigned long last_index, + unsigned int iftag, unsigned int settag) +{ + unsigned int height = root->height, shift; + unsigned long tagged = 0, index = first_index; + struct radix_tree_node *open_slots[height], *slot; + + last_index = min(last_index, radix_tree_maxindex(height)); + if (first_index > last_index) + return 0; + if (!root_tag_get(root, iftag)) + return 0; + if (height == 0) { + root_tag_set(root, settag); + return 1; + } + + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + slot = radix_tree_indirect_to_ptr(root->rnode); + + for (;;) { + int offset; + + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + if (!slot->slots[offset]) + goto next; + if (!tag_get(slot, iftag, offset)) + goto next; + tag_set(slot, settag, offset); + if (height == 1) { + tagged++; + goto next; + } + /* Go down one level */ + height--; + shift -= RADIX_TREE_MAP_SHIFT; + open_slots[height] = slot; + slot = slot->slots[offset]; + continue; +next: + /* Go to next item at level determined by 'shift' */ + index = ((index >> shift) + 1) << shift; + if (index > last_index) + break; + while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) { + /* + * We've fully scanned this node. Go up. Because + * last_index is guaranteed to be in the tree, what + * we do below cannot wander astray. + */ + slot = open_slots[height]; + height++; + shift += RADIX_TREE_MAP_SHIFT; + } + } + /* + * The iftag must have been set somewhere because otherwise + * we would return immediated at the beginning of the function + */ + root_tag_set(root, settag); + + return tagged; +} +EXPORT_SYMBOL(radix_tree_gang_tag_if_tagged); + + +/** * radix_tree_next_hole - find the next hole (not-present entry) * @root: tree root * @index: index key -- 1.6.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged 2010-06-04 18:40 ` [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged Jan Kara @ 2010-06-09 23:30 ` Andrew Morton 2010-06-10 12:28 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Andrew Morton @ 2010-06-09 23:30 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, npiggin, david, linux-mm On Fri, 4 Jun 2010 20:40:53 +0200 Jan Kara <jack@suse.cz> wrote: > Implement function for setting one tag if another tag is set > for each item in given range. > > > ... > > /** > + * radix_tree_gang_tag_if_tagged - for each item in given range set given > + * tag if item has another tag set > + * @root: radix tree root > + * @first_index: starting index of a range to scan > + * @last_index: last index of a range to scan > + * @iftag: tag index to test > + * @settag: tag index to set if tested tag is set > + * > + * This function scans range of radix tree from first_index to last_index. > + * For each item in the range if iftag is set, the function sets also > + * settag. > + * > + * The function returns number of leaves where the tag was set. > + */ > +unsigned long radix_tree_gang_tag_if_tagged(struct radix_tree_root *root, > + unsigned long first_index, unsigned long last_index, > + unsigned int iftag, unsigned int settag) This is kind of a misuse of the term "gang". First we had radix_tree_lookup(), which returned a single page. That was a bit inefficient, so then we added radix_tree_gang_lookup(), which retuned a "gang" of pages. But radix_tree_gang_tag_if_tagged() doesn't return a gang of anything (it has no `void **results' argument). radix_tree_range_tag_if_tagged()? > +{ > + unsigned int height = root->height, shift; > + unsigned long tagged = 0, index = first_index; > + struct radix_tree_node *open_slots[height], *slot; > + > + last_index = min(last_index, radix_tree_maxindex(height)); > + if (first_index > last_index) > + return 0; > + if (!root_tag_get(root, iftag)) > + return 0; > + if (height == 0) { > + root_tag_set(root, settag); > + return 1; > + } > + > + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; > + slot = radix_tree_indirect_to_ptr(root->rnode); > + > + for (;;) { > + int offset; > + > + offset = (index >> shift) & RADIX_TREE_MAP_MASK; > + if (!slot->slots[offset]) > + goto next; > + if (!tag_get(slot, iftag, offset)) > + goto next; > + tag_set(slot, settag, offset); > + if (height == 1) { > + tagged++; > + goto next; > + } > + /* Go down one level */ > + height--; > + shift -= RADIX_TREE_MAP_SHIFT; > + open_slots[height] = slot; > + slot = slot->slots[offset]; > + continue; > +next: > + /* Go to next item at level determined by 'shift' */ > + index = ((index >> shift) + 1) << shift; > + if (index > last_index) > + break; > + while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) { > + /* > + * We've fully scanned this node. Go up. Because > + * last_index is guaranteed to be in the tree, what > + * we do below cannot wander astray. > + */ > + slot = open_slots[height]; > + height++; > + shift += RADIX_TREE_MAP_SHIFT; > + } > + } > + /* > + * The iftag must have been set somewhere because otherwise > + * we would return immediated at the beginning of the function > + */ > + root_tag_set(root, settag); > + > + return tagged; > +} > +EXPORT_SYMBOL(radix_tree_gang_tag_if_tagged); Wouldn't this be a lot simpler if it used __lookup_tag()? Along the lines of do { slot *slots[N]; n = __lookup_tag(.., slots, ...); for (i = 0; i < n; i++) tag_set(slots[i], ...); } while (something); ? That's still one cache miss per slot and misses on the slots will preponderate, so the performance won't be much different. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged 2010-06-09 23:30 ` Andrew Morton @ 2010-06-10 12:28 ` Jan Kara 0 siblings, 0 replies; 13+ messages in thread From: Jan Kara @ 2010-06-10 12:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, npiggin, david, linux-mm On Wed 09-06-10 16:30:45, Andrew Morton wrote: > On Fri, 4 Jun 2010 20:40:53 +0200 > Jan Kara <jack@suse.cz> wrote: > > > Implement function for setting one tag if another tag is set > > for each item in given range. > > > > > > ... > > > > /** > > + * radix_tree_gang_tag_if_tagged - for each item in given range set given > > + * tag if item has another tag set > > + * @root: radix tree root > > + * @first_index: starting index of a range to scan > > + * @last_index: last index of a range to scan > > + * @iftag: tag index to test > > + * @settag: tag index to set if tested tag is set > > + * > > + * This function scans range of radix tree from first_index to last_index. > > + * For each item in the range if iftag is set, the function sets also > > + * settag. > > + * > > + * The function returns number of leaves where the tag was set. > > + */ > > +unsigned long radix_tree_gang_tag_if_tagged(struct radix_tree_root *root, > > + unsigned long first_index, unsigned long last_index, > > + unsigned int iftag, unsigned int settag) > > This is kind of a misuse of the term "gang". > > First we had radix_tree_lookup(), which returned a single page. > > That was a bit inefficient, so then we added radix_tree_gang_lookup(), > which retuned a "gang" of pages. > > But radix_tree_gang_tag_if_tagged() doesn't return a gang of anything > (it has no `void **results' argument). > > radix_tree_range_tag_if_tagged()? Good point and your name looks better. Changed. > > +{ > > + unsigned int height = root->height, shift; > > + unsigned long tagged = 0, index = first_index; > > + struct radix_tree_node *open_slots[height], *slot; > > + > > + last_index = min(last_index, radix_tree_maxindex(height)); > > + if (first_index > last_index) > > + return 0; > > + if (!root_tag_get(root, iftag)) > > + return 0; > > + if (height == 0) { > > + root_tag_set(root, settag); > > + return 1; > > + } > > + > > + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; > > + slot = radix_tree_indirect_to_ptr(root->rnode); > > + > > + for (;;) { > > + int offset; > > + > > + offset = (index >> shift) & RADIX_TREE_MAP_MASK; > > + if (!slot->slots[offset]) > > + goto next; > > + if (!tag_get(slot, iftag, offset)) > > + goto next; > > + tag_set(slot, settag, offset); > > + if (height == 1) { > > + tagged++; > > + goto next; > > + } > > + /* Go down one level */ > > + height--; > > + shift -= RADIX_TREE_MAP_SHIFT; > > + open_slots[height] = slot; > > + slot = slot->slots[offset]; > > + continue; > > +next: > > + /* Go to next item at level determined by 'shift' */ > > + index = ((index >> shift) + 1) << shift; > > + if (index > last_index) > > + break; > > + while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) { > > + /* > > + * We've fully scanned this node. Go up. Because > > + * last_index is guaranteed to be in the tree, what > > + * we do below cannot wander astray. > > + */ > > + slot = open_slots[height]; > > + height++; > > + shift += RADIX_TREE_MAP_SHIFT; > > + } > > + } > > + /* > > + * The iftag must have been set somewhere because otherwise > > + * we would return immediated at the beginning of the function > > + */ > > + root_tag_set(root, settag); > > + > > + return tagged; > > +} > > +EXPORT_SYMBOL(radix_tree_gang_tag_if_tagged); > > Wouldn't this be a lot simpler if it used __lookup_tag()? Along the > lines of > > do { > slot *slots[N]; > > n = __lookup_tag(.., slots, ...); > for (i = 0; i < n; i++) > tag_set(slots[i], ...); > } while (something); > > ? > > That's still one cache miss per slot and misses on the slots will > preponderate, so the performance won't be much different. But __lookup_tag returns only leaf nodes so we'd have to reimplement something like what radix_tree_tag_set does (tag_set sets tag only for that one node) - in particular, we'd have to reconstruct the radix tree path. Also __lookup_tag cares about RCU which my function doesn't have to because it is guaranteed to be called under spin lock. I'm not sure whether the cost is noticeable or not though... So do you still think it's worthwhile to go the simpler way? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging 2010-06-04 18:40 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara 2010-06-04 18:40 ` [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged Jan Kara @ 2010-06-04 18:40 ` Jan Kara 2010-06-09 23:41 ` Andrew Morton 2010-06-09 23:45 ` Andrew Morton 1 sibling, 2 replies; 13+ messages in thread From: Jan Kara @ 2010-06-04 18:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm, Jan Kara We try to avoid livelocks of writeback when some steadily creates dirty pages in a mapping we are writing out. For memory-cleaning writeback, using nr_to_write works reasonably well but we cannot really use it for data integrity writeback. This patch tries to solve the problem. The idea is simple: Tag all pages that should be written back with a special tag (TOWRITE) in the radix tree. This can be done rather quickly and thus livelocks should not happen in practice. Then we start doing the hard work of locking pages and sending them to disk only for those pages that have TOWRITE tag set. Signed-off-by: Jan Kara <jack@suse.cz> --- include/linux/fs.h | 1 + include/linux/radix-tree.h | 2 +- mm/page-writeback.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 44 insertions(+), 3 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 3428393..fe308f0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -685,6 +685,7 @@ struct block_device { */ #define PAGECACHE_TAG_DIRTY 0 #define PAGECACHE_TAG_WRITEBACK 1 +#define PAGECACHE_TAG_TOWRITE 2 int mapping_tagged(struct address_space *mapping, int tag); diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index efdfb07..f7ebff8 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) /*** radix-tree API starts here ***/ -#define RADIX_TREE_MAX_TAGS 2 +#define RADIX_TREE_MAX_TAGS 3 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ struct radix_tree_root { diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b289310..f590a12 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -807,6 +807,30 @@ void __init page_writeback_init(void) } /** + * tag_pages_for_writeback - tag pages to be written by write_cache_pages + * @mapping: address space structure to write + * @start: starting page index + * @end: ending page index (inclusive) + * + * This function scans the page range from @start to @end and tags all pages + * that have DIRTY tag set with a special TOWRITE tag. The idea is that + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag + * to identify pages eligible for writeback. This mechanism is used to avoid + * livelocking of writeback by a process steadily creating new dirty pages in + * the file (thus it is important for this function to be damn quick so that it + * can tag pages faster than a dirtying process can create them). + */ +void tag_pages_for_writeback(struct address_space *mapping, + pgoff_t start, pgoff_t end) +{ + spin_lock_irq(&mapping->tree_lock); + radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end, + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE); + spin_unlock_irq(&mapping->tree_lock); +} +EXPORT_SYMBOL(tag_pages_for_writeback); + +/** * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. * @mapping: address space structure to write * @wbc: subtract the number of written pages from *@wbc->nr_to_write @@ -820,6 +844,13 @@ void __init page_writeback_init(void) * the call was made get new I/O started against them. If wbc->sync_mode is * WB_SYNC_ALL then we were called for data integrity and we must wait for * existing IO to complete. + * + * To avoid livelocks (when other process dirties new pages), we first tag + * pages which should be written back with TOWRITE tag and only then start + * writing them. For data-integrity sync we have to be careful so that we do + * not miss some pages (e.g., because some other process has cleared TOWRITE + * tag we set). The rule we follow is that TOWRITE tag can be cleared only + * by the process clearing the DIRTY tag (and submitting the page for IO). */ int write_cache_pages(struct address_space *mapping, struct writeback_control *wbc, writepage_t writepage, @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping, int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; + int tag; pagevec_init(&pvec, 0); if (wbc->range_cyclic) { @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping, range_whole = 1; cycled = 1; /* ignore range_cyclic tests */ } + if (wbc->sync_mode == WB_SYNC_ALL) + tag = PAGECACHE_TAG_TOWRITE; + else + tag = PAGECACHE_TAG_DIRTY; retry: + if (wbc->sync_mode == WB_SYNC_ALL) + tag_pages_for_writeback(mapping, index, end); done_index = index; while (!done && (index <= end)) { int i; - nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, - PAGECACHE_TAG_DIRTY, + nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); if (nr_pages == 0) break; @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page) radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + radix_tree_tag_clear(&mapping->page_tree, + page_index(page), + PAGECACHE_TAG_TOWRITE); spin_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); -- 1.6.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging 2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara @ 2010-06-09 23:41 ` Andrew Morton 2010-06-10 12:31 ` Jan Kara 2010-06-09 23:45 ` Andrew Morton 1 sibling, 1 reply; 13+ messages in thread From: Andrew Morton @ 2010-06-09 23:41 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, npiggin, david, linux-mm On Fri, 4 Jun 2010 20:40:54 +0200 Jan Kara <jack@suse.cz> wrote: > We try to avoid livelocks of writeback when some steadily creates > dirty pages in a mapping we are writing out. For memory-cleaning > writeback, using nr_to_write works reasonably well but we cannot > really use it for data integrity writeback. This patch tries to > solve the problem. > > The idea is simple: Tag all pages that should be written back > with a special tag (TOWRITE) in the radix tree. This can be done > rather quickly and thus livelocks should not happen in practice. > Then we start doing the hard work of locking pages and sending > them to disk only for those pages that have TOWRITE tag set. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > include/linux/fs.h | 1 + > include/linux/radix-tree.h | 2 +- > mm/page-writeback.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- > 3 files changed, 44 insertions(+), 3 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3428393..fe308f0 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -685,6 +685,7 @@ struct block_device { > */ > #define PAGECACHE_TAG_DIRTY 0 > #define PAGECACHE_TAG_WRITEBACK 1 > +#define PAGECACHE_TAG_TOWRITE 2 > > int mapping_tagged(struct address_space *mapping, int tag); > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > index efdfb07..f7ebff8 100644 > --- a/include/linux/radix-tree.h > +++ b/include/linux/radix-tree.h > @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) > > /*** radix-tree API starts here ***/ > > -#define RADIX_TREE_MAX_TAGS 2 > +#define RADIX_TREE_MAX_TAGS 3 > > /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ > struct radix_tree_root { > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b289310..f590a12 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -807,6 +807,30 @@ void __init page_writeback_init(void) > } > > /** > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages > + * @mapping: address space structure to write > + * @start: starting page index > + * @end: ending page index (inclusive) > + * > + * This function scans the page range from @start to @end I'd add "inclusive" here as well. Add it everywhere, very carefully (or "exclusive"). it really is a minefield and we've had off-by-ones from this before > and tags all pages > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag > + * to identify pages eligible for writeback. This mechanism is used to avoid > + * livelocking of writeback by a process steadily creating new dirty pages in > + * the file (thus it is important for this function to be damn quick so that it > + * can tag pages faster than a dirtying process can create them). > + */ > +void tag_pages_for_writeback(struct address_space *mapping, > + pgoff_t start, pgoff_t end) > +{ > + spin_lock_irq(&mapping->tree_lock); > + radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end, > + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE); > + spin_unlock_irq(&mapping->tree_lock); > +} > +EXPORT_SYMBOL(tag_pages_for_writeback); Problem. For how long can this disable interrupts? Maybe 1TB of dirty pagecache before the NMI watchdog starts getting involved? Could be a problem in some situations. Easy enough to fix - just walk the range in 1000(?) page hunks, dropping the lock and doing cond_resched() each time. radix_tree_gang_tag_if_tagged() will need to return next_index to make that efficient with sparse files. > +/** > * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. > * @mapping: address space structure to write > * @wbc: subtract the number of written pages from *@wbc->nr_to_write > @@ -820,6 +844,13 @@ void __init page_writeback_init(void) > * the call was made get new I/O started against them. If wbc->sync_mode is > * WB_SYNC_ALL then we were called for data integrity and we must wait for > * existing IO to complete. > + * > + * To avoid livelocks (when other process dirties new pages), we first tag > + * pages which should be written back with TOWRITE tag and only then start > + * writing them. For data-integrity sync we have to be careful so that we do > + * not miss some pages (e.g., because some other process has cleared TOWRITE > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only > + * by the process clearing the DIRTY tag (and submitting the page for IO). > */ Seems simple enough and I can't think of any bugs which the obvious races will cause. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging 2010-06-09 23:41 ` Andrew Morton @ 2010-06-10 12:31 ` Jan Kara 0 siblings, 0 replies; 13+ messages in thread From: Jan Kara @ 2010-06-10 12:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, npiggin, david, linux-mm On Wed 09-06-10 16:41:15, Andrew Morton wrote: > On Fri, 4 Jun 2010 20:40:54 +0200 > Jan Kara <jack@suse.cz> wrote: > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h > > index efdfb07..f7ebff8 100644 > > --- a/include/linux/radix-tree.h > > +++ b/include/linux/radix-tree.h > > @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) > > > > /*** radix-tree API starts here ***/ > > > > -#define RADIX_TREE_MAX_TAGS 2 > > +#define RADIX_TREE_MAX_TAGS 3 > > > > /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ > > struct radix_tree_root { > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index b289310..f590a12 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -807,6 +807,30 @@ void __init page_writeback_init(void) > > } > > > > /** > > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages > > + * @mapping: address space structure to write > > + * @start: starting page index > > + * @end: ending page index (inclusive) > > + * > > + * This function scans the page range from @start to @end > > I'd add "inclusive" here as well. Add it everywhere, very carefully > (or "exclusive"). it really is a minefield and we've had off-by-ones > from this before Done. > > and tags all pages > > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that > > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag > > + * to identify pages eligible for writeback. This mechanism is used to avoid > > + * livelocking of writeback by a process steadily creating new dirty pages in > > + * the file (thus it is important for this function to be damn quick so that it > > + * can tag pages faster than a dirtying process can create them). > > + */ > > +void tag_pages_for_writeback(struct address_space *mapping, > > + pgoff_t start, pgoff_t end) > > +{ > > + spin_lock_irq(&mapping->tree_lock); > > + radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end, > > + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE); > > + spin_unlock_irq(&mapping->tree_lock); > > +} > > +EXPORT_SYMBOL(tag_pages_for_writeback); > > Problem. For how long can this disable interrupts? Maybe 1TB of dirty > pagecache before the NMI watchdog starts getting involved? Could be a > problem in some situations. > > Easy enough to fix - just walk the range in 1000(?) page hunks, > dropping the lock and doing cond_resched() each time. > radix_tree_gang_tag_if_tagged() will need to return next_index to make > that efficient with sparse files. Yes, Nick had a similar objection. I've already fixed it in my tree the way you suggest. > > +/** > > * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. > > * @mapping: address space structure to write > > * @wbc: subtract the number of written pages from *@wbc->nr_to_write > > @@ -820,6 +844,13 @@ void __init page_writeback_init(void) > > * the call was made get new I/O started against them. If wbc->sync_mode is > > * WB_SYNC_ALL then we were called for data integrity and we must wait for > > * existing IO to complete. > > + * > > + * To avoid livelocks (when other process dirties new pages), we first tag > > + * pages which should be written back with TOWRITE tag and only then start > > + * writing them. For data-integrity sync we have to be careful so that we do > > + * not miss some pages (e.g., because some other process has cleared TOWRITE > > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only > > + * by the process clearing the DIRTY tag (and submitting the page for IO). > > */ > > Seems simple enough and I can't think of any bugs which the obvious > races will cause. Thanks for looking into the patches! Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging 2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara 2010-06-09 23:41 ` Andrew Morton @ 2010-06-09 23:45 ` Andrew Morton 2010-06-10 12:42 ` Jan Kara 1 sibling, 1 reply; 13+ messages in thread From: Andrew Morton @ 2010-06-09 23:45 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, npiggin, david, linux-mm On Fri, 4 Jun 2010 20:40:54 +0200 Jan Kara <jack@suse.cz> wrote: > -#define RADIX_TREE_MAX_TAGS 2 > +#define RADIX_TREE_MAX_TAGS 3 Adds another eight bytes to the radix_tree_node, I think. What effect does this have upon the radix_tree_node_cachep packing for sl[aeiou]b? Please add to changelog if you can work it out ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging 2010-06-09 23:45 ` Andrew Morton @ 2010-06-10 12:42 ` Jan Kara 0 siblings, 0 replies; 13+ messages in thread From: Jan Kara @ 2010-06-10 12:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, npiggin, david, linux-mm On Wed 09-06-10 16:45:33, Andrew Morton wrote: > On Fri, 4 Jun 2010 20:40:54 +0200 > Jan Kara <jack@suse.cz> wrote: > > > -#define RADIX_TREE_MAX_TAGS 2 > > +#define RADIX_TREE_MAX_TAGS 3 > > Adds another eight bytes to the radix_tree_node, I think. What effect > does this have upon the radix_tree_node_cachep packing for sl[aeiou]b? > Please add to changelog if you can work it out ;). The sizes of structure are: 32-bit: 288 vs 296 64-bit: 552 vs 560 I have now checked (running different kernels because I wasn't sure the computations I do are right) and that gives 7 objects per page with SLAB and SLUB on a 64-bit kernel. I'll try to get also SLOB numbers for 64-bit and possibly numbers for 32-bit archs (although it gets a bit tiring to try all the kernels ;). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback @ 2010-06-04 18:47 Jan Kara 2010-06-05 1:14 ` Nick Piggin 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2010-06-04 18:47 UTC (permalink / raw) To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm Hi, I've revived my patches to implement livelock avoidance for data integrity writes. Due to some concerns whether tagging of pages before writeout cannot be too costly to use for WB_SYNC_NONE mode (where we stop after nr_to_write pages) I've changed the patch to use page tagging only in WB_SYNC_ALL mode where we are sure that we write out all the tagged pages. Later, we can think about using tagging for livelock avoidance for WB_SYNC_NONE mode as well... As always comments are welcome. Honza PS: I'm sorry for sending this twice. I've screwed up the list address in the first posting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback 2010-06-04 18:47 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara @ 2010-06-05 1:14 ` Nick Piggin 2010-06-06 4:08 ` Wu Fengguang 0 siblings, 1 reply; 13+ messages in thread From: Nick Piggin @ 2010-06-05 1:14 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, david, linux-mm On Fri, Jun 04, 2010 at 08:47:09PM +0200, Jan Kara wrote: > > Hi, > > I've revived my patches to implement livelock avoidance for data integrity > writes. Due to some concerns whether tagging of pages before writeout cannot > be too costly to use for WB_SYNC_NONE mode (where we stop after nr_to_write > pages) I've changed the patch to use page tagging only in WB_SYNC_ALL mode > where we are sure that we write out all the tagged pages. Later, we can think > about using tagging for livelock avoidance for WB_SYNC_NONE mode as well... Hmm what concerns? Do you have any numbers? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback 2010-06-05 1:14 ` Nick Piggin @ 2010-06-06 4:08 ` Wu Fengguang 2010-06-06 7:52 ` Nick Piggin 0 siblings, 1 reply; 13+ messages in thread From: Wu Fengguang @ 2010-06-06 4:08 UTC (permalink / raw) To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm On Sat, Jun 05, 2010 at 11:14:47AM +1000, Nick Piggin wrote: > On Fri, Jun 04, 2010 at 08:47:09PM +0200, Jan Kara wrote: > > > > Hi, > > > > I've revived my patches to implement livelock avoidance for data integrity > > writes. Due to some concerns whether tagging of pages before writeout cannot > > be too costly to use for WB_SYNC_NONE mode (where we stop after nr_to_write > > pages) I've changed the patch to use page tagging only in WB_SYNC_ALL mode > > where we are sure that we write out all the tagged pages. Later, we can think > > about using tagging for livelock avoidance for WB_SYNC_NONE mode as well... > > Hmm what concerns? Do you have any numbers? sync() is performed in two stages: the WB_SYNC_NONE run and the WB_SYNC_ALL run. The WB_SYNC_NONE stage can still be livelocked. We may switch to wbc.for_sync (instead of testing WB_SYNC_ALL only) provided by the following patch. Jan, would you add it to your series (with necessary improvements)? Thanks, Fengguang --- introduce wbc.for_sync to cover the two sync() stages The sync() is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. It is necessary to tag both stages with wbc.for_sync, so as to prevent either of them being livelocked. The next patch will utilize this flag to do the livelock prevention. btw, the nr_pages param in bdi_start_writeback() is dropped: 1) better livelock prevention will be used. Limiting nr_to_write was inferior in that it may miss some pages to sync. And it does not really avoid livelock for single huge file: write_cache_pages() ignores nr_to_write for WB_SYNC_NONE at all. 2) It is very unlikely to impact some not-for-sync existing users. For example the ubifs call writeback_inodes_sb() to writeback some pages. And the ext4 also intents to write somehow enough pages with writeback_inodes_sb_if_idle(). However they seem not really care to write more pages. CC: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 28 ++++++++-------------------- include/linux/backing-dev.h | 2 +- include/linux/writeback.h | 8 ++++++++ mm/page-writeback.c | 2 +- 4 files changed, 18 insertions(+), 22 deletions(-) --- linux.orig/fs/fs-writeback.c 2010-06-06 11:16:42.000000000 +0800 +++ linux/fs/fs-writeback.c 2010-06-06 11:21:13.000000000 +0800 @@ -42,6 +42,7 @@ struct wb_writeback_args { long nr_pages; struct super_block *sb; enum writeback_sync_modes sync_mode; + int for_sync:1; unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; @@ -228,6 +229,7 @@ static void bdi_sync_writeback(struct ba struct wb_writeback_args args = { .sb = sb, .sync_mode = WB_SYNC_ALL, + .for_sync = 1, .nr_pages = LONG_MAX, .range_cyclic = 0, }; @@ -244,7 +246,6 @@ static void bdi_sync_writeback(struct ba * bdi_start_writeback - start writeback * @bdi: the backing device to write from * @sb: write inodes from this super_block - * @nr_pages: the number of pages to write * * Description: * This does WB_SYNC_NONE opportunistic writeback. The IO is only @@ -253,24 +254,17 @@ static void bdi_sync_writeback(struct ba * */ void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb, - long nr_pages) + long mission) { struct wb_writeback_args args = { .sb = sb, .sync_mode = WB_SYNC_NONE, - .nr_pages = nr_pages, + .nr_pages = LONG_MAX, + .for_background = mission == WB_FOR_BACKGROUND, + .for_sync = mission == WB_FOR_SYNC, .range_cyclic = 1, }; - /* - * We treat @nr_pages=0 as the special case to do background writeback, - * ie. to sync pages until the background dirty threshold is reached. - */ - if (!nr_pages) { - args.nr_pages = LONG_MAX; - args.for_background = 1; - } - bdi_alloc_queue_work(bdi, &args); } @@ -757,6 +751,7 @@ static long wb_writeback(struct bdi_writ .older_than_this = NULL, .for_kupdate = args->for_kupdate, .for_background = args->for_background, + .for_sync = args->for_sync, .range_cyclic = args->range_cyclic, }; unsigned long oldest_jif; @@ -1216,14 +1211,7 @@ static void wait_sb_inodes(struct super_ */ void writeback_inodes_sb(struct super_block *sb) { - unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY); - unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); - long nr_to_write; - - nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); - - bdi_start_writeback(sb->s_bdi, sb, nr_to_write); + bdi_start_writeback(sb->s_bdi, sb, WB_FOR_SYNC); } EXPORT_SYMBOL(writeback_inodes_sb); --- linux.orig/include/linux/backing-dev.h 2010-06-06 11:16:42.000000000 +0800 +++ linux/include/linux/backing-dev.h 2010-06-06 11:21:13.000000000 +0800 @@ -106,7 +106,7 @@ int bdi_register_dev(struct backing_dev_ void bdi_unregister(struct backing_dev_info *bdi); int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int); void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb, - long nr_pages); + long mission); int bdi_writeback_task(struct bdi_writeback *wb); int bdi_has_dirty_io(struct backing_dev_info *bdi); void bdi_arm_supers_timer(void); --- linux.orig/include/linux/writeback.h 2010-06-06 11:16:42.000000000 +0800 +++ linux/include/linux/writeback.h 2010-06-06 11:58:58.000000000 +0800 @@ -21,6 +21,13 @@ enum writeback_sync_modes { WB_SYNC_ALL, /* Wait on every mapping */ }; +enum writeback_mission { + WB_FOR_BACKGROUND, /* stop on hitting background threshold */ + WB_FOR_SYNC, /* write all now-dirty inodes/pages, + * but take care not to live lock + */ +}; + /* * A control structure which tells the writeback code what to do. These are * always on the stack, and hence need no locking. They are always initialised @@ -53,6 +60,7 @@ struct writeback_control { unsigned encountered_congestion:1; /* An output: a queue is full */ unsigned for_kupdate:1; /* A kupdate writeback */ unsigned for_background:1; /* A background writeback */ + unsigned for_sync:1; /* A writeback for sync */ unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned range_cyclic:1; /* range_start is cyclic */ unsigned more_io:1; /* more io to be dispatched */ --- linux.orig/mm/page-writeback.c 2010-06-06 11:16:42.000000000 +0800 +++ linux/mm/page-writeback.c 2010-06-06 11:21:13.000000000 +0800 @@ -597,7 +597,7 @@ static void balance_dirty_pages(struct a (!laptop_mode && ((global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS)) > background_thresh))) - bdi_start_writeback(bdi, NULL, 0); + bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND); } void set_page_dirty_balance(struct page *page, int page_mkwrite) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback 2010-06-06 4:08 ` Wu Fengguang @ 2010-06-06 7:52 ` Nick Piggin 0 siblings, 0 replies; 13+ messages in thread From: Nick Piggin @ 2010-06-06 7:52 UTC (permalink / raw) To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm On Sun, Jun 06, 2010 at 12:08:19PM +0800, Wu Fengguang wrote: > On Sat, Jun 05, 2010 at 11:14:47AM +1000, Nick Piggin wrote: > > On Fri, Jun 04, 2010 at 08:47:09PM +0200, Jan Kara wrote: > > > > > > Hi, > > > > > > I've revived my patches to implement livelock avoidance for data integrity > > > writes. Due to some concerns whether tagging of pages before writeout cannot > > > be too costly to use for WB_SYNC_NONE mode (where we stop after nr_to_write > > > pages) I've changed the patch to use page tagging only in WB_SYNC_ALL mode > > > where we are sure that we write out all the tagged pages. Later, we can think > > > about using tagging for livelock avoidance for WB_SYNC_NONE mode as well... > > > > Hmm what concerns? Do you have any numbers? > > sync() is performed in two stages: the WB_SYNC_NONE run and the > WB_SYNC_ALL run. The WB_SYNC_NONE stage can still be livelocked. By concerns, I mean Jan's _performance_ concerns. I would prefer to minimise them, and then try to get an idea of the performance impact of doing tagging unconditionally. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2010-06-10 12:43 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-06-04 18:40 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara 2010-06-04 18:40 ` [PATCH 1/2] radix-tree: Implement function radix_tree_gang_tag_if_tagged Jan Kara 2010-06-09 23:30 ` Andrew Morton 2010-06-10 12:28 ` Jan Kara 2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara 2010-06-09 23:41 ` Andrew Morton 2010-06-10 12:31 ` Jan Kara 2010-06-09 23:45 ` Andrew Morton 2010-06-10 12:42 ` Jan Kara -- strict thread matches above, loose matches on Subject: below -- 2010-06-04 18:47 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara 2010-06-05 1:14 ` Nick Piggin 2010-06-06 4:08 ` Wu Fengguang 2010-06-06 7:52 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).