* [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write @ 2014-02-16 2:54 Roman Pen 2014-02-18 23:59 ` Andrew Morton 2014-03-13 20:21 ` Jan Kara 0 siblings, 2 replies; 20+ messages in thread From: Roman Pen @ 2014-02-16 2:54 UTC (permalink / raw) Cc: Roman Pen, Alexander Viro, linux-fsdevel, linux-kernel In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, thus mark request as WRITE_SYNC. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org --- fs/mpage.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 4979ffa..4e0af5a 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, struct buffer_head map_bh; loff_t i_size = i_size_read(inode); int ret = 0; + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); @@ -570,7 +571,7 @@ page_is_mapped: * This page will go to BIO. Do we need to send this BIO off first? */ if (bio && mpd->last_block_in_bio != blocks[0] - 1) - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); alloc_new: if (bio == NULL) { @@ -587,7 +588,7 @@ alloc_new: */ length = first_unmapped << blkbits; if (bio_add_page(bio, page, length, 0) < length) { - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); goto alloc_new; } @@ -620,7 +621,7 @@ alloc_new: set_page_writeback(page); unlock_page(page); if (boundary || (first_unmapped != blocks_per_page)) { - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); if (boundary_block) { write_boundary_block(boundary_bdev, boundary_block, 1 << blkbits); @@ -632,7 +633,7 @@ alloc_new: confused: if (bio) - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); if (mpd->use_writepage) { ret = mapping->a_ops->writepage(page, wbc); @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, }; ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); - if (mpd.bio) - mpage_bio_submit(WRITE, mpd.bio); + if (mpd.bio) { + int wr = (wbc->sync_mode == WB_SYNC_ALL ? + WRITE_SYNC : WRITE); + mpage_bio_submit(wr, mpd.bio); + } } blk_finish_plug(&plug); return ret; @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, .use_writepage = 0, }; int ret = __mpage_writepage(page, wbc, &mpd); - if (mpd.bio) - mpage_bio_submit(WRITE, mpd.bio); + if (mpd.bio) { + int wr = (wbc->sync_mode == WB_SYNC_ALL ? + WRITE_SYNC : WRITE); + mpage_bio_submit(wr, mpd.bio); + } return ret; } EXPORT_SYMBOL(mpage_writepage); -- 1.8.5.2 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-02-16 2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen @ 2014-02-18 23:59 ` Andrew Morton 2014-02-19 1:38 ` Roman Peniaev 2014-03-13 20:21 ` Jan Kara 1 sibling, 1 reply; 20+ messages in thread From: Andrew Morton @ 2014-02-18 23:59 UTC (permalink / raw) To: Roman Pen; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote: > In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, > thus mark request as WRITE_SYNC. gargh, the documentation for this stuff is useless. What do REQ_SYNC and REQ_NOIDLE actually *do*? For mpage writes, REQ_NOIDLE appears to be incorrect - we very much expect that there will be more writes and that they will be contiguous with this one. But we won't be waiting on this write before submitting more writes, so perhaps REQ_NOIDLE is at least harmless. I dunno about REQ_SYNC - it requires delving into the bowels of CFQ and we shouldn't need to do that. Jens. Help. How is a poor kernel reader supposed to work this out? > --- a/fs/mpage.c > +++ b/fs/mpage.c > @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, > struct buffer_head map_bh; > loff_t i_size = i_size_read(inode); > int ret = 0; > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); > > if (page_has_buffers(page)) { > struct buffer_head *head = page_buffers(page); > @@ -570,7 +571,7 @@ page_is_mapped: > * This page will go to BIO. Do we need to send this BIO off first? > */ > if (bio && mpd->last_block_in_bio != blocks[0] - 1) > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > > alloc_new: > if (bio == NULL) { > @@ -587,7 +588,7 @@ alloc_new: > */ > length = first_unmapped << blkbits; > if (bio_add_page(bio, page, length, 0) < length) { > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > goto alloc_new; > } > > @@ -620,7 +621,7 @@ alloc_new: > set_page_writeback(page); > unlock_page(page); > if (boundary || (first_unmapped != blocks_per_page)) { > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > if (boundary_block) { > write_boundary_block(boundary_bdev, > boundary_block, 1 << blkbits); > @@ -632,7 +633,7 @@ alloc_new: > > confused: > if (bio) > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > > if (mpd->use_writepage) { > ret = mapping->a_ops->writepage(page, wbc); > @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, > }; > > ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); > - if (mpd.bio) > - mpage_bio_submit(WRITE, mpd.bio); > + if (mpd.bio) { > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > + WRITE_SYNC : WRITE); > + mpage_bio_submit(wr, mpd.bio); > + } > } > blk_finish_plug(&plug); > return ret; > @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, > .use_writepage = 0, > }; > int ret = __mpage_writepage(page, wbc, &mpd); > - if (mpd.bio) > - mpage_bio_submit(WRITE, mpd.bio); > + if (mpd.bio) { > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > + WRITE_SYNC : WRITE); > + mpage_bio_submit(wr, mpd.bio); > + } > return ret; > } > EXPORT_SYMBOL(mpage_writepage); ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-02-18 23:59 ` Andrew Morton @ 2014-02-19 1:38 ` Roman Peniaev 2014-03-12 14:29 ` Roman Peniaev 0 siblings, 1 reply; 20+ messages in thread From: Roman Peniaev @ 2014-02-19 1:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe (my previous email was rejected by vger.kernel.org because google web sent it as html. will resend the same one in plain text mode) > What do REQ_SYNC and REQ_NOIDLE actually *do*? Yep, this REQ_SYNC is very confusing to me. First of all according to the sources of old school block buffer filesystems (e.g. ext2) we can get this stack in case of fsync call: __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode) do_writepages(mapping, &wbc) mapping->a_ops->writepages(page, wbc) (ext2_writepages) mpage_writepages(mapping, wbc, fat_get_block); write_cache_pages(mapping, wbc, __mpage_writepage, &mpd) __mpage_writepage(page, wbc, data) >>>>> mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL? <or in case of not contiguous buffers> mapping->a_ops->writepage(page, wbc) (ext2_writepage) block_write_full_page(page, fat_get_block, wbc) block_write_full_page_endio(page, get_block, wbc, end_buffer_async_write) __block_write_full_page(inode, page, get_block, wbc, handler); submit_bh(WRITE_SYNC) So, it turns out to be that some bios for the same dirty range can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of the bios only with REQ_WRITE. (according to the comment of __mpage_writepage: * If all blocks are found to be contiguous then the page can go into the * BIO. Otherwise fall back to the mapping's writepage(). ) Also, it seems to me that all over the kernel WRITE_SYNC has meaning of: 1. try to get the block on-disk faster 2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result My patch is an attempt to make some unification in case of fsync call. -- Roman On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote: > >> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, >> thus mark request as WRITE_SYNC. > > gargh, the documentation for this stuff is useless. > > What do REQ_SYNC and REQ_NOIDLE actually *do*? > > For mpage writes, REQ_NOIDLE appears to be incorrect - we very much > expect that there will be more writes and that they will be contiguous > with this one. But we won't be waiting on this write before submitting > more writes, so perhaps REQ_NOIDLE is at least harmless. > > I dunno about REQ_SYNC - it requires delving into the bowels of CFQ > and we shouldn't need to do that. > > Jens. Help. How is a poor kernel reader supposed to work this out? > >> --- a/fs/mpage.c >> +++ b/fs/mpage.c >> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, >> struct buffer_head map_bh; >> loff_t i_size = i_size_read(inode); >> int ret = 0; >> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); >> >> if (page_has_buffers(page)) { >> struct buffer_head *head = page_buffers(page); >> @@ -570,7 +571,7 @@ page_is_mapped: >> * This page will go to BIO. Do we need to send this BIO off first? >> */ >> if (bio && mpd->last_block_in_bio != blocks[0] - 1) >> - bio = mpage_bio_submit(WRITE, bio); >> + bio = mpage_bio_submit(wr, bio); >> >> alloc_new: >> if (bio == NULL) { >> @@ -587,7 +588,7 @@ alloc_new: >> */ >> length = first_unmapped << blkbits; >> if (bio_add_page(bio, page, length, 0) < length) { >> - bio = mpage_bio_submit(WRITE, bio); >> + bio = mpage_bio_submit(wr, bio); >> goto alloc_new; >> } >> >> @@ -620,7 +621,7 @@ alloc_new: >> set_page_writeback(page); >> unlock_page(page); >> if (boundary || (first_unmapped != blocks_per_page)) { >> - bio = mpage_bio_submit(WRITE, bio); >> + bio = mpage_bio_submit(wr, bio); >> if (boundary_block) { >> write_boundary_block(boundary_bdev, >> boundary_block, 1 << blkbits); >> @@ -632,7 +633,7 @@ alloc_new: >> >> confused: >> if (bio) >> - bio = mpage_bio_submit(WRITE, bio); >> + bio = mpage_bio_submit(wr, bio); >> >> if (mpd->use_writepage) { >> ret = mapping->a_ops->writepage(page, wbc); >> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, >> }; >> >> ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); >> - if (mpd.bio) >> - mpage_bio_submit(WRITE, mpd.bio); >> + if (mpd.bio) { >> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? >> + WRITE_SYNC : WRITE); >> + mpage_bio_submit(wr, mpd.bio); >> + } >> } >> blk_finish_plug(&plug); >> return ret; >> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, >> .use_writepage = 0, >> }; >> int ret = __mpage_writepage(page, wbc, &mpd); >> - if (mpd.bio) >> - mpage_bio_submit(WRITE, mpd.bio); >> + if (mpd.bio) { >> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? >> + WRITE_SYNC : WRITE); >> + mpage_bio_submit(wr, mpd.bio); >> + } >> return ret; >> } >> EXPORT_SYMBOL(mpage_writepage); > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-02-19 1:38 ` Roman Peniaev @ 2014-03-12 14:29 ` Roman Peniaev 2014-03-13 20:01 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Roman Peniaev @ 2014-03-12 14:29 UTC (permalink / raw) To: Andrew Morton; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe Jens, could you please explain the real purpose of WAIT_SYNC? In case of wbc->sync_mode == WB_SYNC_ALL. Because my current understanding is if writeback control has WB_SYNC_ALL everything should be submitted with WAIT_SYNC. -- Roman On Wed, Feb 19, 2014 at 10:38 AM, Roman Peniaev <r.peniaev@gmail.com> wrote: > (my previous email was rejected by vger.kernel.org because google web > sent it as html. > will resend the same one in plain text mode) > >> What do REQ_SYNC and REQ_NOIDLE actually *do*? > > Yep, this REQ_SYNC is very confusing to me. > First of all according to the sources of old school block buffer filesystems > (e.g. ext2) we can get this stack in case of fsync call: > > __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode) > do_writepages(mapping, &wbc) > mapping->a_ops->writepages(page, wbc) > (ext2_writepages) > mpage_writepages(mapping, wbc, fat_get_block); > write_cache_pages(mapping, wbc, __mpage_writepage, &mpd) > __mpage_writepage(page, wbc, data) >>>>>> mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL? > <or in case of not contiguous buffers> > mapping->a_ops->writepage(page, wbc) > (ext2_writepage) > block_write_full_page(page, fat_get_block, wbc) > block_write_full_page_endio(page, get_block, wbc, > end_buffer_async_write) > __block_write_full_page(inode, page, get_block, wbc, > handler); > submit_bh(WRITE_SYNC) > > So, it turns out to be that some bios for the same dirty range > can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of > the bios only with REQ_WRITE. > (according to the comment of __mpage_writepage: > * If all blocks are found to be contiguous then the page can go into the > * BIO. Otherwise fall back to the mapping's writepage(). > ) > > Also, it seems to me that all over the kernel WRITE_SYNC has meaning of: > 1. try to get the block on-disk faster > 2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result > > My patch is an attempt to make some unification in case of fsync call. > > -- > Roman > > > On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton > <akpm@linux-foundation.org> wrote: >> On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote: >> >>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, >>> thus mark request as WRITE_SYNC. >> >> gargh, the documentation for this stuff is useless. >> >> What do REQ_SYNC and REQ_NOIDLE actually *do*? >> >> For mpage writes, REQ_NOIDLE appears to be incorrect - we very much >> expect that there will be more writes and that they will be contiguous >> with this one. But we won't be waiting on this write before submitting >> more writes, so perhaps REQ_NOIDLE is at least harmless. >> >> I dunno about REQ_SYNC - it requires delving into the bowels of CFQ >> and we shouldn't need to do that. >> >> Jens. Help. How is a poor kernel reader supposed to work this out? >> >>> --- a/fs/mpage.c >>> +++ b/fs/mpage.c >>> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, >>> struct buffer_head map_bh; >>> loff_t i_size = i_size_read(inode); >>> int ret = 0; >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); >>> >>> if (page_has_buffers(page)) { >>> struct buffer_head *head = page_buffers(page); >>> @@ -570,7 +571,7 @@ page_is_mapped: >>> * This page will go to BIO. Do we need to send this BIO off first? >>> */ >>> if (bio && mpd->last_block_in_bio != blocks[0] - 1) >>> - bio = mpage_bio_submit(WRITE, bio); >>> + bio = mpage_bio_submit(wr, bio); >>> >>> alloc_new: >>> if (bio == NULL) { >>> @@ -587,7 +588,7 @@ alloc_new: >>> */ >>> length = first_unmapped << blkbits; >>> if (bio_add_page(bio, page, length, 0) < length) { >>> - bio = mpage_bio_submit(WRITE, bio); >>> + bio = mpage_bio_submit(wr, bio); >>> goto alloc_new; >>> } >>> >>> @@ -620,7 +621,7 @@ alloc_new: >>> set_page_writeback(page); >>> unlock_page(page); >>> if (boundary || (first_unmapped != blocks_per_page)) { >>> - bio = mpage_bio_submit(WRITE, bio); >>> + bio = mpage_bio_submit(wr, bio); >>> if (boundary_block) { >>> write_boundary_block(boundary_bdev, >>> boundary_block, 1 << blkbits); >>> @@ -632,7 +633,7 @@ alloc_new: >>> >>> confused: >>> if (bio) >>> - bio = mpage_bio_submit(WRITE, bio); >>> + bio = mpage_bio_submit(wr, bio); >>> >>> if (mpd->use_writepage) { >>> ret = mapping->a_ops->writepage(page, wbc); >>> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, >>> }; >>> >>> ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); >>> - if (mpd.bio) >>> - mpage_bio_submit(WRITE, mpd.bio); >>> + if (mpd.bio) { >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? >>> + WRITE_SYNC : WRITE); >>> + mpage_bio_submit(wr, mpd.bio); >>> + } >>> } >>> blk_finish_plug(&plug); >>> return ret; >>> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, >>> .use_writepage = 0, >>> }; >>> int ret = __mpage_writepage(page, wbc, &mpd); >>> - if (mpd.bio) >>> - mpage_bio_submit(WRITE, mpd.bio); >>> + if (mpd.bio) { >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? >>> + WRITE_SYNC : WRITE); >>> + mpage_bio_submit(wr, mpd.bio); >>> + } >>> return ret; >>> } >>> EXPORT_SYMBOL(mpage_writepage); >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-12 14:29 ` Roman Peniaev @ 2014-03-13 20:01 ` Jan Kara 2014-03-13 21:34 ` Andrew Morton 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2014-03-13 20:01 UTC (permalink / raw) To: Roman Peniaev Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe Hello, On Wed 12-03-14 23:29:04, Roman Peniaev wrote: > could you please explain the real purpose of WAIT_SYNC? > In case of wbc->sync_mode == WB_SYNC_ALL. > Because my current understanding is if writeback control has > WB_SYNC_ALL everything > should be submitted with WAIT_SYNC. So AFAIK the idea is that REQ_SYNC flag should indicate the IO is synchronous - i.e., someone is waiting for it to complete. This is opposed to asynchronous writeback done by flusher threads where noone waits for particular write to complete. Subsequently, IO scheduler is expected (but not required to - only CFQ honors REQ_SYNC AFAIK) to treat sync requests with higher priority than async onces. When to set REQ_SYNC is not an obvious question. If we set it for too much IO, it has no effect. If we don't set it for some IO we risk that someone waiting for that IO to complete will be starved by others setting REQ_SYNC. So all in all I think that using WRITE_SYNC iff we are doing WB_SYNC_ALL writeback is a reasonable choice. Honza > On Wed, Feb 19, 2014 at 10:38 AM, Roman Peniaev <r.peniaev@gmail.com> wrote: > > (my previous email was rejected by vger.kernel.org because google web > > sent it as html. > > will resend the same one in plain text mode) > > > >> What do REQ_SYNC and REQ_NOIDLE actually *do*? > > > > Yep, this REQ_SYNC is very confusing to me. > > First of all according to the sources of old school block buffer filesystems > > (e.g. ext2) we can get this stack in case of fsync call: > > > > __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode) > > do_writepages(mapping, &wbc) > > mapping->a_ops->writepages(page, wbc) > > (ext2_writepages) > > mpage_writepages(mapping, wbc, fat_get_block); > > write_cache_pages(mapping, wbc, __mpage_writepage, &mpd) > > __mpage_writepage(page, wbc, data) > >>>>>> mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL? > > <or in case of not contiguous buffers> > > mapping->a_ops->writepage(page, wbc) > > (ext2_writepage) > > block_write_full_page(page, fat_get_block, wbc) > > block_write_full_page_endio(page, get_block, wbc, > > end_buffer_async_write) > > __block_write_full_page(inode, page, get_block, wbc, > > handler); > > submit_bh(WRITE_SYNC) > > > > So, it turns out to be that some bios for the same dirty range > > can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of > > the bios only with REQ_WRITE. > > (according to the comment of __mpage_writepage: > > * If all blocks are found to be contiguous then the page can go into the > > * BIO. Otherwise fall back to the mapping's writepage(). > > ) > > > > Also, it seems to me that all over the kernel WRITE_SYNC has meaning of: > > 1. try to get the block on-disk faster > > 2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result > > > > My patch is an attempt to make some unification in case of fsync call. > > > > -- > > Roman > > > > > > On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton > > <akpm@linux-foundation.org> wrote: > >> On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote: > >> > >>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, > >>> thus mark request as WRITE_SYNC. > >> > >> gargh, the documentation for this stuff is useless. > >> > >> What do REQ_SYNC and REQ_NOIDLE actually *do*? > >> > >> For mpage writes, REQ_NOIDLE appears to be incorrect - we very much > >> expect that there will be more writes and that they will be contiguous > >> with this one. But we won't be waiting on this write before submitting > >> more writes, so perhaps REQ_NOIDLE is at least harmless. > >> > >> I dunno about REQ_SYNC - it requires delving into the bowels of CFQ > >> and we shouldn't need to do that. > >> > >> Jens. Help. How is a poor kernel reader supposed to work this out? > >> > >>> --- a/fs/mpage.c > >>> +++ b/fs/mpage.c > >>> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, > >>> struct buffer_head map_bh; > >>> loff_t i_size = i_size_read(inode); > >>> int ret = 0; > >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); > >>> > >>> if (page_has_buffers(page)) { > >>> struct buffer_head *head = page_buffers(page); > >>> @@ -570,7 +571,7 @@ page_is_mapped: > >>> * This page will go to BIO. Do we need to send this BIO off first? > >>> */ > >>> if (bio && mpd->last_block_in_bio != blocks[0] - 1) > >>> - bio = mpage_bio_submit(WRITE, bio); > >>> + bio = mpage_bio_submit(wr, bio); > >>> > >>> alloc_new: > >>> if (bio == NULL) { > >>> @@ -587,7 +588,7 @@ alloc_new: > >>> */ > >>> length = first_unmapped << blkbits; > >>> if (bio_add_page(bio, page, length, 0) < length) { > >>> - bio = mpage_bio_submit(WRITE, bio); > >>> + bio = mpage_bio_submit(wr, bio); > >>> goto alloc_new; > >>> } > >>> > >>> @@ -620,7 +621,7 @@ alloc_new: > >>> set_page_writeback(page); > >>> unlock_page(page); > >>> if (boundary || (first_unmapped != blocks_per_page)) { > >>> - bio = mpage_bio_submit(WRITE, bio); > >>> + bio = mpage_bio_submit(wr, bio); > >>> if (boundary_block) { > >>> write_boundary_block(boundary_bdev, > >>> boundary_block, 1 << blkbits); > >>> @@ -632,7 +633,7 @@ alloc_new: > >>> > >>> confused: > >>> if (bio) > >>> - bio = mpage_bio_submit(WRITE, bio); > >>> + bio = mpage_bio_submit(wr, bio); > >>> > >>> if (mpd->use_writepage) { > >>> ret = mapping->a_ops->writepage(page, wbc); > >>> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, > >>> }; > >>> > >>> ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); > >>> - if (mpd.bio) > >>> - mpage_bio_submit(WRITE, mpd.bio); > >>> + if (mpd.bio) { > >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > >>> + WRITE_SYNC : WRITE); > >>> + mpage_bio_submit(wr, mpd.bio); > >>> + } > >>> } > >>> blk_finish_plug(&plug); > >>> return ret; > >>> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, > >>> .use_writepage = 0, > >>> }; > >>> int ret = __mpage_writepage(page, wbc, &mpd); > >>> - if (mpd.bio) > >>> - mpage_bio_submit(WRITE, mpd.bio); > >>> + if (mpd.bio) { > >>> + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > >>> + WRITE_SYNC : WRITE); > >>> + mpage_bio_submit(wr, mpd.bio); > >>> + } > >>> return ret; > >>> } > >>> EXPORT_SYMBOL(mpage_writepage); > >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-13 20:01 ` Jan Kara @ 2014-03-13 21:34 ` Andrew Morton 2014-03-14 13:07 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Andrew Morton @ 2014-03-13 21:34 UTC (permalink / raw) To: Jan Kara Cc: Roman Peniaev, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe, Tejun Heo On Thu, 13 Mar 2014 21:01:19 +0100 Jan Kara <jack@suse.cz> wrote: > Hello, > > On Wed 12-03-14 23:29:04, Roman Peniaev wrote: > > could you please explain the real purpose of WAIT_SYNC? > > In case of wbc->sync_mode == WB_SYNC_ALL. > > Because my current understanding is if writeback control has > > WB_SYNC_ALL everything > > should be submitted with WAIT_SYNC. > So AFAIK the idea is that REQ_SYNC flag should indicate the IO is > synchronous - i.e., someone is waiting for it to complete. This is opposed > to asynchronous writeback done by flusher threads where noone waits for > particular write to complete. Subsequently, IO scheduler is expected (but > not required to - only CFQ honors REQ_SYNC AFAIK) to treat sync requests > with higher priority than async onces. > > When to set REQ_SYNC is not an obvious question. If we set it for too much > IO, it has no effect. If we don't set it for some IO we risk that someone > waiting for that IO to complete will be starved by others setting REQ_SYNC. > > So all in all I think that using WRITE_SYNC iff we are doing WB_SYNC_ALL > writeback is a reasonable choice. I added this to the changelog: : akpm: afaict this change will cause the data integrity write bios to be : placed onto the second queue in cfq_io_cq.cfqq[], which presumably results : in special treatment. The documentation for REQ_SYNC is horrid. Which is pretty pathetic. Jens isn't talking to us. Tejun, are you able explain REQ_SYNC? I just don't know about this patch. It will presumably have some effect on data-integrity writes. But is that a good effect or a bad one? Will it result in a sys_sync() starving out other IO for ages in an undesirable fashion? It might well do! Will it prioritize those writes, resulting in overall less efficient IO patterns? It might well do! I don't see how we can proceed without understanding the tradeoffs and deciding that the overall effect is a desirable one. From: Roman Pen <r.peniaev@gmail.com> Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, thus mark request as WRITE_SYNC. akpm: afaict this change will cause the data integrity write bios to be placed onto the second queue in cfq_io_cq.cfqq[], which presumably results in special treatment. The documentation for REQ_SYNC is horrid. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/mpage.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff -puN fs/mpage.c~fs-mpagec-forgotten-write_sync-in-case-of-data-integrity-write fs/mpage.c --- a/fs/mpage.c~fs-mpagec-forgotten-write_sync-in-case-of-data-integrity-write +++ a/fs/mpage.c @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page struct buffer_head map_bh; loff_t i_size = i_size_read(inode); int ret = 0; + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); @@ -570,7 +571,7 @@ page_is_mapped: * This page will go to BIO. Do we need to send this BIO off first? */ if (bio && mpd->last_block_in_bio != blocks[0] - 1) - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); alloc_new: if (bio == NULL) { @@ -587,7 +588,7 @@ alloc_new: */ length = first_unmapped << blkbits; if (bio_add_page(bio, page, length, 0) < length) { - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); goto alloc_new; } @@ -620,7 +621,7 @@ alloc_new: set_page_writeback(page); unlock_page(page); if (boundary || (first_unmapped != blocks_per_page)) { - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); if (boundary_block) { write_boundary_block(boundary_bdev, boundary_block, 1 << blkbits); @@ -632,7 +633,7 @@ alloc_new: confused: if (bio) - bio = mpage_bio_submit(WRITE, bio); + bio = mpage_bio_submit(wr, bio); if (mpd->use_writepage) { ret = mapping->a_ops->writepage(page, wbc); @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *m }; ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); - if (mpd.bio) - mpage_bio_submit(WRITE, mpd.bio); + if (mpd.bio) { + int wr = (wbc->sync_mode == WB_SYNC_ALL ? + WRITE_SYNC : WRITE); + mpage_bio_submit(wr, mpd.bio); + } } blk_finish_plug(&plug); return ret; @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, g .use_writepage = 0, }; int ret = __mpage_writepage(page, wbc, &mpd); - if (mpd.bio) - mpage_bio_submit(WRITE, mpd.bio); + if (mpd.bio) { + int wr = (wbc->sync_mode == WB_SYNC_ALL ? + WRITE_SYNC : WRITE); + mpage_bio_submit(wr, mpd.bio); + } return ret; } EXPORT_SYMBOL(mpage_writepage); _ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-13 21:34 ` Andrew Morton @ 2014-03-14 13:07 ` Tejun Heo 2014-03-14 14:07 ` Roman Peniaev 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2014-03-14 13:07 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Roman Peniaev, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe Hello, Andrew. On Thu, Mar 13, 2014 at 02:34:56PM -0700, Andrew Morton wrote: > Jens isn't talking to us. Tejun, are you able explain REQ_SYNC? It has nothing to do with data integrity. It's just a hint telling the block layer that someone is waiting for the IO so it'd be a good idea to prioritize it. For example, nothing visible to userland really waits for periodic writebacks, so we can delay their processing to prioritize, for example, READs triggered from a page fault, which is obviously causing userland visible latency. Block layer treats all READs as REQ_SYNC and also allows upper layers to mark some writes REQ_SYNC for cases where somebody is waiting for the write to complete for cases like flush(2). > From: Roman Pen <r.peniaev@gmail.com> > Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write > > In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity > write, thus mark request as WRITE_SYNC. So, at least this patch description is very misleading. WRITE_SYNC has *NOTHING* to do with data integrity. The only thing matters is whether somebody is waiting for its completion or not. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 13:07 ` Tejun Heo @ 2014-03-14 14:07 ` Roman Peniaev 2014-03-14 14:11 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Roman Peniaev @ 2014-03-14 14:07 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe Hello, Tejun. On Fri, Mar 14, 2014 at 10:07 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Andrew. > > On Thu, Mar 13, 2014 at 02:34:56PM -0700, Andrew Morton wrote: >> Jens isn't talking to us. Tejun, are you able explain REQ_SYNC? > > It has nothing to do with data integrity. It's just a hint telling > the block layer that someone is waiting for the IO so it'd be a good > idea to prioritize it. For example, nothing visible to userland > really waits for periodic writebacks, so we can delay their processing > to prioritize, for example, READs triggered from a page fault, which > is obviously causing userland visible latency. > > Block layer treats all READs as REQ_SYNC and also allows upper layers > to mark some writes REQ_SYNC for cases where somebody is waiting for > the write to complete for cases like flush(2). > >> From: Roman Pen <r.peniaev@gmail.com> >> Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write >> >> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity >> write, thus mark request as WRITE_SYNC. > > So, at least this patch description is very misleading. WRITE_SYNC > has *NOTHING* to do with data integrity. The only thing matters is > whether somebody is waiting for its completion or not. The thing is that I started investigating WB_SYNC_ALL and WRITE_SYNC from the __filemap_fdatawrite_range call which has explicit comment: * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as * opposed to a regular memory cleansing writeback. So, in the commit msg I want to say that when we are doing integrity write (sync, fsync calls) we have to mark requests as WRITE_SYNC, and there is only one path (mpage writeback path) where WRITE_SYNC is missed. That was the idea. (not WRITE_SYNC is responsible for integrity write, but in case of integrity write) Seems the following message should be better: When data inegrity operation (sync, fsync, fdatasync calls) happens writeback control is set to WB_SYNC_ALL. In that case all write requests are marked with WRITE_SYNC, but on mpage writeback path WRITE_SYNC is missed. This patch fixes this. Is it ok, what do you think? Also, could you please help me do understand how can I guarantee integrity in case of block device with big volatile cache and filesystem, which does not support REQ_FLUSH/FUA? Even if I am doing fsync, block device will never receive any indication, that these requests should really be stored on device, or cache should be flushed right now (like REQ_FLUSH/FUA behaviour). Seems it is impossible to do explicit block device cache flush if filesystem does not care about it. Thanks. -- Roman > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:07 ` Roman Peniaev @ 2014-03-14 14:11 ` Tejun Heo 2014-03-14 14:15 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Tejun Heo @ 2014-03-14 14:11 UTC (permalink / raw) To: Roman Peniaev Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe Hello, On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote: > Seems the following message should be better: > When data inegrity operation (sync, fsync, fdatasync calls) happens > writeback control is set to WB_SYNC_ALL. > In that case all write requests are marked with WRITE_SYNC, but on > mpage writeback path > WRITE_SYNC is missed. This patch fixes this. > > Is it ok, what do you think? I think the description should make it clear that WRITE_SYNC is about latency, not about integrity and we probably should add comments explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there probably is someone waiting). > Also, could you please help me do understand how can I guarantee > integrity in case of block device with big volatile > cache and filesystem, which does not support REQ_FLUSH/FUA? If a device has a volatile cache but doesn't support flush, it can't guarantee integrity. There's no way for its user to determine or force whether certain data is on non-volatile media. It's an inherently broken device. Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:11 ` Tejun Heo @ 2014-03-14 14:15 ` Jan Kara 2014-03-14 14:23 ` Roman Peniaev 2014-03-14 14:17 ` Roman Peniaev 2014-03-14 15:36 ` Roman Peniaev 2 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2014-03-14 14:15 UTC (permalink / raw) To: Tejun Heo Cc: Roman Peniaev, Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri 14-03-14 10:11:43, Tejun Heo wrote: > > Also, could you please help me do understand how can I guarantee > > integrity in case of block device with big volatile > > cache and filesystem, which does not support REQ_FLUSH/FUA? > > If a device has a volatile cache but doesn't support flush, it can't > guarantee integrity. There's no way for its user to determine or > force whether certain data is on non-volatile media. It's an > inherently broken device. I think his problem was that the device does support REQ_FLUSH/FUA but the filesystem on top of it doesn't issue it properly. That's a filesystem problem so fix the filesystem... :) Which one is it? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:15 ` Jan Kara @ 2014-03-14 14:23 ` Roman Peniaev 2014-03-14 14:52 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Roman Peniaev @ 2014-03-14 14:23 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 11:15 PM, Jan Kara <jack@suse.cz> wrote: > On Fri 14-03-14 10:11:43, Tejun Heo wrote: >> > Also, could you please help me do understand how can I guarantee >> > integrity in case of block device with big volatile >> > cache and filesystem, which does not support REQ_FLUSH/FUA? >> >> If a device has a volatile cache but doesn't support flush, it can't >> guarantee integrity. There's no way for its user to determine or >> force whether certain data is on non-volatile media. It's an >> inherently broken device. > I think his problem was that the device does support REQ_FLUSH/FUA but > the filesystem on top of it doesn't issue it properly. That's a filesystem > problem so fix the filesystem... :) Which one is it? take any old school, e.g. ext2 or even better: fat :) -- Roman ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:23 ` Roman Peniaev @ 2014-03-14 14:52 ` Jan Kara 2014-03-14 14:54 ` Tejun Heo 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2014-03-14 14:52 UTC (permalink / raw) To: Roman Peniaev Cc: Jan Kara, Tejun Heo, Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri 14-03-14 23:23:45, Roman Peniaev wrote: > On Fri, Mar 14, 2014 at 11:15 PM, Jan Kara <jack@suse.cz> wrote: > > On Fri 14-03-14 10:11:43, Tejun Heo wrote: > >> > Also, could you please help me do understand how can I guarantee > >> > integrity in case of block device with big volatile > >> > cache and filesystem, which does not support REQ_FLUSH/FUA? > >> > >> If a device has a volatile cache but doesn't support flush, it can't > >> guarantee integrity. There's no way for its user to determine or > >> force whether certain data is on non-volatile media. It's an > >> inherently broken device. > > I think his problem was that the device does support REQ_FLUSH/FUA but > > the filesystem on top of it doesn't issue it properly. That's a filesystem > > problem so fix the filesystem... :) Which one is it? > > take any old school, e.g. ext2 or even better: fat :) Well, for ext2, you can use ext4 kernel driver which takes care of REQ_FLUSH properly. For fat, you'll need to fix the fs... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:52 ` Jan Kara @ 2014-03-14 14:54 ` Tejun Heo 2014-03-14 15:08 ` Jan Kara 2014-03-15 9:09 ` Christoph Hellwig 0 siblings, 2 replies; 20+ messages in thread From: Tejun Heo @ 2014-03-14 14:54 UTC (permalink / raw) To: Jan Kara Cc: Roman Peniaev, Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 03:52:15PM +0100, Jan Kara wrote: > Well, for ext2, you can use ext4 kernel driver which takes care of > REQ_FLUSH properly. For fat, you'll need to fix the fs... This is a bit surprising tho. Were we always like this? We never had even stupid "flush down everything and sync"? Or is this something we broke while morphing flush implementation several times in the past years? Thanks. -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:54 ` Tejun Heo @ 2014-03-14 15:08 ` Jan Kara 2014-03-15 9:09 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Jan Kara @ 2014-03-14 15:08 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Roman Peniaev, Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri 14-03-14 10:54:30, Tejun Heo wrote: > On Fri, Mar 14, 2014 at 03:52:15PM +0100, Jan Kara wrote: > > Well, for ext2, you can use ext4 kernel driver which takes care of > > REQ_FLUSH properly. For fat, you'll need to fix the fs... > > This is a bit surprising tho. Were we always like this? We never had > even stupid "flush down everything and sync"? Or is this something we > broke while morphing flush implementation several times in the past > years? It has been always like this. We could add some sending of REQ_FLUSH into generic code - like generic_file_fsync() and some similar helper for ->sync_fs(). Just someone has to do it... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:54 ` Tejun Heo 2014-03-14 15:08 ` Jan Kara @ 2014-03-15 9:09 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2014-03-15 9:09 UTC (permalink / raw) To: Tejun Heo Cc: Jan Kara, Roman Peniaev, Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 10:54:30AM -0400, Tejun Heo wrote: > This is a bit surprising tho. Were we always like this? We never had > even stupid "flush down everything and sync"? Or is this something we > broke while morphing flush implementation several times in the past > years? It's something that never worked in Linux. Until XFS went ahead and enabled barriers by default no filesystem in Linux ever flushed the volatile cache by default, and most couldn't (and still can't) even if you asked them to. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:11 ` Tejun Heo 2014-03-14 14:15 ` Jan Kara @ 2014-03-14 14:17 ` Roman Peniaev 2014-03-14 14:20 ` Tejun Heo 2014-03-14 15:36 ` Roman Peniaev 2 siblings, 1 reply; 20+ messages in thread From: Roman Peniaev @ 2014-03-14 14:17 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 11:11 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote: >> Seems the following message should be better: >> When data inegrity operation (sync, fsync, fdatasync calls) happens >> writeback control is set to WB_SYNC_ALL. >> In that case all write requests are marked with WRITE_SYNC, but on >> mpage writeback path >> WRITE_SYNC is missed. This patch fixes this. >> >> Is it ok, what do you think? > > I think the description should make it clear that WRITE_SYNC is about > latency, not about integrity and we probably should add comments > explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there > probably is someone waiting). > >> Also, could you please help me do understand how can I guarantee >> integrity in case of block device with big volatile >> cache and filesystem, which does not support REQ_FLUSH/FUA? > > If a device has a volatile cache but doesn't support flush, it can't > guarantee integrity. There's no way for its user to determine or > force whether certain data is on non-volatile media. It's an > inherently broken device. No, no. Not device does not support flush, filesystem does not care about flush. (take any old school, e.g. ext2) We did some write, and then we did fsync. But filesystem does not support REQ_FLUSH/FUA so block device will never get the checkpoint where it should really flush everything. So, fsync will not guarantee any integrity in that case. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:17 ` Roman Peniaev @ 2014-03-14 14:20 ` Tejun Heo 2014-03-14 14:29 ` Roman Peniaev 0 siblings, 1 reply; 20+ messages in thread From: Tejun Heo @ 2014-03-14 14:20 UTC (permalink / raw) To: Roman Peniaev Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 11:17:56PM +0900, Roman Peniaev wrote: > No, no. Not device does not support flush, filesystem does not care about flush. > (take any old school, e.g. ext2) > > We did some write, and then we did fsync. > But filesystem does not support REQ_FLUSH/FUA so block device will never > get the checkpoint where it should really flush everything. > > So, fsync will not guarantee any integrity in that case. Oh, no idea. Fix the filesystem to support REQ_FLUSH? :) -- tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:20 ` Tejun Heo @ 2014-03-14 14:29 ` Roman Peniaev 0 siblings, 0 replies; 20+ messages in thread From: Roman Peniaev @ 2014-03-14 14:29 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 11:20 PM, Tejun Heo <tj@kernel.org> wrote: > On Fri, Mar 14, 2014 at 11:17:56PM +0900, Roman Peniaev wrote: >> No, no. Not device does not support flush, filesystem does not care about flush. >> (take any old school, e.g. ext2) >> >> We did some write, and then we did fsync. >> But filesystem does not support REQ_FLUSH/FUA so block device will never >> get the checkpoint where it should really flush everything. >> >> So, fsync will not guarantee any integrity in that case. > > Oh, no idea. Fix the filesystem to support REQ_FLUSH? :) Yep, best variant. But I was thinking that there should be some guarantees from fsync (and friends) calls, which will issue flush after completion of every writeback request. But no way > > -- > tejun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-03-14 14:11 ` Tejun Heo 2014-03-14 14:15 ` Jan Kara 2014-03-14 14:17 ` Roman Peniaev @ 2014-03-14 15:36 ` Roman Peniaev 2 siblings, 0 replies; 20+ messages in thread From: Roman Peniaev @ 2014-03-14 15:36 UTC (permalink / raw) To: Tejun Heo Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe On Fri, Mar 14, 2014 at 11:11 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote: >> Seems the following message should be better: >> When data inegrity operation (sync, fsync, fdatasync calls) happens >> writeback control is set to WB_SYNC_ALL. >> In that case all write requests are marked with WRITE_SYNC, but on >> mpage writeback path >> WRITE_SYNC is missed. This patch fixes this. >> >> Is it ok, what do you think? > > I think the description should make it clear that WRITE_SYNC is about > latency, not about integrity and we probably should add comments > explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there > probably is someone waiting). I sent v2 of the patch where I tried to be more explicit in details. Please, check. -- Roman ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write 2014-02-16 2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen 2014-02-18 23:59 ` Andrew Morton @ 2014-03-13 20:21 ` Jan Kara 1 sibling, 0 replies; 20+ messages in thread From: Jan Kara @ 2014-03-13 20:21 UTC (permalink / raw) To: Roman Pen Cc: Alexander Viro, linux-fsdevel, linux-kernel, Andrew Morton, Jens Axboe On Sun 16-02-14 11:54:28, Roman Pen wrote: > In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write, > thus mark request as WRITE_SYNC. The patch looks good to me and also makes generic mpage code more in sync with what e.g. XFS or ext4 do in their writepage functions. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > Signed-off-by: Roman Pen <r.peniaev@gmail.com> > CC: Alexander Viro <viro@zeniv.linux.org.uk> > CC: linux-fsdevel@vger.kernel.org > CC: linux-kernel@vger.kernel.org > --- > fs/mpage.c | 23 +++++++++++++++-------- > 1 file changed, 15 insertions(+), 8 deletions(-) > > diff --git a/fs/mpage.c b/fs/mpage.c > index 4979ffa..4e0af5a 100644 > --- a/fs/mpage.c > +++ b/fs/mpage.c > @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, > struct buffer_head map_bh; > loff_t i_size = i_size_read(inode); > int ret = 0; > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); > > if (page_has_buffers(page)) { > struct buffer_head *head = page_buffers(page); > @@ -570,7 +571,7 @@ page_is_mapped: > * This page will go to BIO. Do we need to send this BIO off first? > */ > if (bio && mpd->last_block_in_bio != blocks[0] - 1) > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > > alloc_new: > if (bio == NULL) { > @@ -587,7 +588,7 @@ alloc_new: > */ > length = first_unmapped << blkbits; > if (bio_add_page(bio, page, length, 0) < length) { > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > goto alloc_new; > } > > @@ -620,7 +621,7 @@ alloc_new: > set_page_writeback(page); > unlock_page(page); > if (boundary || (first_unmapped != blocks_per_page)) { > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > if (boundary_block) { > write_boundary_block(boundary_bdev, > boundary_block, 1 << blkbits); > @@ -632,7 +633,7 @@ alloc_new: > > confused: > if (bio) > - bio = mpage_bio_submit(WRITE, bio); > + bio = mpage_bio_submit(wr, bio); > > if (mpd->use_writepage) { > ret = mapping->a_ops->writepage(page, wbc); > @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping, > }; > > ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd); > - if (mpd.bio) > - mpage_bio_submit(WRITE, mpd.bio); > + if (mpd.bio) { > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > + WRITE_SYNC : WRITE); > + mpage_bio_submit(wr, mpd.bio); > + } > } > blk_finish_plug(&plug); > return ret; > @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block, > .use_writepage = 0, > }; > int ret = __mpage_writepage(page, wbc, &mpd); > - if (mpd.bio) > - mpage_bio_submit(WRITE, mpd.bio); > + if (mpd.bio) { > + int wr = (wbc->sync_mode == WB_SYNC_ALL ? > + WRITE_SYNC : WRITE); > + mpage_bio_submit(wr, mpd.bio); > + } > return ret; > } > EXPORT_SYMBOL(mpage_writepage); > -- > 1.8.5.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2014-03-15 9:09 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-16 2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen 2014-02-18 23:59 ` Andrew Morton 2014-02-19 1:38 ` Roman Peniaev 2014-03-12 14:29 ` Roman Peniaev 2014-03-13 20:01 ` Jan Kara 2014-03-13 21:34 ` Andrew Morton 2014-03-14 13:07 ` Tejun Heo 2014-03-14 14:07 ` Roman Peniaev 2014-03-14 14:11 ` Tejun Heo 2014-03-14 14:15 ` Jan Kara 2014-03-14 14:23 ` Roman Peniaev 2014-03-14 14:52 ` Jan Kara 2014-03-14 14:54 ` Tejun Heo 2014-03-14 15:08 ` Jan Kara 2014-03-15 9:09 ` Christoph Hellwig 2014-03-14 14:17 ` Roman Peniaev 2014-03-14 14:20 ` Tejun Heo 2014-03-14 14:29 ` Roman Peniaev 2014-03-14 15:36 ` Roman Peniaev 2014-03-13 20:21 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).