[PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
@ 2014-02-16  2:54 Roman Pen
  2014-02-18 23:59 ` Andrew Morton
  2014-03-13 20:21 ` Jan Kara
  0 siblings, 2 replies; 20+ messages in thread
From: Roman Pen @ 2014-02-16  2:54 UTC (permalink / raw)
  Cc: Roman Pen, Alexander Viro, linux-fsdevel, linux-kernel

In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
thus mark request as WRITE_SYNC.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 fs/mpage.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 4979ffa..4e0af5a 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
 	int ret = 0;
+	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -570,7 +571,7 @@ page_is_mapped:
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 
 alloc_new:
 	if (bio == NULL) {
@@ -587,7 +588,7 @@ alloc_new:
 	 */
 	length = first_unmapped << blkbits;
 	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 		goto alloc_new;
 	}
 
@@ -620,7 +621,7 @@ alloc_new:
 	set_page_writeback(page);
 	unlock_page(page);
 	if (boundary || (first_unmapped != blocks_per_page)) {
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
@@ -632,7 +633,7 @@ alloc_new:
 
 confused:
 	if (bio)
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 
 	if (mpd->use_writepage) {
 		ret = mapping->a_ops->writepage(page, wbc);
@@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
 		};
 
 		ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
-		if (mpd.bio)
-			mpage_bio_submit(WRITE, mpd.bio);
+		if (mpd.bio) {
+			int wr = (wbc->sync_mode == WB_SYNC_ALL ?
+				  WRITE_SYNC : WRITE);
+			mpage_bio_submit(wr, mpd.bio);
+		}
 	}
 	blk_finish_plug(&plug);
 	return ret;
@@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
 		.use_writepage = 0,
 	};
 	int ret = __mpage_writepage(page, wbc, &mpd);
-	if (mpd.bio)
-		mpage_bio_submit(WRITE, mpd.bio);
+	if (mpd.bio) {
+		int wr = (wbc->sync_mode == WB_SYNC_ALL ?
+			  WRITE_SYNC : WRITE);
+		mpage_bio_submit(wr, mpd.bio);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepage);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-02-16  2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen
@ 2014-02-18 23:59 ` Andrew Morton
  2014-02-19  1:38   ` Roman Peniaev
  2014-03-13 20:21 ` Jan Kara
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2014-02-18 23:59 UTC (permalink / raw)
  To: Roman Pen; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe

On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote:

> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
> thus mark request as WRITE_SYNC.

gargh, the documentation for this stuff is useless.

What do REQ_SYNC and REQ_NOIDLE actually *do*?

For mpage writes, REQ_NOIDLE appears to be incorrect - we very much
expect that there will be more writes and that they will be contiguous
with this one.  But we won't be waiting on this write before submitting
more writes, so perhaps REQ_NOIDLE is at least harmless.

I dunno about REQ_SYNC - it requires delving into the bowels of CFQ
and we shouldn't need to do that.

Jens.  Help.  How is a poor kernel reader supposed to work this out?

> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  	struct buffer_head map_bh;
>  	loff_t i_size = i_size_read(inode);
>  	int ret = 0;
> +	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>  
>  	if (page_has_buffers(page)) {
>  		struct buffer_head *head = page_buffers(page);
> @@ -570,7 +571,7 @@ page_is_mapped:
>  	 * This page will go to BIO.  Do we need to send this BIO off first?
>  	 */
>  	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  
>  alloc_new:
>  	if (bio == NULL) {
> @@ -587,7 +588,7 @@ alloc_new:
>  	 */
>  	length = first_unmapped << blkbits;
>  	if (bio_add_page(bio, page, length, 0) < length) {
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  		goto alloc_new;
>  	}
>  
> @@ -620,7 +621,7 @@ alloc_new:
>  	set_page_writeback(page);
>  	unlock_page(page);
>  	if (boundary || (first_unmapped != blocks_per_page)) {
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  		if (boundary_block) {
>  			write_boundary_block(boundary_bdev,
>  					boundary_block, 1 << blkbits);
> @@ -632,7 +633,7 @@ alloc_new:
>  
>  confused:
>  	if (bio)
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  
>  	if (mpd->use_writepage) {
>  		ret = mapping->a_ops->writepage(page, wbc);
> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
>  		};
>  
>  		ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
> -		if (mpd.bio)
> -			mpage_bio_submit(WRITE, mpd.bio);
> +		if (mpd.bio) {
> +			int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> +				  WRITE_SYNC : WRITE);
> +			mpage_bio_submit(wr, mpd.bio);
> +		}
>  	}
>  	blk_finish_plug(&plug);
>  	return ret;
> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
>  		.use_writepage = 0,
>  	};
>  	int ret = __mpage_writepage(page, wbc, &mpd);
> -	if (mpd.bio)
> -		mpage_bio_submit(WRITE, mpd.bio);
> +	if (mpd.bio) {
> +		int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> +			  WRITE_SYNC : WRITE);
> +		mpage_bio_submit(wr, mpd.bio);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepage);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-02-18 23:59 ` Andrew Morton
@ 2014-02-19  1:38   ` Roman Peniaev
  2014-03-12 14:29     ` Roman Peniaev
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Peniaev @ 2014-02-19  1:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe

(my previous email was rejected by vger.kernel.org because google web
sent it as html.
 will resend the same one in plain text mode)

> What do REQ_SYNC and REQ_NOIDLE actually *do*?

Yep, this REQ_SYNC is very confusing to me.
First of all according to the sources of old school block buffer filesystems
(e.g. ext2) we can get this stack in case of fsync call:

     __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode)
      do_writepages(mapping, &wbc)
         mapping->a_ops->writepages(page, wbc)
         (ext2_writepages)
            mpage_writepages(mapping, wbc, fat_get_block);
              write_cache_pages(mapping, wbc, __mpage_writepage, &mpd)
                __mpage_writepage(page, wbc, data)
>>>>>             mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL?
                    <or in case of not contiguous buffers>
                  mapping->a_ops->writepage(page, wbc)
                  (ext2_writepage)
                    block_write_full_page(page, fat_get_block, wbc)
                      block_write_full_page_endio(page, get_block, wbc,
                                                  end_buffer_async_write)
                        __block_write_full_page(inode, page, get_block, wbc,
                                                handler);
                          submit_bh(WRITE_SYNC)

So, it turns out to be that some bios for the same dirty range
can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of
the bios only with REQ_WRITE.
(according to the comment of __mpage_writepage:
 * If all blocks are found to be contiguous then the page can go into the
 * BIO.  Otherwise fall back to the mapping's writepage().
)

Also, it seems to me that all over the kernel WRITE_SYNC has meaning of:
1. try to get the block on-disk faster
2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result

My patch is an attempt to make some unification in case of fsync call.

--
Roman


On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote:
>
>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
>> thus mark request as WRITE_SYNC.
>
> gargh, the documentation for this stuff is useless.
>
> What do REQ_SYNC and REQ_NOIDLE actually *do*?
>
> For mpage writes, REQ_NOIDLE appears to be incorrect - we very much
> expect that there will be more writes and that they will be contiguous
> with this one.  But we won't be waiting on this write before submitting
> more writes, so perhaps REQ_NOIDLE is at least harmless.
>
> I dunno about REQ_SYNC - it requires delving into the bowels of CFQ
> and we shouldn't need to do that.
>
> Jens.  Help.  How is a poor kernel reader supposed to work this out?
>
>> --- a/fs/mpage.c
>> +++ b/fs/mpage.c
>> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>>       struct buffer_head map_bh;
>>       loff_t i_size = i_size_read(inode);
>>       int ret = 0;
>> +     int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>>
>>       if (page_has_buffers(page)) {
>>               struct buffer_head *head = page_buffers(page);
>> @@ -570,7 +571,7 @@ page_is_mapped:
>>        * This page will go to BIO.  Do we need to send this BIO off first?
>>        */
>>       if (bio && mpd->last_block_in_bio != blocks[0] - 1)
>> -             bio = mpage_bio_submit(WRITE, bio);
>> +             bio = mpage_bio_submit(wr, bio);
>>
>>  alloc_new:
>>       if (bio == NULL) {
>> @@ -587,7 +588,7 @@ alloc_new:
>>        */
>>       length = first_unmapped << blkbits;
>>       if (bio_add_page(bio, page, length, 0) < length) {
>> -             bio = mpage_bio_submit(WRITE, bio);
>> +             bio = mpage_bio_submit(wr, bio);
>>               goto alloc_new;
>>       }
>>
>> @@ -620,7 +621,7 @@ alloc_new:
>>       set_page_writeback(page);
>>       unlock_page(page);
>>       if (boundary || (first_unmapped != blocks_per_page)) {
>> -             bio = mpage_bio_submit(WRITE, bio);
>> +             bio = mpage_bio_submit(wr, bio);
>>               if (boundary_block) {
>>                       write_boundary_block(boundary_bdev,
>>                                       boundary_block, 1 << blkbits);
>> @@ -632,7 +633,7 @@ alloc_new:
>>
>>  confused:
>>       if (bio)
>> -             bio = mpage_bio_submit(WRITE, bio);
>> +             bio = mpage_bio_submit(wr, bio);
>>
>>       if (mpd->use_writepage) {
>>               ret = mapping->a_ops->writepage(page, wbc);
>> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
>>               };
>>
>>               ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
>> -             if (mpd.bio)
>> -                     mpage_bio_submit(WRITE, mpd.bio);
>> +             if (mpd.bio) {
>> +                     int wr = (wbc->sync_mode == WB_SYNC_ALL ?
>> +                               WRITE_SYNC : WRITE);
>> +                     mpage_bio_submit(wr, mpd.bio);
>> +             }
>>       }
>>       blk_finish_plug(&plug);
>>       return ret;
>> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
>>               .use_writepage = 0,
>>       };
>>       int ret = __mpage_writepage(page, wbc, &mpd);
>> -     if (mpd.bio)
>> -             mpage_bio_submit(WRITE, mpd.bio);
>> +     if (mpd.bio) {
>> +             int wr = (wbc->sync_mode == WB_SYNC_ALL ?
>> +                       WRITE_SYNC : WRITE);
>> +             mpage_bio_submit(wr, mpd.bio);
>> +     }
>>       return ret;
>>  }
>>  EXPORT_SYMBOL(mpage_writepage);
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-02-19  1:38   ` Roman Peniaev
@ 2014-03-12 14:29     ` Roman Peniaev
  2014-03-13 20:01       ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Peniaev @ 2014-03-12 14:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexander Viro, linux-fsdevel, linux-kernel, Jens Axboe

Jens,

could you please explain the real purpose of WAIT_SYNC?
In case of wbc->sync_mode == WB_SYNC_ALL.
Because my current understanding is if writeback control has
WB_SYNC_ALL everything
should be submitted with WAIT_SYNC.

--
Roman


On Wed, Feb 19, 2014 at 10:38 AM, Roman Peniaev <r.peniaev@gmail.com> wrote:
> (my previous email was rejected by vger.kernel.org because google web
> sent it as html.
>  will resend the same one in plain text mode)
>
>> What do REQ_SYNC and REQ_NOIDLE actually *do*?
>
> Yep, this REQ_SYNC is very confusing to me.
> First of all according to the sources of old school block buffer filesystems
> (e.g. ext2) we can get this stack in case of fsync call:
>
>      __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode)
>       do_writepages(mapping, &wbc)
>          mapping->a_ops->writepages(page, wbc)
>          (ext2_writepages)
>             mpage_writepages(mapping, wbc, fat_get_block);
>               write_cache_pages(mapping, wbc, __mpage_writepage, &mpd)
>                 __mpage_writepage(page, wbc, data)
>>>>>>             mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL?
>                     <or in case of not contiguous buffers>
>                   mapping->a_ops->writepage(page, wbc)
>                   (ext2_writepage)
>                     block_write_full_page(page, fat_get_block, wbc)
>                       block_write_full_page_endio(page, get_block, wbc,
>                                                   end_buffer_async_write)
>                         __block_write_full_page(inode, page, get_block, wbc,
>                                                 handler);
>                           submit_bh(WRITE_SYNC)
>
> So, it turns out to be that some bios for the same dirty range
> can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of
> the bios only with REQ_WRITE.
> (according to the comment of __mpage_writepage:
>  * If all blocks are found to be contiguous then the page can go into the
>  * BIO.  Otherwise fall back to the mapping's writepage().
> )
>
> Also, it seems to me that all over the kernel WRITE_SYNC has meaning of:
> 1. try to get the block on-disk faster
> 2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result
>
> My patch is an attempt to make some unification in case of fsync call.
>
> --
> Roman
>
>
> On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote:
>>
>>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
>>> thus mark request as WRITE_SYNC.
>>
>> gargh, the documentation for this stuff is useless.
>>
>> What do REQ_SYNC and REQ_NOIDLE actually *do*?
>>
>> For mpage writes, REQ_NOIDLE appears to be incorrect - we very much
>> expect that there will be more writes and that they will be contiguous
>> with this one.  But we won't be waiting on this write before submitting
>> more writes, so perhaps REQ_NOIDLE is at least harmless.
>>
>> I dunno about REQ_SYNC - it requires delving into the bowels of CFQ
>> and we shouldn't need to do that.
>>
>> Jens.  Help.  How is a poor kernel reader supposed to work this out?
>>
>>> --- a/fs/mpage.c
>>> +++ b/fs/mpage.c
>>> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>>>       struct buffer_head map_bh;
>>>       loff_t i_size = i_size_read(inode);
>>>       int ret = 0;
>>> +     int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>>>
>>>       if (page_has_buffers(page)) {
>>>               struct buffer_head *head = page_buffers(page);
>>> @@ -570,7 +571,7 @@ page_is_mapped:
>>>        * This page will go to BIO.  Do we need to send this BIO off first?
>>>        */
>>>       if (bio && mpd->last_block_in_bio != blocks[0] - 1)
>>> -             bio = mpage_bio_submit(WRITE, bio);
>>> +             bio = mpage_bio_submit(wr, bio);
>>>
>>>  alloc_new:
>>>       if (bio == NULL) {
>>> @@ -587,7 +588,7 @@ alloc_new:
>>>        */
>>>       length = first_unmapped << blkbits;
>>>       if (bio_add_page(bio, page, length, 0) < length) {
>>> -             bio = mpage_bio_submit(WRITE, bio);
>>> +             bio = mpage_bio_submit(wr, bio);
>>>               goto alloc_new;
>>>       }
>>>
>>> @@ -620,7 +621,7 @@ alloc_new:
>>>       set_page_writeback(page);
>>>       unlock_page(page);
>>>       if (boundary || (first_unmapped != blocks_per_page)) {
>>> -             bio = mpage_bio_submit(WRITE, bio);
>>> +             bio = mpage_bio_submit(wr, bio);
>>>               if (boundary_block) {
>>>                       write_boundary_block(boundary_bdev,
>>>                                       boundary_block, 1 << blkbits);
>>> @@ -632,7 +633,7 @@ alloc_new:
>>>
>>>  confused:
>>>       if (bio)
>>> -             bio = mpage_bio_submit(WRITE, bio);
>>> +             bio = mpage_bio_submit(wr, bio);
>>>
>>>       if (mpd->use_writepage) {
>>>               ret = mapping->a_ops->writepage(page, wbc);
>>> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
>>>               };
>>>
>>>               ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
>>> -             if (mpd.bio)
>>> -                     mpage_bio_submit(WRITE, mpd.bio);
>>> +             if (mpd.bio) {
>>> +                     int wr = (wbc->sync_mode == WB_SYNC_ALL ?
>>> +                               WRITE_SYNC : WRITE);
>>> +                     mpage_bio_submit(wr, mpd.bio);
>>> +             }
>>>       }
>>>       blk_finish_plug(&plug);
>>>       return ret;
>>> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
>>>               .use_writepage = 0,
>>>       };
>>>       int ret = __mpage_writepage(page, wbc, &mpd);
>>> -     if (mpd.bio)
>>> -             mpage_bio_submit(WRITE, mpd.bio);
>>> +     if (mpd.bio) {
>>> +             int wr = (wbc->sync_mode == WB_SYNC_ALL ?
>>> +                       WRITE_SYNC : WRITE);
>>> +             mpage_bio_submit(wr, mpd.bio);
>>> +     }
>>>       return ret;
>>>  }
>>>  EXPORT_SYMBOL(mpage_writepage);
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-12 14:29     ` Roman Peniaev
@ 2014-03-13 20:01       ` Jan Kara
  2014-03-13 21:34         ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2014-03-13 20:01 UTC (permalink / raw)
  To: Roman Peniaev
  Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel,
	Jens Axboe

  Hello,

On Wed 12-03-14 23:29:04, Roman Peniaev wrote:
> could you please explain the real purpose of WAIT_SYNC?
> In case of wbc->sync_mode == WB_SYNC_ALL.
> Because my current understanding is if writeback control has
> WB_SYNC_ALL everything
> should be submitted with WAIT_SYNC.
  So AFAIK the idea is that REQ_SYNC flag should indicate the IO is
synchronous - i.e., someone is waiting for it to complete. This is opposed
to asynchronous writeback done by flusher threads where noone waits for
particular write to complete. Subsequently, IO scheduler is expected (but
not required to - only CFQ honors REQ_SYNC AFAIK) to treat sync requests
with higher priority than async onces.

When to set REQ_SYNC is not an obvious question. If we set it for too much
IO, it has no effect. If we don't set it for some IO we risk that someone
waiting for that IO to complete will be starved by others setting REQ_SYNC.

So all in all I think that using WRITE_SYNC iff we are doing WB_SYNC_ALL
writeback is a reasonable choice.

								Honza

> On Wed, Feb 19, 2014 at 10:38 AM, Roman Peniaev <r.peniaev@gmail.com> wrote:
> > (my previous email was rejected by vger.kernel.org because google web
> > sent it as html.
> >  will resend the same one in plain text mode)
> >
> >> What do REQ_SYNC and REQ_NOIDLE actually *do*?
> >
> > Yep, this REQ_SYNC is very confusing to me.
> > First of all according to the sources of old school block buffer filesystems
> > (e.g. ext2) we can get this stack in case of fsync call:
> >
> >      __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode)
> >       do_writepages(mapping, &wbc)
> >          mapping->a_ops->writepages(page, wbc)
> >          (ext2_writepages)
> >             mpage_writepages(mapping, wbc, fat_get_block);
> >               write_cache_pages(mapping, wbc, __mpage_writepage, &mpd)
> >                 __mpage_writepage(page, wbc, data)
> >>>>>>             mpage_bio_submit(WRITE, bio) >>>> why WRITE? not WRITE_SYNC in case of WB_SYNC_ALL?
> >                     <or in case of not contiguous buffers>
> >                   mapping->a_ops->writepage(page, wbc)
> >                   (ext2_writepage)
> >                     block_write_full_page(page, fat_get_block, wbc)
> >                       block_write_full_page_endio(page, get_block, wbc,
> >                                                   end_buffer_async_write)
> >                         __block_write_full_page(inode, page, get_block, wbc,
> >                                                 handler);
> >                           submit_bh(WRITE_SYNC)
> >
> > So, it turns out to be that some bios for the same dirty range
> > can be submitted with REQ_WRITE|REQ_SYNC|REQ_NOIDLE and some of
> > the bios only with REQ_WRITE.
> > (according to the comment of __mpage_writepage:
> >  * If all blocks are found to be contiguous then the page can go into the
> >  * BIO.  Otherwise fall back to the mapping's writepage().
> > )
> >
> > Also, it seems to me that all over the kernel WRITE_SYNC has meaning of:
> > 1. try to get the block on-disk faster
> > 2. if I have to do flush - mark my bio with WRITE_SYNC and wait for result
> >
> > My patch is an attempt to make some unification in case of fsync call.
> >
> > --
> > Roman
> >
> >
> > On Wed, Feb 19, 2014 at 8:59 AM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> >> On Sun, 16 Feb 2014 11:54:28 +0900 Roman Pen <r.peniaev@gmail.com> wrote:
> >>
> >>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
> >>> thus mark request as WRITE_SYNC.
> >>
> >> gargh, the documentation for this stuff is useless.
> >>
> >> What do REQ_SYNC and REQ_NOIDLE actually *do*?
> >>
> >> For mpage writes, REQ_NOIDLE appears to be incorrect - we very much
> >> expect that there will be more writes and that they will be contiguous
> >> with this one.  But we won't be waiting on this write before submitting
> >> more writes, so perhaps REQ_NOIDLE is at least harmless.
> >>
> >> I dunno about REQ_SYNC - it requires delving into the bowels of CFQ
> >> and we shouldn't need to do that.
> >>
> >> Jens.  Help.  How is a poor kernel reader supposed to work this out?
> >>
> >>> --- a/fs/mpage.c
> >>> +++ b/fs/mpage.c
> >>> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
> >>>       struct buffer_head map_bh;
> >>>       loff_t i_size = i_size_read(inode);
> >>>       int ret = 0;
> >>> +     int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
> >>>
> >>>       if (page_has_buffers(page)) {
> >>>               struct buffer_head *head = page_buffers(page);
> >>> @@ -570,7 +571,7 @@ page_is_mapped:
> >>>        * This page will go to BIO.  Do we need to send this BIO off first?
> >>>        */
> >>>       if (bio && mpd->last_block_in_bio != blocks[0] - 1)
> >>> -             bio = mpage_bio_submit(WRITE, bio);
> >>> +             bio = mpage_bio_submit(wr, bio);
> >>>
> >>>  alloc_new:
> >>>       if (bio == NULL) {
> >>> @@ -587,7 +588,7 @@ alloc_new:
> >>>        */
> >>>       length = first_unmapped << blkbits;
> >>>       if (bio_add_page(bio, page, length, 0) < length) {
> >>> -             bio = mpage_bio_submit(WRITE, bio);
> >>> +             bio = mpage_bio_submit(wr, bio);
> >>>               goto alloc_new;
> >>>       }
> >>>
> >>> @@ -620,7 +621,7 @@ alloc_new:
> >>>       set_page_writeback(page);
> >>>       unlock_page(page);
> >>>       if (boundary || (first_unmapped != blocks_per_page)) {
> >>> -             bio = mpage_bio_submit(WRITE, bio);
> >>> +             bio = mpage_bio_submit(wr, bio);
> >>>               if (boundary_block) {
> >>>                       write_boundary_block(boundary_bdev,
> >>>                                       boundary_block, 1 << blkbits);
> >>> @@ -632,7 +633,7 @@ alloc_new:
> >>>
> >>>  confused:
> >>>       if (bio)
> >>> -             bio = mpage_bio_submit(WRITE, bio);
> >>> +             bio = mpage_bio_submit(wr, bio);
> >>>
> >>>       if (mpd->use_writepage) {
> >>>               ret = mapping->a_ops->writepage(page, wbc);
> >>> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
> >>>               };
> >>>
> >>>               ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
> >>> -             if (mpd.bio)
> >>> -                     mpage_bio_submit(WRITE, mpd.bio);
> >>> +             if (mpd.bio) {
> >>> +                     int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> >>> +                               WRITE_SYNC : WRITE);
> >>> +                     mpage_bio_submit(wr, mpd.bio);
> >>> +             }
> >>>       }
> >>>       blk_finish_plug(&plug);
> >>>       return ret;
> >>> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
> >>>               .use_writepage = 0,
> >>>       };
> >>>       int ret = __mpage_writepage(page, wbc, &mpd);
> >>> -     if (mpd.bio)
> >>> -             mpage_bio_submit(WRITE, mpd.bio);
> >>> +     if (mpd.bio) {
> >>> +             int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> >>> +                       WRITE_SYNC : WRITE);
> >>> +             mpage_bio_submit(wr, mpd.bio);
> >>> +     }
> >>>       return ret;
> >>>  }
> >>>  EXPORT_SYMBOL(mpage_writepage);
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-02-16  2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen
  2014-02-18 23:59 ` Andrew Morton
@ 2014-03-13 20:21 ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2014-03-13 20:21 UTC (permalink / raw)
  To: Roman Pen
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Andrew Morton,
	Jens Axboe

On Sun 16-02-14 11:54:28, Roman Pen wrote:
> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity write,
> thus mark request as WRITE_SYNC.
  The patch looks good to me and also makes generic mpage code more in sync
with what e.g. XFS or ext4 do in their writepage functions. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Roman Pen <r.peniaev@gmail.com>
> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> CC: linux-fsdevel@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  fs/mpage.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 4979ffa..4e0af5a 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -462,6 +462,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  	struct buffer_head map_bh;
>  	loff_t i_size = i_size_read(inode);
>  	int ret = 0;
> +	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
>  
>  	if (page_has_buffers(page)) {
>  		struct buffer_head *head = page_buffers(page);
> @@ -570,7 +571,7 @@ page_is_mapped:
>  	 * This page will go to BIO.  Do we need to send this BIO off first?
>  	 */
>  	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  
>  alloc_new:
>  	if (bio == NULL) {
> @@ -587,7 +588,7 @@ alloc_new:
>  	 */
>  	length = first_unmapped << blkbits;
>  	if (bio_add_page(bio, page, length, 0) < length) {
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  		goto alloc_new;
>  	}
>  
> @@ -620,7 +621,7 @@ alloc_new:
>  	set_page_writeback(page);
>  	unlock_page(page);
>  	if (boundary || (first_unmapped != blocks_per_page)) {
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  		if (boundary_block) {
>  			write_boundary_block(boundary_bdev,
>  					boundary_block, 1 << blkbits);
> @@ -632,7 +633,7 @@ alloc_new:
>  
>  confused:
>  	if (bio)
> -		bio = mpage_bio_submit(WRITE, bio);
> +		bio = mpage_bio_submit(wr, bio);
>  
>  	if (mpd->use_writepage) {
>  		ret = mapping->a_ops->writepage(page, wbc);
> @@ -688,8 +689,11 @@ mpage_writepages(struct address_space *mapping,
>  		};
>  
>  		ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
> -		if (mpd.bio)
> -			mpage_bio_submit(WRITE, mpd.bio);
> +		if (mpd.bio) {
> +			int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> +				  WRITE_SYNC : WRITE);
> +			mpage_bio_submit(wr, mpd.bio);
> +		}
>  	}
>  	blk_finish_plug(&plug);
>  	return ret;
> @@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, get_block_t get_block,
>  		.use_writepage = 0,
>  	};
>  	int ret = __mpage_writepage(page, wbc, &mpd);
> -	if (mpd.bio)
> -		mpage_bio_submit(WRITE, mpd.bio);
> +	if (mpd.bio) {
> +		int wr = (wbc->sync_mode == WB_SYNC_ALL ?
> +			  WRITE_SYNC : WRITE);
> +		mpage_bio_submit(wr, mpd.bio);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepage);
> -- 
> 1.8.5.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-13 20:01       ` Jan Kara
@ 2014-03-13 21:34         ` Andrew Morton
  2014-03-14 13:07           ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2014-03-13 21:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Roman Peniaev, Alexander Viro, linux-fsdevel, linux-kernel,
	Jens Axboe, Tejun Heo

On Thu, 13 Mar 2014 21:01:19 +0100 Jan Kara <jack@suse.cz> wrote:

>   Hello,
> 
> On Wed 12-03-14 23:29:04, Roman Peniaev wrote:
> > could you please explain the real purpose of WAIT_SYNC?
> > In case of wbc->sync_mode == WB_SYNC_ALL.
> > Because my current understanding is if writeback control has
> > WB_SYNC_ALL everything
> > should be submitted with WAIT_SYNC.
>   So AFAIK the idea is that REQ_SYNC flag should indicate the IO is
> synchronous - i.e., someone is waiting for it to complete. This is opposed
> to asynchronous writeback done by flusher threads where noone waits for
> particular write to complete. Subsequently, IO scheduler is expected (but
> not required to - only CFQ honors REQ_SYNC AFAIK) to treat sync requests
> with higher priority than async onces.
> 
> When to set REQ_SYNC is not an obvious question. If we set it for too much
> IO, it has no effect. If we don't set it for some IO we risk that someone
> waiting for that IO to complete will be starved by others setting REQ_SYNC.
> 
> So all in all I think that using WRITE_SYNC iff we are doing WB_SYNC_ALL
> writeback is a reasonable choice.

I added this to the changelog:

: akpm: afaict this change will cause the data integrity write bios to be
: placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
: in special treatment.  The documentation for REQ_SYNC is horrid.

Which is pretty pathetic.

Jens isn't talking to us.  Tejun, are you able explain REQ_SYNC?

I just don't know about this patch.  It will presumably have some
effect on data-integrity writes.  But is that a good effect or a bad
one?  Will it result in a sys_sync() starving out other IO for ages in
an undesirable fashion?  It might well do!  Will it prioritize those
writes, resulting in overall less efficient IO patterns?  It might well
do!

I don't see how we can proceed without understanding the tradeoffs and
deciding that the overall effect is a desirable one.




From: Roman Pen <r.peniaev@gmail.com>
Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write

In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
write, thus mark request as WRITE_SYNC.

akpm: afaict this change will cause the data integrity write bios to be
placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
in special treatment.  The documentation for REQ_SYNC is horrid.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/mpage.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff -puN fs/mpage.c~fs-mpagec-forgotten-write_sync-in-case-of-data-integrity-write fs/mpage.c
--- a/fs/mpage.c~fs-mpagec-forgotten-write_sync-in-case-of-data-integrity-write
+++ a/fs/mpage.c
@@ -462,6 +462,7 @@ static int __mpage_writepage(struct page
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
 	int ret = 0;
+	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -570,7 +571,7 @@ page_is_mapped:
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 
 alloc_new:
 	if (bio == NULL) {
@@ -587,7 +588,7 @@ alloc_new:
 	 */
 	length = first_unmapped << blkbits;
 	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 		goto alloc_new;
 	}
 
@@ -620,7 +621,7 @@ alloc_new:
 	set_page_writeback(page);
 	unlock_page(page);
 	if (boundary || (first_unmapped != blocks_per_page)) {
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
@@ -632,7 +633,7 @@ alloc_new:
 
 confused:
 	if (bio)
-		bio = mpage_bio_submit(WRITE, bio);
+		bio = mpage_bio_submit(wr, bio);
 
 	if (mpd->use_writepage) {
 		ret = mapping->a_ops->writepage(page, wbc);
@@ -688,8 +689,11 @@ mpage_writepages(struct address_space *m
 		};
 
 		ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
-		if (mpd.bio)
-			mpage_bio_submit(WRITE, mpd.bio);
+		if (mpd.bio) {
+			int wr = (wbc->sync_mode == WB_SYNC_ALL ?
+				  WRITE_SYNC : WRITE);
+			mpage_bio_submit(wr, mpd.bio);
+		}
 	}
 	blk_finish_plug(&plug);
 	return ret;
@@ -706,8 +710,11 @@ int mpage_writepage(struct page *page, g
 		.use_writepage = 0,
 	};
 	int ret = __mpage_writepage(page, wbc, &mpd);
-	if (mpd.bio)
-		mpage_bio_submit(WRITE, mpd.bio);
+	if (mpd.bio) {
+		int wr = (wbc->sync_mode == WB_SYNC_ALL ?
+			  WRITE_SYNC : WRITE);
+		mpage_bio_submit(wr, mpd.bio);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepage);
_

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-13 21:34         ` Andrew Morton
@ 2014-03-14 13:07           ` Tejun Heo
  2014-03-14 14:07             ` Roman Peniaev
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2014-03-14 13:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Roman Peniaev, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

Hello, Andrew.

On Thu, Mar 13, 2014 at 02:34:56PM -0700, Andrew Morton wrote:
> Jens isn't talking to us.  Tejun, are you able explain REQ_SYNC?

It has nothing to do with data integrity.  It's just a hint telling
the block layer that someone is waiting for the IO so it'd be a good
idea to prioritize it.  For example, nothing visible to userland
really waits for periodic writebacks, so we can delay their processing
to prioritize, for example, READs triggered from a page fault, which
is obviously causing userland visible latency.

Block layer treats all READs as REQ_SYNC and also allows upper layers
to mark some writes REQ_SYNC for cases where somebody is waiting for
the write to complete for cases like flush(2).

> From: Roman Pen <r.peniaev@gmail.com>
> Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
> 
> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
> write, thus mark request as WRITE_SYNC.

So, at least this patch description is very misleading.  WRITE_SYNC
has *NOTHING* to do with data integrity.  The only thing matters is
whether somebody is waiting for its completion or not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 13:07           ` Tejun Heo
@ 2014-03-14 14:07             ` Roman Peniaev
  2014-03-14 14:11               ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Peniaev @ 2014-03-14 14:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

Hello, Tejun.

On Fri, Mar 14, 2014 at 10:07 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andrew.
>
> On Thu, Mar 13, 2014 at 02:34:56PM -0700, Andrew Morton wrote:
>> Jens isn't talking to us.  Tejun, are you able explain REQ_SYNC?
>
> It has nothing to do with data integrity.  It's just a hint telling
> the block layer that someone is waiting for the IO so it'd be a good
> idea to prioritize it.  For example, nothing visible to userland
> really waits for periodic writebacks, so we can delay their processing
> to prioritize, for example, READs triggered from a page fault, which
> is obviously causing userland visible latency.
>
> Block layer treats all READs as REQ_SYNC and also allows upper layers
> to mark some writes REQ_SYNC for cases where somebody is waiting for
> the write to complete for cases like flush(2).
>
>> From: Roman Pen <r.peniaev@gmail.com>
>> Subject: fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
>>
>> In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
>> write, thus mark request as WRITE_SYNC.
>
> So, at least this patch description is very misleading.  WRITE_SYNC
> has *NOTHING* to do with data integrity.  The only thing matters is
> whether somebody is waiting for its completion or not.

The thing is that I started investigating WB_SYNC_ALL and WRITE_SYNC
from the __filemap_fdatawrite_range call which has explicit comment:

 * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
 * opposed to a regular memory cleansing writeback.

So, in the commit msg I want to say that when we are doing integrity write
(sync, fsync calls) we have to mark requests as WRITE_SYNC, and there
is only one path
(mpage writeback path)  where WRITE_SYNC is missed. That was the idea.
(not WRITE_SYNC is responsible for integrity write, but in case of
integrity write)

Seems the following message should be better:
When data inegrity operation (sync, fsync, fdatasync calls) happens
writeback control is set to WB_SYNC_ALL.
In that case all write requests are marked with WRITE_SYNC, but on
mpage writeback path
WRITE_SYNC is missed. This patch fixes this.

Is it ok, what do you think?

Also, could you please help me do understand how can I guarantee
integrity in case of block device with big volatile
cache and filesystem, which does not support REQ_FLUSH/FUA?

Even if I am doing fsync, block device will never receive any
indication, that these requests should really be stored on device,
or cache should be flushed right now (like REQ_FLUSH/FUA behaviour).

Seems it is impossible to do explicit block device cache flush if
filesystem does not care about it.

Thanks.

--
Roman

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:07             ` Roman Peniaev
@ 2014-03-14 14:11               ` Tejun Heo
  2014-03-14 14:15                 ` Jan Kara
                                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Tejun Heo @ 2014-03-14 14:11 UTC (permalink / raw)
  To: Roman Peniaev
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

Hello,

On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote:
> Seems the following message should be better:
> When data inegrity operation (sync, fsync, fdatasync calls) happens
> writeback control is set to WB_SYNC_ALL.
> In that case all write requests are marked with WRITE_SYNC, but on
> mpage writeback path
> WRITE_SYNC is missed. This patch fixes this.
> 
> Is it ok, what do you think?

I think the description should make it clear that WRITE_SYNC is about
latency, not about integrity and we probably should add comments
explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there
probably is someone waiting).

> Also, could you please help me do understand how can I guarantee
> integrity in case of block device with big volatile
> cache and filesystem, which does not support REQ_FLUSH/FUA?

If a device has a volatile cache but doesn't support flush, it can't
guarantee integrity.  There's no way for its user to determine or
force whether certain data is on non-volatile media.  It's an
inherently broken device.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:11               ` Tejun Heo
@ 2014-03-14 14:15                 ` Jan Kara
  2014-03-14 14:23                   ` Roman Peniaev
  2014-03-14 14:17                 ` Roman Peniaev
  2014-03-14 15:36                 ` Roman Peniaev
  2 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2014-03-14 14:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Peniaev, Andrew Morton, Jan Kara, Alexander Viro,
	linux-fsdevel, linux-kernel, Jens Axboe

On Fri 14-03-14 10:11:43, Tejun Heo wrote:
> > Also, could you please help me do understand how can I guarantee
> > integrity in case of block device with big volatile
> > cache and filesystem, which does not support REQ_FLUSH/FUA?
> 
> If a device has a volatile cache but doesn't support flush, it can't
> guarantee integrity.  There's no way for its user to determine or
> force whether certain data is on non-volatile media.  It's an
> inherently broken device.
  I think his problem was that the device does support REQ_FLUSH/FUA but
the filesystem on top of it doesn't issue it properly. That's a filesystem
problem so fix the filesystem... :) Which one is it?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:11               ` Tejun Heo
  2014-03-14 14:15                 ` Jan Kara
@ 2014-03-14 14:17                 ` Roman Peniaev
  2014-03-14 14:20                   ` Tejun Heo
  2014-03-14 15:36                 ` Roman Peniaev
  2 siblings, 1 reply; 20+ messages in thread
From: Roman Peniaev @ 2014-03-14 14:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 11:11 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote:
>> Seems the following message should be better:
>> When data inegrity operation (sync, fsync, fdatasync calls) happens
>> writeback control is set to WB_SYNC_ALL.
>> In that case all write requests are marked with WRITE_SYNC, but on
>> mpage writeback path
>> WRITE_SYNC is missed. This patch fixes this.
>>
>> Is it ok, what do you think?
>
> I think the description should make it clear that WRITE_SYNC is about
> latency, not about integrity and we probably should add comments
> explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there
> probably is someone waiting).
>
>> Also, could you please help me do understand how can I guarantee
>> integrity in case of block device with big volatile
>> cache and filesystem, which does not support REQ_FLUSH/FUA?
>
> If a device has a volatile cache but doesn't support flush, it can't
> guarantee integrity.  There's no way for its user to determine or
> force whether certain data is on non-volatile media.  It's an
> inherently broken device.

No, no. Not device does not support flush, filesystem does not care about flush.
(take any old school, e.g. ext2)

We did some write, and then we did fsync.
But filesystem does not support REQ_FLUSH/FUA so block device will never
get the checkpoint where it should really flush everything.

So, fsync will not guarantee any integrity in that case.


>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:17                 ` Roman Peniaev
@ 2014-03-14 14:20                   ` Tejun Heo
  2014-03-14 14:29                     ` Roman Peniaev
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2014-03-14 14:20 UTC (permalink / raw)
  To: Roman Peniaev
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 11:17:56PM +0900, Roman Peniaev wrote:
> No, no. Not device does not support flush, filesystem does not care about flush.
> (take any old school, e.g. ext2)
> 
> We did some write, and then we did fsync.
> But filesystem does not support REQ_FLUSH/FUA so block device will never
> get the checkpoint where it should really flush everything.
> 
> So, fsync will not guarantee any integrity in that case.

Oh, no idea.  Fix the filesystem to support REQ_FLUSH?  :)

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:15                 ` Jan Kara
@ 2014-03-14 14:23                   ` Roman Peniaev
  2014-03-14 14:52                     ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Peniaev @ 2014-03-14 14:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Andrew Morton, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 11:15 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 14-03-14 10:11:43, Tejun Heo wrote:
>> > Also, could you please help me do understand how can I guarantee
>> > integrity in case of block device with big volatile
>> > cache and filesystem, which does not support REQ_FLUSH/FUA?
>>
>> If a device has a volatile cache but doesn't support flush, it can't
>> guarantee integrity.  There's no way for its user to determine or
>> force whether certain data is on non-volatile media.  It's an
>> inherently broken device.
>   I think his problem was that the device does support REQ_FLUSH/FUA but
> the filesystem on top of it doesn't issue it properly. That's a filesystem
> problem so fix the filesystem... :) Which one is it?

take any old school, e.g. ext2 or even better: fat :)

--
Roman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:20                   ` Tejun Heo
@ 2014-03-14 14:29                     ` Roman Peniaev
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Peniaev @ 2014-03-14 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 11:20 PM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Mar 14, 2014 at 11:17:56PM +0900, Roman Peniaev wrote:
>> No, no. Not device does not support flush, filesystem does not care about flush.
>> (take any old school, e.g. ext2)
>>
>> We did some write, and then we did fsync.
>> But filesystem does not support REQ_FLUSH/FUA so block device will never
>> get the checkpoint where it should really flush everything.
>>
>> So, fsync will not guarantee any integrity in that case.
>
> Oh, no idea.  Fix the filesystem to support REQ_FLUSH?  :)

Yep, best variant. But I was thinking that there should be some guarantees
from fsync (and friends) calls, which will issue flush after completion of every
writeback request. But no way

>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:23                   ` Roman Peniaev
@ 2014-03-14 14:52                     ` Jan Kara
  2014-03-14 14:54                       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2014-03-14 14:52 UTC (permalink / raw)
  To: Roman Peniaev
  Cc: Jan Kara, Tejun Heo, Andrew Morton, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri 14-03-14 23:23:45, Roman Peniaev wrote:
> On Fri, Mar 14, 2014 at 11:15 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 14-03-14 10:11:43, Tejun Heo wrote:
> >> > Also, could you please help me do understand how can I guarantee
> >> > integrity in case of block device with big volatile
> >> > cache and filesystem, which does not support REQ_FLUSH/FUA?
> >>
> >> If a device has a volatile cache but doesn't support flush, it can't
> >> guarantee integrity.  There's no way for its user to determine or
> >> force whether certain data is on non-volatile media.  It's an
> >> inherently broken device.
> >   I think his problem was that the device does support REQ_FLUSH/FUA but
> > the filesystem on top of it doesn't issue it properly. That's a filesystem
> > problem so fix the filesystem... :) Which one is it?
> 
> take any old school, e.g. ext2 or even better: fat :)
  Well, for ext2, you can use ext4 kernel driver which takes care of
REQ_FLUSH properly. For fat, you'll need to fix the fs...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:52                     ` Jan Kara
@ 2014-03-14 14:54                       ` Tejun Heo
  2014-03-14 15:08                         ` Jan Kara
  2014-03-15  9:09                         ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2014-03-14 14:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Roman Peniaev, Andrew Morton, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 03:52:15PM +0100, Jan Kara wrote:
>   Well, for ext2, you can use ext4 kernel driver which takes care of
> REQ_FLUSH properly. For fat, you'll need to fix the fs...

This is a bit surprising tho.  Were we always like this?  We never had
even stupid "flush down everything and sync"?  Or is this something we
broke while morphing flush implementation several times in the past
years?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:54                       ` Tejun Heo
@ 2014-03-14 15:08                         ` Jan Kara
  2014-03-15  9:09                         ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2014-03-14 15:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Roman Peniaev, Andrew Morton, Alexander Viro,
	linux-fsdevel, linux-kernel, Jens Axboe

On Fri 14-03-14 10:54:30, Tejun Heo wrote:
> On Fri, Mar 14, 2014 at 03:52:15PM +0100, Jan Kara wrote:
> >   Well, for ext2, you can use ext4 kernel driver which takes care of
> > REQ_FLUSH properly. For fat, you'll need to fix the fs...
> 
> This is a bit surprising tho.  Were we always like this?  We never had
> even stupid "flush down everything and sync"?  Or is this something we
> broke while morphing flush implementation several times in the past
> years?
  It has been always like this. We could add some sending of REQ_FLUSH into
generic code - like generic_file_fsync() and some similar helper for
->sync_fs(). Just someone has to do it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:11               ` Tejun Heo
  2014-03-14 14:15                 ` Jan Kara
  2014-03-14 14:17                 ` Roman Peniaev
@ 2014-03-14 15:36                 ` Roman Peniaev
  2 siblings, 0 replies; 20+ messages in thread
From: Roman Peniaev @ 2014-03-14 15:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Jan Kara, Alexander Viro, linux-fsdevel,
	linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 11:11 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Mar 14, 2014 at 11:07:04PM +0900, Roman Peniaev wrote:
>> Seems the following message should be better:
>> When data inegrity operation (sync, fsync, fdatasync calls) happens
>> writeback control is set to WB_SYNC_ALL.
>> In that case all write requests are marked with WRITE_SYNC, but on
>> mpage writeback path
>> WRITE_SYNC is missed. This patch fixes this.
>>
>> Is it ok, what do you think?
>
> I think the description should make it clear that WRITE_SYNC is about
> latency, not about integrity and we probably should add comments
> explaining why we're using WRITE_SYNC for WB_SYNC_ALL (because there
> probably is someone waiting).

I sent v2 of the patch where I tried to be more explicit in details.
Please, check.

--
Roman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  2014-03-14 14:54                       ` Tejun Heo
  2014-03-14 15:08                         ` Jan Kara
@ 2014-03-15  9:09                         ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-03-15  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Roman Peniaev, Andrew Morton, Alexander Viro,
	linux-fsdevel, linux-kernel, Jens Axboe

On Fri, Mar 14, 2014 at 10:54:30AM -0400, Tejun Heo wrote:
> This is a bit surprising tho.  Were we always like this?  We never had
> even stupid "flush down everything and sync"?  Or is this something we
> broke while morphing flush implementation several times in the past
> years?

It's something that never worked in Linux.  Until XFS went ahead and
enabled barriers by default no filesystem in Linux ever flushed the
volatile cache by default, and most couldn't (and still can't) even if
you asked them to.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-03-15  9:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-16  2:54 [PATCH 1/1] fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write Roman Pen
2014-02-18 23:59 ` Andrew Morton
2014-02-19  1:38   ` Roman Peniaev
2014-03-12 14:29     ` Roman Peniaev
2014-03-13 20:01       ` Jan Kara
2014-03-13 21:34         ` Andrew Morton
2014-03-14 13:07           ` Tejun Heo
2014-03-14 14:07             ` Roman Peniaev
2014-03-14 14:11               ` Tejun Heo
2014-03-14 14:15                 ` Jan Kara
2014-03-14 14:23                   ` Roman Peniaev
2014-03-14 14:52                     ` Jan Kara
2014-03-14 14:54                       ` Tejun Heo
2014-03-14 15:08                         ` Jan Kara
2014-03-15  9:09                         ` Christoph Hellwig
2014-03-14 14:17                 ` Roman Peniaev
2014-03-14 14:20                   ` Tejun Heo
2014-03-14 14:29                     ` Roman Peniaev
2014-03-14 15:36                 ` Roman Peniaev
2014-03-13 20:21 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).