[PATCH] nilfs2: Fix race condition that causes file system corruption

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nilfs2: Fix race condition that causes file system corruption
@ 2017-07-11  6:19 Andreas Rohner
       [not found] ` <20170711061907.19259-1-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Rohner @ 2017-07-11  6:19 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

There is a race condition between the function nilfs_dirty_inode() and
nilfs_set_file_dirty().

When a file is opened, nilfs_dirty_inode() is called to update the
access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a
separate transaction. __nilfs_mark_inode_dirty() caches the ifile
buffer_head in the i_bh field of the inode info structure and marks it
as dirty.

After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to
the ns_dirty_files list.

Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field. If
there is a cached buffer_head in i_bh it is not marked as dirty again.

Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that
writes out the ifile occurs in-between the two. If this happens the
inode is not on the ns_dirty_files list, but its ifile block is still
marked as dirty and written out.

In the next segment construction, the data for the file is written out
and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root
is written into the i_bh block, which is not dirty, because it was
written out in another segment construction.

As a result the bmap update can be lost, which leads to file system
corruption. Either the virtual block address points to an unallocated
DAT block, or the DAT entry will be reused for something different.

The error can remain undetected for a long time. A typical error message
would be one of the "bad btree" errors or a warning that a DAT entry
could not be found.

This bug can be reproduced reliably by a simple benchmark that creates
and overwrites millions of 4k files.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/segment.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 70ded52dc1dd..50e12956c737 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1958,8 +1958,6 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
 					  err, ii->vfs_inode.i_ino);
 				return err;
 			}
-			mark_buffer_dirty(ibh);
-			nilfs_mdt_mark_dirty(ifile);
 			spin_lock(&nilfs->ns_inode_lock);
 			if (likely(!ii->i_bh))
 				ii->i_bh = ibh;
@@ -1968,6 +1966,10 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
 			goto retry;
 		}

+		// Always redirty the buffer to avoid race condition
+		mark_buffer_dirty(ii->i_bh);
+		nilfs_mdt_mark_dirty(ifile);
+
 		clear_bit(NILFS_I_QUEUED, &ii->i_state);
 		set_bit(NILFS_I_BUSY, &ii->i_state);
 		list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found] ` <20170711061907.19259-1-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2017-07-12  0:08   ` Ryusuke Konishi
       [not found]     ` <CAKFNMonzhYBKkjotNiNY0x0DnWxzHYF8WvO7f7xjGwr-8a=aRQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-07-12  0:08 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs

Hi,

2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>
> There is a race condition between the function nilfs_dirty_inode() and
> nilfs_set_file_dirty().
>
> When a file is opened, nilfs_dirty_inode() is called to update the
> access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a
> separate transaction. __nilfs_mark_inode_dirty() caches the ifile
> buffer_head in the i_bh field of the inode info structure and marks it
> as dirty.
>
> After some data was written to the file in another transaction, the
> function nilfs_set_file_dirty() is called, which adds the inode to
> the ns_dirty_files list.
>
> Then the segment construction calls nilfs_segctor_collect_dirty_files(),
> which goes through the ns_dirty_files list and checks the i_bh field. If
> there is a cached buffer_head in i_bh it is not marked as dirty again.
>
> Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
> transactions, it is possible that a segment construction that
> writes out the ifile occurs in-between the two. If this happens the
> inode is not on the ns_dirty_files list, but its ifile block is still
> marked as dirty and written out.
>
> In the next segment construction, the data for the file is written out
> and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root
> is written into the i_bh block, which is not dirty, because it was
> written out in another segment construction.
>
> As a result the bmap update can be lost, which leads to file system
> corruption. Either the virtual block address points to an unallocated
> DAT block, or the DAT entry will be reused for something different.
>
> The error can remain undetected for a long time. A typical error message
> would be one of the "bad btree" errors or a warning that a DAT entry
> could not be found.
>
> This bug can be reproduced reliably by a simple benchmark that creates
> and overwrites millions of 4k files.
>
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>

Will review, please wait for a while.

Regards,
Ryusuke Konishi

>
> ---
>  fs/nilfs2/segment.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 70ded52dc1dd..50e12956c737 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1958,8 +1958,6 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>                                           err, ii->vfs_inode.i_ino);
>                                 return err;
>                         }
> -                       mark_buffer_dirty(ibh);
> -                       nilfs_mdt_mark_dirty(ifile);
>                         spin_lock(&nilfs->ns_inode_lock);
>                         if (likely(!ii->i_bh))
>                                 ii->i_bh = ibh;
> @@ -1968,6 +1966,10 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>                         goto retry;
>                 }
>
> +               // Always redirty the buffer to avoid race condition
> +               mark_buffer_dirty(ii->i_bh);
> +               nilfs_mdt_mark_dirty(ifile);
> +
>                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>                 set_bit(NILFS_I_BUSY, &ii->i_state);
>                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
> --
> 2.13.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]     ` <CAKFNMonzhYBKkjotNiNY0x0DnWxzHYF8WvO7f7xjGwr-8a=aRQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-07-13  6:44       ` Andreas Rohner
  2017-10-28 17:16         ` Ryusuke Konishi
  2017-10-24  3:18       ` Clemens Eisserer
  1 sibling, 1 reply; 17+ messages in thread
From: Andreas Rohner @ 2017-07-13  6:44 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,

Am Wed, 12 Jul 2017 09:08:57 +0900
schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> Hi,
> 
> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> >
> > There is a race condition between the function nilfs_dirty_inode()
> > and nilfs_set_file_dirty().
> >
> > When a file is opened, nilfs_dirty_inode() is called to update the
> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
> > ifile buffer_head in the i_bh field of the inode info structure and
> > marks it as dirty.
> >
> > After some data was written to the file in another transaction, the
> > function nilfs_set_file_dirty() is called, which adds the inode to
> > the ns_dirty_files list.
> >
> > Then the segment construction calls
> > nilfs_segctor_collect_dirty_files(), which goes through the
> > ns_dirty_files list and checks the i_bh field. If there is a cached
> > buffer_head in i_bh it is not marked as dirty again.
> >
> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
> > transactions, it is possible that a segment construction that
> > writes out the ifile occurs in-between the two. If this happens the
> > inode is not on the ns_dirty_files list, but its ifile block is
> > still marked as dirty and written out.
> >
> > In the next segment construction, the data for the file is written
> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
> > bmap root is written into the i_bh block, which is not dirty,
> > because it was written out in another segment construction.
> >
> > As a result the bmap update can be lost, which leads to file system
> > corruption. Either the virtual block address points to an
> > unallocated DAT block, or the DAT entry will be reused for
> > something different.
> >
> > The error can remain undetected for a long time. A typical error
> > message would be one of the "bad btree" errors or a warning that a
> > DAT entry could not be found.
> >
> > This bug can be reproduced reliably by a simple benchmark that
> > creates and overwrites millions of 4k files.
> >
> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>  
> 
> Will review, please wait for a while.

Thank you, I have uploaded my benchmark on GitHub:

https://github.com/zeitgeist87/bugsimulator

The benchmark is very simple. It writes out 13107200 4k files in a two
layer directory tree, so that every directory contains about 1000 files.
Then it overwrites these files in an infinite loop.

You need quite aggressive cleaner settings for it to keep up with the
bugsimulator:

mc_nsegments_per_clean	16
nsegments_per_clean	16
max_clean_segments		30%
cleaning_interval	0.1
mc_cleaning_interval	0.1

If you run this benchmark on a fresh 100GB NILFS2 partition with the
above cleaner settings, the file system will be corrupt in about half an
hour. At least on my machine.

With the patch for the race condition, it can run indefinetly.

Regards,
Andreas

> Regards,
> Ryusuke Konishi
> 
> >
> > ---
> >  fs/nilfs2/segment.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index 70ded52dc1dd..50e12956c737 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -1958,8 +1958,6 @@ static int
> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
> > ii->vfs_inode.i_ino); return err;
> >                         }
> > -                       mark_buffer_dirty(ibh);
> > -                       nilfs_mdt_mark_dirty(ifile);
> >                         spin_lock(&nilfs->ns_inode_lock);
> >                         if (likely(!ii->i_bh))
> >                                 ii->i_bh = ibh;
> > @@ -1968,6 +1966,10 @@ static int
> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
> > retry; }
> >
> > +               // Always redirty the buffer to avoid race condition
> > +               mark_buffer_dirty(ii->i_bh);
> > +               nilfs_mdt_mark_dirty(ifile);
> > +
> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
> > --
> > 2.13.2
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]     ` <CAKFNMonzhYBKkjotNiNY0x0DnWxzHYF8WvO7f7xjGwr-8a=aRQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-07-13  6:44       ` Andreas Rohner
@ 2017-10-24  3:18       ` Clemens Eisserer
       [not found]         ` <CAFvQSYR6vXuyGzGjeogKdg1Xzg=4QDE4TM7r4G-UnZ7hLniDkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Clemens Eisserer @ 2017-10-24  3:18 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi,

>> As a result the bmap update can be lost, which leads to file system
>> corruption.
> Will review, please wait for a while.

Any chance to get this fix integrated any time soon?
As far as I understand it, nilfs currently ships with a reproduceable
corruption bug, although a fix is available.

Best regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]         ` <CAFvQSYR6vXuyGzGjeogKdg1Xzg=4QDE4TM7r4G-UnZ7hLniDkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-24 17:06           ` Viacheslav Dubeyko
       [not found]             ` <1508864769.501.5.camel-yeENwD64cLxBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Viacheslav Dubeyko @ 2017-10-24 17:06 UTC (permalink / raw)
  To: Clemens Eisserer, linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 2017-10-24 at 05:18 +0200, Clemens Eisserer wrote:
> Hi,
> 
> > 
> > > 
> > > As a result the bmap update can be lost, which leads to file
> > > system
> > > corruption.
> > Will review, please wait for a while.
> Any chance to get this fix integrated any time soon?
> As far as I understand it, nilfs currently ships with a reproduceable
> corruption bug, although a fix is available.


What the patch do you mean? It's pretty impossible to follow your
message.

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]             ` <1508864769.501.5.camel-yeENwD64cLxBDgjK7y7TUQ@public.gmane.org>
@ 2017-10-24 19:29               ` Clemens Eisserer
  0 siblings, 0 replies; 17+ messages in thread
From: Clemens Eisserer @ 2017-10-24 19:29 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

HI Vyacheslav,

>> > > As a result the bmap update can be lost, which leads to file
>> > > system
>> > > corruption.
>> > Will review, please wait for a while.
>> Any chance to get this fix integrated any time soon?
>> As far as I understand it, nilfs currently ships with a reproduceable
>> corruption bug, although a fix is available.
>
>
> What the patch do you mean? It's pretty impossible to follow your
> message.

Sorry for the confusion, I guess I am mollycoddled by gmail's thread view.

I was referring to a patch sent by Andreas Rohner on 11.7. which fixes
a race that can lead to filesystem corruption.

Best regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
  2017-07-13  6:44       ` Andreas Rohner
@ 2017-10-28 17:16         ` Ryusuke Konishi
       [not found]           ` <CAKFNMomD0qE3Q-bSfH=6-A82LaEVh2fK8Asp6emgYMa9BstQKw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-28 17:16 UTC (permalink / raw)
  To: Andreas Rohner, Clemens Eisserer; +Cc: linux-nilfs, Viacheslav Dubeyko

Hi,

Sorry for my late reponse.
I could reproduce a corruption issue with the bug simulator.

-----
[88805.039545] NILFS (sda1): segctord starting. Construction interval = 5 second
s, CP frequency < 30 seconds
[92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting data buff
er: ino=10846, cno=92, offset=2, blocknr=11425709, vblocknr=11227714
[92817.941936] NILFS (sda1): error -17 preparing GC: cannot read source blocks
-----

Now I'm testing the patch.   I will send it to upstream after it turned
out to suppress the issue.

The patch itself looks good to me though I guess there may be similar bugs.
What makes things complicated seems that inode block is not marked as dirty
just when "propagate" function carries a dirty state from a dirty
node/data block.

Anyway, I hasten to apply this patch.

Thanks,
Ryusuke Konishi


2017-07-13 15:44 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> Hi,
>
> Am Wed, 12 Jul 2017 09:08:57 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>
>> Hi,
>>
>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >
>> > There is a race condition between the function nilfs_dirty_inode()
>> > and nilfs_set_file_dirty().
>> >
>> > When a file is opened, nilfs_dirty_inode() is called to update the
>> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
>> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
>> > ifile buffer_head in the i_bh field of the inode info structure and
>> > marks it as dirty.
>> >
>> > After some data was written to the file in another transaction, the
>> > function nilfs_set_file_dirty() is called, which adds the inode to
>> > the ns_dirty_files list.
>> >
>> > Then the segment construction calls
>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> > ns_dirty_files list and checks the i_bh field. If there is a cached
>> > buffer_head in i_bh it is not marked as dirty again.
>> >
>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
>> > transactions, it is possible that a segment construction that
>> > writes out the ifile occurs in-between the two. If this happens the
>> > inode is not on the ns_dirty_files list, but its ifile block is
>> > still marked as dirty and written out.
>> >
>> > In the next segment construction, the data for the file is written
>> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
>> > bmap root is written into the i_bh block, which is not dirty,
>> > because it was written out in another segment construction.
>> >
>> > As a result the bmap update can be lost, which leads to file system
>> > corruption. Either the virtual block address points to an
>> > unallocated DAT block, or the DAT entry will be reused for
>> > something different.
>> >
>> > The error can remain undetected for a long time. A typical error
>> > message would be one of the "bad btree" errors or a warning that a
>> > DAT entry could not be found.
>> >
>> > This bug can be reproduced reliably by a simple benchmark that
>> > creates and overwrites millions of 4k files.
>> >
>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>
>> Will review, please wait for a while.
>
> Thank you, I have uploaded my benchmark on GitHub:
>
> https://github.com/zeitgeist87/bugsimulator
>
> The benchmark is very simple. It writes out 13107200 4k files in a two
> layer directory tree, so that every directory contains about 1000 files.
> Then it overwrites these files in an infinite loop.
>
> You need quite aggressive cleaner settings for it to keep up with the
> bugsimulator:
>
> mc_nsegments_per_clean  16
> nsegments_per_clean     16
> max_clean_segments              30%
> cleaning_interval       0.1
> mc_cleaning_interval    0.1
>
> If you run this benchmark on a fresh 100GB NILFS2 partition with the
> above cleaner settings, the file system will be corrupt in about half an
> hour. At least on my machine.
>
> With the patch for the race condition, it can run indefinetly.
>
> Regards,
> Andreas
>
>> Regards,
>> Ryusuke Konishi
>>
>> >
>> > ---
>> >  fs/nilfs2/segment.c | 6 ++++--
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> > index 70ded52dc1dd..50e12956c737 100644
>> > --- a/fs/nilfs2/segment.c
>> > +++ b/fs/nilfs2/segment.c
>> > @@ -1958,8 +1958,6 @@ static int
>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
>> > ii->vfs_inode.i_ino); return err;
>> >                         }
>> > -                       mark_buffer_dirty(ibh);
>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >                         if (likely(!ii->i_bh))
>> >                                 ii->i_bh = ibh;
>> > @@ -1968,6 +1966,10 @@ static int
>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
>> > retry; }
>> >
>> > +               // Always redirty the buffer to avoid race condition
>> > +               mark_buffer_dirty(ii->i_bh);
>> > +               nilfs_mdt_mark_dirty(ifile);
>> > +
>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
>> > --
>> > 2.13.2
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> > linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]           ` <CAKFNMomD0qE3Q-bSfH=6-A82LaEVh2fK8Asp6emgYMa9BstQKw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-29  5:09             ` Ryusuke Konishi
       [not found]               ` <CAKFNMomWEWyPZ4jSvJLk9AdzuDxn1T4-gxsOpbgRoZ-=6jH1Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-29  5:09 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer, Viacheslav Dubeyko

Unfortunately the patch didn't fix this issue:

[  612.614615] NILFS (sda1): segctord starting. Construction interval
= 5 seconds, CP frequency < 30 seconds
[ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
vblocknr=1198436
[ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read source blocks

Look like the bugsimulator hit other bug.

When this happened, the disk usage was 100%:

$ df /dev/sda1
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1      104849404 99655676         0 100% /test

And, the error killed cleanerd.   However, I could cleanly unmount the
partition, and
I didn't see any errors after remount the partition.

As you commented, I have changed some parameters in /etc/nilfs_cleanerd.conf:
-------------------------------------------
mc_nsegments_per_clean  16
nsegments_per_clean     16
max_clean_segments              30%
cleaning_interval       0.1
mc_cleaning_interval    0.1
-------------------------------------------

I will look into this issue with bugsimulator.

Anyone can reproduce the original corruption issue in other way?

Regards,
Ryusuke Konishi


2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Hi,
>
> Sorry for my late reponse.
> I could reproduce a corruption issue with the bug simulator.
>
> -----
> [88805.039545] NILFS (sda1): segctord starting. Construction interval = 5 second
> s, CP frequency < 30 seconds
> [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting data buff
> er: ino=10846, cno=92, offset=2, blocknr=11425709, vblocknr=11227714
> [92817.941936] NILFS (sda1): error -17 preparing GC: cannot read source blocks
> -----
>
> Now I'm testing the patch.   I will send it to upstream after it turned
> out to suppress the issue.
>
> The patch itself looks good to me though I guess there may be similar bugs.
> What makes things complicated seems that inode block is not marked as dirty
> just when "propagate" function carries a dirty state from a dirty
> node/data block.
>
> Anyway, I hasten to apply this patch.
>
> Thanks,
> Ryusuke Konishi
>
>
> 2017-07-13 15:44 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> Hi,
>>
>> Am Wed, 12 Jul 2017 09:08:57 +0900
>> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>
>>> Hi,
>>>
>>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> >
>>> > There is a race condition between the function nilfs_dirty_inode()
>>> > and nilfs_set_file_dirty().
>>> >
>>> > When a file is opened, nilfs_dirty_inode() is called to update the
>>> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
>>> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
>>> > ifile buffer_head in the i_bh field of the inode info structure and
>>> > marks it as dirty.
>>> >
>>> > After some data was written to the file in another transaction, the
>>> > function nilfs_set_file_dirty() is called, which adds the inode to
>>> > the ns_dirty_files list.
>>> >
>>> > Then the segment construction calls
>>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>> > ns_dirty_files list and checks the i_bh field. If there is a cached
>>> > buffer_head in i_bh it is not marked as dirty again.
>>> >
>>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
>>> > transactions, it is possible that a segment construction that
>>> > writes out the ifile occurs in-between the two. If this happens the
>>> > inode is not on the ns_dirty_files list, but its ifile block is
>>> > still marked as dirty and written out.
>>> >
>>> > In the next segment construction, the data for the file is written
>>> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
>>> > bmap root is written into the i_bh block, which is not dirty,
>>> > because it was written out in another segment construction.
>>> >
>>> > As a result the bmap update can be lost, which leads to file system
>>> > corruption. Either the virtual block address points to an
>>> > unallocated DAT block, or the DAT entry will be reused for
>>> > something different.
>>> >
>>> > The error can remain undetected for a long time. A typical error
>>> > message would be one of the "bad btree" errors or a warning that a
>>> > DAT entry could not be found.
>>> >
>>> > This bug can be reproduced reliably by a simple benchmark that
>>> > creates and overwrites millions of 4k files.
>>> >
>>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>>
>>> Will review, please wait for a while.
>>
>> Thank you, I have uploaded my benchmark on GitHub:
>>
>> https://github.com/zeitgeist87/bugsimulator
>>
>> The benchmark is very simple. It writes out 13107200 4k files in a two
>> layer directory tree, so that every directory contains about 1000 files.
>> Then it overwrites these files in an infinite loop.
>>
>> You need quite aggressive cleaner settings for it to keep up with the
>> bugsimulator:
>>
>> mc_nsegments_per_clean  16
>> nsegments_per_clean     16
>> max_clean_segments              30%
>> cleaning_interval       0.1
>> mc_cleaning_interval    0.1
>>
>> If you run this benchmark on a fresh 100GB NILFS2 partition with the
>> above cleaner settings, the file system will be corrupt in about half an
>> hour. At least on my machine.
>>
>> With the patch for the race condition, it can run indefinetly.
>>
>> Regards,
>> Andreas
>>
>>> Regards,
>>> Ryusuke Konishi
>>>
>>> >
>>> > ---
>>> >  fs/nilfs2/segment.c | 6 ++++--
>>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>> >
>>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>> > index 70ded52dc1dd..50e12956c737 100644
>>> > --- a/fs/nilfs2/segment.c
>>> > +++ b/fs/nilfs2/segment.c
>>> > @@ -1958,8 +1958,6 @@ static int
>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
>>> > ii->vfs_inode.i_ino); return err;
>>> >                         }
>>> > -                       mark_buffer_dirty(ibh);
>>> > -                       nilfs_mdt_mark_dirty(ifile);
>>> >                         spin_lock(&nilfs->ns_inode_lock);
>>> >                         if (likely(!ii->i_bh))
>>> >                                 ii->i_bh = ibh;
>>> > @@ -1968,6 +1966,10 @@ static int
>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
>>> > retry; }
>>> >
>>> > +               // Always redirty the buffer to avoid race condition
>>> > +               mark_buffer_dirty(ii->i_bh);
>>> > +               nilfs_mdt_mark_dirty(ifile);
>>> > +
>>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
>>> > --
>>> > 2.13.2
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe
>>> > linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]               ` <CAKFNMomWEWyPZ4jSvJLk9AdzuDxn1T4-gxsOpbgRoZ-=6jH1Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-29  5:24                 ` Ryusuke Konishi
  2017-10-29  8:07                 ` Andreas Rohner
  2017-10-29  9:46                 ` Andreas Rohner
  2 siblings, 0 replies; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-29  5:24 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer, Viacheslav Dubeyko

If someone could confirm the patch by Andreas fixed the original corruption
issue,  I will send the patch to upstream anyway.

Verification (a "Tested-by" tag) is needed to let it go to stable trees.

Ryusuke Konishi


2017-10-29 14:09 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Unfortunately the patch didn't fix this issue:
>
> [  612.614615] NILFS (sda1): segctord starting. Construction interval
> = 5 seconds, CP frequency < 30 seconds
> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
> vblocknr=1198436
> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read source blocks
>
> Look like the bugsimulator hit other bug.
>
> When this happened, the disk usage was 100%:
>
> $ df /dev/sda1
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sda1      104849404 99655676         0 100% /test
>
> And, the error killed cleanerd.   However, I could cleanly unmount the
> partition, and
> I didn't see any errors after remount the partition.
>
> As you commented, I have changed some parameters in /etc/nilfs_cleanerd.conf:
> -------------------------------------------
> mc_nsegments_per_clean  16
> nsegments_per_clean     16
> max_clean_segments              30%
> cleaning_interval       0.1
> mc_cleaning_interval    0.1
> -------------------------------------------
>
> I will look into this issue with bugsimulator.
>
> Anyone can reproduce the original corruption issue in other way?
>
> Regards,
> Ryusuke Konishi
>
>
> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> Hi,
>>
>> Sorry for my late reponse.
>> I could reproduce a corruption issue with the bug simulator.
>>
>> -----
>> [88805.039545] NILFS (sda1): segctord starting. Construction interval = 5 second
>> s, CP frequency < 30 seconds
>> [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting data buff
>> er: ino=10846, cno=92, offset=2, blocknr=11425709, vblocknr=11227714
>> [92817.941936] NILFS (sda1): error -17 preparing GC: cannot read source blocks
>> -----
>>
>> Now I'm testing the patch.   I will send it to upstream after it turned
>> out to suppress the issue.
>>
>> The patch itself looks good to me though I guess there may be similar bugs.
>> What makes things complicated seems that inode block is not marked as dirty
>> just when "propagate" function carries a dirty state from a dirty
>> node/data block.
>>
>> Anyway, I hasten to apply this patch.
>>
>> Thanks,
>> Ryusuke Konishi
>>
>>
>> 2017-07-13 15:44 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> Hi,
>>>
>>> Am Wed, 12 Jul 2017 09:08:57 +0900
>>> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>
>>>> Hi,
>>>>
>>>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>>> >
>>>> > There is a race condition between the function nilfs_dirty_inode()
>>>> > and nilfs_set_file_dirty().
>>>> >
>>>> > When a file is opened, nilfs_dirty_inode() is called to update the
>>>> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
>>>> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
>>>> > ifile buffer_head in the i_bh field of the inode info structure and
>>>> > marks it as dirty.
>>>> >
>>>> > After some data was written to the file in another transaction, the
>>>> > function nilfs_set_file_dirty() is called, which adds the inode to
>>>> > the ns_dirty_files list.
>>>> >
>>>> > Then the segment construction calls
>>>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>>> > ns_dirty_files list and checks the i_bh field. If there is a cached
>>>> > buffer_head in i_bh it is not marked as dirty again.
>>>> >
>>>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
>>>> > transactions, it is possible that a segment construction that
>>>> > writes out the ifile occurs in-between the two. If this happens the
>>>> > inode is not on the ns_dirty_files list, but its ifile block is
>>>> > still marked as dirty and written out.
>>>> >
>>>> > In the next segment construction, the data for the file is written
>>>> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
>>>> > bmap root is written into the i_bh block, which is not dirty,
>>>> > because it was written out in another segment construction.
>>>> >
>>>> > As a result the bmap update can be lost, which leads to file system
>>>> > corruption. Either the virtual block address points to an
>>>> > unallocated DAT block, or the DAT entry will be reused for
>>>> > something different.
>>>> >
>>>> > The error can remain undetected for a long time. A typical error
>>>> > message would be one of the "bad btree" errors or a warning that a
>>>> > DAT entry could not be found.
>>>> >
>>>> > This bug can be reproduced reliably by a simple benchmark that
>>>> > creates and overwrites millions of 4k files.
>>>> >
>>>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>>>
>>>> Will review, please wait for a while.
>>>
>>> Thank you, I have uploaded my benchmark on GitHub:
>>>
>>> https://github.com/zeitgeist87/bugsimulator
>>>
>>> The benchmark is very simple. It writes out 13107200 4k files in a two
>>> layer directory tree, so that every directory contains about 1000 files.
>>> Then it overwrites these files in an infinite loop.
>>>
>>> You need quite aggressive cleaner settings for it to keep up with the
>>> bugsimulator:
>>>
>>> mc_nsegments_per_clean  16
>>> nsegments_per_clean     16
>>> max_clean_segments              30%
>>> cleaning_interval       0.1
>>> mc_cleaning_interval    0.1
>>>
>>> If you run this benchmark on a fresh 100GB NILFS2 partition with the
>>> above cleaner settings, the file system will be corrupt in about half an
>>> hour. At least on my machine.
>>>
>>> With the patch for the race condition, it can run indefinetly.
>>>
>>> Regards,
>>> Andreas
>>>
>>>> Regards,
>>>> Ryusuke Konishi
>>>>
>>>> >
>>>> > ---
>>>> >  fs/nilfs2/segment.c | 6 ++++--
>>>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>>> >
>>>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>>> > index 70ded52dc1dd..50e12956c737 100644
>>>> > --- a/fs/nilfs2/segment.c
>>>> > +++ b/fs/nilfs2/segment.c
>>>> > @@ -1958,8 +1958,6 @@ static int
>>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
>>>> > ii->vfs_inode.i_ino); return err;
>>>> >                         }
>>>> > -                       mark_buffer_dirty(ibh);
>>>> > -                       nilfs_mdt_mark_dirty(ifile);
>>>> >                         spin_lock(&nilfs->ns_inode_lock);
>>>> >                         if (likely(!ii->i_bh))
>>>> >                                 ii->i_bh = ibh;
>>>> > @@ -1968,6 +1966,10 @@ static int
>>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
>>>> > retry; }
>>>> >
>>>> > +               // Always redirty the buffer to avoid race condition
>>>> > +               mark_buffer_dirty(ii->i_bh);
>>>> > +               nilfs_mdt_mark_dirty(ifile);
>>>> > +
>>>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>>> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
>>>> > --
>>>> > 2.13.2
>>>> >
>>>> > --
>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>> > linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]               ` <CAKFNMomWEWyPZ4jSvJLk9AdzuDxn1T4-gxsOpbgRoZ-=6jH1Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-29  5:24                 ` Ryusuke Konishi
@ 2017-10-29  8:07                 ` Andreas Rohner
  2017-10-29 12:53                   ` Ryusuke Konishi
  2017-10-29  9:46                 ` Andreas Rohner
  2 siblings, 1 reply; 17+ messages in thread
From: Andreas Rohner @ 2017-10-29  8:07 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs

Hi Ryusuke,

Thank you for reviewing my patch. I will rerun my tests to make sure
my conclusions were correct.

The log output suggests to me, that you are hitting a totally unrelated
bug. I have seen that message before on one of my experimental branches,
but I didn't think it could happen with stock NILFS2.

My solution was to add the following piece of code to
nilfs_vdesc_is_live() in nilfs-utils:

if (vdesc->vd_period.p_end == vdesc->vd_cno) {
	/*
	 * This block was overwritten in the same logical segment, but
	 * in a different partial segment. Probably because of
	 * fdatasync() or a flush to disk.
	 * Without this check, gc will cause buffer confliction error
	 * if both partial segments are cleaned at the same time.
	 * In that case there will be two vdesc with the same ino,
	 * cno and offset.
	 */
	return 0;
}

I can't tell for sure if that will fix your problem. I am just guessing.
It isn't necessary with my setup, to reproduce the race condition bug.

I am rerunning my tests now to make sure I didn't make a mistake.

Best regards,

Andreas

Am Sun, 29 Oct 2017 14:09:40 +0900
schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> Unfortunately the patch didn't fix this issue:
> 
> [  612.614615] NILFS (sda1): segctord starting. Construction interval
> = 5 seconds, CP frequency < 30 seconds
> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
> vblocknr=1198436
> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
> source blocks
> 
> Look like the bugsimulator hit other bug.
> 
> When this happened, the disk usage was 100%:
> 
> $ df /dev/sda1
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sda1      104849404 99655676         0 100% /test
> 
> And, the error killed cleanerd.   However, I could cleanly unmount the
> partition, and
> I didn't see any errors after remount the partition.
> 
> As you commented, I have changed some parameters
> in /etc/nilfs_cleanerd.conf:
> ------------------------------------------- mc_nsegments_per_clean  16
> nsegments_per_clean     16
> max_clean_segments              30%
> cleaning_interval       0.1
> mc_cleaning_interval    0.1
> -------------------------------------------
> 
> I will look into this issue with bugsimulator.
> 
> Anyone can reproduce the original corruption issue in other way?
> 
> Regards,
> Ryusuke Konishi
> 
> 
> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> > Hi,
> >
> > Sorry for my late reponse.
> > I could reproduce a corruption issue with the bug simulator.
> >
> > -----
> > [88805.039545] NILFS (sda1): segctord starting. Construction
> > interval = 5 second s, CP frequency < 30 seconds
> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
> > conflicting data buff er: ino=10846, cno=92, offset=2,
> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
> > error -17 preparing GC: cannot read source blocks -----
> >
> > Now I'm testing the patch.   I will send it to upstream after it
> > turned out to suppress the issue.
> >
> > The patch itself looks good to me though I guess there may be
> > similar bugs. What makes things complicated seems that inode block
> > is not marked as dirty just when "propagate" function carries a
> > dirty state from a dirty node/data block.
> >
> > Anyway, I hasten to apply this patch.
> >
> > Thanks,
> > Ryusuke Konishi
> >
> >
> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >> Hi,
> >>
> >> Am Wed, 12 Jul 2017 09:08:57 +0900
> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >>  
> >>> Hi,
> >>>
> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >>> >
> >>> > There is a race condition between the function
> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
> >>> >
> >>> > When a file is opened, nilfs_dirty_inode() is called to update
> >>> > the access timestamp in the inode. It calls
> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in the
> >>> > i_bh field of the inode info structure and marks it as dirty.
> >>> >
> >>> > After some data was written to the file in another transaction,
> >>> > the function nilfs_set_file_dirty() is called, which adds the
> >>> > inode to the ns_dirty_files list.
> >>> >
> >>> > Then the segment construction calls
> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
> >>> > ns_dirty_files list and checks the i_bh field. If there is a
> >>> > cached buffer_head in i_bh it is not marked as dirty again.
> >>> >
> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
> >>> > separate transactions, it is possible that a segment
> >>> > construction that writes out the ifile occurs in-between the
> >>> > two. If this happens the inode is not on the ns_dirty_files
> >>> > list, but its ifile block is still marked as dirty and written
> >>> > out.
> >>> >
> >>> > In the next segment construction, the data for the file is
> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
> >>> > Eventually the bmap root is written into the i_bh block, which
> >>> > is not dirty, because it was written out in another segment
> >>> > construction.
> >>> >
> >>> > As a result the bmap update can be lost, which leads to file
> >>> > system corruption. Either the virtual block address points to an
> >>> > unallocated DAT block, or the DAT entry will be reused for
> >>> > something different.
> >>> >
> >>> > The error can remain undetected for a long time. A typical error
> >>> > message would be one of the "bad btree" errors or a warning
> >>> > that a DAT entry could not be found.
> >>> >
> >>> > This bug can be reproduced reliably by a simple benchmark that
> >>> > creates and overwrites millions of 4k files.
> >>> >
> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>  
> >>>
> >>> Will review, please wait for a while.  
> >>
> >> Thank you, I have uploaded my benchmark on GitHub:
> >>
> >> https://github.com/zeitgeist87/bugsimulator
> >>
> >> The benchmark is very simple. It writes out 13107200 4k files in a
> >> two layer directory tree, so that every directory contains about
> >> 1000 files. Then it overwrites these files in an infinite loop.
> >>
> >> You need quite aggressive cleaner settings for it to keep up with
> >> the bugsimulator:
> >>
> >> mc_nsegments_per_clean  16
> >> nsegments_per_clean     16
> >> max_clean_segments              30%
> >> cleaning_interval       0.1
> >> mc_cleaning_interval    0.1
> >>
> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
> >> the above cleaner settings, the file system will be corrupt in
> >> about half an hour. At least on my machine.
> >>
> >> With the patch for the race condition, it can run indefinetly.
> >>
> >> Regards,
> >> Andreas
> >>  
> >>> Regards,
> >>> Ryusuke Konishi
> >>>  
> >>> >
> >>> > ---
> >>> >  fs/nilfs2/segment.c | 6 ++++--
> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >>> >
> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >>> > index 70ded52dc1dd..50e12956c737 100644
> >>> > --- a/fs/nilfs2/segment.c
> >>> > +++ b/fs/nilfs2/segment.c
> >>> > @@ -1958,8 +1958,6 @@ static int
> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >>> > err, ii->vfs_inode.i_ino); return err;
> >>> >                         }
> >>> > -                       mark_buffer_dirty(ibh);
> >>> > -                       nilfs_mdt_mark_dirty(ifile);
> >>> >                         spin_lock(&nilfs->ns_inode_lock);
> >>> >                         if (likely(!ii->i_bh))
> >>> >                                 ii->i_bh = ibh;
> >>> > @@ -1968,6 +1966,10 @@ static int
> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >>> > goto retry; }
> >>> >
> >>> > +               // Always redirty the buffer to avoid race
> >>> > condition
> >>> > +               mark_buffer_dirty(ii->i_bh);
> >>> > +               nilfs_mdt_mark_dirty(ifile);
> >>> > +
> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
> >>> >                 list_move_tail(&ii->i_dirty,
> >>> > &sci->sc_dirty_files); --
> >>> > 2.13.2
> >>> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe
> >>> > linux-nilfs" in the body of a message to
> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> >>> > http://vger.kernel.org/majordomo-info.html  
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe
> >>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> More majordomo info at
> >>> http://vger.kernel.org/majordomo-info.html  
> >>  
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]               ` <CAKFNMomWEWyPZ4jSvJLk9AdzuDxn1T4-gxsOpbgRoZ-=6jH1Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-29  5:24                 ` Ryusuke Konishi
  2017-10-29  8:07                 ` Andreas Rohner
@ 2017-10-29  9:46                 ` Andreas Rohner
  2017-10-29 10:23                   ` Clemens Eisserer
  2 siblings, 1 reply; 17+ messages in thread
From: Andreas Rohner @ 2017-10-29  9:46 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs

Hi Ryusuke,

I ran my tests again and I can reproduce the issue within half an hour:

[ 9823.056522] WARNING: CPU: 5 PID: 18461 at fs/nilfs2//dat.c:160
nilfs_dat_prepare_end+0xd3/0xe0 [nilfs2]

...

[ 9823.056570] Call Trace:
[ 9823.056616]  nilfs_dat_prepare_update+0x20/0x60 [nilfs2]
[ 9823.056620]  nilfs_direct_propagate+0x71/0xd0 [nilfs2]
[ 9823.056623]  nilfs_bmap_propagate+0x2d/0x70 [nilfs2]
[ 9823.056626]  nilfs_collect_file_data+0x23/0x50 [nilfs2]
[ 9823.056629]  ? nilfs_collect_dat_data+0x50/0x50 [nilfs2]
[ 9823.056636]  nilfs_segctor_apply_buffers+0x7b/0xf0 [nilfs2]
[ 9823.056640]  nilfs_segctor_scan_file+0x135/0x190 [nilfs2]
[ 9823.056643]  nilfs_segctor_do_construct+0xb7a/0x2100 [nilfs2]
[ 9823.056647]  nilfs_segctor_construct+0x1c9/0x300 [nilfs2]
[ 9823.056650]  ? nilfs_segctor_construct+0x1c9/0x300 [nilfs2]
[ 9823.056653]  nilfs_segctor_thread+0x10a/0x390 [nilfs2]
[ 9823.056659]  kthread+0x109/0x140
[ 9823.056662]  ? nilfs_segctor_construct+0x300/0x300 [nilfs2]
[ 9823.056663]  ? kthread_create_on_node+0x70/0x70
[ 9823.056665]  ret_from_fork+0x25/0x30

The result is a broken file system, and the kernel log is spammed with
the above warnings and stack traces.

The file system usage is at 71%:

$ LANG=C df /dev/sdb1 
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sdb1      104849404 70205436  29401088  71% /mnt

The patch for the race condition definitely resolves the problem. With
it, I can run the bugsimulator for hours without any problems.

However, I used quite an old kernel 4.12.0. I will update to the
latest version and test again.

Best regards,

Andreas

Am Sun, 29 Oct 2017 14:09:40 +0900
schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> Unfortunately the patch didn't fix this issue:
> 
> [  612.614615] NILFS (sda1): segctord starting. Construction interval
> = 5 seconds, CP frequency < 30 seconds
> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
> vblocknr=1198436
> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
> source blocks
> 
> Look like the bugsimulator hit other bug.
> 
> When this happened, the disk usage was 100%:
> 
> $ df /dev/sda1
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sda1      104849404 99655676         0 100% /test
> 
> And, the error killed cleanerd.   However, I could cleanly unmount the
> partition, and
> I didn't see any errors after remount the partition.
> 
> As you commented, I have changed some parameters
> in /etc/nilfs_cleanerd.conf:
> ------------------------------------------- mc_nsegments_per_clean  16
> nsegments_per_clean     16
> max_clean_segments              30%
> cleaning_interval       0.1
> mc_cleaning_interval    0.1
> -------------------------------------------
> 
> I will look into this issue with bugsimulator.
> 
> Anyone can reproduce the original corruption issue in other way?
> 
> Regards,
> Ryusuke Konishi
> 
> 
> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> > Hi,
> >
> > Sorry for my late reponse.
> > I could reproduce a corruption issue with the bug simulator.
> >
> > -----
> > [88805.039545] NILFS (sda1): segctord starting. Construction
> > interval = 5 second s, CP frequency < 30 seconds
> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
> > conflicting data buff er: ino=10846, cno=92, offset=2,
> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
> > error -17 preparing GC: cannot read source blocks -----
> >
> > Now I'm testing the patch.   I will send it to upstream after it
> > turned out to suppress the issue.
> >
> > The patch itself looks good to me though I guess there may be
> > similar bugs. What makes things complicated seems that inode block
> > is not marked as dirty just when "propagate" function carries a
> > dirty state from a dirty node/data block.
> >
> > Anyway, I hasten to apply this patch.
> >
> > Thanks,
> > Ryusuke Konishi
> >
> >
> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >> Hi,
> >>
> >> Am Wed, 12 Jul 2017 09:08:57 +0900
> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >>  
> >>> Hi,
> >>>
> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >>> >
> >>> > There is a race condition between the function
> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
> >>> >
> >>> > When a file is opened, nilfs_dirty_inode() is called to update
> >>> > the access timestamp in the inode. It calls
> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in the
> >>> > i_bh field of the inode info structure and marks it as dirty.
> >>> >
> >>> > After some data was written to the file in another transaction,
> >>> > the function nilfs_set_file_dirty() is called, which adds the
> >>> > inode to the ns_dirty_files list.
> >>> >
> >>> > Then the segment construction calls
> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
> >>> > ns_dirty_files list and checks the i_bh field. If there is a
> >>> > cached buffer_head in i_bh it is not marked as dirty again.
> >>> >
> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
> >>> > separate transactions, it is possible that a segment
> >>> > construction that writes out the ifile occurs in-between the
> >>> > two. If this happens the inode is not on the ns_dirty_files
> >>> > list, but its ifile block is still marked as dirty and written
> >>> > out.
> >>> >
> >>> > In the next segment construction, the data for the file is
> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
> >>> > Eventually the bmap root is written into the i_bh block, which
> >>> > is not dirty, because it was written out in another segment
> >>> > construction.
> >>> >
> >>> > As a result the bmap update can be lost, which leads to file
> >>> > system corruption. Either the virtual block address points to an
> >>> > unallocated DAT block, or the DAT entry will be reused for
> >>> > something different.
> >>> >
> >>> > The error can remain undetected for a long time. A typical error
> >>> > message would be one of the "bad btree" errors or a warning
> >>> > that a DAT entry could not be found.
> >>> >
> >>> > This bug can be reproduced reliably by a simple benchmark that
> >>> > creates and overwrites millions of 4k files.
> >>> >
> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>  
> >>>
> >>> Will review, please wait for a while.  
> >>
> >> Thank you, I have uploaded my benchmark on GitHub:
> >>
> >> https://github.com/zeitgeist87/bugsimulator
> >>
> >> The benchmark is very simple. It writes out 13107200 4k files in a
> >> two layer directory tree, so that every directory contains about
> >> 1000 files. Then it overwrites these files in an infinite loop.
> >>
> >> You need quite aggressive cleaner settings for it to keep up with
> >> the bugsimulator:
> >>
> >> mc_nsegments_per_clean  16
> >> nsegments_per_clean     16
> >> max_clean_segments              30%
> >> cleaning_interval       0.1
> >> mc_cleaning_interval    0.1
> >>
> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
> >> the above cleaner settings, the file system will be corrupt in
> >> about half an hour. At least on my machine.
> >>
> >> With the patch for the race condition, it can run indefinetly.
> >>
> >> Regards,
> >> Andreas
> >>  
> >>> Regards,
> >>> Ryusuke Konishi
> >>>  
> >>> >
> >>> > ---
> >>> >  fs/nilfs2/segment.c | 6 ++++--
> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >>> >
> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >>> > index 70ded52dc1dd..50e12956c737 100644
> >>> > --- a/fs/nilfs2/segment.c
> >>> > +++ b/fs/nilfs2/segment.c
> >>> > @@ -1958,8 +1958,6 @@ static int
> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >>> > err, ii->vfs_inode.i_ino); return err;
> >>> >                         }
> >>> > -                       mark_buffer_dirty(ibh);
> >>> > -                       nilfs_mdt_mark_dirty(ifile);
> >>> >                         spin_lock(&nilfs->ns_inode_lock);
> >>> >                         if (likely(!ii->i_bh))
> >>> >                                 ii->i_bh = ibh;
> >>> > @@ -1968,6 +1966,10 @@ static int
> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >>> > goto retry; }
> >>> >
> >>> > +               // Always redirty the buffer to avoid race
> >>> > condition
> >>> > +               mark_buffer_dirty(ii->i_bh);
> >>> > +               nilfs_mdt_mark_dirty(ifile);
> >>> > +
> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
> >>> >                 list_move_tail(&ii->i_dirty,
> >>> > &sci->sc_dirty_files); --
> >>> > 2.13.2
> >>> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe
> >>> > linux-nilfs" in the body of a message to
> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> >>> > http://vger.kernel.org/majordomo-info.html  
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe
> >>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> More majordomo info at
> >>> http://vger.kernel.org/majordomo-info.html  
> >>  
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
  2017-10-29  9:46                 ` Andreas Rohner
@ 2017-10-29 10:23                   ` Clemens Eisserer
  0 siblings, 0 replies; 17+ messages in thread
From: Clemens Eisserer @ 2017-10-29 10:23 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi Ryusuke,

Thanks for taking the time and having a look at the issue / patch.
I will be able to conduct a few tests by tomorrow.

Best regards, Clemens

2017-10-29 10:46 GMT+01:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> Hi Ryusuke,
>
> I ran my tests again and I can reproduce the issue within half an hour:
>
> [ 9823.056522] WARNING: CPU: 5 PID: 18461 at fs/nilfs2//dat.c:160
> nilfs_dat_prepare_end+0xd3/0xe0 [nilfs2]
>
> ...
>
> [ 9823.056570] Call Trace:
> [ 9823.056616]  nilfs_dat_prepare_update+0x20/0x60 [nilfs2]
> [ 9823.056620]  nilfs_direct_propagate+0x71/0xd0 [nilfs2]
> [ 9823.056623]  nilfs_bmap_propagate+0x2d/0x70 [nilfs2]
> [ 9823.056626]  nilfs_collect_file_data+0x23/0x50 [nilfs2]
> [ 9823.056629]  ? nilfs_collect_dat_data+0x50/0x50 [nilfs2]
> [ 9823.056636]  nilfs_segctor_apply_buffers+0x7b/0xf0 [nilfs2]
> [ 9823.056640]  nilfs_segctor_scan_file+0x135/0x190 [nilfs2]
> [ 9823.056643]  nilfs_segctor_do_construct+0xb7a/0x2100 [nilfs2]
> [ 9823.056647]  nilfs_segctor_construct+0x1c9/0x300 [nilfs2]
> [ 9823.056650]  ? nilfs_segctor_construct+0x1c9/0x300 [nilfs2]
> [ 9823.056653]  nilfs_segctor_thread+0x10a/0x390 [nilfs2]
> [ 9823.056659]  kthread+0x109/0x140
> [ 9823.056662]  ? nilfs_segctor_construct+0x300/0x300 [nilfs2]
> [ 9823.056663]  ? kthread_create_on_node+0x70/0x70
> [ 9823.056665]  ret_from_fork+0x25/0x30
>
> The result is a broken file system, and the kernel log is spammed with
> the above warnings and stack traces.
>
> The file system usage is at 71%:
>
> $ LANG=C df /dev/sdb1
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sdb1      104849404 70205436  29401088  71% /mnt
>
> The patch for the race condition definitely resolves the problem. With
> it, I can run the bugsimulator for hours without any problems.
>
> However, I used quite an old kernel 4.12.0. I will update to the
> latest version and test again.
>
> Best regards,
>
> Andreas
>
> Am Sun, 29 Oct 2017 14:09:40 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>
>> Unfortunately the patch didn't fix this issue:
>>
>> [  612.614615] NILFS (sda1): segctord starting. Construction interval
>> = 5 seconds, CP frequency < 30 seconds
>> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
>> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
>> vblocknr=1198436
>> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>> source blocks
>>
>> Look like the bugsimulator hit other bug.
>>
>> When this happened, the disk usage was 100%:
>>
>> $ df /dev/sda1
>> Filesystem     1K-blocks     Used Available Use% Mounted on
>> /dev/sda1      104849404 99655676         0 100% /test
>>
>> And, the error killed cleanerd.   However, I could cleanly unmount the
>> partition, and
>> I didn't see any errors after remount the partition.
>>
>> As you commented, I have changed some parameters
>> in /etc/nilfs_cleanerd.conf:
>> ------------------------------------------- mc_nsegments_per_clean  16
>> nsegments_per_clean     16
>> max_clean_segments              30%
>> cleaning_interval       0.1
>> mc_cleaning_interval    0.1
>> -------------------------------------------
>>
>> I will look into this issue with bugsimulator.
>>
>> Anyone can reproduce the original corruption issue in other way?
>>
>> Regards,
>> Ryusuke Konishi
>>
>>
>> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> > Hi,
>> >
>> > Sorry for my late reponse.
>> > I could reproduce a corruption issue with the bug simulator.
>> >
>> > -----
>> > [88805.039545] NILFS (sda1): segctord starting. Construction
>> > interval = 5 second s, CP frequency < 30 seconds
>> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> > conflicting data buff er: ino=10846, cno=92, offset=2,
>> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>> > error -17 preparing GC: cannot read source blocks -----
>> >
>> > Now I'm testing the patch.   I will send it to upstream after it
>> > turned out to suppress the issue.
>> >
>> > The patch itself looks good to me though I guess there may be
>> > similar bugs. What makes things complicated seems that inode block
>> > is not marked as dirty just when "propagate" function carries a
>> > dirty state from a dirty node/data block.
>> >
>> > Anyway, I hasten to apply this patch.
>> >
>> > Thanks,
>> > Ryusuke Konishi
>> >
>> >
>> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >> Hi,
>> >>
>> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> >>
>> >>> Hi,
>> >>>
>> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >>> >
>> >>> > There is a race condition between the function
>> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>> >>> >
>> >>> > When a file is opened, nilfs_dirty_inode() is called to update
>> >>> > the access timestamp in the inode. It calls
>> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in the
>> >>> > i_bh field of the inode info structure and marks it as dirty.
>> >>> >
>> >>> > After some data was written to the file in another transaction,
>> >>> > the function nilfs_set_file_dirty() is called, which adds the
>> >>> > inode to the ns_dirty_files list.
>> >>> >
>> >>> > Then the segment construction calls
>> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>> >>> >
>> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>> >>> > separate transactions, it is possible that a segment
>> >>> > construction that writes out the ifile occurs in-between the
>> >>> > two. If this happens the inode is not on the ns_dirty_files
>> >>> > list, but its ifile block is still marked as dirty and written
>> >>> > out.
>> >>> >
>> >>> > In the next segment construction, the data for the file is
>> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>> >>> > Eventually the bmap root is written into the i_bh block, which
>> >>> > is not dirty, because it was written out in another segment
>> >>> > construction.
>> >>> >
>> >>> > As a result the bmap update can be lost, which leads to file
>> >>> > system corruption. Either the virtual block address points to an
>> >>> > unallocated DAT block, or the DAT entry will be reused for
>> >>> > something different.
>> >>> >
>> >>> > The error can remain undetected for a long time. A typical error
>> >>> > message would be one of the "bad btree" errors or a warning
>> >>> > that a DAT entry could not be found.
>> >>> >
>> >>> > This bug can be reproduced reliably by a simple benchmark that
>> >>> > creates and overwrites millions of 4k files.
>> >>> >
>> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> >>>
>> >>> Will review, please wait for a while.
>> >>
>> >> Thank you, I have uploaded my benchmark on GitHub:
>> >>
>> >> https://github.com/zeitgeist87/bugsimulator
>> >>
>> >> The benchmark is very simple. It writes out 13107200 4k files in a
>> >> two layer directory tree, so that every directory contains about
>> >> 1000 files. Then it overwrites these files in an infinite loop.
>> >>
>> >> You need quite aggressive cleaner settings for it to keep up with
>> >> the bugsimulator:
>> >>
>> >> mc_nsegments_per_clean  16
>> >> nsegments_per_clean     16
>> >> max_clean_segments              30%
>> >> cleaning_interval       0.1
>> >> mc_cleaning_interval    0.1
>> >>
>> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>> >> the above cleaner settings, the file system will be corrupt in
>> >> about half an hour. At least on my machine.
>> >>
>> >> With the patch for the race condition, it can run indefinetly.
>> >>
>> >> Regards,
>> >> Andreas
>> >>
>> >>> Regards,
>> >>> Ryusuke Konishi
>> >>>
>> >>> >
>> >>> > ---
>> >>> >  fs/nilfs2/segment.c | 6 ++++--
>> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>> >
>> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> >>> > index 70ded52dc1dd..50e12956c737 100644
>> >>> > --- a/fs/nilfs2/segment.c
>> >>> > +++ b/fs/nilfs2/segment.c
>> >>> > @@ -1958,8 +1958,6 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > err, ii->vfs_inode.i_ino); return err;
>> >>> >                         }
>> >>> > -                       mark_buffer_dirty(ibh);
>> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >>> >                         if (likely(!ii->i_bh))
>> >>> >                                 ii->i_bh = ibh;
>> >>> > @@ -1968,6 +1966,10 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > goto retry; }
>> >>> >
>> >>> > +               // Always redirty the buffer to avoid race
>> >>> > condition
>> >>> > +               mark_buffer_dirty(ii->i_bh);
>> >>> > +               nilfs_mdt_mark_dirty(ifile);
>> >>> > +
>> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >>> >                 list_move_tail(&ii->i_dirty,
>> >>> > &sci->sc_dirty_files); --
>> >>> > 2.13.2
>> >>> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe
>> >>> > linux-nilfs" in the body of a message to
>> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>> >>> > http://vger.kernel.org/majordomo-info.html
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe
>> >>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> More majordomo info at
>> >>> http://vger.kernel.org/majordomo-info.html
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
  2017-10-29  8:07                 ` Andreas Rohner
@ 2017-10-29 12:53                   ` Ryusuke Konishi
       [not found]                     ` <CAKFNMo=6uOSA-bdkxN6dYnPo-7Svkm2vMXbFLzcP_eEp9ra_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-29 12:53 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer

2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> Hi Ryusuke,
>
> Thank you for reviewing my patch. I will rerun my tests to make sure
> my conclusions were correct.
>
> The log output suggests to me, that you are hitting a totally unrelated
> bug. I have seen that message before on one of my experimental branches,
> but I didn't think it could happen with stock NILFS2.
>
> My solution was to add the following piece of code to
> nilfs_vdesc_is_live() in nilfs-utils:
>
> if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>         /*
>          * This block was overwritten in the same logical segment, but
>          * in a different partial segment. Probably because of
>          * fdatasync() or a flush to disk.
>          * Without this check, gc will cause buffer confliction error
>          * if both partial segments are cleaned at the same time.
>          * In that case there will be two vdesc with the same ino,
>          * cno and offset.
>          */
>         return 0;
> }
>
> I can't tell for sure if that will fix your problem. I am just guessing.
> It isn't necessary with my setup, to reproduce the race condition bug.

Thank you, Andreas.  This snippet looks to have fixed the error so far.

Could you shape it to a patch (preferably a separate one) for nilfs-utils repo ?

I haven't yet succeeded to reproduce the original fs corruption error.
Will continue to try reproducing the bug waiting for report from you
and/or Clemens.

Thanks,
Ryusuke Konishi

>
> I am rerunning my tests now to make sure I didn't make a mistake.
>
> Best regards,
>
> Andreas
>
> Am Sun, 29 Oct 2017 14:09:40 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>
>> Unfortunately the patch didn't fix this issue:
>>
>> [  612.614615] NILFS (sda1): segctord starting. Construction interval
>> = 5 seconds, CP frequency < 30 seconds
>> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
>> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
>> vblocknr=1198436
>> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>> source blocks
>>
>> Look like the bugsimulator hit other bug.
>>
>> When this happened, the disk usage was 100%:
>>
>> $ df /dev/sda1
>> Filesystem     1K-blocks     Used Available Use% Mounted on
>> /dev/sda1      104849404 99655676         0 100% /test
>>
>> And, the error killed cleanerd.   However, I could cleanly unmount the
>> partition, and
>> I didn't see any errors after remount the partition.
>>
>> As you commented, I have changed some parameters
>> in /etc/nilfs_cleanerd.conf:
>> ------------------------------------------- mc_nsegments_per_clean  16
>> nsegments_per_clean     16
>> max_clean_segments              30%
>> cleaning_interval       0.1
>> mc_cleaning_interval    0.1
>> -------------------------------------------
>>
>> I will look into this issue with bugsimulator.
>>
>> Anyone can reproduce the original corruption issue in other way?
>>
>> Regards,
>> Ryusuke Konishi
>>
>>
>> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> > Hi,
>> >
>> > Sorry for my late reponse.
>> > I could reproduce a corruption issue with the bug simulator.
>> >
>> > -----
>> > [88805.039545] NILFS (sda1): segctord starting. Construction
>> > interval = 5 second s, CP frequency < 30 seconds
>> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> > conflicting data buff er: ino=10846, cno=92, offset=2,
>> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>> > error -17 preparing GC: cannot read source blocks -----
>> >
>> > Now I'm testing the patch.   I will send it to upstream after it
>> > turned out to suppress the issue.
>> >
>> > The patch itself looks good to me though I guess there may be
>> > similar bugs. What makes things complicated seems that inode block
>> > is not marked as dirty just when "propagate" function carries a
>> > dirty state from a dirty node/data block.
>> >
>> > Anyway, I hasten to apply this patch.
>> >
>> > Thanks,
>> > Ryusuke Konishi
>> >
>> >
>> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >> Hi,
>> >>
>> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> >>
>> >>> Hi,
>> >>>
>> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >>> >
>> >>> > There is a race condition between the function
>> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>> >>> >
>> >>> > When a file is opened, nilfs_dirty_inode() is called to update
>> >>> > the access timestamp in the inode. It calls
>> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in the
>> >>> > i_bh field of the inode info structure and marks it as dirty.
>> >>> >
>> >>> > After some data was written to the file in another transaction,
>> >>> > the function nilfs_set_file_dirty() is called, which adds the
>> >>> > inode to the ns_dirty_files list.
>> >>> >
>> >>> > Then the segment construction calls
>> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>> >>> >
>> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>> >>> > separate transactions, it is possible that a segment
>> >>> > construction that writes out the ifile occurs in-between the
>> >>> > two. If this happens the inode is not on the ns_dirty_files
>> >>> > list, but its ifile block is still marked as dirty and written
>> >>> > out.
>> >>> >
>> >>> > In the next segment construction, the data for the file is
>> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>> >>> > Eventually the bmap root is written into the i_bh block, which
>> >>> > is not dirty, because it was written out in another segment
>> >>> > construction.
>> >>> >
>> >>> > As a result the bmap update can be lost, which leads to file
>> >>> > system corruption. Either the virtual block address points to an
>> >>> > unallocated DAT block, or the DAT entry will be reused for
>> >>> > something different.
>> >>> >
>> >>> > The error can remain undetected for a long time. A typical error
>> >>> > message would be one of the "bad btree" errors or a warning
>> >>> > that a DAT entry could not be found.
>> >>> >
>> >>> > This bug can be reproduced reliably by a simple benchmark that
>> >>> > creates and overwrites millions of 4k files.
>> >>> >
>> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> >>>
>> >>> Will review, please wait for a while.
>> >>
>> >> Thank you, I have uploaded my benchmark on GitHub:
>> >>
>> >> https://github.com/zeitgeist87/bugsimulator
>> >>
>> >> The benchmark is very simple. It writes out 13107200 4k files in a
>> >> two layer directory tree, so that every directory contains about
>> >> 1000 files. Then it overwrites these files in an infinite loop.
>> >>
>> >> You need quite aggressive cleaner settings for it to keep up with
>> >> the bugsimulator:
>> >>
>> >> mc_nsegments_per_clean  16
>> >> nsegments_per_clean     16
>> >> max_clean_segments              30%
>> >> cleaning_interval       0.1
>> >> mc_cleaning_interval    0.1
>> >>
>> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>> >> the above cleaner settings, the file system will be corrupt in
>> >> about half an hour. At least on my machine.
>> >>
>> >> With the patch for the race condition, it can run indefinetly.
>> >>
>> >> Regards,
>> >> Andreas
>> >>
>> >>> Regards,
>> >>> Ryusuke Konishi
>> >>>
>> >>> >
>> >>> > ---
>> >>> >  fs/nilfs2/segment.c | 6 ++++--
>> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>> >
>> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> >>> > index 70ded52dc1dd..50e12956c737 100644
>> >>> > --- a/fs/nilfs2/segment.c
>> >>> > +++ b/fs/nilfs2/segment.c
>> >>> > @@ -1958,8 +1958,6 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > err, ii->vfs_inode.i_ino); return err;
>> >>> >                         }
>> >>> > -                       mark_buffer_dirty(ibh);
>> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >>> >                         if (likely(!ii->i_bh))
>> >>> >                                 ii->i_bh = ibh;
>> >>> > @@ -1968,6 +1966,10 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > goto retry; }
>> >>> >
>> >>> > +               // Always redirty the buffer to avoid race
>> >>> > condition
>> >>> > +               mark_buffer_dirty(ii->i_bh);
>> >>> > +               nilfs_mdt_mark_dirty(ifile);
>> >>> > +
>> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >>> >                 list_move_tail(&ii->i_dirty,
>> >>> > &sci->sc_dirty_files); --
>> >>> > 2.13.2
>> >>> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe
>> >>> > linux-nilfs" in the body of a message to
>> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>> >>> > http://vger.kernel.org/majordomo-info.html
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe
>> >>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> More majordomo info at
>> >>> http://vger.kernel.org/majordomo-info.html
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]                     ` <CAKFNMo=6uOSA-bdkxN6dYnPo-7Svkm2vMXbFLzcP_eEp9ra_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-29 14:06                       ` Andreas Rohner
  2017-10-29 16:29                         ` Ryusuke Konishi
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Rohner @ 2017-10-29 14:06 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs, Clemens Eisserer

Am Sun, 29 Oct 2017 21:53:38 +0900
schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:

> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> > Hi Ryusuke,
> >
> > Thank you for reviewing my patch. I will rerun my tests to make sure
> > my conclusions were correct.
> >
> > The log output suggests to me, that you are hitting a totally
> > unrelated bug. I have seen that message before on one of my
> > experimental branches, but I didn't think it could happen with
> > stock NILFS2.
> >
> > My solution was to add the following piece of code to
> > nilfs_vdesc_is_live() in nilfs-utils:
> >
> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
> >         /*
> >          * This block was overwritten in the same logical segment,
> > but
> >          * in a different partial segment. Probably because of
> >          * fdatasync() or a flush to disk.
> >          * Without this check, gc will cause buffer confliction
> > error
> >          * if both partial segments are cleaned at the same time.
> >          * In that case there will be two vdesc with the same ino,
> >          * cno and offset.
> >          */
> >         return 0;
> > }
> >
> > I can't tell for sure if that will fix your problem. I am just
> > guessing. It isn't necessary with my setup, to reproduce the race
> > condition bug.  
> 
> Thank you, Andreas.  This snippet looks to have fixed the error so
> far.
> 
> Could you shape it to a patch (preferably a separate one) for
> nilfs-utils repo ?

Of course, but I am not sure how this error can occur exactly.
I tried to reproduce it with a small shell script using
fdatasync(), but I wasn't able to.

However, I am pretty sure, that the patch does no harm and is safe.

> I haven't yet succeeded to reproduce the original fs corruption error.
> Will continue to try reproducing the bug waiting for report from you
> and/or Clemens.

I think I forgot to mention, that I use a protection period of 30
seconds. Here is my full nilfs_cleanerd.conf:

protection_period	30
min_clean_segments	10%
max_clean_segments 30%
clean_check_interval	10
nsegments_per_clean 16
mc_nsegments_per_clean 16
cleaning_interval 0.1
mc_cleaning_interval 0.1
retry_interval		60
min_reclaimable_blocks	10%
mc_min_reclaimable_blocks	1%
use_set_suinfo
use_mmap
log_priority		info

I can reproduce the corruption error with the latest kernel 4.14.0, and
I can confirm that it is fixed after applying the patch.
I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
Samsung 850 EVO SSD.

Best regards,

Andreas

> Thanks,
> Ryusuke Konishi
> 
> >
> > I am rerunning my tests now to make sure I didn't make a mistake.
> >
> > Best regards,
> >
> > Andreas
> >
> > Am Sun, 29 Oct 2017 14:09:40 +0900
> > schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >  
> >> Unfortunately the patch didn't fix this issue:
> >>
> >> [  612.614615] NILFS (sda1): segctord starting. Construction
> >> interval = 5 seconds, CP frequency < 30 seconds
> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
> >> conflicting data buffer: ino=1155, cno=11, offset=1,
> >> blocknr=1216036, vblocknr=1198436
> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
> >> source blocks
> >>
> >> Look like the bugsimulator hit other bug.
> >>
> >> When this happened, the disk usage was 100%:
> >>
> >> $ df /dev/sda1
> >> Filesystem     1K-blocks     Used Available Use% Mounted on
> >> /dev/sda1      104849404 99655676         0 100% /test
> >>
> >> And, the error killed cleanerd.   However, I could cleanly unmount
> >> the partition, and
> >> I didn't see any errors after remount the partition.
> >>
> >> As you commented, I have changed some parameters
> >> in /etc/nilfs_cleanerd.conf:
> >> -------------------------------------------
> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
> >> max_clean_segments              30%
> >> cleaning_interval       0.1
> >> mc_cleaning_interval    0.1
> >> -------------------------------------------
> >>
> >> I will look into this issue with bugsimulator.
> >>
> >> Anyone can reproduce the original corruption issue in other way?
> >>
> >> Regards,
> >> Ryusuke Konishi
> >>
> >>
> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
> >> <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:  
> >> > Hi,
> >> >
> >> > Sorry for my late reponse.
> >> > I could reproduce a corruption issue with the bug simulator.
> >> >
> >> > -----
> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
> >> > interval = 5 second s, CP frequency < 30 seconds
> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
> >> > error -17 preparing GC: cannot read source blocks -----
> >> >
> >> > Now I'm testing the patch.   I will send it to upstream after it
> >> > turned out to suppress the issue.
> >> >
> >> > The patch itself looks good to me though I guess there may be
> >> > similar bugs. What makes things complicated seems that inode
> >> > block is not marked as dirty just when "propagate" function
> >> > carries a dirty state from a dirty node/data block.
> >> >
> >> > Anyway, I hasten to apply this patch.
> >> >
> >> > Thanks,
> >> > Ryusuke Konishi
> >> >
> >> >
> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
> >> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >> >> Hi,
> >> >>
> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
> >> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >> >>  
> >> >>> Hi,
> >> >>>
> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
> >> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:  
> >> >>> >
> >> >>> > There is a race condition between the function
> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
> >> >>> >
> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
> >> >>> > update the access timestamp in the inode. It calls
> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
> >> >>> > the i_bh field of the inode info structure and marks it as
> >> >>> > dirty.
> >> >>> >
> >> >>> > After some data was written to the file in another
> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
> >> >>> > which adds the inode to the ns_dirty_files list.
> >> >>> >
> >> >>> > Then the segment construction calls
> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
> >> >>> >
> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
> >> >>> > separate transactions, it is possible that a segment
> >> >>> > construction that writes out the ifile occurs in-between the
> >> >>> > two. If this happens the inode is not on the ns_dirty_files
> >> >>> > list, but its ifile block is still marked as dirty and
> >> >>> > written out.
> >> >>> >
> >> >>> > In the next segment construction, the data for the file is
> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
> >> >>> > Eventually the bmap root is written into the i_bh block,
> >> >>> > which is not dirty, because it was written out in another
> >> >>> > segment construction.
> >> >>> >
> >> >>> > As a result the bmap update can be lost, which leads to file
> >> >>> > system corruption. Either the virtual block address points
> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
> >> >>> > for something different.
> >> >>> >
> >> >>> > The error can remain undetected for a long time. A typical
> >> >>> > error message would be one of the "bad btree" errors or a
> >> >>> > warning that a DAT entry could not be found.
> >> >>> >
> >> >>> > This bug can be reproduced reliably by a simple benchmark
> >> >>> > that creates and overwrites millions of 4k files.
> >> >>> >
> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>  
> >> >>>
> >> >>> Will review, please wait for a while.  
> >> >>
> >> >> Thank you, I have uploaded my benchmark on GitHub:
> >> >>
> >> >> https://github.com/zeitgeist87/bugsimulator
> >> >>
> >> >> The benchmark is very simple. It writes out 13107200 4k files
> >> >> in a two layer directory tree, so that every directory contains
> >> >> about 1000 files. Then it overwrites these files in an infinite
> >> >> loop.
> >> >>
> >> >> You need quite aggressive cleaner settings for it to keep up
> >> >> with the bugsimulator:
> >> >>
> >> >> mc_nsegments_per_clean  16
> >> >> nsegments_per_clean     16
> >> >> max_clean_segments              30%
> >> >> cleaning_interval       0.1
> >> >> mc_cleaning_interval    0.1
> >> >>
> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
> >> >> the above cleaner settings, the file system will be corrupt in
> >> >> about half an hour. At least on my machine.
> >> >>
> >> >> With the patch for the race condition, it can run indefinetly.
> >> >>
> >> >> Regards,
> >> >> Andreas
> >> >>  
> >> >>> Regards,
> >> >>> Ryusuke Konishi
> >> >>>  
> >> >>> >
> >> >>> > ---
> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >> >>> >
> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >> >>> > index 70ded52dc1dd..50e12956c737 100644
> >> >>> > --- a/fs/nilfs2/segment.c
> >> >>> > +++ b/fs/nilfs2/segment.c
> >> >>> > @@ -1958,8 +1958,6 @@ static int
> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >> >>> > err, ii->vfs_inode.i_ino); return err;
> >> >>> >                         }
> >> >>> > -                       mark_buffer_dirty(ibh);
> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
> >> >>> >                         if (likely(!ii->i_bh))
> >> >>> >                                 ii->i_bh = ibh;
> >> >>> > @@ -1968,6 +1966,10 @@ static int
> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
> >> >>> > goto retry; }
> >> >>> >
> >> >>> > +               // Always redirty the buffer to avoid race
> >> >>> > condition
> >> >>> > +               mark_buffer_dirty(ii->i_bh);
> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
> >> >>> > +
> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
> >> >>> >                 list_move_tail(&ii->i_dirty,
> >> >>> > &sci->sc_dirty_files); --
> >> >>> > 2.13.2
> >> >>> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe
> >> >>> > linux-nilfs" in the body of a message to
> >> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> >> >>> > http://vger.kernel.org/majordomo-info.html  
> >> >>> --
> >> >>> To unsubscribe from this list: send the line "unsubscribe
> >> >>> linux-nilfs" in the body of a message to
> >> >>> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> >> >>> http://vger.kernel.org/majordomo-info.html  
> >> >>  
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at
> >> http://vger.kernel.org/majordomo-info.html  
> >  
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
  2017-10-29 14:06                       ` Andreas Rohner
@ 2017-10-29 16:29                         ` Ryusuke Konishi
       [not found]                           ` <CAKFNMonWJtOHxaHQO8SNPGOUx6H=ghp7EQ+L1s3nbjMZt-Zr_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-29 16:29 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer

>> I haven't yet succeeded to reproduce the original fs corruption error.
>> Will continue to try reproducing the bug waiting for report from you
>> and/or Clemens.
>
> I think I forgot to mention, that I use a protection period of 30
> seconds. Here is my full nilfs_cleanerd.conf:

Thanks.   This is likely to affect the test.

Regards,
Ryusuke Konishi

2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
> Am Sun, 29 Oct 2017 21:53:38 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>
>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> > Hi Ryusuke,
>> >
>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>> > my conclusions were correct.
>> >
>> > The log output suggests to me, that you are hitting a totally
>> > unrelated bug. I have seen that message before on one of my
>> > experimental branches, but I didn't think it could happen with
>> > stock NILFS2.
>> >
>> > My solution was to add the following piece of code to
>> > nilfs_vdesc_is_live() in nilfs-utils:
>> >
>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>> >         /*
>> >          * This block was overwritten in the same logical segment,
>> > but
>> >          * in a different partial segment. Probably because of
>> >          * fdatasync() or a flush to disk.
>> >          * Without this check, gc will cause buffer confliction
>> > error
>> >          * if both partial segments are cleaned at the same time.
>> >          * In that case there will be two vdesc with the same ino,
>> >          * cno and offset.
>> >          */
>> >         return 0;
>> > }
>> >
>> > I can't tell for sure if that will fix your problem. I am just
>> > guessing. It isn't necessary with my setup, to reproduce the race
>> > condition bug.
>>
>> Thank you, Andreas.  This snippet looks to have fixed the error so
>> far.
>>
>> Could you shape it to a patch (preferably a separate one) for
>> nilfs-utils repo ?
>
> Of course, but I am not sure how this error can occur exactly.
> I tried to reproduce it with a small shell script using
> fdatasync(), but I wasn't able to.
>
> However, I am pretty sure, that the patch does no harm and is safe.
>
>> I haven't yet succeeded to reproduce the original fs corruption error.
>> Will continue to try reproducing the bug waiting for report from you
>> and/or Clemens.
>
> I think I forgot to mention, that I use a protection period of 30
> seconds. Here is my full nilfs_cleanerd.conf:
>
> protection_period       30
> min_clean_segments      10%
> max_clean_segments 30%
> clean_check_interval    10
> nsegments_per_clean 16
> mc_nsegments_per_clean 16
> cleaning_interval 0.1
> mc_cleaning_interval 0.1
> retry_interval          60
> min_reclaimable_blocks  10%
> mc_min_reclaimable_blocks       1%
> use_set_suinfo
> use_mmap
> log_priority            info
>
> I can reproduce the corruption error with the latest kernel 4.14.0, and
> I can confirm that it is fixed after applying the patch.
> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
> Samsung 850 EVO SSD.
>
> Best regards,
>
> Andreas
>
>> Thanks,
>> Ryusuke Konishi
>>
>> >
>> > I am rerunning my tests now to make sure I didn't make a mistake.
>> >
>> > Best regards,
>> >
>> > Andreas
>> >
>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>> > schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> >
>> >> Unfortunately the patch didn't fix this issue:
>> >>
>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>> >> interval = 5 seconds, CP frequency < 30 seconds
>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>> >> blocknr=1216036, vblocknr=1198436
>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>> >> source blocks
>> >>
>> >> Look like the bugsimulator hit other bug.
>> >>
>> >> When this happened, the disk usage was 100%:
>> >>
>> >> $ df /dev/sda1
>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>> >> /dev/sda1      104849404 99655676         0 100% /test
>> >>
>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>> >> the partition, and
>> >> I didn't see any errors after remount the partition.
>> >>
>> >> As you commented, I have changed some parameters
>> >> in /etc/nilfs_cleanerd.conf:
>> >> -------------------------------------------
>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>> >> max_clean_segments              30%
>> >> cleaning_interval       0.1
>> >> mc_cleaning_interval    0.1
>> >> -------------------------------------------
>> >>
>> >> I will look into this issue with bugsimulator.
>> >>
>> >> Anyone can reproduce the original corruption issue in other way?
>> >>
>> >> Regards,
>> >> Ryusuke Konishi
>> >>
>> >>
>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>> >> <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> >> > Hi,
>> >> >
>> >> > Sorry for my late reponse.
>> >> > I could reproduce a corruption issue with the bug simulator.
>> >> >
>> >> > -----
>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>> >> > interval = 5 second s, CP frequency < 30 seconds
>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>> >> > error -17 preparing GC: cannot read source blocks -----
>> >> >
>> >> > Now I'm testing the patch.   I will send it to upstream after it
>> >> > turned out to suppress the issue.
>> >> >
>> >> > The patch itself looks good to me though I guess there may be
>> >> > similar bugs. What makes things complicated seems that inode
>> >> > block is not marked as dirty just when "propagate" function
>> >> > carries a dirty state from a dirty node/data block.
>> >> >
>> >> > Anyway, I hasten to apply this patch.
>> >> >
>> >> > Thanks,
>> >> > Ryusuke Konishi
>> >> >
>> >> >
>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>> >> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >> >> Hi,
>> >> >>
>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>> >> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> >> >>> >
>> >> >>> > There is a race condition between the function
>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>> >> >>> >
>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>> >> >>> > update the access timestamp in the inode. It calls
>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>> >> >>> > the i_bh field of the inode info structure and marks it as
>> >> >>> > dirty.
>> >> >>> >
>> >> >>> > After some data was written to the file in another
>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>> >> >>> > which adds the inode to the ns_dirty_files list.
>> >> >>> >
>> >> >>> > Then the segment construction calls
>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>> >> >>> >
>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>> >> >>> > separate transactions, it is possible that a segment
>> >> >>> > construction that writes out the ifile occurs in-between the
>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>> >> >>> > list, but its ifile block is still marked as dirty and
>> >> >>> > written out.
>> >> >>> >
>> >> >>> > In the next segment construction, the data for the file is
>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>> >> >>> > Eventually the bmap root is written into the i_bh block,
>> >> >>> > which is not dirty, because it was written out in another
>> >> >>> > segment construction.
>> >> >>> >
>> >> >>> > As a result the bmap update can be lost, which leads to file
>> >> >>> > system corruption. Either the virtual block address points
>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>> >> >>> > for something different.
>> >> >>> >
>> >> >>> > The error can remain undetected for a long time. A typical
>> >> >>> > error message would be one of the "bad btree" errors or a
>> >> >>> > warning that a DAT entry could not be found.
>> >> >>> >
>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>> >> >>> > that creates and overwrites millions of 4k files.
>> >> >>> >
>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> >> >>>
>> >> >>> Will review, please wait for a while.
>> >> >>
>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>> >> >>
>> >> >> https://github.com/zeitgeist87/bugsimulator
>> >> >>
>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>> >> >> in a two layer directory tree, so that every directory contains
>> >> >> about 1000 files. Then it overwrites these files in an infinite
>> >> >> loop.
>> >> >>
>> >> >> You need quite aggressive cleaner settings for it to keep up
>> >> >> with the bugsimulator:
>> >> >>
>> >> >> mc_nsegments_per_clean  16
>> >> >> nsegments_per_clean     16
>> >> >> max_clean_segments              30%
>> >> >> cleaning_interval       0.1
>> >> >> mc_cleaning_interval    0.1
>> >> >>
>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>> >> >> the above cleaner settings, the file system will be corrupt in
>> >> >> about half an hour. At least on my machine.
>> >> >>
>> >> >> With the patch for the race condition, it can run indefinetly.
>> >> >>
>> >> >> Regards,
>> >> >> Andreas
>> >> >>
>> >> >>> Regards,
>> >> >>> Ryusuke Konishi
>> >> >>>
>> >> >>> >
>> >> >>> > ---
>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >> >>> >
>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>> >> >>> > --- a/fs/nilfs2/segment.c
>> >> >>> > +++ b/fs/nilfs2/segment.c
>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>> >> >>> >                         }
>> >> >>> > -                       mark_buffer_dirty(ibh);
>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >> >>> >                         if (likely(!ii->i_bh))
>> >> >>> >                                 ii->i_bh = ibh;
>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >> >>> > goto retry; }
>> >> >>> >
>> >> >>> > +               // Always redirty the buffer to avoid race
>> >> >>> > condition
>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>> >> >>> > +
>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >> >>> >                 list_move_tail(&ii->i_dirty,
>> >> >>> > &sci->sc_dirty_files); --
>> >> >>> > 2.13.2
>> >> >>> >
>> >> >>> > --
>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>> >> >>> > linux-nilfs" in the body of a message to
>> >> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>> >> >>> > http://vger.kernel.org/majordomo-info.html
>> >> >>> --
>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>> >> >>> linux-nilfs" in the body of a message to
>> >> >>> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>> >> >>> http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe
>> >> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> More majordomo info at
>> >> http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]                           ` <CAKFNMonWJtOHxaHQO8SNPGOUx6H=ghp7EQ+L1s3nbjMZt-Zr_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-29 22:19                             ` Ryusuke Konishi
       [not found]                               ` <CAKFNMo=NnxYgxXhBASBY7QVO9phf=p3RgXCpCb6qWR9N99YedA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-29 22:19 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer

I could reproduce the issue in two and a half hours.
Will proceed verification of the patch later.

Regards,
Ryusuke Konishi

2017-10-30 1:29 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>> Will continue to try reproducing the bug waiting for report from you
>>> and/or Clemens.
>>
>> I think I forgot to mention, that I use a protection period of 30
>> seconds. Here is my full nilfs_cleanerd.conf:
>
> Thanks.   This is likely to affect the test.
>
> Regards,
> Ryusuke Konishi
>
> 2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>> Am Sun, 29 Oct 2017 21:53:38 +0900
>> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>
>>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> > Hi Ryusuke,
>>> >
>>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>>> > my conclusions were correct.
>>> >
>>> > The log output suggests to me, that you are hitting a totally
>>> > unrelated bug. I have seen that message before on one of my
>>> > experimental branches, but I didn't think it could happen with
>>> > stock NILFS2.
>>> >
>>> > My solution was to add the following piece of code to
>>> > nilfs_vdesc_is_live() in nilfs-utils:
>>> >
>>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>>> >         /*
>>> >          * This block was overwritten in the same logical segment,
>>> > but
>>> >          * in a different partial segment. Probably because of
>>> >          * fdatasync() or a flush to disk.
>>> >          * Without this check, gc will cause buffer confliction
>>> > error
>>> >          * if both partial segments are cleaned at the same time.
>>> >          * In that case there will be two vdesc with the same ino,
>>> >          * cno and offset.
>>> >          */
>>> >         return 0;
>>> > }
>>> >
>>> > I can't tell for sure if that will fix your problem. I am just
>>> > guessing. It isn't necessary with my setup, to reproduce the race
>>> > condition bug.
>>>
>>> Thank you, Andreas.  This snippet looks to have fixed the error so
>>> far.
>>>
>>> Could you shape it to a patch (preferably a separate one) for
>>> nilfs-utils repo ?
>>
>> Of course, but I am not sure how this error can occur exactly.
>> I tried to reproduce it with a small shell script using
>> fdatasync(), but I wasn't able to.
>>
>> However, I am pretty sure, that the patch does no harm and is safe.
>>
>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>> Will continue to try reproducing the bug waiting for report from you
>>> and/or Clemens.
>>
>> I think I forgot to mention, that I use a protection period of 30
>> seconds. Here is my full nilfs_cleanerd.conf:
>>
>> protection_period       30
>> min_clean_segments      10%
>> max_clean_segments 30%
>> clean_check_interval    10
>> nsegments_per_clean 16
>> mc_nsegments_per_clean 16
>> cleaning_interval 0.1
>> mc_cleaning_interval 0.1
>> retry_interval          60
>> min_reclaimable_blocks  10%
>> mc_min_reclaimable_blocks       1%
>> use_set_suinfo
>> use_mmap
>> log_priority            info
>>
>> I can reproduce the corruption error with the latest kernel 4.14.0, and
>> I can confirm that it is fixed after applying the patch.
>> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
>> Samsung 850 EVO SSD.
>>
>> Best regards,
>>
>> Andreas
>>
>>> Thanks,
>>> Ryusuke Konishi
>>>
>>> >
>>> > I am rerunning my tests now to make sure I didn't make a mistake.
>>> >
>>> > Best regards,
>>> >
>>> > Andreas
>>> >
>>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>>> > schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>> >
>>> >> Unfortunately the patch didn't fix this issue:
>>> >>
>>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>>> >> interval = 5 seconds, CP frequency < 30 seconds
>>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>>> >> blocknr=1216036, vblocknr=1198436
>>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>>> >> source blocks
>>> >>
>>> >> Look like the bugsimulator hit other bug.
>>> >>
>>> >> When this happened, the disk usage was 100%:
>>> >>
>>> >> $ df /dev/sda1
>>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>>> >> /dev/sda1      104849404 99655676         0 100% /test
>>> >>
>>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>>> >> the partition, and
>>> >> I didn't see any errors after remount the partition.
>>> >>
>>> >> As you commented, I have changed some parameters
>>> >> in /etc/nilfs_cleanerd.conf:
>>> >> -------------------------------------------
>>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>>> >> max_clean_segments              30%
>>> >> cleaning_interval       0.1
>>> >> mc_cleaning_interval    0.1
>>> >> -------------------------------------------
>>> >>
>>> >> I will look into this issue with bugsimulator.
>>> >>
>>> >> Anyone can reproduce the original corruption issue in other way?
>>> >>
>>> >> Regards,
>>> >> Ryusuke Konishi
>>> >>
>>> >>
>>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>>> >> <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>> >> > Hi,
>>> >> >
>>> >> > Sorry for my late reponse.
>>> >> > I could reproduce a corruption issue with the bug simulator.
>>> >> >
>>> >> > -----
>>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>>> >> > interval = 5 second s, CP frequency < 30 seconds
>>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>>> >> > error -17 preparing GC: cannot read source blocks -----
>>> >> >
>>> >> > Now I'm testing the patch.   I will send it to upstream after it
>>> >> > turned out to suppress the issue.
>>> >> >
>>> >> > The patch itself looks good to me though I guess there may be
>>> >> > similar bugs. What makes things complicated seems that inode
>>> >> > block is not marked as dirty just when "propagate" function
>>> >> > carries a dirty state from a dirty node/data block.
>>> >> >
>>> >> > Anyway, I hasten to apply this patch.
>>> >> >
>>> >> > Thanks,
>>> >> > Ryusuke Konishi
>>> >> >
>>> >> >
>>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>>> >> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> >> >> Hi,
>>> >> >>
>>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>> >> >>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>>> >> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> >> >>> >
>>> >> >>> > There is a race condition between the function
>>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>>> >> >>> >
>>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>>> >> >>> > update the access timestamp in the inode. It calls
>>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>>> >> >>> > the i_bh field of the inode info structure and marks it as
>>> >> >>> > dirty.
>>> >> >>> >
>>> >> >>> > After some data was written to the file in another
>>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>>> >> >>> > which adds the inode to the ns_dirty_files list.
>>> >> >>> >
>>> >> >>> > Then the segment construction calls
>>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>>> >> >>> >
>>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>>> >> >>> > separate transactions, it is possible that a segment
>>> >> >>> > construction that writes out the ifile occurs in-between the
>>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>>> >> >>> > list, but its ifile block is still marked as dirty and
>>> >> >>> > written out.
>>> >> >>> >
>>> >> >>> > In the next segment construction, the data for the file is
>>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>>> >> >>> > Eventually the bmap root is written into the i_bh block,
>>> >> >>> > which is not dirty, because it was written out in another
>>> >> >>> > segment construction.
>>> >> >>> >
>>> >> >>> > As a result the bmap update can be lost, which leads to file
>>> >> >>> > system corruption. Either the virtual block address points
>>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>>> >> >>> > for something different.
>>> >> >>> >
>>> >> >>> > The error can remain undetected for a long time. A typical
>>> >> >>> > error message would be one of the "bad btree" errors or a
>>> >> >>> > warning that a DAT entry could not be found.
>>> >> >>> >
>>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>>> >> >>> > that creates and overwrites millions of 4k files.
>>> >> >>> >
>>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>> >> >>>
>>> >> >>> Will review, please wait for a while.
>>> >> >>
>>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>>> >> >>
>>> >> >> https://github.com/zeitgeist87/bugsimulator
>>> >> >>
>>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>>> >> >> in a two layer directory tree, so that every directory contains
>>> >> >> about 1000 files. Then it overwrites these files in an infinite
>>> >> >> loop.
>>> >> >>
>>> >> >> You need quite aggressive cleaner settings for it to keep up
>>> >> >> with the bugsimulator:
>>> >> >>
>>> >> >> mc_nsegments_per_clean  16
>>> >> >> nsegments_per_clean     16
>>> >> >> max_clean_segments              30%
>>> >> >> cleaning_interval       0.1
>>> >> >> mc_cleaning_interval    0.1
>>> >> >>
>>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>>> >> >> the above cleaner settings, the file system will be corrupt in
>>> >> >> about half an hour. At least on my machine.
>>> >> >>
>>> >> >> With the patch for the race condition, it can run indefinetly.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Andreas
>>> >> >>
>>> >> >>> Regards,
>>> >> >>> Ryusuke Konishi
>>> >> >>>
>>> >> >>> >
>>> >> >>> > ---
>>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>> >> >>> >
>>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>>> >> >>> > --- a/fs/nilfs2/segment.c
>>> >> >>> > +++ b/fs/nilfs2/segment.c
>>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>>> >> >>> >                         }
>>> >> >>> > -                       mark_buffer_dirty(ibh);
>>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>>> >> >>> >                         if (likely(!ii->i_bh))
>>> >> >>> >                                 ii->i_bh = ibh;
>>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>> >> >>> > goto retry; }
>>> >> >>> >
>>> >> >>> > +               // Always redirty the buffer to avoid race
>>> >> >>> > condition
>>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>>> >> >>> > +
>>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>> >> >>> >                 list_move_tail(&ii->i_dirty,
>>> >> >>> > &sci->sc_dirty_files); --
>>> >> >>> > 2.13.2
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>>> >> >>> > linux-nilfs" in the body of a message to
>>> >> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>>> >> >>> > http://vger.kernel.org/majordomo-info.html
>>> >> >>> --
>>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>>> >> >>> linux-nilfs" in the body of a message to
>>> >> >>> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>>> >> >>> http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe
>>> >> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >> More majordomo info at
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] nilfs2: Fix race condition that causes file system corruption
       [not found]                               ` <CAKFNMo=NnxYgxXhBASBY7QVO9phf=p3RgXCpCb6qWR9N99YedA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-30 11:25                                 ` Ryusuke Konishi
  0 siblings, 0 replies; 17+ messages in thread
From: Ryusuke Konishi @ 2017-10-30 11:25 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs, Clemens Eisserer

Hi,

The patch by Andreas seems to have fixed the corruption issue
that I could reproduced yesterday.   In the stress test at my
environment,  it prevented fs corruption more than 9 hours.

Therefore, I applied the patch to nilfs repo with Tested-by tags.

Thanks,
Ryusuke Konishi

2017-10-30 7:19 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> I could reproduce the issue in two and a half hours.
> Will proceed verification of the patch later.
>
> Regards,
> Ryusuke Konishi
>
> 2017-10-30 1:29 GMT+09:00 Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>>> Will continue to try reproducing the bug waiting for report from you
>>>> and/or Clemens.
>>>
>>> I think I forgot to mention, that I use a protection period of 30
>>> seconds. Here is my full nilfs_cleanerd.conf:
>>
>> Thanks.   This is likely to affect the test.
>>
>> Regards,
>> Ryusuke Konishi
>>
>> 2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>> Am Sun, 29 Oct 2017 21:53:38 +0900
>>> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>
>>>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>>> > Hi Ryusuke,
>>>> >
>>>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>>>> > my conclusions were correct.
>>>> >
>>>> > The log output suggests to me, that you are hitting a totally
>>>> > unrelated bug. I have seen that message before on one of my
>>>> > experimental branches, but I didn't think it could happen with
>>>> > stock NILFS2.
>>>> >
>>>> > My solution was to add the following piece of code to
>>>> > nilfs_vdesc_is_live() in nilfs-utils:
>>>> >
>>>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>>>> >         /*
>>>> >          * This block was overwritten in the same logical segment,
>>>> > but
>>>> >          * in a different partial segment. Probably because of
>>>> >          * fdatasync() or a flush to disk.
>>>> >          * Without this check, gc will cause buffer confliction
>>>> > error
>>>> >          * if both partial segments are cleaned at the same time.
>>>> >          * In that case there will be two vdesc with the same ino,
>>>> >          * cno and offset.
>>>> >          */
>>>> >         return 0;
>>>> > }
>>>> >
>>>> > I can't tell for sure if that will fix your problem. I am just
>>>> > guessing. It isn't necessary with my setup, to reproduce the race
>>>> > condition bug.
>>>>
>>>> Thank you, Andreas.  This snippet looks to have fixed the error so
>>>> far.
>>>>
>>>> Could you shape it to a patch (preferably a separate one) for
>>>> nilfs-utils repo ?
>>>
>>> Of course, but I am not sure how this error can occur exactly.
>>> I tried to reproduce it with a small shell script using
>>> fdatasync(), but I wasn't able to.
>>>
>>> However, I am pretty sure, that the patch does no harm and is safe.
>>>
>>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>>> Will continue to try reproducing the bug waiting for report from you
>>>> and/or Clemens.
>>>
>>> I think I forgot to mention, that I use a protection period of 30
>>> seconds. Here is my full nilfs_cleanerd.conf:
>>>
>>> protection_period       30
>>> min_clean_segments      10%
>>> max_clean_segments 30%
>>> clean_check_interval    10
>>> nsegments_per_clean 16
>>> mc_nsegments_per_clean 16
>>> cleaning_interval 0.1
>>> mc_cleaning_interval 0.1
>>> retry_interval          60
>>> min_reclaimable_blocks  10%
>>> mc_min_reclaimable_blocks       1%
>>> use_set_suinfo
>>> use_mmap
>>> log_priority            info
>>>
>>> I can reproduce the corruption error with the latest kernel 4.14.0, and
>>> I can confirm that it is fixed after applying the patch.
>>> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
>>> Samsung 850 EVO SSD.
>>>
>>> Best regards,
>>>
>>> Andreas
>>>
>>>> Thanks,
>>>> Ryusuke Konishi
>>>>
>>>> >
>>>> > I am rerunning my tests now to make sure I didn't make a mistake.
>>>> >
>>>> > Best regards,
>>>> >
>>>> > Andreas
>>>> >
>>>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>>>> > schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>> >
>>>> >> Unfortunately the patch didn't fix this issue:
>>>> >>
>>>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>>>> >> interval = 5 seconds, CP frequency < 30 seconds
>>>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>>>> >> blocknr=1216036, vblocknr=1198436
>>>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>>>> >> source blocks
>>>> >>
>>>> >> Look like the bugsimulator hit other bug.
>>>> >>
>>>> >> When this happened, the disk usage was 100%:
>>>> >>
>>>> >> $ df /dev/sda1
>>>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>>>> >> /dev/sda1      104849404 99655676         0 100% /test
>>>> >>
>>>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>>>> >> the partition, and
>>>> >> I didn't see any errors after remount the partition.
>>>> >>
>>>> >> As you commented, I have changed some parameters
>>>> >> in /etc/nilfs_cleanerd.conf:
>>>> >> -------------------------------------------
>>>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>>>> >> max_clean_segments              30%
>>>> >> cleaning_interval       0.1
>>>> >> mc_cleaning_interval    0.1
>>>> >> -------------------------------------------
>>>> >>
>>>> >> I will look into this issue with bugsimulator.
>>>> >>
>>>> >> Anyone can reproduce the original corruption issue in other way?
>>>> >>
>>>> >> Regards,
>>>> >> Ryusuke Konishi
>>>> >>
>>>> >>
>>>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>>>> >> <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>> >> > Hi,
>>>> >> >
>>>> >> > Sorry for my late reponse.
>>>> >> > I could reproduce a corruption issue with the bug simulator.
>>>> >> >
>>>> >> > -----
>>>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>>>> >> > interval = 5 second s, CP frequency < 30 seconds
>>>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>>>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>>>> >> > error -17 preparing GC: cannot read source blocks -----
>>>> >> >
>>>> >> > Now I'm testing the patch.   I will send it to upstream after it
>>>> >> > turned out to suppress the issue.
>>>> >> >
>>>> >> > The patch itself looks good to me though I guess there may be
>>>> >> > similar bugs. What makes things complicated seems that inode
>>>> >> > block is not marked as dirty just when "propagate" function
>>>> >> > carries a dirty state from a dirty node/data block.
>>>> >> >
>>>> >> > Anyway, I hasten to apply this patch.
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Ryusuke Konishi
>>>> >> >
>>>> >> >
>>>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>>>> >> > <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>>> >> >> Hi,
>>>> >> >>
>>>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>>>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
>>>> >> >>
>>>> >> >>> Hi,
>>>> >> >>>
>>>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>>>> >> >>> <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>:
>>>> >> >>> >
>>>> >> >>> > There is a race condition between the function
>>>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>>>> >> >>> >
>>>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>>>> >> >>> > update the access timestamp in the inode. It calls
>>>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>>>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>>>> >> >>> > the i_bh field of the inode info structure and marks it as
>>>> >> >>> > dirty.
>>>> >> >>> >
>>>> >> >>> > After some data was written to the file in another
>>>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>>>> >> >>> > which adds the inode to the ns_dirty_files list.
>>>> >> >>> >
>>>> >> >>> > Then the segment construction calls
>>>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>>>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>>>> >> >>> >
>>>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>>>> >> >>> > separate transactions, it is possible that a segment
>>>> >> >>> > construction that writes out the ifile occurs in-between the
>>>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>>>> >> >>> > list, but its ifile block is still marked as dirty and
>>>> >> >>> > written out.
>>>> >> >>> >
>>>> >> >>> > In the next segment construction, the data for the file is
>>>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>>>> >> >>> > Eventually the bmap root is written into the i_bh block,
>>>> >> >>> > which is not dirty, because it was written out in another
>>>> >> >>> > segment construction.
>>>> >> >>> >
>>>> >> >>> > As a result the bmap update can be lost, which leads to file
>>>> >> >>> > system corruption. Either the virtual block address points
>>>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>>>> >> >>> > for something different.
>>>> >> >>> >
>>>> >> >>> > The error can remain undetected for a long time. A typical
>>>> >> >>> > error message would be one of the "bad btree" errors or a
>>>> >> >>> > warning that a DAT entry could not be found.
>>>> >> >>> >
>>>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>>>> >> >>> > that creates and overwrites millions of 4k files.
>>>> >> >>> >
>>>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>>> >> >>>
>>>> >> >>> Will review, please wait for a while.
>>>> >> >>
>>>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>>>> >> >>
>>>> >> >> https://github.com/zeitgeist87/bugsimulator
>>>> >> >>
>>>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>>>> >> >> in a two layer directory tree, so that every directory contains
>>>> >> >> about 1000 files. Then it overwrites these files in an infinite
>>>> >> >> loop.
>>>> >> >>
>>>> >> >> You need quite aggressive cleaner settings for it to keep up
>>>> >> >> with the bugsimulator:
>>>> >> >>
>>>> >> >> mc_nsegments_per_clean  16
>>>> >> >> nsegments_per_clean     16
>>>> >> >> max_clean_segments              30%
>>>> >> >> cleaning_interval       0.1
>>>> >> >> mc_cleaning_interval    0.1
>>>> >> >>
>>>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>>>> >> >> the above cleaner settings, the file system will be corrupt in
>>>> >> >> about half an hour. At least on my machine.
>>>> >> >>
>>>> >> >> With the patch for the race condition, it can run indefinetly.
>>>> >> >>
>>>> >> >> Regards,
>>>> >> >> Andreas
>>>> >> >>
>>>> >> >>> Regards,
>>>> >> >>> Ryusuke Konishi
>>>> >> >>>
>>>> >> >>> >
>>>> >> >>> > ---
>>>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>>>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>>> >> >>> >
>>>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>>>> >> >>> > --- a/fs/nilfs2/segment.c
>>>> >> >>> > +++ b/fs/nilfs2/segment.c
>>>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>>>> >> >>> >                         }
>>>> >> >>> > -                       mark_buffer_dirty(ibh);
>>>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>>>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>>>> >> >>> >                         if (likely(!ii->i_bh))
>>>> >> >>> >                                 ii->i_bh = ibh;
>>>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>>> >> >>> > goto retry; }
>>>> >> >>> >
>>>> >> >>> > +               // Always redirty the buffer to avoid race
>>>> >> >>> > condition
>>>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>>>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>>>> >> >>> > +
>>>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>>> >> >>> >                 list_move_tail(&ii->i_dirty,
>>>> >> >>> > &sci->sc_dirty_files); --
>>>> >> >>> > 2.13.2
>>>> >> >>> >
>>>> >> >>> > --
>>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>>>> >> >>> > linux-nilfs" in the body of a message to
>>>> >> >>> > majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>>>> >> >>> > http://vger.kernel.org/majordomo-info.html
>>>> >> >>> --
>>>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>>>> >> >>> linux-nilfs" in the body of a message to
>>>> >> >>> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
>>>> >> >>> http://vger.kernel.org/majordomo-info.html
>>>> >> >>
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe
>>>> >> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> >> More majordomo info at
>>>> >> http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-nilfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-30 11:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-11  6:19 [PATCH] nilfs2: Fix race condition that causes file system corruption Andreas Rohner
     [not found] ` <20170711061907.19259-1-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2017-07-12  0:08   ` Ryusuke Konishi
     [not found]     ` <CAKFNMonzhYBKkjotNiNY0x0DnWxzHYF8WvO7f7xjGwr-8a=aRQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-13  6:44       ` Andreas Rohner
2017-10-28 17:16         ` Ryusuke Konishi
     [not found]           ` <CAKFNMomD0qE3Q-bSfH=6-A82LaEVh2fK8Asp6emgYMa9BstQKw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-29  5:09             ` Ryusuke Konishi
     [not found]               ` <CAKFNMomWEWyPZ4jSvJLk9AdzuDxn1T4-gxsOpbgRoZ-=6jH1Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-29  5:24                 ` Ryusuke Konishi
2017-10-29  8:07                 ` Andreas Rohner
2017-10-29 12:53                   ` Ryusuke Konishi
     [not found]                     ` <CAKFNMo=6uOSA-bdkxN6dYnPo-7Svkm2vMXbFLzcP_eEp9ra_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-29 14:06                       ` Andreas Rohner
2017-10-29 16:29                         ` Ryusuke Konishi
     [not found]                           ` <CAKFNMonWJtOHxaHQO8SNPGOUx6H=ghp7EQ+L1s3nbjMZt-Zr_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-29 22:19                             ` Ryusuke Konishi
     [not found]                               ` <CAKFNMo=NnxYgxXhBASBY7QVO9phf=p3RgXCpCb6qWR9N99YedA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-30 11:25                                 ` Ryusuke Konishi
2017-10-29  9:46                 ` Andreas Rohner
2017-10-29 10:23                   ` Clemens Eisserer
2017-10-24  3:18       ` Clemens Eisserer
     [not found]         ` <CAFvQSYR6vXuyGzGjeogKdg1Xzg=4QDE4TM7r4G-UnZ7hLniDkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-24 17:06           ` Viacheslav Dubeyko
     [not found]             ` <1508864769.501.5.camel-yeENwD64cLxBDgjK7y7TUQ@public.gmane.org>
2017-10-24 19:29               ` Clemens Eisserer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.