From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-nvdimm@lists.01.org, xfs@oss.sgi.com
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Date: Tue, 26 Jan 2016 09:47:46 -0500 [thread overview]
Message-ID: <20160126144746.GL2948@linux.intel.com> (raw)
In-Reply-To: <20160126130521.GB23820@quack.suse.cz>
On Tue, Jan 26, 2016 at 02:05:21PM +0100, Jan Kara wrote:
> On Tue 26-01-16 07:48:12, Matthew Wilcox wrote:
> > I *think* that what Dave's proposing (and if he isn't, I'm proposing it
> > for him) is that the filesystem takes its allocation lock shared during
> > the ->fault handler, then in the ->page_mkwrite handler, it knows that an
> > allocation is coming, so it takes its allocation lock in exclusive mode.
> >
> > So read vs write faults won't be able to race because the allocation lock
> > will prevent it.
>
> So this is correct and clean design but we will take the lock in exclusive
> mode (and thus hurt scalability) for every write fault, not just for the
> ones allocating blocks. And at the moment we take exclusive lock for write
> faults, there's no more need for having the hole page instantiated - we can
> still do it for simplicity but it's no longer necessary to avoid data
> corruption.
In my mind we take it only for allocating writes, because we also include
the patch to insert PFNs with the writable bit set in the dax_fault
handler if the page fault was for writes.
Although that only works when the *first* fault is a write ... if we
read and page then write the same page, we will indeed take the lock
in exclusive mode. I think that's fixable too -- in the page_mkwrite
handler, take the lock in exclusive mode only if there's a page in the
radix tree. I'll take a look at that optimisation after doing the first
couple of steps.
WARNING: multiple messages have this Message-ID (diff)
From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>,
linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org,
xfs@oss.sgi.com, Andreas Dilger <adilger.kernel@dilger.ca>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Jan Kara <jack@suse.com>,
linux-fsdevel@vger.kernel.org,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-ext4@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Date: Tue, 26 Jan 2016 09:47:46 -0500 [thread overview]
Message-ID: <20160126144746.GL2948@linux.intel.com> (raw)
In-Reply-To: <20160126130521.GB23820@quack.suse.cz>
On Tue, Jan 26, 2016 at 02:05:21PM +0100, Jan Kara wrote:
> On Tue 26-01-16 07:48:12, Matthew Wilcox wrote:
> > I *think* that what Dave's proposing (and if he isn't, I'm proposing it
> > for him) is that the filesystem takes its allocation lock shared during
> > the ->fault handler, then in the ->page_mkwrite handler, it knows that an
> > allocation is coming, so it takes its allocation lock in exclusive mode.
> >
> > So read vs write faults won't be able to race because the allocation lock
> > will prevent it.
>
> So this is correct and clean design but we will take the lock in exclusive
> mode (and thus hurt scalability) for every write fault, not just for the
> ones allocating blocks. And at the moment we take exclusive lock for write
> faults, there's no more need for having the hole page instantiated - we can
> still do it for simplicity but it's no longer necessary to avoid data
> corruption.
In my mind we take it only for allocating writes, because we also include
the patch to insert PFNs with the writable bit set in the dax_fault
handler if the page fault was for writes.
Although that only works when the *first* fault is a write ... if we
read and page then write the same page, we will indeed take the lock
in exclusive mode. I think that's fixable too -- in the page_mkwrite
handler, take the lock in exclusive mode only if there's a page in the
radix tree. I'll take a look at that optimisation after doing the first
couple of steps.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, "Theodore Ts'o" <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-nvdimm@ml01.01.org, xfs@oss.sgi.com
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Date: Tue, 26 Jan 2016 09:47:46 -0500 [thread overview]
Message-ID: <20160126144746.GL2948@linux.intel.com> (raw)
In-Reply-To: <20160126130521.GB23820@quack.suse.cz>
On Tue, Jan 26, 2016 at 02:05:21PM +0100, Jan Kara wrote:
> On Tue 26-01-16 07:48:12, Matthew Wilcox wrote:
> > I *think* that what Dave's proposing (and if he isn't, I'm proposing it
> > for him) is that the filesystem takes its allocation lock shared during
> > the ->fault handler, then in the ->page_mkwrite handler, it knows that an
> > allocation is coming, so it takes its allocation lock in exclusive mode.
> >
> > So read vs write faults won't be able to race because the allocation lock
> > will prevent it.
>
> So this is correct and clean design but we will take the lock in exclusive
> mode (and thus hurt scalability) for every write fault, not just for the
> ones allocating blocks. And at the moment we take exclusive lock for write
> faults, there's no more need for having the hole page instantiated - we can
> still do it for simplicity but it's no longer necessary to avoid data
> corruption.
In my mind we take it only for allocating writes, because we also include
the patch to insert PFNs with the writable bit set in the dax_fault
handler if the page fault was for writes.
Although that only works when the *first* fault is a write ... if we
read and page then write the same page, we will indeed take the lock
in exclusive mode. I think that's fixable too -- in the page_mkwrite
handler, take the lock in exclusive mode only if there's a page in the
radix tree. I'll take a look at that optimisation after doing the first
couple of steps.
next prev parent reply other threads:[~2016-01-26 14:47 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-22 23:06 [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race Ross Zwisler
2016-01-22 23:06 ` Ross Zwisler
2016-01-22 23:06 ` Ross Zwisler
2016-01-23 2:01 ` Matthew Wilcox
2016-01-23 2:01 ` Matthew Wilcox
2016-01-23 2:01 ` Matthew Wilcox
2016-01-24 22:01 ` Dave Chinner
2016-01-24 22:01 ` Dave Chinner
2016-01-24 22:01 ` Dave Chinner
2016-01-25 13:59 ` Jan Kara
2016-01-25 13:59 ` Jan Kara
2016-01-25 13:59 ` Jan Kara
2016-01-26 12:48 ` Matthew Wilcox
2016-01-26 12:48 ` Matthew Wilcox
2016-01-26 12:48 ` Matthew Wilcox
2016-01-26 13:05 ` Jan Kara
2016-01-26 13:05 ` Jan Kara
2016-01-26 13:05 ` Jan Kara
2016-01-26 14:47 ` Matthew Wilcox [this message]
2016-01-26 14:47 ` Matthew Wilcox
2016-01-26 14:47 ` Matthew Wilcox
2016-01-25 20:46 ` Matthew Wilcox
2016-01-25 20:46 ` Matthew Wilcox
2016-01-25 20:46 ` Matthew Wilcox
2016-01-26 8:46 ` Jan Kara
2016-01-26 8:46 ` Jan Kara
2016-01-26 8:46 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160126144746.GL2948@linux.intel.com \
--to=willy@linux.intel.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.