Are there u32 atomic bitops? (or dealing w/ i

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Are there u32 atomic bitops? (or dealing w/ i_flags)
@ 2012-12-18  1:10 Andy Lutomirski
  2012-12-18  1:34 ` Ming Lei
  2012-12-18  1:57 ` Al Viro
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-18  1:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel

I want to change inode->i_flags access to be atomic -- there are some
locking oddities right now, I think, and I want to use a new inode
flag to signal mtime updates from page_mkwrite.  The problem is that
i_flags is an unsigned int, and making it an unsigned long seems like
a waste, but there aren't any u32 atomic bitops.

What should I do?  Suck it up and waste four bytes on 64-bit machines?

In general, having atomic flag words be long seems likely to waste
bits on 64-bit architectures.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18  1:10 Are there u32 atomic bitops? (or dealing w/ i_flags) Andy Lutomirski
@ 2012-12-18  1:34 ` Ming Lei
  2012-12-18  1:57 ` Al Viro
  1 sibling, 0 replies; 26+ messages in thread
From: Ming Lei @ 2012-12-18  1:34 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, Linux FS Devel

On Tue, Dec 18, 2012 at 9:10 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> I want to change inode->i_flags access to be atomic -- there are some
> locking oddities right now, I think, and I want to use a new inode
> flag to signal mtime updates from page_mkwrite.  The problem is that
> i_flags is an unsigned int, and making it an unsigned long seems like
> a waste, but there aren't any u32 atomic bitops.

See below in include/linux/types.h

typedef struct {
        int counter;
} atomic_t;

#ifdef CONFIG_64BIT
typedef struct {
        long counter;
} atomic64_t;
#endif



Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18  1:10 Are there u32 atomic bitops? (or dealing w/ i_flags) Andy Lutomirski
  2012-12-18  1:34 ` Ming Lei
@ 2012-12-18  1:57 ` Al Viro
  2012-12-18  2:42   ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2012-12-18  1:57 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, Linux FS Devel

On Mon, Dec 17, 2012 at 05:10:21PM -0800, Andy Lutomirski wrote:
> I want to change inode->i_flags access to be atomic -- there are some
> locking oddities right now, I think, and I want to use a new inode
> flag to signal mtime updates from page_mkwrite.  The problem is that
> i_flags is an unsigned int, and making it an unsigned long seems like
> a waste, but there aren't any u32 atomic bitops.

... and atomic accesses cost more.  A lot more on some architectures.
FWIW, atomic_t *is* 32bit on 32bit architectures, which still doesn't
make it a good idea.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18  1:57 ` Al Viro
@ 2012-12-18  2:42   ` Andy Lutomirski
  2012-12-18 21:30     ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-18  2:42 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, Linux FS Devel

On Mon, Dec 17, 2012 at 5:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Dec 17, 2012 at 05:10:21PM -0800, Andy Lutomirski wrote:
>> I want to change inode->i_flags access to be atomic -- there are some
>> locking oddities right now, I think, and I want to use a new inode
>> flag to signal mtime updates from page_mkwrite.  The problem is that
>> i_flags is an unsigned int, and making it an unsigned long seems like
>> a waste, but there aren't any u32 atomic bitops.
>
> ... and atomic accesses cost more.  A lot more on some architectures.
> FWIW, atomic_t *is* 32bit on 32bit architectures, which still doesn't
> make it a good idea.

Are atomic_set_mask and atomic_clear_mask as fast as set_bit and
friends on all archs?

In any case, i_flags looks like it's rarely written, so I find it a
bit hard to believe that making it atomic would hurt.  Isn't
atomic_read equivalent to non-atomic reads everywhere?

I want page_mkwrite to set a flag (without taking i_mutex) but *not*
call file_update_time and then to have the writeback paths update the
inode time.  (This, along with stable pages, is the major cause of
long sleeps in my application.)  OTOH, maybe I should just use i_state
and i_lock for this.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18  2:42   ` Andy Lutomirski
@ 2012-12-18 21:30     ` Dave Chinner
  2012-12-18 22:20       ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-12-18 21:30 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Mon, Dec 17, 2012 at 06:42:44PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 17, 2012 at 5:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Mon, Dec 17, 2012 at 05:10:21PM -0800, Andy Lutomirski wrote:
> >> I want to change inode->i_flags access to be atomic -- there are some
> >> locking oddities right now, I think, and I want to use a new inode
> >> flag to signal mtime updates from page_mkwrite.  The problem is that
> >> i_flags is an unsigned int, and making it an unsigned long seems like
> >> a waste, but there aren't any u32 atomic bitops.
> >
> > ... and atomic accesses cost more.  A lot more on some architectures.
> > FWIW, atomic_t *is* 32bit on 32bit architectures, which still doesn't
> > make it a good idea.
> 
> Are atomic_set_mask and atomic_clear_mask as fast as set_bit and
> friends on all archs?
> 
> In any case, i_flags looks like it's rarely written, so I find it a
> bit hard to believe that making it atomic would hurt.  Isn't
> atomic_read equivalent to non-atomic reads everywhere?
> 
> I want page_mkwrite to set a flag (without taking i_mutex) but *not*
> call file_update_time and then to have the writeback paths update the
> inode time.

Deadlocks ahoy!

We don't currently take the i_mutex anywhere in the writeback path
as the writeback path is nested inside the i_mutex. Hence this seems
like an extremely dangerous thing to do...

> (This, along with stable pages, is the major cause of
> long sleeps in my application.)  OTOH, maybe I should just use i_state
> and i_lock for this.

Or, perhaps, export O_CMTIME to fcntl and/or open?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18 21:30     ` Dave Chinner
@ 2012-12-18 22:20       ` Andy Lutomirski
  2012-12-20  7:03         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-18 22:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Tue, Dec 18, 2012 at 1:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Dec 17, 2012 at 06:42:44PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 17, 2012 at 5:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> > On Mon, Dec 17, 2012 at 05:10:21PM -0800, Andy Lutomirski wrote:
>> >> I want to change inode->i_flags access to be atomic -- there are some
>> >> locking oddities right now, I think, and I want to use a new inode
>> >> flag to signal mtime updates from page_mkwrite.  The problem is that
>> >> i_flags is an unsigned int, and making it an unsigned long seems like
>> >> a waste, but there aren't any u32 atomic bitops.
>> >
>> > ... and atomic accesses cost more.  A lot more on some architectures.
>> > FWIW, atomic_t *is* 32bit on 32bit architectures, which still doesn't
>> > make it a good idea.
>>
>> Are atomic_set_mask and atomic_clear_mask as fast as set_bit and
>> friends on all archs?
>>
>> In any case, i_flags looks like it's rarely written, so I find it a
>> bit hard to believe that making it atomic would hurt.  Isn't
>> atomic_read equivalent to non-atomic reads everywhere?
>>
>> I want page_mkwrite to set a flag (without taking i_mutex) but *not*
>> call file_update_time and then to have the writeback paths update the
>> inode time.
>
> Deadlocks ahoy!
>
> We don't currently take the i_mutex anywhere in the writeback path
> as the writeback path is nested inside the i_mutex. Hence this seems
> like an extremely dangerous thing to do...

Hmm.

This can all be made fs-specific: page_mkclean_file sets a bit in the
inode flags or inode state or address_space_flags.  (The latter might
be nice -- it's already atomic and I think there are plenty of bits
free.)  Then a filesystem could stop calling file_update_time in
->page_mkwrite and instead test-clear that bit in ->writepage.

At that point, maybe the fs knows it's safe to mark the inode dirty
and update times.  (The actual time fields don't seem to be uniformly
protected by any lock.)  Can ext4_mark_inode_dirty be called from
->writepage?

Filesystems that haven't been converted can will continue to update
times in ->page_mkwrite.

>
>> (This, along with stable pages, is the major cause of
>> long sleeps in my application.)  OTOH, maybe I should just use i_state
>> and i_lock for this.
>
> Or, perhaps, export O_CMTIME to fcntl and/or open?

That would work, but it would be kind of nice for the time stamps on
my files to get updated correctly.  I'd argue that it's more correct
for the timestamp to be after the update (at page cleaning time,
probably) than at page_mkwrite time.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-18 22:20       ` Andy Lutomirski
@ 2012-12-20  7:03         ` Dave Chinner
  2012-12-20 20:05           ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-12-20  7:03 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Tue, Dec 18, 2012 at 02:20:06PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 18, 2012 at 1:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Dec 17, 2012 at 06:42:44PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 17, 2012 at 5:57 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >> > On Mon, Dec 17, 2012 at 05:10:21PM -0800, Andy Lutomirski wrote:
> >> >> I want to change inode->i_flags access to be atomic -- there are some
> >> >> locking oddities right now, I think, and I want to use a new inode
> >> >> flag to signal mtime updates from page_mkwrite.  The problem is that
> >> >> i_flags is an unsigned int, and making it an unsigned long seems like
> >> >> a waste, but there aren't any u32 atomic bitops.
> >> >
> >> > ... and atomic accesses cost more.  A lot more on some architectures.
> >> > FWIW, atomic_t *is* 32bit on 32bit architectures, which still doesn't
> >> > make it a good idea.
> >>
> >> Are atomic_set_mask and atomic_clear_mask as fast as set_bit and
> >> friends on all archs?
> >>
> >> In any case, i_flags looks like it's rarely written, so I find it a
> >> bit hard to believe that making it atomic would hurt.  Isn't
> >> atomic_read equivalent to non-atomic reads everywhere?
> >>
> >> I want page_mkwrite to set a flag (without taking i_mutex) but *not*
> >> call file_update_time and then to have the writeback paths update the
> >> inode time.
> >
> > Deadlocks ahoy!
> >
> > We don't currently take the i_mutex anywhere in the writeback path
> > as the writeback path is nested inside the i_mutex. Hence this seems
> > like an extremely dangerous thing to do...
> 
> Hmm.
> 
> This can all be made fs-specific: page_mkclean_file sets a bit in the
> inode flags or inode state or address_space_flags.  (The latter might
> be nice -- it's already atomic and I think there are plenty of bits
> free.)  Then a filesystem could stop calling file_update_time in
> ->page_mkwrite and instead test-clear that bit in ->writepage.

writepage happens with pages locked. There's a limit to what you can
do in a filesystem while a page is locked, and you most certainly
don't want to be taking the i_mutex at that point...

> At that point, maybe the fs knows it's safe to mark the inode dirty
> and update times.  (The actual time fields don't seem to be uniformly
> protected by any lock.)  Can ext4_mark_inode_dirty be called from
> ->writepage?

Probably, but that's a filesystem internal function that has to be
called from with a valid transaction handle.

The fact you are conerned about this function tells me something
important - that you aren't having problems with i_mutex (show me
where i_mutex is taken on the page_mkwrite path ;), but you are
having latency problems with the ext4 .dirty_inode method starting a
new transaction when it is called from mark_inode_dirty_sync().

So, a filesystem specific problem, perhaps?

> Filesystems that haven't been converted can will continue to update
> times in ->page_mkwrite.

You don't need to change this at all. If you have ext4
implement .update_timestamp to do whatever timestamp trickery you
want and avoid ext4 starting a new transaction in .dirty_inode for
pure timestamp updates, you can move the timestamp update into the
ext4 writeback path. The ext4 writeback path is already quite
special, so I'm sure people would welcome another weird behaviour
being added to it :)

IOWs, what you want to do doesn't seem to require any changes to the
generic code.  Just make it do timestamp updates in a manner similar
to XFS and btrfs, and you can handle it all completely internally to
the filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-20  7:03         ` Dave Chinner
@ 2012-12-20 20:05           ` Andy Lutomirski
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
  2012-12-20 23:36             ` Are there u32 atomic bitops? (or dealing w/ i_flags) Dave Chinner
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 20:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Wed, Dec 19, 2012 at 11:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> The fact you are conerned about this function tells me something
> important - that you aren't having problems with i_mutex (show me
> where i_mutex is taken on the page_mkwrite path ;), but you are
> having latency problems with the ext4 .dirty_inode method starting a
> new transaction when it is called from mark_inode_dirty_sync().
>
> So, a filesystem specific problem, perhaps?

i_mutex isn't involved.  On ext4, it looks like this (from latencytop):

start_this_handle jbd2__journal_start jbd2_journal_start
ext4_journal_start_sb ext4_dirty_inode __mark_inode_dirty update_time
file_update_time ext4_page_mkwrite do_wp_page handle_pte_fault
handle_mm_fault

This is a showstopper for my software -- I'm running on a kernel with
the call to file_update_time commented out.

>
>> Filesystems that haven't been converted can will continue to update
>> times in ->page_mkwrite.
>
> You don't need to change this at all. If you have ext4
> implement .update_timestamp to do whatever timestamp trickery you
> want and avoid ext4 starting a new transaction in .dirty_inode for
> pure timestamp updates, you can move the timestamp update into the
> ext4 writeback path. The ext4 writeback path is already quite
> special, so I'm sure people would welcome another weird behaviour
> being added to it :)
>
> IOWs, what you want to do doesn't seem to require any changes to the
> generic code.  Just make it do timestamp updates in a manner similar
> to XFS and btrfs, and you can handle it all completely internally to
> the filesystem...

Are XFS and btrfs really better?  XFS does:

	tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
	error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
	if (error) {
		xfs_trans_cancel(tp, 0);
		return -error;
	}

	xfs_ilock(ip, XFS_ILOCK_EXCL);

This looks like it could sleep in a couple of places.  I admit I
haven't actually tried it.


In any case, I have an alternative approach that I'm currently playing
with.  If it survives a bit of testing, I'll send patches.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes
  2012-12-20 20:05           ` Andy Lutomirski
@ 2012-12-20 23:10             ` Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
                                 ` (3 more replies)
  2012-12-20 23:36             ` Are there u32 atomic bitops? (or dealing w/ i_flags) Dave Chinner
  1 sibling, 4 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel
  Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski

Writes via mmap currently update mtime and ctime in ->page_mkwrite.
This is unfortunate from a performance and a correctness point of view.
The file times should be updated after writes, not before (so that every
write eventually results in a fresh timestamp).  More importantly
(for me), ->page_mkwrite is called periodically even on mlocked pages,
and some filesystems can sleep in mark_inode_dirty.

This patchset attempts to fix both issues at once.  It adds a new
address_space flag AS_CMTIME that is set atomically whenever the system
transfers a pte dirty bit to a struct page backed by the address_space.
This can happen with various locks held and when low on memory.

Later on, whenever syncing an inode (which happens indirectly in msync)
or whenever a vma is torn down, if AS_CMTIME is set, then the file times
are updated.  This happens in a context from which (I think) it's safe
to dirty inodes.

One nice property of this approach is that it requires no fs-specific
work.  It's actually quite a bit simpler than I expected.

I've tested this, and mtime and ctime are updated on munmap, exit, MS_SYNC,
and fsync after writing via mmap.  The times are also updated 30 seconds
after writing, all by themselves :)  Lockdep has no complaints.

NB: I am not at all an expert in anything fs or pagecache related.  Please
help me find things that may be wrong with these patches.

Andy Lutomirski (4):
  mm: Explicitly track when the page dirty bit is transferred from a
    pte
  mm: Update file times when inodes are written after mmaped writes
  Remove file_update_time from all mkwrite paths
  ext4: Fix an incorrect comment about i_mutex

 fs/9p/vfs_file.c        |  3 ---
 fs/btrfs/inode.c        |  4 +---
 fs/buffer.c             |  6 ------
 fs/ceph/addr.c          |  3 ---
 fs/ext4/fsync.c         |  2 --
 fs/ext4/inode.c         |  1 -
 fs/gfs2/file.c          |  3 ---
 fs/inode.c              | 37 +++++++++++++++++++++++++++++++++++++
 fs/nilfs2/file.c        |  1 -
 fs/sysfs/bin.c          |  2 --
 include/linux/fs.h      |  1 +
 include/linux/mm.h      |  1 +
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            |  1 -
 mm/memory-failure.c     |  4 +---
 mm/memory.c             |  5 +----
 mm/mmap.c               |  4 ++++
 mm/page-writeback.c     | 42 ++++++++++++++++++++++++++++++++++++++++--
 mm/rmap.c               |  9 ++++++---
 19 files changed, 95 insertions(+), 37 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
@ 2012-12-20 23:10               ` Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel
  Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski

This is a slight cleanup, but, more importantly, it will let us
easily detect writes via mmap when the process is done writing
(e.g. munmaps, msyncs, fsyncs, or dies).

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm.h  |  1 +
 mm/memory-failure.c |  4 +---
 mm/page-writeback.c | 16 ++++++++++++++--
 mm/rmap.c           |  9 ++++++---
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a44aa00..4c7e49b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1037,6 +1037,7 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_writeback(struct page *page);
 int set_page_dirty(struct page *page);
+int set_page_dirty_from_pte(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8b20278..ab5c3f4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -892,9 +892,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	mapping = page_mapping(hpage);
 	if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
 	    mapping_cap_writeback_dirty(mapping)) {
-		if (page_mkclean(hpage)) {
-			SetPageDirty(hpage);
-		} else {
+		if (!page_mkclean(hpage)) {
 			kill = 0;
 			ttu |= TTU_IGNORE_HWPOISON;
 			printk(KERN_INFO
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..cdea11a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2109,6 +2109,19 @@ int set_page_dirty(struct page *page)
 EXPORT_SYMBOL(set_page_dirty);
 
 /*
+ * Dirty a page due to the page table dirty bit.
+ *
+ * For pages with a mapping this should be done under the page lock.
+ * This does set_page_dirty plus extra work to account for the fact that
+ * the kernel was not notified when the actual write was done.
+ */
+int set_page_dirty_from_pte(struct page *page)
+{
+	/* Doesn't do anything interesting yet. */
+	return set_page_dirty(page);
+}
+
+/*
  * set_page_dirty() is racy if the caller has no reference against
  * page->mapping->host, and if the page is unlocked.  This is because another
  * CPU could truncate the page off the mapping and then free the mapping.
@@ -2175,8 +2188,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * as a serialization point for all the different
 		 * threads doing their things.
 		 */
-		if (page_mkclean(page))
-			set_page_dirty(page);
+		page_mkclean(page);
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..b8fe00e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -931,6 +931,9 @@ int page_mkclean(struct page *page)
 			ret = page_mkclean_file(mapping, page);
 	}
 
+	if (ret)
+		set_page_dirty_from_pte(page);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
@@ -1151,7 +1154,7 @@ void page_remove_rmap(struct page *page)
 	 */
 	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
+		set_page_dirty_from_pte(page);
 	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
@@ -1229,7 +1232,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		set_page_dirty_from_pte(page);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1423,7 +1426,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			set_page_dirty_from_pte(page);
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
@ 2012-12-20 23:10               ` Andy Lutomirski
  2012-12-21  0:14                 ` Dave Chinner
  2012-12-21  0:34                 ` Jan Kara
  2012-12-20 23:10               ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
  3 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel
  Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski

The onus is currently on filesystems to call file_update_time
somewhere in the page_mkwrite path.  This is unfortunate for three
reasons:

1. page_mkwrite on a locked page should be fast.  ext4, for example,
   often sleeps while dirtying inodes.

2. The current behavior is surprising -- the timestamp resulting from
   an mmaped write will be before the write, not after.  This contradicts
   the mmap(2) manpage, which says:

     The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
     MAP_SHARED  will  be  updated  after  a write to the mapped region, and
     before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
     occurs.

3. (An ulterior motive) I'd like to use hardware dirty tracking for
   shared, locked, writable mappings (instead of page faults).  Moving
   important work out of the page_mkwrite path is an important first step.

The next patch will remove the now-unnecessary file_update_time calls.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/inode.c              | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h |  3 +++
 mm/memory.c             |  2 +-
 mm/mmap.c               |  4 ++++
 mm/page-writeback.c     | 30 ++++++++++++++++++++++++++++--
 6 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 64999f1..d881d0b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1688,6 +1688,43 @@ int file_update_time(struct file *file)
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime time
+ *	@inode: inode accessed
+ *
+ *	This is like file_update_time, but it assumes the mnt is writable
+ *	and takes an inode parameter instead.
+ */
+
+int inode_update_time_writable(struct inode *inode)
+{
+	struct timespec now;
+	int sync_it = 0;
+	int ret;
+
+	/* First try to exhaust all avenues to not sync */
+	if (IS_NOCMTIME(inode))
+		return 0;
+
+	now = current_fs_time(inode->i_sb);
+	if (!timespec_equal(&inode->i_mtime, &now))
+		sync_it = S_MTIME;
+
+	if (!timespec_equal(&inode->i_ctime, &now))
+		sync_it |= S_CTIME;
+
+	if (IS_I_VERSION(inode))
+		sync_it |= S_VERSION;
+
+	if (!sync_it)
+		return 0;
+
+	ret = update_time(inode, &now, sync_it);
+
+	return ret;
+}
+EXPORT_SYMBOL(inode_update_time_writable);
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75fe9a1..c95f9fa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2554,6 +2554,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern int file_update_time(struct file *file);
+extern int inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..a038ed9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_CMTIME	= __GFP_BITS_SHIFT + 4, /* written via pte */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -68,6 +69,8 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+extern void mapping_flush_cmtime(struct address_space *mapping);
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..086b901 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1160,7 +1160,7 @@ again:
 				rss[MM_ANONPAGES]--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty_from_pte(page);
 				if (pte_young(ptent) &&
 				    likely(!VM_SequentialReadHint(vma)))
 					mark_page_accessed(page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3913262..60301dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	struct vm_area_struct *next = vma->vm_next;
 
 	might_sleep();
+
+	if (vma->vm_file)
+		mapping_flush_cmtime(vma->vm_file->f_mapping);
+
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cdea11a..8cbb7fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/*
+	 * This is after writepages because the AS_CMTIME bit won't
+	 * bet set until writepages is called.
+	 */
+	mapping_flush_cmtime(mapping);
+
 	return ret;
 }
 
@@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
  */
 int set_page_dirty_from_pte(struct page *page)
 {
-	/* Doesn't do anything interesting yet. */
-	return set_page_dirty(page);
+	int ret = set_page_dirty(page);
+
+	/*
+	 * We may be out of memory and/or have various locks held, so
+	 * there isn't much we can do in here.
+	 */
+	struct address_space *mapping = page_mapping(page);
+	if (mapping)
+		set_bit(AS_CMTIME, &mapping->flags);
+
+	return ret;
 }
 
 /*
@@ -2287,3 +2303,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
 	return radix_tree_tagged(&mapping->page_tree, tag);
 }
 EXPORT_SYMBOL(mapping_tagged);
+
+/*
+ * Call from any context from which inode_update_time_writable would be okay
+ * to flush deferred cmtime changes.
+ */
+void mapping_flush_cmtime(struct address_space *mapping)
+{
+	if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
+		inode_update_time_writable(mapping->host);
+}
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
@ 2012-12-20 23:10               ` Andy Lutomirski
  2012-12-20 23:10               ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
  3 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel
  Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski

The times are now updated at sync time.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/9p/vfs_file.c | 3 ---
 fs/btrfs/inode.c | 4 +---
 fs/buffer.c      | 6 ------
 fs/ceph/addr.c   | 3 ---
 fs/ext4/inode.c  | 1 -
 fs/gfs2/file.c   | 3 ---
 fs/nilfs2/file.c | 1 -
 fs/sysfs/bin.c   | 2 --
 mm/filemap.c     | 1 -
 mm/memory.c      | 3 ---
 10 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index c2483e9..34b84f0 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -610,9 +610,6 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	p9_debug(P9_DEBUG_VFS, "page %p fid %lx\n",
 		 page, (unsigned long)filp->private_data);
 
-	/* Update file times before taking page lock */
-	file_update_time(filp);
-
 	v9inode = V9FS_I(inode);
 	/* make sure the cache has finished storing the page */
 	v9fs_fscache_wait_on_page_write(inode, page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95542a1..6fb8558 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6747,10 +6747,8 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
 	if (!ret) {
-		ret = file_update_time(vma->vm_file);
 		reserved = 1;
-	}
-	if (ret) {
+	} else if (ret) {
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
diff --git a/fs/buffer.c b/fs/buffer.c
index ec0aca8..a9a8f2a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,12 +2379,6 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	sb_start_pagefault(sb);
 
-	/*
-	 * Update file times before taking page lock. We may end up failing the
-	 * fault so this update may be superfluous but who really cares...
-	 */
-	file_update_time(vma->vm_file);
-
 	ret = __block_page_mkwrite(vma, vmf, get_block);
 	sb_end_pagefault(sb);
 	return block_page_mkwrite_return(ret);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..19af339 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1183,9 +1183,6 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size, len;
 	int ret;
 
-	/* Update time before taking page lock */
-	file_update_time(vma->vm_file);
-
 	size = i_size_read(inode);
 	if (off + PAGE_CACHE_SIZE <= size)
 		len = PAGE_CACHE_SIZE;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..5ddbc75 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4780,7 +4780,6 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int retries = 0;
 
 	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index e056b4c..999b1ed 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -398,9 +398,6 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	sb_start_pagefault(inode->i_sb);
 
-	/* Update file times before taking page lock */
-	file_update_time(vma->vm_file);
-
 	ret = gfs2_rs_alloc(ip);
 	if (ret)
 		return ret;
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 16f35f7..185b8e6 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -116,7 +116,6 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (unlikely(ret))
 		goto out;
 
-	file_update_time(vma->vm_file);
 	ret = __block_page_mkwrite(vma, vmf, nilfs_get_block);
 	if (ret) {
 		nilfs_transaction_abort(inode->i_sb);
diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 614b2b5..a475983 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -228,8 +228,6 @@ static int bin_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	ret = 0;
 	if (bb->vm_ops->page_mkwrite)
 		ret = bb->vm_ops->page_mkwrite(vma, vmf);
-	else
-		file_update_time(file);
 
 	sysfs_put_active(attr_sd);
 	return ret;
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..feb0540 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1715,7 +1715,6 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int ret = VM_FAULT_LOCKED;
 
 	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
 	lock_page(page);
 	if (page->mapping != inode->i_mapping) {
 		unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 086b901..07a0b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2658,9 +2658,6 @@ reuse:
 		if (!page_mkwrite) {
 			wait_on_page_locked(dirty_page);
 			set_page_dirty_balance(dirty_page, page_mkwrite);
-			/* file_update_time outside page_lock */
-			if (vma->vm_file)
-				file_update_time(vma->vm_file);
 		}
 		put_page(dirty_page);
 		if (page_mkwrite) {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                                 ` (2 preceding siblings ...)
  2012-12-20 23:10               ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
@ 2012-12-20 23:10               ` Andy Lutomirski
  2012-12-20 23:42                 ` Jan Kara
  3 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
  To: linux-kernel, Linux FS Devel
  Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/fsync.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index be1d89f..8c15642 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)
  *
  * What we do is just kick off a commit and wait on it.  This will snapshot the
  * inode to disk.
- *
- * i_mutex lock is held when entering and exiting this function
  */
 
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-20 20:05           ` Andy Lutomirski
  2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
@ 2012-12-20 23:36             ` Dave Chinner
  2012-12-20 23:42               ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-12-20 23:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Thu, Dec 20, 2012 at 12:05:09PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 19, 2012 at 11:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> start_this_handle jbd2__journal_start jbd2_journal_start
> ext4_journal_start_sb ext4_dirty_inode __mark_inode_dirty update_time
> file_update_time ext4_page_mkwrite do_wp_page handle_pte_fault
> handle_mm_fault

Yup, as I suspected. It's a filesystem specific problem.

> This is a showstopper for my software -- I'm running on a kernel with
> the call to file_update_time commented out.

Which means you are effectively running with O_CMTIME on all mmapped
files....

> >> Filesystems that haven't been converted can will continue to update
> >> times in ->page_mkwrite.
> >
> > You don't need to change this at all. If you have ext4
> > implement .update_timestamp to do whatever timestamp trickery you
> > want and avoid ext4 starting a new transaction in .dirty_inode for
> > pure timestamp updates, you can move the timestamp update into the
> > ext4 writeback path. The ext4 writeback path is already quite
> > special, so I'm sure people would welcome another weird behaviour
> > being added to it :)
> >
> > IOWs, what you want to do doesn't seem to require any changes to the
> > generic code.  Just make it do timestamp updates in a manner similar
> > to XFS and btrfs, and you can handle it all completely internally to
> > the filesystem...
> 
> Are XFS and btrfs really better?  XFS does:
> 
> 	tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> 	error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
> 	if (error) {
> 		xfs_trans_cancel(tp, 0);
> 		return -error;
> 	}
> 
> 	xfs_ilock(ip, XFS_ILOCK_EXCL);
> 
> This looks like it could sleep in a couple of places.  I admit I
> haven't actually tried it.

It certainly can if there is no log space available, but that's a
filesystem specific problem, not a problem with the way the VFS does
timestamp updates.

Indeed, it was only recently we changed this code to use a
transaction, previously it was doing exactly what you are proposing
completely internally to XFS. i.e. it copied the timestamp and set a
dirty flag that then trigger the next transaction that dirtied the
inode or triggered inode writeback to copy the new timestamps into
the inode with a transaction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Are there u32 atomic bitops? (or dealing w/ i_flags)
  2012-12-20 23:36             ` Are there u32 atomic bitops? (or dealing w/ i_flags) Dave Chinner
@ 2012-12-20 23:42               ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-kernel, Linux FS Devel

On Thu, Dec 20, 2012 at 3:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Dec 20, 2012 at 12:05:09PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 19, 2012 at 11:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>> start_this_handle jbd2__journal_start jbd2_journal_start
>> ext4_journal_start_sb ext4_dirty_inode __mark_inode_dirty update_time
>> file_update_time ext4_page_mkwrite do_wp_page handle_pte_fault
>> handle_mm_fault
>
> Yup, as I suspected. It's a filesystem specific problem.
>
>> This is a showstopper for my software -- I'm running on a kernel with
>> the call to file_update_time commented out.
>
> Which means you are effectively running with O_CMTIME on all mmapped
> files....

Indeed.  I'm not thrilled by not having timestamps update, and IMO
adding O_CMTIME as part of the API seems like a silly way to "fix"
this.  I'd rather fix the implementation.

The problems may be fs-specific in the sense that an fs could fix them
internally, but I think something like my AS_CMTIME approach could fix
it everywhere with much less complexity.  Any thoughts on my patch
set?  (Are you the right person to review it?)

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex
  2012-12-20 23:10               ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
@ 2012-12-20 23:42                 ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-20 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro, Jan Kara

On Thu 20-12-12 15:10:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

  BTW: The right person for this patch is ext4 maintainer:
"Theodore Ts'o" <tytso@mit.edu>
  Since this is completely independent of anything else in this series,
just submit the patch to him. Thanks!

								Honza
> ---
>  fs/ext4/fsync.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index be1d89f..8c15642 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)
>   *
>   * What we do is just kick off a commit and wait on it.  This will snapshot the
>   * inode to disk.
> - *
> - * i_mutex lock is held when entering and exiting this function
>   */
>  
>  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> -- 
> 1.7.11.7
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-20 23:10               ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
@ 2012-12-21  0:14                 ` Dave Chinner
  2012-12-21  0:58                   ` Jan Kara
  2012-12-21  5:36                   ` Andy Lutomirski
  2012-12-21  0:34                 ` Jan Kara
  1 sibling, 2 replies; 26+ messages in thread
From: Dave Chinner @ 2012-12-21  0:14 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, Linux FS Devel, Al Viro, Jan Kara

On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> The onus is currently on filesystems to call file_update_time
> somewhere in the page_mkwrite path.  This is unfortunate for three
> reasons:
> 
> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
>    often sleeps while dirtying inodes.

That's an ext4 problem, not a page fault or timestamp update
problem. Fix ext4.

> 2. The current behavior is surprising -- the timestamp resulting from
>    an mmaped write will be before the write, not after.  This contradicts
>    the mmap(2) manpage, which says:
> 
>      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
>      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
>      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
>      occurs.

What you propose (time updates in do_writepages()) violates this.
msync(MS_ASYNC) doesn't actually start any IO, therefore the
timestamp wil not be updated.

Besides, what POSIX actually says is:

| The st_ctime and st_mtime fields of a file that is mapped with
| MAP_SHARED and PROT_WRITE shall be marked for update at some point
| in the interval between a write reference to the mapped region and
| the next call to msync() with MS_ASYNC or MS_SYNC for that portion
| of the file by any process.

Which means updating the timestamp during the first write is
perfectly acceptible. Indeed, by definition, we are compliant with
the man page because the update is after the write has occurred.
That is, the write triggered the page fault, so the page fault
processing where we update the timestamps is definitely after the
write occurred. :)

> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>    shared, locked, writable mappings (instead of page faults).  Moving
>    important work out of the page_mkwrite path is an important first step.

I don't think you're going to get very far doing this. page_mkwrite
needs to do:

	a) block allocation in page_mkwrite() for faults over holes
	   to detect ENOSPC conditions immediately rather than in
	   writeback when such an error results in data loss.
	b) detect writes over unwritten extents so that the pages in
	   the page cache can be set up correctly for conversion to
	   occur during writeback.

Correcting these two problems was the reason for introducing
page_mkwrite in the first place - we need to do this stuff before
the page fault is completed, and that means, by definition,
page_mkwrite needs to be able to block. Moving c/mtime updates out
of the way does not, in any way, change these requirements.

Perhaps you should implement everything you want to do inside ext4
first, so we can get an idea of exactly what you want page_mkwrite()
to do, how you want it to behave, and how you expect filesystems to
handle the above situations correctly....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-20 23:10               ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
  2012-12-21  0:14                 ` Dave Chinner
@ 2012-12-21  0:34                 ` Jan Kara
  2012-12-21  5:42                   ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21  0:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro, Jan Kara

On Thu 20-12-12 15:10:10, Andy Lutomirski wrote:
> The onus is currently on filesystems to call file_update_time
> somewhere in the page_mkwrite path.  This is unfortunate for three
> reasons:
> 
> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
>    often sleeps while dirtying inodes.
> 
> 2. The current behavior is surprising -- the timestamp resulting from
>    an mmaped write will be before the write, not after.  This contradicts
>    the mmap(2) manpage, which says:
> 
>      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
>      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
>      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
>      occurs.
  I agree your behavior is more correct wrt to the manpage / spec. OTOH I
could dig out several emails where users complain time stamps magically
change some time after the file was written via mmap (because writeback
happened at that time and it did some allocation to the inode). People hit
this e.g. when compiling something, ld(1) writes final binary through mmap,
the package / archive the final binary and later some sanity check finds
the time stamp on the binary is newer than the package / archive.

Looking more into the patch you end up updating timestamps on munmap(2)
(thus on file close in particular). That should avoid the most surprising
cases and users hopefully won't notice the difference. Good. But please
mention this explicitely in the changelog.


 
> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>    shared, locked, writable mappings (instead of page faults).  Moving
>    important work out of the page_mkwrite path is an important first step.
> 
> The next patch will remove the now-unnecessary file_update_time calls.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  fs/inode.c              | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h      |  1 +
>  include/linux/pagemap.h |  3 +++
>  mm/memory.c             |  2 +-
>  mm/mmap.c               |  4 ++++
>  mm/page-writeback.c     | 30 ++++++++++++++++++++++++++++--
>  6 files changed, 74 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 64999f1..d881d0b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1688,6 +1688,43 @@ int file_update_time(struct file *file)
>  }
>  EXPORT_SYMBOL(file_update_time);
>  
> +/**
> + *	inode_update_time_writable	-	update mtime and ctime time
> + *	@inode: inode accessed
> + *
> + *	This is like file_update_time, but it assumes the mnt is writable
> + *	and takes an inode parameter instead.
> + */
> +
> +int inode_update_time_writable(struct inode *inode)
> +{
> +	struct timespec now;
> +	int sync_it = 0;
> +	int ret;
> +
> +	/* First try to exhaust all avenues to not sync */
> +	if (IS_NOCMTIME(inode))
> +		return 0;
> +
> +	now = current_fs_time(inode->i_sb);
> +	if (!timespec_equal(&inode->i_mtime, &now))
> +		sync_it = S_MTIME;
> +
> +	if (!timespec_equal(&inode->i_ctime, &now))
> +		sync_it |= S_CTIME;
> +
> +	if (IS_I_VERSION(inode))
> +		sync_it |= S_VERSION;
> +
> +	if (!sync_it)
> +		return 0;
> +
> +	ret = update_time(inode, &now, sync_it);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(inode_update_time_writable);
> +
  So this differs from file_update_time() only by not calling
__mnt_want_write(). Why this special function? It is actually unsafe wrt
remounts read-only or filesystem freezing... For that you need to call
sb_start_write() / sb_end_write() around the timestamp update. Umm, or
better sb_start_pagefault() / sb_end_pagefault() because the call in
remove_vma() gets called under mmap_sem so we are in a rather similar
situation to ->page_mkwrite.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3913262..60301dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  	struct vm_area_struct *next = vma->vm_next;
>  
>  	might_sleep();
> +
> +	if (vma->vm_file)
> +		mapping_flush_cmtime(vma->vm_file->f_mapping);
> +
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
>  	if (vma->vm_file)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index cdea11a..8cbb7fb 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  		ret = mapping->a_ops->writepages(mapping, wbc);
>  	else
>  		ret = generic_writepages(mapping, wbc);
> +
> +	/*
> +	 * This is after writepages because the AS_CMTIME bit won't
> +	 * bet set until writepages is called.
> +	 */
> +	mapping_flush_cmtime(mapping);
> +
>  	return ret;
>  }
>  
> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
>   */
>  int set_page_dirty_from_pte(struct page *page)
>  {
> -	/* Doesn't do anything interesting yet. */
> -	return set_page_dirty(page);
> +	int ret = set_page_dirty(page);
> +
> +	/*
> +	 * We may be out of memory and/or have various locks held, so
> +	 * there isn't much we can do in here.
> +	 */
> +	struct address_space *mapping = page_mapping(page);
  Declarations should go together please. So something like:
	int ret = set_page_dirty(page);
	struct address_space *mapping = page_mapping(page);

	/* comment... */

> +	if (mapping)
> +		set_bit(AS_CMTIME, &mapping->flags);
> +
> +	return ret;
>  }
>  
>  /*
> @@ -2287,3 +2303,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
>  	return radix_tree_tagged(&mapping->page_tree, tag);
>  }
>  EXPORT_SYMBOL(mapping_tagged);
> +
> +/*
> + * Call from any context from which inode_update_time_writable would be okay
> + * to flush deferred cmtime changes.
> + */
> +void mapping_flush_cmtime(struct address_space *mapping)
> +{
> +	if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
> +		inode_update_time_writable(mapping->host);
> +}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  0:14                 ` Dave Chinner
@ 2012-12-21  0:58                   ` Jan Kara
  2012-12-21  1:12                     ` Dave Chinner
  2012-12-21  5:36                   ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21  0:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro, Jan Kara

On Fri 21-12-12 11:14:57, Dave Chinner wrote:
> On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> > The onus is currently on filesystems to call file_update_time
> > somewhere in the page_mkwrite path.  This is unfortunate for three
> > reasons:
> > 
> > 1. page_mkwrite on a locked page should be fast.  ext4, for example,
> >    often sleeps while dirtying inodes.
> 
> That's an ext4 problem, not a page fault or timestamp update
> problem. Fix ext4.
  Well, XFS doesn't journal the timestamp update which is why it gets away
without blocking on journal. Other filesystems (and I don't think it's just
ext4) are so benevolent with timestamps so their updates are more costly...

> > 2. The current behavior is surprising -- the timestamp resulting from
> >    an mmaped write will be before the write, not after.  This contradicts
> >    the mmap(2) manpage, which says:
> > 
> >      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
> >      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
> >      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
> >      occurs.
> 
> What you propose (time updates in do_writepages()) violates this.
> msync(MS_ASYNC) doesn't actually start any IO, therefore the
> timestamp wil not be updated.
> 
> Besides, what POSIX actually says is:
> 
> | The st_ctime and st_mtime fields of a file that is mapped with
> | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> | in the interval between a write reference to the mapped region and
> | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> | of the file by any process.
> 
> Which means updating the timestamp during the first write is
> perfectly acceptible. Indeed, by definition, we are compliant with
> the man page because the update is after the write has occurred.
> That is, the write triggered the page fault, so the page fault
> processing where we update the timestamps is definitely after the
> write occurred. :)
  Well, but there can be more writes to the already write faulted page.
They can come seconds after we called ->page_mkwrite() and thus updated
time stamps. So Andy is correct we violate the spec AFAICT.

> > 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> >    shared, locked, writable mappings (instead of page faults).  Moving
> >    important work out of the page_mkwrite path is an important first step.
> 
> I don't think you're going to get very far doing this. page_mkwrite
> needs to do:
> 
> 	a) block allocation in page_mkwrite() for faults over holes
> 	   to detect ENOSPC conditions immediately rather than in
> 	   writeback when such an error results in data loss.
> 	b) detect writes over unwritten extents so that the pages in
> 	   the page cache can be set up correctly for conversion to
> 	   occur during writeback.
> 
> Correcting these two problems was the reason for introducing
> page_mkwrite in the first place - we need to do this stuff before
> the page fault is completed, and that means, by definition,
> page_mkwrite needs to be able to block. Moving c/mtime updates out
> of the way does not, in any way, change these requirements.
  Here I completely agree. I wanted to comment on it in my post as well but
then forgot about it.

> Perhaps you should implement everything you want to do inside ext4
> first, so we can get an idea of exactly what you want page_mkwrite()
> to do, how you want it to behave, and how you expect filesystems to
> handle the above situations correctly....
  Ah, now I noticed we don't call file_update_time() from
__block_page_mkwrite() so yes, just changing ext4 without touching generic
code is easily possible.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  0:58                   ` Jan Kara
@ 2012-12-21  1:12                     ` Dave Chinner
  2012-12-21  1:36                       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-12-21  1:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro

On Fri, Dec 21, 2012 at 01:58:25AM +0100, Jan Kara wrote:
> On Fri 21-12-12 11:14:57, Dave Chinner wrote:
> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> > > The onus is currently on filesystems to call file_update_time
> > > somewhere in the page_mkwrite path.  This is unfortunate for three
> > > reasons:
> > > 
> > > 1. page_mkwrite on a locked page should be fast.  ext4, for example,
> > >    often sleeps while dirtying inodes.
> > 
> > That's an ext4 problem, not a page fault or timestamp update
> > problem. Fix ext4.
>   Well, XFS doesn't journal the timestamp update which is why it gets away
> without blocking on journal. Other filesystems (and I don't think it's just
> ext4) are so benevolent with timestamps so their updates are more costly...

XFS does journal it's timestamps now. We changed that code recently,
so XFs is no different to any other filesystem in this respect.

My point is, though, that this choice (of when to update the
timestamp) can already be made by filesytsems without changing the
high level code.

> > > 2. The current behavior is surprising -- the timestamp resulting from
> > >    an mmaped write will be before the write, not after.  This contradicts
> > >    the mmap(2) manpage, which says:
> > > 
> > >      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
> > >      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
> > >      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
> > >      occurs.
> > 
> > What you propose (time updates in do_writepages()) violates this.
> > msync(MS_ASYNC) doesn't actually start any IO, therefore the
> > timestamp wil not be updated.
> > 
> > Besides, what POSIX actually says is:
> > 
> > | The st_ctime and st_mtime fields of a file that is mapped with
> > | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> > | in the interval between a write reference to the mapped region and
> > | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> > | of the file by any process.
> > 
> > Which means updating the timestamp during the first write is
> > perfectly acceptible. Indeed, by definition, we are compliant with
> > the man page because the update is after the write has occurred.
> > That is, the write triggered the page fault, so the page fault
> > processing where we update the timestamps is definitely after the
> > write occurred. :)
>   Well, but there can be more writes to the already write faulted page.
> They can come seconds after we called ->page_mkwrite() and thus updated
> time stamps. So Andy is correct we violate the spec AFAICT.

Depends how you read it. It can be updated at *any time* between the
write and the msync() call, which is exactly what happens right now.
The fact that second and subsequent writes between the first write
and the msync call do not change it is irrelevant, as the first one
is the one that matters... Indeed, if you read to the letter of the
posix definition, then updating timestamps in the msync call is also
incorrect, because that is not between the write and the msync()
call.

What I'm saying is saying the current behaviour is wrong is
dependent on a specific intepretation of the standard, and the same
arguments can be made against this proposal. Hence such arguments
are not a convincing/compelling reason to change behaviours.

> > Perhaps you should implement everything you want to do inside ext4
> > first, so we can get an idea of exactly what you want page_mkwrite()
> > to do, how you want it to behave, and how you expect filesystems to
> > handle the above situations correctly....
>   Ah, now I noticed we don't call file_update_time() from
> __block_page_mkwrite() so yes, just changing ext4 without touching generic
> code is easily possible.

Yup. I'm not opposed to making such changes, but I want to see the
bigger picture before making the little changes that start moving in
that direction. Until it's demonstrated as possible with a
custom page_mkwrite implementation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  1:12                     ` Dave Chinner
@ 2012-12-21  1:36                       ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-21  1:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro

On Fri 21-12-12 12:12:46, Dave Chinner wrote:
> > > > 2. The current behavior is surprising -- the timestamp resulting from
> > > >    an mmaped write will be before the write, not after.  This contradicts
> > > >    the mmap(2) manpage, which says:
> > > > 
> > > >      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
> > > >      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
> > > >      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
> > > >      occurs.
> > > 
> > > What you propose (time updates in do_writepages()) violates this.
> > > msync(MS_ASYNC) doesn't actually start any IO, therefore the
> > > timestamp wil not be updated.
> > > 
> > > Besides, what POSIX actually says is:
> > > 
> > > | The st_ctime and st_mtime fields of a file that is mapped with
> > > | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> > > | in the interval between a write reference to the mapped region and
> > > | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> > > | of the file by any process.
> > > 
> > > Which means updating the timestamp during the first write is
> > > perfectly acceptible. Indeed, by definition, we are compliant with
> > > the man page because the update is after the write has occurred.
> > > That is, the write triggered the page fault, so the page fault
> > > processing where we update the timestamps is definitely after the
> > > write occurred. :)
> >   Well, but there can be more writes to the already write faulted page.
> > They can come seconds after we called ->page_mkwrite() and thus updated
> > time stamps. So Andy is correct we violate the spec AFAICT.
> 
> Depends how you read it. It can be updated at *any time* between the
> write and the msync() call, which is exactly what happens right now.
> The fact that second and subsequent writes between the first write
> and the msync call do not change it is irrelevant, as the first one
> is the one that matters... Indeed, if you read to the letter of the
> posix definition, then updating timestamps in the msync call is also
> incorrect, because that is not between the write and the msync()
> call.
> 
> What I'm saying is saying the current behaviour is wrong is
> dependent on a specific intepretation of the standard, and the same
> arguments can be made against this proposal. Hence such arguments
> are not a convincing/compelling reason to change behaviours.
  I have to say I'm not following you :) If I have the program that does:

  fd = open("file", O_RDWR);
  addr = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
  addr[0] = 'a';
  sleep(1);
  addr[1] = 'b';
  close(fd);

  Then application of the spec to the second write clearly states that time
stamps should be updated sometime after the write of 'b'. I don't see any
space for other interpretation there... And currently we update time stamps
only at the moment we write 'a'.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  0:14                 ` Dave Chinner
  2012-12-21  0:58                   ` Jan Kara
@ 2012-12-21  5:36                   ` Andy Lutomirski
  2012-12-21 10:51                     ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21  5:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Linux FS Devel, Al Viro, Jan Kara

On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
>> The onus is currently on filesystems to call file_update_time
>> somewhere in the page_mkwrite path.  This is unfortunate for three
>> reasons:
>>
>> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
>>    often sleeps while dirtying inodes.
>
> That's an ext4 problem, not a page fault or timestamp update
> problem. Fix ext4.

I could fix ext4 (or rather, someone who understands jbd2 could fix
ext4), but I think my approach is considerably simpler and fixes all
filesystems at once.

>
>> 2. The current behavior is surprising -- the timestamp resulting from
>>    an mmaped write will be before the write, not after.  This contradicts
>>    the mmap(2) manpage, which says:
>>
>>      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
>>      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
>>      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
>>      occurs.
>
> What you propose (time updates in do_writepages()) violates this.
> msync(MS_ASYNC) doesn't actually start any IO, therefore the
> timestamp wil not be updated.

That's easy enough -- I can add special-case code for MS_ASYNC.  It'll
make an operation that is otherwise a no-op considerably less free,
though.

>
> Besides, what POSIX actually says is:

[snipped -- covered elsewhere in the thread]

>
>> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>>    shared, locked, writable mappings (instead of page faults).  Moving
>>    important work out of the page_mkwrite path is an important first step.
>
> I don't think you're going to get very far doing this. page_mkwrite
> needs to do:
>
>         a) block allocation in page_mkwrite() for faults over holes
>            to detect ENOSPC conditions immediately rather than in
>            writeback when such an error results in data loss.
>         b) detect writes over unwritten extents so that the pages in
>            the page cache can be set up correctly for conversion to
>            occur during writeback.
>
> Correcting these two problems was the reason for introducing
> page_mkwrite in the first place - we need to do this stuff before
> the page fault is completed, and that means, by definition,
> page_mkwrite needs to be able to block. Moving c/mtime updates out
> of the way does not, in any way, change these requirements.

I was unclear.  I have no problem if PROT_WRITE, MAP_SHARED pages
start out write protected.  I want two changes from the status quo,
though:

1. ptes (maybe only if mlocked or maybe even only if some new madvise
flag is set) should be marked clean but stay writable during and after
writeback.  (This would only work when stable pages are not in
effect.)

2. There should be a way to request that pages be made clean-but-writable.

#1 should be mostly fine already from the filesystems' point of view.
#2 would involve calling page_mkwrite or some equivalent.

I suspect that there are bugs that are currently untriggerable
involving clean-but-writable pages.  For example, what happens if cpu
0 (atomically) changes a pte from clean/writable to not present, but
cpu 1 writes the page before the TLB flush happens.  I think the pte
ends up not present + dirty, but the current unmapping code won't
notice.

This would involve splitting page_mkclean into "make clean and write
protect" and "make clean but leave writable".

Doing this would avoid ~1M "soft" page faults per day in a single
real-time thread in my software.  (The file time patches are to make
sure that the "soft" page faults actually don't sleep.)

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  0:34                 ` Jan Kara
@ 2012-12-21  5:42                   ` Andy Lutomirski
  2012-12-21 11:03                     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21  5:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro

On Thu, Dec 20, 2012 at 4:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-12-12 15:10:10, Andy Lutomirski wrote:
>> The onus is currently on filesystems to call file_update_time
>> somewhere in the page_mkwrite path.  This is unfortunate for three
>> reasons:
>>
>> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
>>    often sleeps while dirtying inodes.
>>
>> 2. The current behavior is surprising -- the timestamp resulting from
>>    an mmaped write will be before the write, not after.  This contradicts
>>    the mmap(2) manpage, which says:
>>
>>      The st_ctime and st_mtime field for a file mapped with  PROT_WRITE  and
>>      MAP_SHARED  will  be  updated  after  a write to the mapped region, and
>>      before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if  one
>>      occurs.
>   I agree your behavior is more correct wrt to the manpage / spec. OTOH I
> could dig out several emails where users complain time stamps magically
> change some time after the file was written via mmap (because writeback
> happened at that time and it did some allocation to the inode). People hit
> this e.g. when compiling something, ld(1) writes final binary through mmap,
> the package / archive the final binary and later some sanity check finds
> the time stamp on the binary is newer than the package / archive.
>
> Looking more into the patch you end up updating timestamps on munmap(2)
> (thus on file close in particular). That should avoid the most surprising
> cases and users hopefully won't notice the difference. Good. But please
> mention this explicitely in the changelog.

I was careful to get that case right.  I'll update the changelog.

In particular, I've so far tested munmap, msync(MS_SYNC), fsync,
waiting 30 seconds, and dying by fatal signal.  All of those paths
work right.

>> +/**
>> + *   inode_update_time_writable      -       update mtime and ctime time
>> + *   @inode: inode accessed
>> + *
>> + *   This is like file_update_time, but it assumes the mnt is writable
>> + *   and takes an inode parameter instead.
>> + */
>> +
>> +int inode_update_time_writable(struct inode *inode)
>> +{
>> +     struct timespec now;
>> +     int sync_it = 0;
>> +     int ret;
>> +
>> +     /* First try to exhaust all avenues to not sync */
>> +     if (IS_NOCMTIME(inode))
>> +             return 0;
>> +
>> +     now = current_fs_time(inode->i_sb);
>> +     if (!timespec_equal(&inode->i_mtime, &now))
>> +             sync_it = S_MTIME;
>> +
>> +     if (!timespec_equal(&inode->i_ctime, &now))
>> +             sync_it |= S_CTIME;
>> +
>> +     if (IS_I_VERSION(inode))
>> +             sync_it |= S_VERSION;
>> +
>> +     if (!sync_it)
>> +             return 0;
>> +
>> +     ret = update_time(inode, &now, sync_it);
>> +
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL(inode_update_time_writable);
>> +
>   So this differs from file_update_time() only by not calling
> __mnt_want_write(). Why this special function? It is actually unsafe wrt
> remounts read-only or filesystem freezing... For that you need to call
> sb_start_write() / sb_end_write() around the timestamp update. Umm, or
> better sb_start_pagefault() / sb_end_pagefault() because the call in
> remove_vma() gets called under mmap_sem so we are in a rather similar
> situation to ->page_mkwrite.

The important difference is that it takes an inode* as a parameter
instead of a file*.  I don't think that inodes have a struct vfsmount,
so I can't call __mnt_want_write.  I'll take a look at
sb_start_pagefault.  I'll also refactor this a bit to minimize code
duplication.  The current approach was for the v1 rfc version. :)

>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 3913262..60301dc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>>       struct vm_area_struct *next = vma->vm_next;
>>
>>       might_sleep();
>> +
>> +     if (vma->vm_file)
>> +             mapping_flush_cmtime(vma->vm_file->f_mapping);
>> +
>>       if (vma->vm_ops && vma->vm_ops->close)
>>               vma->vm_ops->close(vma);
>>       if (vma->vm_file)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index cdea11a..8cbb7fb 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>>               ret = mapping->a_ops->writepages(mapping, wbc);
>>       else
>>               ret = generic_writepages(mapping, wbc);
>> +
>> +     /*
>> +      * This is after writepages because the AS_CMTIME bit won't
>> +      * bet set until writepages is called.
>> +      */
>> +     mapping_flush_cmtime(mapping);
>> +
>>       return ret;
>>  }
>>
>> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
>>   */
>>  int set_page_dirty_from_pte(struct page *page)
>>  {
>> -     /* Doesn't do anything interesting yet. */
>> -     return set_page_dirty(page);
>> +     int ret = set_page_dirty(page);
>> +
>> +     /*
>> +      * We may be out of memory and/or have various locks held, so
>> +      * there isn't much we can do in here.
>> +      */
>> +     struct address_space *mapping = page_mapping(page);
>   Declarations should go together please. So something like:
>         int ret = set_page_dirty(page);
>         struct address_space *mapping = page_mapping(page);
>
>         /* comment... */

Will do.  Some day I'll learn how to act less like a C99/C++
programmer when writing kernel code.

Am I correct in interpreting this as "these patches may be
sufficiently non-insane that I should keep working on them"?  I admit
I'm pretty far out of my depth working on vm/vfs stuff.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  5:36                   ` Andy Lutomirski
@ 2012-12-21 10:51                     ` Jan Kara
  2012-12-21 18:26                       ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21 10:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Chinner, linux-kernel, Linux FS Devel, Al Viro, Jan Kara

On Thu 20-12-12 21:36:58, Andy Lutomirski wrote:
> On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> >> The onus is currently on filesystems to call file_update_time
> >> somewhere in the page_mkwrite path.  This is unfortunate for three
> >> reasons:
> >>
> >> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
> >>    often sleeps while dirtying inodes.
> >
> > That's an ext4 problem, not a page fault or timestamp update
> > problem. Fix ext4.
> 
> I could fix ext4 (or rather, someone who understands jbd2 could fix
> ext4), but I think my approach is considerably simpler and fixes all
> filesystems at once.
  Well, you could implement the fix you did just inside ext4. Remove
timestamp update from ext4_page_mkwrite(), just set some inode flag there,
update times in ext4_writepages(), ext4_sync_file(), and
ext4_release_file(). It won't be as elegant but it would work.

> >> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> >>    shared, locked, writable mappings (instead of page faults).  Moving
> >>    important work out of the page_mkwrite path is an important first step.
> >
> > I don't think you're going to get very far doing this. page_mkwrite
> > needs to do:
> >
> >         a) block allocation in page_mkwrite() for faults over holes
> >            to detect ENOSPC conditions immediately rather than in
> >            writeback when such an error results in data loss.
> >         b) detect writes over unwritten extents so that the pages in
> >            the page cache can be set up correctly for conversion to
> >            occur during writeback.
> >
> > Correcting these two problems was the reason for introducing
> > page_mkwrite in the first place - we need to do this stuff before
> > the page fault is completed, and that means, by definition,
> > page_mkwrite needs to be able to block. Moving c/mtime updates out
> > of the way does not, in any way, change these requirements.
> 
> I was unclear.  I have no problem if PROT_WRITE, MAP_SHARED pages
> start out write protected.  I want two changes from the status quo,
> though:
> 
> 1. ptes (maybe only if mlocked or maybe even only if some new madvise
> flag is set) should be marked clean but stay writable during and after
> writeback.  (This would only work when stable pages are not in
> effect.)
  This is actually problematic - s390 architecture doesn't have PTE dirty
bit. It has per page HW dirty bit but that's problematic to use so it's
completely relying on page being protected after writeback so that it hits
page fault when it should be dirtied again. Also memory management uses
these page faults to track number of dirty pages which is then important to
throttle aggressive writers etc. (we used to do what you describe sometimes
in early 2.6 days but then we switched to writeprotecting the page for
writeback to be able to do dirty page tracking).

If we limit the behavior you desribe to mlocked pages, I could imagine
things would work out from the accounting POV (we could treat mlocked pages
as always dirty for accounting purposes) but it's definitely not a change
to be easily made.

> 2. There should be a way to request that pages be made clean-but-writable.
> 
> #1 should be mostly fine already from the filesystems' point of view.
> #2 would involve calling page_mkwrite or some equivalent.
> 
> I suspect that there are bugs that are currently untriggerable
> involving clean-but-writable pages.  For example, what happens if cpu
> 0 (atomically) changes a pte from clean/writable to not present, but
> cpu 1 writes the page before the TLB flush happens.  I think the pte
> ends up not present + dirty, but the current unmapping code won't
> notice.
> 
> This would involve splitting page_mkclean into "make clean and write
> protect" and "make clean but leave writable".
> 
> Doing this would avoid ~1M "soft" page faults per day in a single
> real-time thread in my software.  (The file time patches are to make
> sure that the "soft" page faults actually don't sleep.)
  Yes, there's a non-negligible cost of writeprotecting the page during
writeback. But the avoidance of OOM conditions due to too aggressive
writers simply has precedence... Anyway, if you want to push more in this
direction, I suggest to move this discussion to linux-mm@kvack.org as there
are lingering more appropriate listeners :). I'm just a filesystem guy with
some basic knowledge of MM.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21  5:42                   ` Andy Lutomirski
@ 2012-12-21 11:03                     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-21 11:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-kernel, Linux FS Devel, Dave Chinner, Al Viro

On Thu 20-12-12 21:42:46, Andy Lutomirski wrote:
> On Thu, Dec 20, 2012 at 4:34 PM, Jan Kara <jack@suse.cz> wrote:
> >> +/**
> >> + *   inode_update_time_writable      -       update mtime and ctime time
> >> + *   @inode: inode accessed
> >> + *
> >> + *   This is like file_update_time, but it assumes the mnt is writable
> >> + *   and takes an inode parameter instead.
> >> + */
> >> +
> >> +int inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     struct timespec now;
> >> +     int sync_it = 0;
> >> +     int ret;
> >> +
> >> +     /* First try to exhaust all avenues to not sync */
> >> +     if (IS_NOCMTIME(inode))
> >> +             return 0;
> >> +
> >> +     now = current_fs_time(inode->i_sb);
> >> +     if (!timespec_equal(&inode->i_mtime, &now))
> >> +             sync_it = S_MTIME;
> >> +
> >> +     if (!timespec_equal(&inode->i_ctime, &now))
> >> +             sync_it |= S_CTIME;
> >> +
> >> +     if (IS_I_VERSION(inode))
> >> +             sync_it |= S_VERSION;
> >> +
> >> +     if (!sync_it)
> >> +             return 0;
> >> +
> >> +     ret = update_time(inode, &now, sync_it);
> >> +
> >> +     return ret;
> >> +}
> >> +EXPORT_SYMBOL(inode_update_time_writable);
> >> +
> >   So this differs from file_update_time() only by not calling
> > __mnt_want_write(). Why this special function? It is actually unsafe wrt
> > remounts read-only or filesystem freezing... For that you need to call
> > sb_start_write() / sb_end_write() around the timestamp update. Umm, or
> > better sb_start_pagefault() / sb_end_pagefault() because the call in
> > remove_vma() gets called under mmap_sem so we are in a rather similar
> > situation to ->page_mkwrite.
> 
> The important difference is that it takes an inode* as a parameter
> instead of a file*.  I don't think that inodes have a struct vfsmount,
> so I can't call __mnt_want_write.  I'll take a look at
> sb_start_pagefault.  I'll also refactor this a bit to minimize code
> duplication.  The current approach was for the v1 rfc version. :)
  I see. OK. __mnt_want_write() is the less important part and I guess we
can just skip that for timestamp updates (we are sure update was generated
because of some writeable mount so it's not really important via which
mount writing of them to disk was triggered). But freeze protection is a
must (otherwise e.g. dm snapshot could create corrupted filesystems). Also
reducing code duplication would be nice as you say.

> >> diff --git a/mm/mmap.c b/mm/mmap.c
> >> index 3913262..60301dc 100644
> >> --- a/mm/mmap.c
> >> +++ b/mm/mmap.c
> >> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> >>       struct vm_area_struct *next = vma->vm_next;
> >>
> >>       might_sleep();
> >> +
> >> +     if (vma->vm_file)
> >> +             mapping_flush_cmtime(vma->vm_file->f_mapping);
> >> +
> >>       if (vma->vm_ops && vma->vm_ops->close)
> >>               vma->vm_ops->close(vma);
> >>       if (vma->vm_file)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index cdea11a..8cbb7fb 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> >>               ret = mapping->a_ops->writepages(mapping, wbc);
> >>       else
> >>               ret = generic_writepages(mapping, wbc);
> >> +
> >> +     /*
> >> +      * This is after writepages because the AS_CMTIME bit won't
> >> +      * bet set until writepages is called.
> >> +      */
> >> +     mapping_flush_cmtime(mapping);
> >> +
> >>       return ret;
> >>  }
> >>
> >> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
> >>   */
> >>  int set_page_dirty_from_pte(struct page *page)
> >>  {
> >> -     /* Doesn't do anything interesting yet. */
> >> -     return set_page_dirty(page);
> >> +     int ret = set_page_dirty(page);
> >> +
> >> +     /*
> >> +      * We may be out of memory and/or have various locks held, so
> >> +      * there isn't much we can do in here.
> >> +      */
> >> +     struct address_space *mapping = page_mapping(page);
> >   Declarations should go together please. So something like:
> >         int ret = set_page_dirty(page);
> >         struct address_space *mapping = page_mapping(page);
> >
> >         /* comment... */
> 
> Will do.  Some day I'll learn how to act less like a C99/C++
> programmer when writing kernel code.
  We are conservative sometimes ;)

> Am I correct in interpreting this as "these patches may be
> sufficiently non-insane that I should keep working on them"?  I admit
> I'm pretty far out of my depth working on vm/vfs stuff.
  Well, they look sane to me. The 'ext4 sometime hangs' part isn't that
compelling in general (although I understand it's your main motivation) but
'we should follow POSIX' part is. I suggest you CC linux-mm@kvack.org on
your next iteration as some changes involve more MM than anything else.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
  2012-12-21 10:51                     ` Jan Kara
@ 2012-12-21 18:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21 18:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-kernel, Linux FS Devel, Al Viro

On Fri, Dec 21, 2012 at 2:51 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-12-12 21:36:58, Andy Lutomirski wrote:
>> On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
>> >> The onus is currently on filesystems to call file_update_time
>> >> somewhere in the page_mkwrite path.  This is unfortunate for three
>> >> reasons:
>> >>
>> >> 1. page_mkwrite on a locked page should be fast.  ext4, for example,
>> >>    often sleeps while dirtying inodes.
>> >
>> > That's an ext4 problem, not a page fault or timestamp update
>> > problem. Fix ext4.
>>
>> I could fix ext4 (or rather, someone who understands jbd2 could fix
>> ext4), but I think my approach is considerably simpler and fixes all
>> filesystems at once.
>   Well, you could implement the fix you did just inside ext4. Remove
> timestamp update from ext4_page_mkwrite(), just set some inode flag there,
> update times in ext4_writepages(), ext4_sync_file(), and
> ext4_release_file(). It won't be as elegant but it would work.

I'm worried about funny corner cases.  Suppose fd1 and fd2 refer to
two separate struct files for the same inode.  Then do this:

addr1 = mmap(fd1);
addr2 = mmap(fd2);
*addr2 = 0;  <-- calls page_mkwrite
munmap(addr1);
close(fd1);
sleep(10);
*addr2 = 1;  <-- probably does not call page_mkwrite
munmap(addr2);
close(fd2);

Without extra care, the file timestamp will be ten seconds too early.
To make this work, ext4 would need to keep track of which vma is
dirty, which would involve a lot more complexity.  Alternatively, ext4
could call page_mkclean on all pages in the mapping in
ext4_release_file, but that would suck for performance.  (Actually,
that would probably be awful for my workload -- I have a background
task that periodically opens and (critically) closes the same files
that are mmaped in real-time thread, and all those extra tlb flushes
and page faults would hurt.)

The approach in my patches magically avoids this problem by using the
per-pte dirty bit as opposed to anything that's per page. :)
Admittedly, I could leave the AS_CMTIME stuff in the core and just
sprinkle mapping_flush_cmtime calls around ext4.

One of these days I hope to open-source the thing that hits all these
issues.  It's useful and has no good reason to be proprietary, other
than its present messiness.

>
>> >> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>> >>    shared, locked, writable mappings (instead of page faults).  Moving
>> >>    important work out of the page_mkwrite path is an important first step.
>> >
>> > I don't think you're going to get very far doing this. page_mkwrite
>> > needs to do:
>> >
>> >         a) block allocation in page_mkwrite() for faults over holes
>> >            to detect ENOSPC conditions immediately rather than in
>> >            writeback when such an error results in data loss.
>> >         b) detect writes over unwritten extents so that the pages in
>> >            the page cache can be set up correctly for conversion to
>> >            occur during writeback.
>> >
>> > Correcting these two problems was the reason for introducing
>> > page_mkwrite in the first place - we need to do this stuff before
>> > the page fault is completed, and that means, by definition,
>> > page_mkwrite needs to be able to block. Moving c/mtime updates out
>> > of the way does not, in any way, change these requirements.
>>
>> I was unclear.  I have no problem if PROT_WRITE, MAP_SHARED pages
>> start out write protected.  I want two changes from the status quo,
>> though:
>>
>> 1. ptes (maybe only if mlocked or maybe even only if some new madvise
>> flag is set) should be marked clean but stay writable during and after
>> writeback.  (This would only work when stable pages are not in
>> effect.)
>   This is actually problematic - s390 architecture doesn't have PTE dirty
> bit. It has per page HW dirty bit but that's problematic to use so it's
> completely relying on page being protected after writeback so that it hits
> page fault when it should be dirtied again. Also memory management uses
> these page faults to track number of dirty pages which is then important to
> throttle aggressive writers etc. (we used to do what you describe sometimes
> in early 2.6 days but then we switched to writeprotecting the page for
> writeback to be able to do dirty page tracking).

Ugh.  I thought arches without per-pte dirty bits were supposed to
emulate them by treating writable+clean as write protected and fixing
everything up when page faults happen.

This whole thing could just be turned off on s390, though.

>
> If we limit the behavior you desribe to mlocked pages, I could imagine
> things would work out from the accounting POV (we could treat mlocked pages
> as always dirty for accounting purposes) but it's definitely not a change
> to be easily made.
>
>> 2. There should be a way to request that pages be made clean-but-writable.
>>
>> #1 should be mostly fine already from the filesystems' point of view.
>> #2 would involve calling page_mkwrite or some equivalent.
>>
>> I suspect that there are bugs that are currently untriggerable
>> involving clean-but-writable pages.  For example, what happens if cpu
>> 0 (atomically) changes a pte from clean/writable to not present, but
>> cpu 1 writes the page before the TLB flush happens.  I think the pte
>> ends up not present + dirty, but the current unmapping code won't
>> notice.
>>
>> This would involve splitting page_mkclean into "make clean and write
>> protect" and "make clean but leave writable".
>>
>> Doing this would avoid ~1M "soft" page faults per day in a single
>> real-time thread in my software.  (The file time patches are to make
>> sure that the "soft" page faults actually don't sleep.)
>   Yes, there's a non-negligible cost of writeprotecting the page during
> writeback. But the avoidance of OOM conditions due to too aggressive
> writers simply has precedence... Anyway, if you want to push more in this
> direction, I suggest to move this discussion to linux-mm@kvack.org as there
> are lingering more appropriate listeners :). I'm just a filesystem guy with
> some basic knowledge of MM.

Will do.  I'll send out updated patches as well.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-12-21 18:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-18  1:10 Are there u32 atomic bitops? (or dealing w/ i_flags) Andy Lutomirski
2012-12-18  1:34 ` Ming Lei
2012-12-18  1:57 ` Al Viro
2012-12-18  2:42   ` Andy Lutomirski
2012-12-18 21:30     ` Dave Chinner
2012-12-18 22:20       ` Andy Lutomirski
2012-12-20  7:03         ` Dave Chinner
2012-12-20 20:05           ` Andy Lutomirski
2012-12-20 23:10             ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2012-12-20 23:10               ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
2012-12-20 23:10               ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
2012-12-21  0:14                 ` Dave Chinner
2012-12-21  0:58                   ` Jan Kara
2012-12-21  1:12                     ` Dave Chinner
2012-12-21  1:36                       ` Jan Kara
2012-12-21  5:36                   ` Andy Lutomirski
2012-12-21 10:51                     ` Jan Kara
2012-12-21 18:26                       ` Andy Lutomirski
2012-12-21  0:34                 ` Jan Kara
2012-12-21  5:42                   ` Andy Lutomirski
2012-12-21 11:03                     ` Jan Kara
2012-12-20 23:10               ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
2012-12-20 23:10               ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
2012-12-20 23:42                 ` Jan Kara
2012-12-20 23:36             ` Are there u32 atomic bitops? (or dealing w/ i_flags) Dave Chinner
2012-12-20 23:42               ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).