* [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte
2012-12-20 23:10 ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
@ 2012-12-20 23:10 ` Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
To: linux-kernel, Linux FS Devel
Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski
This is a slight cleanup, but, more importantly, it will let us
easily detect writes via mmap when the process is done writing
(e.g. munmaps, msyncs, fsyncs, or dies).
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
include/linux/mm.h | 1 +
mm/memory-failure.c | 4 +---
mm/page-writeback.c | 16 ++++++++++++++--
mm/rmap.c | 9 ++++++---
4 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a44aa00..4c7e49b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1037,6 +1037,7 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
void account_page_dirtied(struct page *page, struct address_space *mapping);
void account_page_writeback(struct page *page);
int set_page_dirty(struct page *page);
+int set_page_dirty_from_pte(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8b20278..ab5c3f4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -892,9 +892,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
mapping = page_mapping(hpage);
if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
mapping_cap_writeback_dirty(mapping)) {
- if (page_mkclean(hpage)) {
- SetPageDirty(hpage);
- } else {
+ if (!page_mkclean(hpage)) {
kill = 0;
ttu |= TTU_IGNORE_HWPOISON;
printk(KERN_INFO
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..cdea11a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2109,6 +2109,19 @@ int set_page_dirty(struct page *page)
EXPORT_SYMBOL(set_page_dirty);
/*
+ * Dirty a page due to the page table dirty bit.
+ *
+ * For pages with a mapping this should be done under the page lock.
+ * This does set_page_dirty plus extra work to account for the fact that
+ * the kernel was not notified when the actual write was done.
+ */
+int set_page_dirty_from_pte(struct page *page)
+{
+ /* Doesn't do anything interesting yet. */
+ return set_page_dirty(page);
+}
+
+/*
* set_page_dirty() is racy if the caller has no reference against
* page->mapping->host, and if the page is unlocked. This is because another
* CPU could truncate the page off the mapping and then free the mapping.
@@ -2175,8 +2188,7 @@ int clear_page_dirty_for_io(struct page *page)
* as a serialization point for all the different
* threads doing their things.
*/
- if (page_mkclean(page))
- set_page_dirty(page);
+ page_mkclean(page);
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..b8fe00e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -931,6 +931,9 @@ int page_mkclean(struct page *page)
ret = page_mkclean_file(mapping, page);
}
+ if (ret)
+ set_page_dirty_from_pte(page);
+
return ret;
}
EXPORT_SYMBOL_GPL(page_mkclean);
@@ -1151,7 +1154,7 @@ void page_remove_rmap(struct page *page)
*/
if (mapping && !mapping_cap_account_dirty(mapping) &&
page_test_and_clear_dirty(page_to_pfn(page), 1))
- set_page_dirty(page);
+ set_page_dirty_from_pte(page);
/*
* Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
* and not charged by memcg for now.
@@ -1229,7 +1232,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_from_pte(page);
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
@@ -1423,7 +1426,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ set_page_dirty_from_pte(page);
page_remove_rmap(page);
page_cache_release(page);
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-20 23:10 ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
@ 2012-12-20 23:10 ` Andy Lutomirski
2012-12-21 0:14 ` Dave Chinner
2012-12-21 0:34 ` Jan Kara
2012-12-20 23:10 ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
3 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
To: linux-kernel, Linux FS Devel
Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski
The onus is currently on filesystems to call file_update_time
somewhere in the page_mkwrite path. This is unfortunate for three
reasons:
1. page_mkwrite on a locked page should be fast. ext4, for example,
often sleeps while dirtying inodes.
2. The current behavior is surprising -- the timestamp resulting from
an mmaped write will be before the write, not after. This contradicts
the mmap(2) manpage, which says:
The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
MAP_SHARED will be updated after a write to the mapped region, and
before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
occurs.
3. (An ulterior motive) I'd like to use hardware dirty tracking for
shared, locked, writable mappings (instead of page faults). Moving
important work out of the page_mkwrite path is an important first step.
The next patch will remove the now-unnecessary file_update_time calls.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
fs/inode.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 1 +
include/linux/pagemap.h | 3 +++
mm/memory.c | 2 +-
mm/mmap.c | 4 ++++
mm/page-writeback.c | 30 ++++++++++++++++++++++++++++--
6 files changed, 74 insertions(+), 3 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 64999f1..d881d0b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1688,6 +1688,43 @@ int file_update_time(struct file *file)
}
EXPORT_SYMBOL(file_update_time);
+/**
+ * inode_update_time_writable - update mtime and ctime time
+ * @inode: inode accessed
+ *
+ * This is like file_update_time, but it assumes the mnt is writable
+ * and takes an inode parameter instead.
+ */
+
+int inode_update_time_writable(struct inode *inode)
+{
+ struct timespec now;
+ int sync_it = 0;
+ int ret;
+
+ /* First try to exhaust all avenues to not sync */
+ if (IS_NOCMTIME(inode))
+ return 0;
+
+ now = current_fs_time(inode->i_sb);
+ if (!timespec_equal(&inode->i_mtime, &now))
+ sync_it = S_MTIME;
+
+ if (!timespec_equal(&inode->i_ctime, &now))
+ sync_it |= S_CTIME;
+
+ if (IS_I_VERSION(inode))
+ sync_it |= S_VERSION;
+
+ if (!sync_it)
+ return 0;
+
+ ret = update_time(inode, &now, sync_it);
+
+ return ret;
+}
+EXPORT_SYMBOL(inode_update_time_writable);
+
int inode_needs_sync(struct inode *inode)
{
if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75fe9a1..c95f9fa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2554,6 +2554,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
extern void setattr_copy(struct inode *inode, const struct iattr *attr);
extern int file_update_time(struct file *file);
+extern int inode_update_time_writable(struct inode *inode);
extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..a038ed9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
+ AS_CMTIME = __GFP_BITS_SHIFT + 4, /* written via pte */
};
static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -68,6 +69,8 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}
+extern void mapping_flush_cmtime(struct address_space *mapping);
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..086b901 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1160,7 +1160,7 @@ again:
rss[MM_ANONPAGES]--;
else {
if (pte_dirty(ptent))
- set_page_dirty(page);
+ set_page_dirty_from_pte(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3913262..60301dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
struct vm_area_struct *next = vma->vm_next;
might_sleep();
+
+ if (vma->vm_file)
+ mapping_flush_cmtime(vma->vm_file->f_mapping);
+
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (vma->vm_file)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cdea11a..8cbb7fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+ /*
+ * This is after writepages because the AS_CMTIME bit won't
+ * bet set until writepages is called.
+ */
+ mapping_flush_cmtime(mapping);
+
return ret;
}
@@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
*/
int set_page_dirty_from_pte(struct page *page)
{
- /* Doesn't do anything interesting yet. */
- return set_page_dirty(page);
+ int ret = set_page_dirty(page);
+
+ /*
+ * We may be out of memory and/or have various locks held, so
+ * there isn't much we can do in here.
+ */
+ struct address_space *mapping = page_mapping(page);
+ if (mapping)
+ set_bit(AS_CMTIME, &mapping->flags);
+
+ return ret;
}
/*
@@ -2287,3 +2303,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
return radix_tree_tagged(&mapping->page_tree, tag);
}
EXPORT_SYMBOL(mapping_tagged);
+
+/*
+ * Call from any context from which inode_update_time_writable would be okay
+ * to flush deferred cmtime changes.
+ */
+void mapping_flush_cmtime(struct address_space *mapping)
+{
+ if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
+ inode_update_time_writable(mapping->host);
+}
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-20 23:10 ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
@ 2012-12-21 0:14 ` Dave Chinner
2012-12-21 0:58 ` Jan Kara
2012-12-21 5:36 ` Andy Lutomirski
2012-12-21 0:34 ` Jan Kara
1 sibling, 2 replies; 26+ messages in thread
From: Dave Chinner @ 2012-12-21 0:14 UTC (permalink / raw)
To: Andy Lutomirski; +Cc: linux-kernel, Linux FS Devel, Al Viro, Jan Kara
On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> The onus is currently on filesystems to call file_update_time
> somewhere in the page_mkwrite path. This is unfortunate for three
> reasons:
>
> 1. page_mkwrite on a locked page should be fast. ext4, for example,
> often sleeps while dirtying inodes.
That's an ext4 problem, not a page fault or timestamp update
problem. Fix ext4.
> 2. The current behavior is surprising -- the timestamp resulting from
> an mmaped write will be before the write, not after. This contradicts
> the mmap(2) manpage, which says:
>
> The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> MAP_SHARED will be updated after a write to the mapped region, and
> before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> occurs.
What you propose (time updates in do_writepages()) violates this.
msync(MS_ASYNC) doesn't actually start any IO, therefore the
timestamp wil not be updated.
Besides, what POSIX actually says is:
| The st_ctime and st_mtime fields of a file that is mapped with
| MAP_SHARED and PROT_WRITE shall be marked for update at some point
| in the interval between a write reference to the mapped region and
| the next call to msync() with MS_ASYNC or MS_SYNC for that portion
| of the file by any process.
Which means updating the timestamp during the first write is
perfectly acceptible. Indeed, by definition, we are compliant with
the man page because the update is after the write has occurred.
That is, the write triggered the page fault, so the page fault
processing where we update the timestamps is definitely after the
write occurred. :)
> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> shared, locked, writable mappings (instead of page faults). Moving
> important work out of the page_mkwrite path is an important first step.
I don't think you're going to get very far doing this. page_mkwrite
needs to do:
a) block allocation in page_mkwrite() for faults over holes
to detect ENOSPC conditions immediately rather than in
writeback when such an error results in data loss.
b) detect writes over unwritten extents so that the pages in
the page cache can be set up correctly for conversion to
occur during writeback.
Correcting these two problems was the reason for introducing
page_mkwrite in the first place - we need to do this stuff before
the page fault is completed, and that means, by definition,
page_mkwrite needs to be able to block. Moving c/mtime updates out
of the way does not, in any way, change these requirements.
Perhaps you should implement everything you want to do inside ext4
first, so we can get an idea of exactly what you want page_mkwrite()
to do, how you want it to behave, and how you expect filesystems to
handle the above situations correctly....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 0:14 ` Dave Chinner
@ 2012-12-21 0:58 ` Jan Kara
2012-12-21 1:12 ` Dave Chinner
2012-12-21 5:36 ` Andy Lutomirski
1 sibling, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21 0:58 UTC (permalink / raw)
To: Dave Chinner
Cc: Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro, Jan Kara
On Fri 21-12-12 11:14:57, Dave Chinner wrote:
> On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> > The onus is currently on filesystems to call file_update_time
> > somewhere in the page_mkwrite path. This is unfortunate for three
> > reasons:
> >
> > 1. page_mkwrite on a locked page should be fast. ext4, for example,
> > often sleeps while dirtying inodes.
>
> That's an ext4 problem, not a page fault or timestamp update
> problem. Fix ext4.
Well, XFS doesn't journal the timestamp update which is why it gets away
without blocking on journal. Other filesystems (and I don't think it's just
ext4) are so benevolent with timestamps so their updates are more costly...
> > 2. The current behavior is surprising -- the timestamp resulting from
> > an mmaped write will be before the write, not after. This contradicts
> > the mmap(2) manpage, which says:
> >
> > The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> > MAP_SHARED will be updated after a write to the mapped region, and
> > before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> > occurs.
>
> What you propose (time updates in do_writepages()) violates this.
> msync(MS_ASYNC) doesn't actually start any IO, therefore the
> timestamp wil not be updated.
>
> Besides, what POSIX actually says is:
>
> | The st_ctime and st_mtime fields of a file that is mapped with
> | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> | in the interval between a write reference to the mapped region and
> | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> | of the file by any process.
>
> Which means updating the timestamp during the first write is
> perfectly acceptible. Indeed, by definition, we are compliant with
> the man page because the update is after the write has occurred.
> That is, the write triggered the page fault, so the page fault
> processing where we update the timestamps is definitely after the
> write occurred. :)
Well, but there can be more writes to the already write faulted page.
They can come seconds after we called ->page_mkwrite() and thus updated
time stamps. So Andy is correct we violate the spec AFAICT.
> > 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> > shared, locked, writable mappings (instead of page faults). Moving
> > important work out of the page_mkwrite path is an important first step.
>
> I don't think you're going to get very far doing this. page_mkwrite
> needs to do:
>
> a) block allocation in page_mkwrite() for faults over holes
> to detect ENOSPC conditions immediately rather than in
> writeback when such an error results in data loss.
> b) detect writes over unwritten extents so that the pages in
> the page cache can be set up correctly for conversion to
> occur during writeback.
>
> Correcting these two problems was the reason for introducing
> page_mkwrite in the first place - we need to do this stuff before
> the page fault is completed, and that means, by definition,
> page_mkwrite needs to be able to block. Moving c/mtime updates out
> of the way does not, in any way, change these requirements.
Here I completely agree. I wanted to comment on it in my post as well but
then forgot about it.
> Perhaps you should implement everything you want to do inside ext4
> first, so we can get an idea of exactly what you want page_mkwrite()
> to do, how you want it to behave, and how you expect filesystems to
> handle the above situations correctly....
Ah, now I noticed we don't call file_update_time() from
__block_page_mkwrite() so yes, just changing ext4 without touching generic
code is easily possible.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 0:58 ` Jan Kara
@ 2012-12-21 1:12 ` Dave Chinner
2012-12-21 1:36 ` Jan Kara
0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-12-21 1:12 UTC (permalink / raw)
To: Jan Kara; +Cc: Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro
On Fri, Dec 21, 2012 at 01:58:25AM +0100, Jan Kara wrote:
> On Fri 21-12-12 11:14:57, Dave Chinner wrote:
> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> > > The onus is currently on filesystems to call file_update_time
> > > somewhere in the page_mkwrite path. This is unfortunate for three
> > > reasons:
> > >
> > > 1. page_mkwrite on a locked page should be fast. ext4, for example,
> > > often sleeps while dirtying inodes.
> >
> > That's an ext4 problem, not a page fault or timestamp update
> > problem. Fix ext4.
> Well, XFS doesn't journal the timestamp update which is why it gets away
> without blocking on journal. Other filesystems (and I don't think it's just
> ext4) are so benevolent with timestamps so their updates are more costly...
XFS does journal it's timestamps now. We changed that code recently,
so XFs is no different to any other filesystem in this respect.
My point is, though, that this choice (of when to update the
timestamp) can already be made by filesytsems without changing the
high level code.
> > > 2. The current behavior is surprising -- the timestamp resulting from
> > > an mmaped write will be before the write, not after. This contradicts
> > > the mmap(2) manpage, which says:
> > >
> > > The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> > > MAP_SHARED will be updated after a write to the mapped region, and
> > > before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> > > occurs.
> >
> > What you propose (time updates in do_writepages()) violates this.
> > msync(MS_ASYNC) doesn't actually start any IO, therefore the
> > timestamp wil not be updated.
> >
> > Besides, what POSIX actually says is:
> >
> > | The st_ctime and st_mtime fields of a file that is mapped with
> > | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> > | in the interval between a write reference to the mapped region and
> > | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> > | of the file by any process.
> >
> > Which means updating the timestamp during the first write is
> > perfectly acceptible. Indeed, by definition, we are compliant with
> > the man page because the update is after the write has occurred.
> > That is, the write triggered the page fault, so the page fault
> > processing where we update the timestamps is definitely after the
> > write occurred. :)
> Well, but there can be more writes to the already write faulted page.
> They can come seconds after we called ->page_mkwrite() and thus updated
> time stamps. So Andy is correct we violate the spec AFAICT.
Depends how you read it. It can be updated at *any time* between the
write and the msync() call, which is exactly what happens right now.
The fact that second and subsequent writes between the first write
and the msync call do not change it is irrelevant, as the first one
is the one that matters... Indeed, if you read to the letter of the
posix definition, then updating timestamps in the msync call is also
incorrect, because that is not between the write and the msync()
call.
What I'm saying is saying the current behaviour is wrong is
dependent on a specific intepretation of the standard, and the same
arguments can be made against this proposal. Hence such arguments
are not a convincing/compelling reason to change behaviours.
> > Perhaps you should implement everything you want to do inside ext4
> > first, so we can get an idea of exactly what you want page_mkwrite()
> > to do, how you want it to behave, and how you expect filesystems to
> > handle the above situations correctly....
> Ah, now I noticed we don't call file_update_time() from
> __block_page_mkwrite() so yes, just changing ext4 without touching generic
> code is easily possible.
Yup. I'm not opposed to making such changes, but I want to see the
bigger picture before making the little changes that start moving in
that direction. Until it's demonstrated as possible with a
custom page_mkwrite implementation....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 1:12 ` Dave Chinner
@ 2012-12-21 1:36 ` Jan Kara
0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-21 1:36 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Andy Lutomirski, linux-kernel, Linux FS Devel, Al Viro
On Fri 21-12-12 12:12:46, Dave Chinner wrote:
> > > > 2. The current behavior is surprising -- the timestamp resulting from
> > > > an mmaped write will be before the write, not after. This contradicts
> > > > the mmap(2) manpage, which says:
> > > >
> > > > The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> > > > MAP_SHARED will be updated after a write to the mapped region, and
> > > > before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> > > > occurs.
> > >
> > > What you propose (time updates in do_writepages()) violates this.
> > > msync(MS_ASYNC) doesn't actually start any IO, therefore the
> > > timestamp wil not be updated.
> > >
> > > Besides, what POSIX actually says is:
> > >
> > > | The st_ctime and st_mtime fields of a file that is mapped with
> > > | MAP_SHARED and PROT_WRITE shall be marked for update at some point
> > > | in the interval between a write reference to the mapped region and
> > > | the next call to msync() with MS_ASYNC or MS_SYNC for that portion
> > > | of the file by any process.
> > >
> > > Which means updating the timestamp during the first write is
> > > perfectly acceptible. Indeed, by definition, we are compliant with
> > > the man page because the update is after the write has occurred.
> > > That is, the write triggered the page fault, so the page fault
> > > processing where we update the timestamps is definitely after the
> > > write occurred. :)
> > Well, but there can be more writes to the already write faulted page.
> > They can come seconds after we called ->page_mkwrite() and thus updated
> > time stamps. So Andy is correct we violate the spec AFAICT.
>
> Depends how you read it. It can be updated at *any time* between the
> write and the msync() call, which is exactly what happens right now.
> The fact that second and subsequent writes between the first write
> and the msync call do not change it is irrelevant, as the first one
> is the one that matters... Indeed, if you read to the letter of the
> posix definition, then updating timestamps in the msync call is also
> incorrect, because that is not between the write and the msync()
> call.
>
> What I'm saying is saying the current behaviour is wrong is
> dependent on a specific intepretation of the standard, and the same
> arguments can be made against this proposal. Hence such arguments
> are not a convincing/compelling reason to change behaviours.
I have to say I'm not following you :) If I have the program that does:
fd = open("file", O_RDWR);
addr = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
addr[0] = 'a';
sleep(1);
addr[1] = 'b';
close(fd);
Then application of the spec to the second write clearly states that time
stamps should be updated sometime after the write of 'b'. I don't see any
space for other interpretation there... And currently we update time stamps
only at the moment we write 'a'.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 0:14 ` Dave Chinner
2012-12-21 0:58 ` Jan Kara
@ 2012-12-21 5:36 ` Andy Lutomirski
2012-12-21 10:51 ` Jan Kara
1 sibling, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21 5:36 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, Linux FS Devel, Al Viro, Jan Kara
On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
>> The onus is currently on filesystems to call file_update_time
>> somewhere in the page_mkwrite path. This is unfortunate for three
>> reasons:
>>
>> 1. page_mkwrite on a locked page should be fast. ext4, for example,
>> often sleeps while dirtying inodes.
>
> That's an ext4 problem, not a page fault or timestamp update
> problem. Fix ext4.
I could fix ext4 (or rather, someone who understands jbd2 could fix
ext4), but I think my approach is considerably simpler and fixes all
filesystems at once.
>
>> 2. The current behavior is surprising -- the timestamp resulting from
>> an mmaped write will be before the write, not after. This contradicts
>> the mmap(2) manpage, which says:
>>
>> The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
>> MAP_SHARED will be updated after a write to the mapped region, and
>> before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
>> occurs.
>
> What you propose (time updates in do_writepages()) violates this.
> msync(MS_ASYNC) doesn't actually start any IO, therefore the
> timestamp wil not be updated.
That's easy enough -- I can add special-case code for MS_ASYNC. It'll
make an operation that is otherwise a no-op considerably less free,
though.
>
> Besides, what POSIX actually says is:
[snipped -- covered elsewhere in the thread]
>
>> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>> shared, locked, writable mappings (instead of page faults). Moving
>> important work out of the page_mkwrite path is an important first step.
>
> I don't think you're going to get very far doing this. page_mkwrite
> needs to do:
>
> a) block allocation in page_mkwrite() for faults over holes
> to detect ENOSPC conditions immediately rather than in
> writeback when such an error results in data loss.
> b) detect writes over unwritten extents so that the pages in
> the page cache can be set up correctly for conversion to
> occur during writeback.
>
> Correcting these two problems was the reason for introducing
> page_mkwrite in the first place - we need to do this stuff before
> the page fault is completed, and that means, by definition,
> page_mkwrite needs to be able to block. Moving c/mtime updates out
> of the way does not, in any way, change these requirements.
I was unclear. I have no problem if PROT_WRITE, MAP_SHARED pages
start out write protected. I want two changes from the status quo,
though:
1. ptes (maybe only if mlocked or maybe even only if some new madvise
flag is set) should be marked clean but stay writable during and after
writeback. (This would only work when stable pages are not in
effect.)
2. There should be a way to request that pages be made clean-but-writable.
#1 should be mostly fine already from the filesystems' point of view.
#2 would involve calling page_mkwrite or some equivalent.
I suspect that there are bugs that are currently untriggerable
involving clean-but-writable pages. For example, what happens if cpu
0 (atomically) changes a pte from clean/writable to not present, but
cpu 1 writes the page before the TLB flush happens. I think the pte
ends up not present + dirty, but the current unmapping code won't
notice.
This would involve splitting page_mkclean into "make clean and write
protect" and "make clean but leave writable".
Doing this would avoid ~1M "soft" page faults per day in a single
real-time thread in my software. (The file time patches are to make
sure that the "soft" page faults actually don't sleep.)
--Andy
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 5:36 ` Andy Lutomirski
@ 2012-12-21 10:51 ` Jan Kara
2012-12-21 18:26 ` Andy Lutomirski
0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21 10:51 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dave Chinner, linux-kernel, Linux FS Devel, Al Viro, Jan Kara
On Thu 20-12-12 21:36:58, Andy Lutomirski wrote:
> On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
> >> The onus is currently on filesystems to call file_update_time
> >> somewhere in the page_mkwrite path. This is unfortunate for three
> >> reasons:
> >>
> >> 1. page_mkwrite on a locked page should be fast. ext4, for example,
> >> often sleeps while dirtying inodes.
> >
> > That's an ext4 problem, not a page fault or timestamp update
> > problem. Fix ext4.
>
> I could fix ext4 (or rather, someone who understands jbd2 could fix
> ext4), but I think my approach is considerably simpler and fixes all
> filesystems at once.
Well, you could implement the fix you did just inside ext4. Remove
timestamp update from ext4_page_mkwrite(), just set some inode flag there,
update times in ext4_writepages(), ext4_sync_file(), and
ext4_release_file(). It won't be as elegant but it would work.
> >> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> >> shared, locked, writable mappings (instead of page faults). Moving
> >> important work out of the page_mkwrite path is an important first step.
> >
> > I don't think you're going to get very far doing this. page_mkwrite
> > needs to do:
> >
> > a) block allocation in page_mkwrite() for faults over holes
> > to detect ENOSPC conditions immediately rather than in
> > writeback when such an error results in data loss.
> > b) detect writes over unwritten extents so that the pages in
> > the page cache can be set up correctly for conversion to
> > occur during writeback.
> >
> > Correcting these two problems was the reason for introducing
> > page_mkwrite in the first place - we need to do this stuff before
> > the page fault is completed, and that means, by definition,
> > page_mkwrite needs to be able to block. Moving c/mtime updates out
> > of the way does not, in any way, change these requirements.
>
> I was unclear. I have no problem if PROT_WRITE, MAP_SHARED pages
> start out write protected. I want two changes from the status quo,
> though:
>
> 1. ptes (maybe only if mlocked or maybe even only if some new madvise
> flag is set) should be marked clean but stay writable during and after
> writeback. (This would only work when stable pages are not in
> effect.)
This is actually problematic - s390 architecture doesn't have PTE dirty
bit. It has per page HW dirty bit but that's problematic to use so it's
completely relying on page being protected after writeback so that it hits
page fault when it should be dirtied again. Also memory management uses
these page faults to track number of dirty pages which is then important to
throttle aggressive writers etc. (we used to do what you describe sometimes
in early 2.6 days but then we switched to writeprotecting the page for
writeback to be able to do dirty page tracking).
If we limit the behavior you desribe to mlocked pages, I could imagine
things would work out from the accounting POV (we could treat mlocked pages
as always dirty for accounting purposes) but it's definitely not a change
to be easily made.
> 2. There should be a way to request that pages be made clean-but-writable.
>
> #1 should be mostly fine already from the filesystems' point of view.
> #2 would involve calling page_mkwrite or some equivalent.
>
> I suspect that there are bugs that are currently untriggerable
> involving clean-but-writable pages. For example, what happens if cpu
> 0 (atomically) changes a pte from clean/writable to not present, but
> cpu 1 writes the page before the TLB flush happens. I think the pte
> ends up not present + dirty, but the current unmapping code won't
> notice.
>
> This would involve splitting page_mkclean into "make clean and write
> protect" and "make clean but leave writable".
>
> Doing this would avoid ~1M "soft" page faults per day in a single
> real-time thread in my software. (The file time patches are to make
> sure that the "soft" page faults actually don't sleep.)
Yes, there's a non-negligible cost of writeprotecting the page during
writeback. But the avoidance of OOM conditions due to too aggressive
writers simply has precedence... Anyway, if you want to push more in this
direction, I suggest to move this discussion to linux-mm@kvack.org as there
are lingering more appropriate listeners :). I'm just a filesystem guy with
some basic knowledge of MM.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 10:51 ` Jan Kara
@ 2012-12-21 18:26 ` Andy Lutomirski
0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21 18:26 UTC (permalink / raw)
To: Jan Kara; +Cc: Dave Chinner, linux-kernel, Linux FS Devel, Al Viro
On Fri, Dec 21, 2012 at 2:51 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-12-12 21:36:58, Andy Lutomirski wrote:
>> On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
>> >> The onus is currently on filesystems to call file_update_time
>> >> somewhere in the page_mkwrite path. This is unfortunate for three
>> >> reasons:
>> >>
>> >> 1. page_mkwrite on a locked page should be fast. ext4, for example,
>> >> often sleeps while dirtying inodes.
>> >
>> > That's an ext4 problem, not a page fault or timestamp update
>> > problem. Fix ext4.
>>
>> I could fix ext4 (or rather, someone who understands jbd2 could fix
>> ext4), but I think my approach is considerably simpler and fixes all
>> filesystems at once.
> Well, you could implement the fix you did just inside ext4. Remove
> timestamp update from ext4_page_mkwrite(), just set some inode flag there,
> update times in ext4_writepages(), ext4_sync_file(), and
> ext4_release_file(). It won't be as elegant but it would work.
I'm worried about funny corner cases. Suppose fd1 and fd2 refer to
two separate struct files for the same inode. Then do this:
addr1 = mmap(fd1);
addr2 = mmap(fd2);
*addr2 = 0; <-- calls page_mkwrite
munmap(addr1);
close(fd1);
sleep(10);
*addr2 = 1; <-- probably does not call page_mkwrite
munmap(addr2);
close(fd2);
Without extra care, the file timestamp will be ten seconds too early.
To make this work, ext4 would need to keep track of which vma is
dirty, which would involve a lot more complexity. Alternatively, ext4
could call page_mkclean on all pages in the mapping in
ext4_release_file, but that would suck for performance. (Actually,
that would probably be awful for my workload -- I have a background
task that periodically opens and (critically) closes the same files
that are mmaped in real-time thread, and all those extra tlb flushes
and page faults would hurt.)
The approach in my patches magically avoids this problem by using the
per-pte dirty bit as opposed to anything that's per page. :)
Admittedly, I could leave the AS_CMTIME stuff in the core and just
sprinkle mapping_flush_cmtime calls around ext4.
One of these days I hope to open-source the thing that hits all these
issues. It's useful and has no good reason to be proprietary, other
than its present messiness.
>
>> >> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>> >> shared, locked, writable mappings (instead of page faults). Moving
>> >> important work out of the page_mkwrite path is an important first step.
>> >
>> > I don't think you're going to get very far doing this. page_mkwrite
>> > needs to do:
>> >
>> > a) block allocation in page_mkwrite() for faults over holes
>> > to detect ENOSPC conditions immediately rather than in
>> > writeback when such an error results in data loss.
>> > b) detect writes over unwritten extents so that the pages in
>> > the page cache can be set up correctly for conversion to
>> > occur during writeback.
>> >
>> > Correcting these two problems was the reason for introducing
>> > page_mkwrite in the first place - we need to do this stuff before
>> > the page fault is completed, and that means, by definition,
>> > page_mkwrite needs to be able to block. Moving c/mtime updates out
>> > of the way does not, in any way, change these requirements.
>>
>> I was unclear. I have no problem if PROT_WRITE, MAP_SHARED pages
>> start out write protected. I want two changes from the status quo,
>> though:
>>
>> 1. ptes (maybe only if mlocked or maybe even only if some new madvise
>> flag is set) should be marked clean but stay writable during and after
>> writeback. (This would only work when stable pages are not in
>> effect.)
> This is actually problematic - s390 architecture doesn't have PTE dirty
> bit. It has per page HW dirty bit but that's problematic to use so it's
> completely relying on page being protected after writeback so that it hits
> page fault when it should be dirtied again. Also memory management uses
> these page faults to track number of dirty pages which is then important to
> throttle aggressive writers etc. (we used to do what you describe sometimes
> in early 2.6 days but then we switched to writeprotecting the page for
> writeback to be able to do dirty page tracking).
Ugh. I thought arches without per-pte dirty bits were supposed to
emulate them by treating writable+clean as write protected and fixing
everything up when page faults happen.
This whole thing could just be turned off on s390, though.
>
> If we limit the behavior you desribe to mlocked pages, I could imagine
> things would work out from the accounting POV (we could treat mlocked pages
> as always dirty for accounting purposes) but it's definitely not a change
> to be easily made.
>
>> 2. There should be a way to request that pages be made clean-but-writable.
>>
>> #1 should be mostly fine already from the filesystems' point of view.
>> #2 would involve calling page_mkwrite or some equivalent.
>>
>> I suspect that there are bugs that are currently untriggerable
>> involving clean-but-writable pages. For example, what happens if cpu
>> 0 (atomically) changes a pte from clean/writable to not present, but
>> cpu 1 writes the page before the TLB flush happens. I think the pte
>> ends up not present + dirty, but the current unmapping code won't
>> notice.
>>
>> This would involve splitting page_mkclean into "make clean and write
>> protect" and "make clean but leave writable".
>>
>> Doing this would avoid ~1M "soft" page faults per day in a single
>> real-time thread in my software. (The file time patches are to make
>> sure that the "soft" page faults actually don't sleep.)
> Yes, there's a non-negligible cost of writeprotecting the page during
> writeback. But the avoidance of OOM conditions due to too aggressive
> writers simply has precedence... Anyway, if you want to push more in this
> direction, I suggest to move this discussion to linux-mm@kvack.org as there
> are lingering more appropriate listeners :). I'm just a filesystem guy with
> some basic knowledge of MM.
Will do. I'll send out updated patches as well.
--Andy
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-20 23:10 ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
2012-12-21 0:14 ` Dave Chinner
@ 2012-12-21 0:34 ` Jan Kara
2012-12-21 5:42 ` Andy Lutomirski
1 sibling, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-12-21 0:34 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro, Jan Kara
On Thu 20-12-12 15:10:10, Andy Lutomirski wrote:
> The onus is currently on filesystems to call file_update_time
> somewhere in the page_mkwrite path. This is unfortunate for three
> reasons:
>
> 1. page_mkwrite on a locked page should be fast. ext4, for example,
> often sleeps while dirtying inodes.
>
> 2. The current behavior is surprising -- the timestamp resulting from
> an mmaped write will be before the write, not after. This contradicts
> the mmap(2) manpage, which says:
>
> The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
> MAP_SHARED will be updated after a write to the mapped region, and
> before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
> occurs.
I agree your behavior is more correct wrt to the manpage / spec. OTOH I
could dig out several emails where users complain time stamps magically
change some time after the file was written via mmap (because writeback
happened at that time and it did some allocation to the inode). People hit
this e.g. when compiling something, ld(1) writes final binary through mmap,
the package / archive the final binary and later some sanity check finds
the time stamp on the binary is newer than the package / archive.
Looking more into the patch you end up updating timestamps on munmap(2)
(thus on file close in particular). That should avoid the most surprising
cases and users hopefully won't notice the difference. Good. But please
mention this explicitely in the changelog.
> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
> shared, locked, writable mappings (instead of page faults). Moving
> important work out of the page_mkwrite path is an important first step.
>
> The next patch will remove the now-unnecessary file_update_time calls.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
> fs/inode.c | 37 +++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 1 +
> include/linux/pagemap.h | 3 +++
> mm/memory.c | 2 +-
> mm/mmap.c | 4 ++++
> mm/page-writeback.c | 30 ++++++++++++++++++++++++++++--
> 6 files changed, 74 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 64999f1..d881d0b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1688,6 +1688,43 @@ int file_update_time(struct file *file)
> }
> EXPORT_SYMBOL(file_update_time);
>
> +/**
> + * inode_update_time_writable - update mtime and ctime time
> + * @inode: inode accessed
> + *
> + * This is like file_update_time, but it assumes the mnt is writable
> + * and takes an inode parameter instead.
> + */
> +
> +int inode_update_time_writable(struct inode *inode)
> +{
> + struct timespec now;
> + int sync_it = 0;
> + int ret;
> +
> + /* First try to exhaust all avenues to not sync */
> + if (IS_NOCMTIME(inode))
> + return 0;
> +
> + now = current_fs_time(inode->i_sb);
> + if (!timespec_equal(&inode->i_mtime, &now))
> + sync_it = S_MTIME;
> +
> + if (!timespec_equal(&inode->i_ctime, &now))
> + sync_it |= S_CTIME;
> +
> + if (IS_I_VERSION(inode))
> + sync_it |= S_VERSION;
> +
> + if (!sync_it)
> + return 0;
> +
> + ret = update_time(inode, &now, sync_it);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(inode_update_time_writable);
> +
So this differs from file_update_time() only by not calling
__mnt_want_write(). Why this special function? It is actually unsafe wrt
remounts read-only or filesystem freezing... For that you need to call
sb_start_write() / sb_end_write() around the timestamp update. Umm, or
better sb_start_pagefault() / sb_end_pagefault() because the call in
remove_vma() gets called under mmap_sem so we are in a rather similar
situation to ->page_mkwrite.
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3913262..60301dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> struct vm_area_struct *next = vma->vm_next;
>
> might_sleep();
> +
> + if (vma->vm_file)
> + mapping_flush_cmtime(vma->vm_file->f_mapping);
> +
> if (vma->vm_ops && vma->vm_ops->close)
> vma->vm_ops->close(vma);
> if (vma->vm_file)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index cdea11a..8cbb7fb 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> ret = mapping->a_ops->writepages(mapping, wbc);
> else
> ret = generic_writepages(mapping, wbc);
> +
> + /*
> + * This is after writepages because the AS_CMTIME bit won't
> + * bet set until writepages is called.
> + */
> + mapping_flush_cmtime(mapping);
> +
> return ret;
> }
>
> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
> */
> int set_page_dirty_from_pte(struct page *page)
> {
> - /* Doesn't do anything interesting yet. */
> - return set_page_dirty(page);
> + int ret = set_page_dirty(page);
> +
> + /*
> + * We may be out of memory and/or have various locks held, so
> + * there isn't much we can do in here.
> + */
> + struct address_space *mapping = page_mapping(page);
Declarations should go together please. So something like:
int ret = set_page_dirty(page);
struct address_space *mapping = page_mapping(page);
/* comment... */
> + if (mapping)
> + set_bit(AS_CMTIME, &mapping->flags);
> +
> + return ret;
> }
>
> /*
> @@ -2287,3 +2303,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
> return radix_tree_tagged(&mapping->page_tree, tag);
> }
> EXPORT_SYMBOL(mapping_tagged);
> +
> +/*
> + * Call from any context from which inode_update_time_writable would be okay
> + * to flush deferred cmtime changes.
> + */
> +void mapping_flush_cmtime(struct address_space *mapping)
> +{
> + if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
> + inode_update_time_writable(mapping->host);
> +}
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 0:34 ` Jan Kara
@ 2012-12-21 5:42 ` Andy Lutomirski
2012-12-21 11:03 ` Jan Kara
0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-21 5:42 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro
On Thu, Dec 20, 2012 at 4:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-12-12 15:10:10, Andy Lutomirski wrote:
>> The onus is currently on filesystems to call file_update_time
>> somewhere in the page_mkwrite path. This is unfortunate for three
>> reasons:
>>
>> 1. page_mkwrite on a locked page should be fast. ext4, for example,
>> often sleeps while dirtying inodes.
>>
>> 2. The current behavior is surprising -- the timestamp resulting from
>> an mmaped write will be before the write, not after. This contradicts
>> the mmap(2) manpage, which says:
>>
>> The st_ctime and st_mtime field for a file mapped with PROT_WRITE and
>> MAP_SHARED will be updated after a write to the mapped region, and
>> before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one
>> occurs.
> I agree your behavior is more correct wrt to the manpage / spec. OTOH I
> could dig out several emails where users complain time stamps magically
> change some time after the file was written via mmap (because writeback
> happened at that time and it did some allocation to the inode). People hit
> this e.g. when compiling something, ld(1) writes final binary through mmap,
> the package / archive the final binary and later some sanity check finds
> the time stamp on the binary is newer than the package / archive.
>
> Looking more into the patch you end up updating timestamps on munmap(2)
> (thus on file close in particular). That should avoid the most surprising
> cases and users hopefully won't notice the difference. Good. But please
> mention this explicitely in the changelog.
I was careful to get that case right. I'll update the changelog.
In particular, I've so far tested munmap, msync(MS_SYNC), fsync,
waiting 30 seconds, and dying by fatal signal. All of those paths
work right.
>> +/**
>> + * inode_update_time_writable - update mtime and ctime time
>> + * @inode: inode accessed
>> + *
>> + * This is like file_update_time, but it assumes the mnt is writable
>> + * and takes an inode parameter instead.
>> + */
>> +
>> +int inode_update_time_writable(struct inode *inode)
>> +{
>> + struct timespec now;
>> + int sync_it = 0;
>> + int ret;
>> +
>> + /* First try to exhaust all avenues to not sync */
>> + if (IS_NOCMTIME(inode))
>> + return 0;
>> +
>> + now = current_fs_time(inode->i_sb);
>> + if (!timespec_equal(&inode->i_mtime, &now))
>> + sync_it = S_MTIME;
>> +
>> + if (!timespec_equal(&inode->i_ctime, &now))
>> + sync_it |= S_CTIME;
>> +
>> + if (IS_I_VERSION(inode))
>> + sync_it |= S_VERSION;
>> +
>> + if (!sync_it)
>> + return 0;
>> +
>> + ret = update_time(inode, &now, sync_it);
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(inode_update_time_writable);
>> +
> So this differs from file_update_time() only by not calling
> __mnt_want_write(). Why this special function? It is actually unsafe wrt
> remounts read-only or filesystem freezing... For that you need to call
> sb_start_write() / sb_end_write() around the timestamp update. Umm, or
> better sb_start_pagefault() / sb_end_pagefault() because the call in
> remove_vma() gets called under mmap_sem so we are in a rather similar
> situation to ->page_mkwrite.
The important difference is that it takes an inode* as a parameter
instead of a file*. I don't think that inodes have a struct vfsmount,
so I can't call __mnt_want_write. I'll take a look at
sb_start_pagefault. I'll also refactor this a bit to minimize code
duplication. The current approach was for the v1 rfc version. :)
>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 3913262..60301dc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> struct vm_area_struct *next = vma->vm_next;
>>
>> might_sleep();
>> +
>> + if (vma->vm_file)
>> + mapping_flush_cmtime(vma->vm_file->f_mapping);
>> +
>> if (vma->vm_ops && vma->vm_ops->close)
>> vma->vm_ops->close(vma);
>> if (vma->vm_file)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index cdea11a..8cbb7fb 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>> ret = mapping->a_ops->writepages(mapping, wbc);
>> else
>> ret = generic_writepages(mapping, wbc);
>> +
>> + /*
>> + * This is after writepages because the AS_CMTIME bit won't
>> + * bet set until writepages is called.
>> + */
>> + mapping_flush_cmtime(mapping);
>> +
>> return ret;
>> }
>>
>> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
>> */
>> int set_page_dirty_from_pte(struct page *page)
>> {
>> - /* Doesn't do anything interesting yet. */
>> - return set_page_dirty(page);
>> + int ret = set_page_dirty(page);
>> +
>> + /*
>> + * We may be out of memory and/or have various locks held, so
>> + * there isn't much we can do in here.
>> + */
>> + struct address_space *mapping = page_mapping(page);
> Declarations should go together please. So something like:
> int ret = set_page_dirty(page);
> struct address_space *mapping = page_mapping(page);
>
> /* comment... */
Will do. Some day I'll learn how to act less like a C99/C++
programmer when writing kernel code.
Am I correct in interpreting this as "these patches may be
sufficiently non-insane that I should keep working on them"? I admit
I'm pretty far out of my depth working on vm/vfs stuff.
Thanks,
Andy
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
2012-12-21 5:42 ` Andy Lutomirski
@ 2012-12-21 11:03 ` Jan Kara
0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-21 11:03 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jan Kara, linux-kernel, Linux FS Devel, Dave Chinner, Al Viro
On Thu 20-12-12 21:42:46, Andy Lutomirski wrote:
> On Thu, Dec 20, 2012 at 4:34 PM, Jan Kara <jack@suse.cz> wrote:
> >> +/**
> >> + * inode_update_time_writable - update mtime and ctime time
> >> + * @inode: inode accessed
> >> + *
> >> + * This is like file_update_time, but it assumes the mnt is writable
> >> + * and takes an inode parameter instead.
> >> + */
> >> +
> >> +int inode_update_time_writable(struct inode *inode)
> >> +{
> >> + struct timespec now;
> >> + int sync_it = 0;
> >> + int ret;
> >> +
> >> + /* First try to exhaust all avenues to not sync */
> >> + if (IS_NOCMTIME(inode))
> >> + return 0;
> >> +
> >> + now = current_fs_time(inode->i_sb);
> >> + if (!timespec_equal(&inode->i_mtime, &now))
> >> + sync_it = S_MTIME;
> >> +
> >> + if (!timespec_equal(&inode->i_ctime, &now))
> >> + sync_it |= S_CTIME;
> >> +
> >> + if (IS_I_VERSION(inode))
> >> + sync_it |= S_VERSION;
> >> +
> >> + if (!sync_it)
> >> + return 0;
> >> +
> >> + ret = update_time(inode, &now, sync_it);
> >> +
> >> + return ret;
> >> +}
> >> +EXPORT_SYMBOL(inode_update_time_writable);
> >> +
> > So this differs from file_update_time() only by not calling
> > __mnt_want_write(). Why this special function? It is actually unsafe wrt
> > remounts read-only or filesystem freezing... For that you need to call
> > sb_start_write() / sb_end_write() around the timestamp update. Umm, or
> > better sb_start_pagefault() / sb_end_pagefault() because the call in
> > remove_vma() gets called under mmap_sem so we are in a rather similar
> > situation to ->page_mkwrite.
>
> The important difference is that it takes an inode* as a parameter
> instead of a file*. I don't think that inodes have a struct vfsmount,
> so I can't call __mnt_want_write. I'll take a look at
> sb_start_pagefault. I'll also refactor this a bit to minimize code
> duplication. The current approach was for the v1 rfc version. :)
I see. OK. __mnt_want_write() is the less important part and I guess we
can just skip that for timestamp updates (we are sure update was generated
because of some writeable mount so it's not really important via which
mount writing of them to disk was triggered). But freeze protection is a
must (otherwise e.g. dm snapshot could create corrupted filesystems). Also
reducing code duplication would be nice as you say.
> >> diff --git a/mm/mmap.c b/mm/mmap.c
> >> index 3913262..60301dc 100644
> >> --- a/mm/mmap.c
> >> +++ b/mm/mmap.c
> >> @@ -223,6 +223,10 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> >> struct vm_area_struct *next = vma->vm_next;
> >>
> >> might_sleep();
> >> +
> >> + if (vma->vm_file)
> >> + mapping_flush_cmtime(vma->vm_file->f_mapping);
> >> +
> >> if (vma->vm_ops && vma->vm_ops->close)
> >> vma->vm_ops->close(vma);
> >> if (vma->vm_file)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index cdea11a..8cbb7fb 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -1910,6 +1910,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> >> ret = mapping->a_ops->writepages(mapping, wbc);
> >> else
> >> ret = generic_writepages(mapping, wbc);
> >> +
> >> + /*
> >> + * This is after writepages because the AS_CMTIME bit won't
> >> + * bet set until writepages is called.
> >> + */
> >> + mapping_flush_cmtime(mapping);
> >> +
> >> return ret;
> >> }
> >>
> >> @@ -2117,8 +2124,17 @@ EXPORT_SYMBOL(set_page_dirty);
> >> */
> >> int set_page_dirty_from_pte(struct page *page)
> >> {
> >> - /* Doesn't do anything interesting yet. */
> >> - return set_page_dirty(page);
> >> + int ret = set_page_dirty(page);
> >> +
> >> + /*
> >> + * We may be out of memory and/or have various locks held, so
> >> + * there isn't much we can do in here.
> >> + */
> >> + struct address_space *mapping = page_mapping(page);
> > Declarations should go together please. So something like:
> > int ret = set_page_dirty(page);
> > struct address_space *mapping = page_mapping(page);
> >
> > /* comment... */
>
> Will do. Some day I'll learn how to act less like a C99/C++
> programmer when writing kernel code.
We are conservative sometimes ;)
> Am I correct in interpreting this as "these patches may be
> sufficiently non-insane that I should keep working on them"? I admit
> I'm pretty far out of my depth working on vm/vfs stuff.
Well, they look sane to me. The 'ext4 sometime hangs' part isn't that
compelling in general (although I understand it's your main motivation) but
'we should follow POSIX' part is. I suggest you CC linux-mm@kvack.org on
your next iteration as some changes involve more MM than anything else.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths
2012-12-20 23:10 ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
@ 2012-12-20 23:10 ` Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
3 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
To: linux-kernel, Linux FS Devel
Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski
The times are now updated at sync time.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
fs/9p/vfs_file.c | 3 ---
fs/btrfs/inode.c | 4 +---
fs/buffer.c | 6 ------
fs/ceph/addr.c | 3 ---
fs/ext4/inode.c | 1 -
fs/gfs2/file.c | 3 ---
fs/nilfs2/file.c | 1 -
fs/sysfs/bin.c | 2 --
mm/filemap.c | 1 -
mm/memory.c | 3 ---
10 files changed, 1 insertion(+), 26 deletions(-)
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index c2483e9..34b84f0 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -610,9 +610,6 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
p9_debug(P9_DEBUG_VFS, "page %p fid %lx\n",
page, (unsigned long)filp->private_data);
- /* Update file times before taking page lock */
- file_update_time(filp);
-
v9inode = V9FS_I(inode);
/* make sure the cache has finished storing the page */
v9fs_fscache_wait_on_page_write(inode, page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95542a1..6fb8558 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6747,10 +6747,8 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
- ret = file_update_time(vma->vm_file);
reserved = 1;
- }
- if (ret) {
+ } else if (ret) {
if (ret == -ENOMEM)
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
diff --git a/fs/buffer.c b/fs/buffer.c
index ec0aca8..a9a8f2a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,12 +2379,6 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
sb_start_pagefault(sb);
- /*
- * Update file times before taking page lock. We may end up failing the
- * fault so this update may be superfluous but who really cares...
- */
- file_update_time(vma->vm_file);
-
ret = __block_page_mkwrite(vma, vmf, get_block);
sb_end_pagefault(sb);
return block_page_mkwrite_return(ret);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..19af339 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1183,9 +1183,6 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
loff_t size, len;
int ret;
- /* Update time before taking page lock */
- file_update_time(vma->vm_file);
-
size = i_size_read(inode);
if (off + PAGE_CACHE_SIZE <= size)
len = PAGE_CACHE_SIZE;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..5ddbc75 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4780,7 +4780,6 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
int retries = 0;
sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index e056b4c..999b1ed 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -398,9 +398,6 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
- /* Update file times before taking page lock */
- file_update_time(vma->vm_file);
-
ret = gfs2_rs_alloc(ip);
if (ret)
return ret;
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 16f35f7..185b8e6 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -116,7 +116,6 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (unlikely(ret))
goto out;
- file_update_time(vma->vm_file);
ret = __block_page_mkwrite(vma, vmf, nilfs_get_block);
if (ret) {
nilfs_transaction_abort(inode->i_sb);
diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index 614b2b5..a475983 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -228,8 +228,6 @@ static int bin_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
ret = 0;
if (bb->vm_ops->page_mkwrite)
ret = bb->vm_ops->page_mkwrite(vma, vmf);
- else
- file_update_time(file);
sysfs_put_active(attr_sd);
return ret;
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..feb0540 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1715,7 +1715,6 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
int ret = VM_FAULT_LOCKED;
sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
lock_page(page);
if (page->mapping != inode->i_mapping) {
unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 086b901..07a0b0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2658,9 +2658,6 @@ reuse:
if (!page_mkwrite) {
wait_on_page_locked(dirty_page);
set_page_dirty_balance(dirty_page, page_mkwrite);
- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
}
put_page(dirty_page);
if (page_mkwrite) {
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex
2012-12-20 23:10 ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
` (2 preceding siblings ...)
2012-12-20 23:10 ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
@ 2012-12-20 23:10 ` Andy Lutomirski
2012-12-20 23:42 ` Jan Kara
3 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2012-12-20 23:10 UTC (permalink / raw)
To: linux-kernel, Linux FS Devel
Cc: Dave Chinner, Al Viro, Jan Kara, Andy Lutomirski
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
fs/ext4/fsync.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index be1d89f..8c15642 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)
*
* What we do is just kick off a commit and wait on it. This will snapshot the
* inode to disk.
- *
- * i_mutex lock is held when entering and exiting this function
*/
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
--
1.7.11.7
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex
2012-12-20 23:10 ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
@ 2012-12-20 23:42 ` Jan Kara
0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-12-20 23:42 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linux-kernel, Linux FS Devel, Dave Chinner, Al Viro, Jan Kara
On Thu 20-12-12 15:10:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
BTW: The right person for this patch is ext4 maintainer:
"Theodore Ts'o" <tytso@mit.edu>
Since this is completely independent of anything else in this series,
just submit the patch to him. Thanks!
Honza
> ---
> fs/ext4/fsync.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index be1d89f..8c15642 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)
> *
> * What we do is just kick off a commit and wait on it. This will snapshot the
> * inode to disk.
> - *
> - * i_mutex lock is held when entering and exiting this function
> */
>
> int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> --
> 1.7.11.7
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 26+ messages in thread