From: Andy Lutomirski <luto@amacapital.net>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
linux-kernel@vger.kernel.org,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes
Date: Fri, 21 Dec 2012 10:26:37 -0800 [thread overview]
Message-ID: <CALCETrX3uSkASU-aw5HF9mNtzVr4HsXt-2a97wzDfc9sDwp8Qw@mail.gmail.com> (raw)
In-Reply-To: <20121221105126.GD17357@quack.suse.cz>
On Fri, Dec 21, 2012 at 2:51 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 20-12-12 21:36:58, Andy Lutomirski wrote:
>> On Thu, Dec 20, 2012 at 4:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Dec 20, 2012 at 03:10:10PM -0800, Andy Lutomirski wrote:
>> >> The onus is currently on filesystems to call file_update_time
>> >> somewhere in the page_mkwrite path. This is unfortunate for three
>> >> reasons:
>> >>
>> >> 1. page_mkwrite on a locked page should be fast. ext4, for example,
>> >> often sleeps while dirtying inodes.
>> >
>> > That's an ext4 problem, not a page fault or timestamp update
>> > problem. Fix ext4.
>>
>> I could fix ext4 (or rather, someone who understands jbd2 could fix
>> ext4), but I think my approach is considerably simpler and fixes all
>> filesystems at once.
> Well, you could implement the fix you did just inside ext4. Remove
> timestamp update from ext4_page_mkwrite(), just set some inode flag there,
> update times in ext4_writepages(), ext4_sync_file(), and
> ext4_release_file(). It won't be as elegant but it would work.
I'm worried about funny corner cases. Suppose fd1 and fd2 refer to
two separate struct files for the same inode. Then do this:
addr1 = mmap(fd1);
addr2 = mmap(fd2);
*addr2 = 0; <-- calls page_mkwrite
munmap(addr1);
close(fd1);
sleep(10);
*addr2 = 1; <-- probably does not call page_mkwrite
munmap(addr2);
close(fd2);
Without extra care, the file timestamp will be ten seconds too early.
To make this work, ext4 would need to keep track of which vma is
dirty, which would involve a lot more complexity. Alternatively, ext4
could call page_mkclean on all pages in the mapping in
ext4_release_file, but that would suck for performance. (Actually,
that would probably be awful for my workload -- I have a background
task that periodically opens and (critically) closes the same files
that are mmaped in real-time thread, and all those extra tlb flushes
and page faults would hurt.)
The approach in my patches magically avoids this problem by using the
per-pte dirty bit as opposed to anything that's per page. :)
Admittedly, I could leave the AS_CMTIME stuff in the core and just
sprinkle mapping_flush_cmtime calls around ext4.
One of these days I hope to open-source the thing that hits all these
issues. It's useful and has no good reason to be proprietary, other
than its present messiness.
>
>> >> 3. (An ulterior motive) I'd like to use hardware dirty tracking for
>> >> shared, locked, writable mappings (instead of page faults). Moving
>> >> important work out of the page_mkwrite path is an important first step.
>> >
>> > I don't think you're going to get very far doing this. page_mkwrite
>> > needs to do:
>> >
>> > a) block allocation in page_mkwrite() for faults over holes
>> > to detect ENOSPC conditions immediately rather than in
>> > writeback when such an error results in data loss.
>> > b) detect writes over unwritten extents so that the pages in
>> > the page cache can be set up correctly for conversion to
>> > occur during writeback.
>> >
>> > Correcting these two problems was the reason for introducing
>> > page_mkwrite in the first place - we need to do this stuff before
>> > the page fault is completed, and that means, by definition,
>> > page_mkwrite needs to be able to block. Moving c/mtime updates out
>> > of the way does not, in any way, change these requirements.
>>
>> I was unclear. I have no problem if PROT_WRITE, MAP_SHARED pages
>> start out write protected. I want two changes from the status quo,
>> though:
>>
>> 1. ptes (maybe only if mlocked or maybe even only if some new madvise
>> flag is set) should be marked clean but stay writable during and after
>> writeback. (This would only work when stable pages are not in
>> effect.)
> This is actually problematic - s390 architecture doesn't have PTE dirty
> bit. It has per page HW dirty bit but that's problematic to use so it's
> completely relying on page being protected after writeback so that it hits
> page fault when it should be dirtied again. Also memory management uses
> these page faults to track number of dirty pages which is then important to
> throttle aggressive writers etc. (we used to do what you describe sometimes
> in early 2.6 days but then we switched to writeprotecting the page for
> writeback to be able to do dirty page tracking).
Ugh. I thought arches without per-pte dirty bits were supposed to
emulate them by treating writable+clean as write protected and fixing
everything up when page faults happen.
This whole thing could just be turned off on s390, though.
>
> If we limit the behavior you desribe to mlocked pages, I could imagine
> things would work out from the accounting POV (we could treat mlocked pages
> as always dirty for accounting purposes) but it's definitely not a change
> to be easily made.
>
>> 2. There should be a way to request that pages be made clean-but-writable.
>>
>> #1 should be mostly fine already from the filesystems' point of view.
>> #2 would involve calling page_mkwrite or some equivalent.
>>
>> I suspect that there are bugs that are currently untriggerable
>> involving clean-but-writable pages. For example, what happens if cpu
>> 0 (atomically) changes a pte from clean/writable to not present, but
>> cpu 1 writes the page before the TLB flush happens. I think the pte
>> ends up not present + dirty, but the current unmapping code won't
>> notice.
>>
>> This would involve splitting page_mkclean into "make clean and write
>> protect" and "make clean but leave writable".
>>
>> Doing this would avoid ~1M "soft" page faults per day in a single
>> real-time thread in my software. (The file time patches are to make
>> sure that the "soft" page faults actually don't sleep.)
> Yes, there's a non-negligible cost of writeprotecting the page during
> writeback. But the avoidance of OOM conditions due to too aggressive
> writers simply has precedence... Anyway, if you want to push more in this
> direction, I suggest to move this discussion to linux-mm@kvack.org as there
> are lingering more appropriate listeners :). I'm just a filesystem guy with
> some basic knowledge of MM.
Will do. I'll send out updated patches as well.
--Andy
next prev parent reply other threads:[~2012-12-21 18:26 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-18 1:10 Are there u32 atomic bitops? (or dealing w/ i_flags) Andy Lutomirski
2012-12-18 1:34 ` Ming Lei
2012-12-18 1:57 ` Al Viro
2012-12-18 2:42 ` Andy Lutomirski
2012-12-18 21:30 ` Dave Chinner
2012-12-18 22:20 ` Andy Lutomirski
2012-12-20 7:03 ` Dave Chinner
2012-12-20 20:05 ` Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 0/4] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 1/4] mm: Explicitly track when the page dirty bit is transferred from a pte Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 2/4] mm: Update file times when inodes are written after mmaped writes Andy Lutomirski
2012-12-21 0:14 ` Dave Chinner
2012-12-21 0:58 ` Jan Kara
2012-12-21 1:12 ` Dave Chinner
2012-12-21 1:36 ` Jan Kara
2012-12-21 5:36 ` Andy Lutomirski
2012-12-21 10:51 ` Jan Kara
2012-12-21 18:26 ` Andy Lutomirski [this message]
2012-12-21 0:34 ` Jan Kara
2012-12-21 5:42 ` Andy Lutomirski
2012-12-21 11:03 ` Jan Kara
2012-12-20 23:10 ` [RFC PATCH 3/4] Remove file_update_time from all mkwrite paths Andy Lutomirski
2012-12-20 23:10 ` [RFC PATCH 4/4] ext4: Fix an incorrect comment about i_mutex Andy Lutomirski
2012-12-20 23:42 ` Jan Kara
2012-12-20 23:36 ` Are there u32 atomic bitops? (or dealing w/ i_flags) Dave Chinner
2012-12-20 23:42 ` Andy Lutomirski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALCETrX3uSkASU-aw5HF9mNtzVr4HsXt-2a97wzDfc9sDwp8Qw@mail.gmail.com \
--to=luto@amacapital.net \
--cc=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).