From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Theodore Ts'o <tytso@mit.edu>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Jan Kara <jack@suse.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Dave Hansen <dave.hansen@intel.com>,
Vlastimil Babka <vbabka@suse.cz>,
Matthew Wilcox <willy@infradead.org>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-block@vger.kernel.org
Subject: Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages
Date: Mon, 24 Oct 2016 13:41:02 +0300 [thread overview]
Message-ID: <20161024104102.GA2849@node.shutemov.name> (raw)
In-Reply-To: <20161012064320.GA13896@quack2.suse.cz>
On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote:
> On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > > invalidate_inode_page() has expectation about page_count() of the page
> > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > > dropped. That condition almost never met for THPs -- tail pages are
> > > > pinned to the pagevec.
> > > >
> > > > Let's drop them, before calling invalidate_inode_page().
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > > mm/truncate.c | 11 +++++++++++
> > > > 1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > > index a01cce450a26..ce904e4b1708 100644
> > > > --- a/mm/truncate.c
> > > > +++ b/mm/truncate.c
> > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > > /* 'end' is in the middle of THP */
> > > > if (index == round_down(end, HPAGE_PMD_NR))
> > > > continue;
> > > > + /*
> > > > + * invalidate_inode_page() expects
> > > > + * page_count(page) == 2 to drop page from page
> > > > + * cache -- drop tail pages references.
> > > > + */
> > > > + get_page(page);
> > > > + pagevec_release(&pvec);
> > >
> > > I'm not quite sure why this is needed. When you have multiorder entry in
> > > the radix tree for your huge page, then you should not get more entries in
> > > the pagevec for your huge page. What do I miss?
> >
> > For compatibility reason find_get_entries() (which is called by
> > pagevec_lookup_entries()) collects all subpages of huge page in the range
> > (head/tails). See patch [07/41]
> >
> > So huge page, which is fully in the range it will be pinned up to
> > PAGEVEC_SIZE times.
>
> Yeah, I see. But then won't it be cleaner to provide iteration method that
> would add to pagevec each radix tree entry (regardless of its order) only
> once and then use it in places where we care? Instead of strange dances
> like you do here?
Maybe. It would require doubling number of find_get_* helpers or
additional flag in each. We have too many already.
And multi-order entries interface for radix-tree has not yet settled in.
I would rather defer such rework until it will be shaped fully.
Let's come back to this later.
> Ultimately we could convert all the places to use these new iteration
> methods but I don't see that as immediately necessary and maybe there are
> places where getting all the subpages in the pagevec actually makes life
> simpler for us (please point me if you know about such place).
I did the way I did to now evaluate each use of find_get_*() one-by-one.
I guessed most of the callers of find_get_page() would be confused by
getting head page instead relevant subpage. Maybe I was wrong and it was
easier to make caller work with that. I don't know...
> On a somewhat unrelated note: I've noticed that you don't invalidate
> a huge page when only part of it should be invalidated. That actually
> breaks some assumptions filesystems make. In particular direct IO code
> assumes that if you do
>
> filemap_write_and_wait_range(inode, start, end);
> invalidate_inode_pages2_range(inode, start, end);
>
> all the page cache covering start-end *will* be invalidated. Your skipping
> of partial pages breaks this assumption and thus can bring consistency
> issues (e.g. write done using direct IO won't be seen by following buffered
> read).
Acctually, invalidate_inode_pages2_range does invalidate whole page if
part of it is in the range. I've catched this problem during testing.
--
Kirill A. Shutemov
WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Theodore Ts'o <tytso@mit.edu>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Jan Kara <jack@suse.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Dave Hansen <dave.hansen@intel.com>,
Vlastimil Babka <vbabka@suse.cz>,
Matthew Wilcox <willy@infradead.org>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-block@vger.kernel.org
Subject: Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages
Date: Mon, 24 Oct 2016 13:41:02 +0300 [thread overview]
Message-ID: <20161024104102.GA2849@node.shutemov.name> (raw)
In-Reply-To: <20161012064320.GA13896@quack2.suse.cz>
On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote:
> On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > > invalidate_inode_page() has expectation about page_count() of the page
> > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > > dropped. That condition almost never met for THPs -- tail pages are
> > > > pinned to the pagevec.
> > > >
> > > > Let's drop them, before calling invalidate_inode_page().
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > > mm/truncate.c | 11 +++++++++++
> > > > 1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > > index a01cce450a26..ce904e4b1708 100644
> > > > --- a/mm/truncate.c
> > > > +++ b/mm/truncate.c
> > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > > /* 'end' is in the middle of THP */
> > > > if (index == round_down(end, HPAGE_PMD_NR))
> > > > continue;
> > > > + /*
> > > > + * invalidate_inode_page() expects
> > > > + * page_count(page) == 2 to drop page from page
> > > > + * cache -- drop tail pages references.
> > > > + */
> > > > + get_page(page);
> > > > + pagevec_release(&pvec);
> > >
> > > I'm not quite sure why this is needed. When you have multiorder entry in
> > > the radix tree for your huge page, then you should not get more entries in
> > > the pagevec for your huge page. What do I miss?
> >
> > For compatibility reason find_get_entries() (which is called by
> > pagevec_lookup_entries()) collects all subpages of huge page in the range
> > (head/tails). See patch [07/41]
> >
> > So huge page, which is fully in the range it will be pinned up to
> > PAGEVEC_SIZE times.
>
> Yeah, I see. But then won't it be cleaner to provide iteration method that
> would add to pagevec each radix tree entry (regardless of its order) only
> once and then use it in places where we care? Instead of strange dances
> like you do here?
Maybe. It would require doubling number of find_get_* helpers or
additional flag in each. We have too many already.
And multi-order entries interface for radix-tree has not yet settled in.
I would rather defer such rework until it will be shaped fully.
Let's come back to this later.
> Ultimately we could convert all the places to use these new iteration
> methods but I don't see that as immediately necessary and maybe there are
> places where getting all the subpages in the pagevec actually makes life
> simpler for us (please point me if you know about such place).
I did the way I did to now evaluate each use of find_get_*() one-by-one.
I guessed most of the callers of find_get_page() would be confused by
getting head page instead relevant subpage. Maybe I was wrong and it was
easier to make caller work with that. I don't know...
> On a somewhat unrelated note: I've noticed that you don't invalidate
> a huge page when only part of it should be invalidated. That actually
> breaks some assumptions filesystems make. In particular direct IO code
> assumes that if you do
>
> filemap_write_and_wait_range(inode, start, end);
> invalidate_inode_pages2_range(inode, start, end);
>
> all the page cache covering start-end *will* be invalidated. Your skipping
> of partial pages breaks this assumption and thus can bring consistency
> issues (e.g. write done using direct IO won't be seen by following buffered
> read).
Acctually, invalidate_inode_pages2_range does invalidate whole page if
part of it is in the range. I've catched this problem during testing.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-10-24 10:41 UTC|newest]
Thread overview: 150+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-15 11:54 [PATCHv3 00/41] ext4: support of huge pages Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 01/41] tools: Add WARN_ON_ONCE Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 02/41] radix tree test suite: Allow GFP_ATOMIC allocations to fail Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 03/41] radix-tree: Add radix_tree_join Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 04/41] radix-tree: Add radix_tree_split Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 05/41] radix-tree: Add radix_tree_split_preload() Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 06/41] radix-tree: Handle multiorder entries being deleted by replace_clear_tags Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-16 12:07 ` Kirill A. Shutemov
2016-09-16 12:07 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 08/41] Revert "radix-tree: implement radix_tree_maybe_preload_order()" Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 09/41] page-flags: relax page flag policy for few flags Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 10/41] mm, rmap: account file thp pages Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 11/41] thp: try to free page's buffers before attempt split Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-11 15:40 ` Jan Kara
2016-10-11 15:40 ` Jan Kara
2016-10-11 21:43 ` Kirill A. Shutemov
2016-10-11 21:43 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 12/41] thp: handle write-protection faults for file THP Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-11 15:47 ` Jan Kara
2016-10-11 15:47 ` Jan Kara
2016-10-11 21:47 ` Kirill A. Shutemov
2016-10-11 21:47 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-11 15:58 ` Jan Kara
2016-10-11 15:58 ` Jan Kara
2016-10-11 21:53 ` Kirill A. Shutemov
2016-10-11 21:53 ` Kirill A. Shutemov
2016-10-12 6:43 ` Jan Kara
2016-10-12 6:43 ` Jan Kara
2016-10-24 10:41 ` Kirill A. Shutemov [this message]
2016-10-24 10:41 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-11 16:15 ` Jan Kara
2016-10-11 16:15 ` Jan Kara
2016-10-11 21:57 ` Kirill A. Shutemov
2016-10-11 21:57 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read() Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-13 9:33 ` Jan Kara
2016-10-13 9:33 ` Jan Kara
2016-10-31 18:10 ` Kirill A. Shutemov
2016-10-31 18:10 ` Kirill A. Shutemov
2016-11-01 16:39 ` Jan Kara
2016-11-01 16:39 ` Jan Kara
2016-11-02 8:32 ` Kirill A. Shutemov
2016-11-02 8:32 ` Kirill A. Shutemov
2016-11-02 14:37 ` Christoph Hellwig
2016-11-02 14:37 ` Christoph Hellwig
2016-11-03 20:40 ` Jan Kara
2016-11-03 20:40 ` Jan Kara
2016-11-07 11:07 ` Kirill A. Shutemov
2016-11-07 11:07 ` Kirill A. Shutemov
2016-11-07 14:59 ` Christoph Hellwig
2016-11-07 14:59 ` Christoph Hellwig
2016-11-02 14:36 ` Christoph Hellwig
2016-11-02 14:36 ` Christoph Hellwig
2016-11-03 17:56 ` Jan Kara
2016-11-03 17:56 ` Jan Kara
2016-11-07 11:13 ` Kirill A. Shutemov
2016-11-07 11:13 ` Kirill A. Shutemov
2016-11-07 15:01 ` Christoph Hellwig
2016-11-07 15:01 ` Christoph Hellwig
2016-11-07 16:03 ` Kirill A. Shutemov
2016-11-07 16:03 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 16/41] filemap: allocate huge page in pagecache_get_page(), if allowed Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range() Kirill A. Shutemov
2016-09-15 11:54 ` Kirill A. Shutemov
2016-10-13 9:44 ` Jan Kara
2016-10-13 9:44 ` Jan Kara
2016-10-13 12:08 ` Kirill A. Shutemov
2016-10-13 12:08 ` Kirill A. Shutemov
2016-10-13 13:18 ` Jan Kara
2016-10-13 13:18 ` Jan Kara
2016-10-24 11:36 ` Kirill A. Shutemov
2016-10-24 11:36 ` Kirill A. Shutemov
2016-10-30 17:31 ` Jan Kara
2016-10-30 17:31 ` Jan Kara
2016-09-15 11:55 ` [PATCHv3 18/41] HACK: readahead: alloc huge pages, if allowed Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 19/41] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-16 12:09 ` Kirill A. Shutemov
2016-09-16 12:09 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 20/41] mm: make write_cache_pages() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 21/41] thp: introduce hpage_size() and hpage_mask() Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 22/41] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask} Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 23/41] fs: make block_read_full_page() be able to read huge page Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 24/41] fs: make block_write_{begin,end}() be able to handle huge pages Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 25/41] fs: make block_page_mkwrite() aware about " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 26/41] truncate: make truncate_inode_pages_range() " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 27/41] truncate: make invalidate_inode_pages2_range() " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 28/41] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 12:27 ` Andreas Dilger
2016-09-16 12:17 ` Kirill A. Shutemov
2016-09-16 12:17 ` Kirill A. Shutemov
2016-09-16 12:10 ` Kirill A. Shutemov
2016-09-16 12:10 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 30/41] ext4: make ext4_writepage() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 31/41] ext4: handle huge pages in ext4_page_mkwrite() Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 32/41] ext4: handle huge pages in __ext4_block_zero_page_range() Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 33/41] ext4: make ext4_block_write_begin() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 34/41] ext4: handle huge pages in ext4_da_write_end() Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 35/41] ext4: make ext4_da_page_release_reservation() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 36/41] ext4: handle writeback with " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 37/41] ext4: make EXT4_IOC_MOVE_EXT work " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 38/41] ext4: fix SEEK_DATA/SEEK_HOLE for " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 39/41] ext4: make fallocate() operations work with " Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 40/41] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff() Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 41/41] ext4, vfs: add huge= mount option Kirill A. Shutemov
2016-09-15 11:55 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161024104102.GA2849@node.shutemov.name \
--to=kirill@shutemov.name \
--cc=aarcange@redhat.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ross.zwisler@linux.intel.com \
--cc=tytso@mit.edu \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.