From: Dave Chinner <david@fromorbit.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>,
Dave Hansen <dave.hansen@intel.com>,
linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
Christoph Lameter <cl@gentwo.org>,
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
Jerome Marchand <jmarchan@redhat.com>,
Yang Shi <yang.shi@linaro.org>,
Sasha Levin <sasha.levin@oracle.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE
Date: Sun, 6 Mar 2016 09:38:11 +1100 [thread overview]
Message-ID: <20160305223811.GD11282@dastard> (raw)
In-Reply-To: <20160304232412.GC12498@node.shutemov.name>
On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > >> Truncate and punch hole that only cover part of THP range is implemented
> > > > >> by zero out this part of THP.
> > > > >>
> > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > > >> inconsistent results depending what pages happened to be allocated.
> > > > >> Not sure if it should be considered ABI break or not.
> > > > >
> > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > >
> > > > > Within the specified range, partial filesystem blocks are zeroed,
> > > > > and whole filesystem blocks are removed from the file. After a
> > > > > successful call, subsequent reads from this range will return
> > > > > zeroes.
> > > > >
> > > > > It means we effectively have 2M filesystem block size.
> > > >
> > > > The question is still whether this will case problems for apps.
> > > >
> > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > > > filesystem act like they have a 2M blocksize and others like they have
> > > > 4k? Would that confuse apps?
> > >
> > > At risk of addressing the tip of an iceberg, before diving down to
> > > scope out the rest of the iceberg...
> > ....
> >
> > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > to hold on to its compound pages for longer than I do, and that's fine
> > > so long as it's all consistent.)
> > ....
> > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > expect the memcg not to be charged for those holes (just as when they
> > > munmap most of an anonymous THP) - that does suggest splitting is needed.
> >
> > I think filesystems will expect splitting to happen. They call
> > truncate_pagecache_range() on the region that the hole is being
> > punched out of, and they expect page cache pages over this range to
> > be unmapped, invalidated and then removed from the mapping tree as a
> > result. Also, most filesystems think the page cache only contains
> > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > limitations THP might have when it comes to invalidation.
> >
> > IOWs, if this range is not aligned to huge page boundaries, then it
> > implies the huge page is either split into PAGE_SIZE mappings and
> > then the range is invalidated as expected, or it is completely
> > invalidated and then refaulted on future accesses which determine if
> > THP or normal pages are used for the page being faulted....
>
> The filesystem in question is tmpfs and complete invalidation is not
> always an option.
Then your two options are: splitting the page and rerunning the hole
punch, or simply zeroing the sections of the THP rather than trying
to punch out the backing store.
> For other filesystems it also can be unavailable
> immediately if the page is dirty (the dirty flag is tracked on per-THP
> basis at the moment).
Filesystems with persistent storage flush the range being punched
first to ensure that partial blocks are correctly written before we
start freeing the backing store. This is needed on XFS to ensure
hole punch plays nicely with delayed allocation and other extent
based operations. Hence we know that we have clean pages over the
hole we are about to punch and so there is no reason the
invalidation should *ever* fail.
tmpfs is a special snowflake when it comes to these fallocate based
filesystem layout manipulation functions - it does not have
persistent storage, so you have to do things very differently to
ensure that data is not lost.
> Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> -EBUSY (or other errno on your choice), if we cannot split the page
> right away?
Which means THP are not transparent any more. What does an
application do when it gets an EBUSY, anyway? It needs to punch a
hole, and failure to do so could result in data corruption or stale
data exposure if the hole isn't punched and the data purged from the
range.
And it's not just hole punching that has this problem. Direct IO is
going to have the same issue with invalidation of the mapped ranges
over the IO being done. XFS already WARNs when page cache
invalidation fails with EBUSY in direct IO, because that is
indicative of an application with a potential data corruption vector
and there's nothing we can do in the kernel code to prevent it.
I think the same issues also exist with DAX using huge (and giant)
pages. Hence it seems like we need to think about these interactions
carefully, because they will no longer are isolated to tmpfs and
THP...
> > Just to complicate things, keep in mind that some filesystems may
> > have a PAGE_SIZE block size, but can be convinced to only
> > allocate/punch/truncate/etc extents on larger alignments on a
> > per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> > a filesystem type specific behaviour - it's per-inode specific...
>
> There is also similar question about THP vs. i_size vs. SIGBUS.
>
> For small pages an application will not get SIGBUS on mmap()ed file, until
> it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.
>
> For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.
>
> Is it a problem?
No idea. I'm guessing that there may be significant stale data
exposure issues here as filesystems do not guarantee that blocks
completely beyond EOF contain zeros.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-03-05 22:38 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-03 16:51 [PATCHv3 00/29] huge tmpfs implementation using compound pages Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 01/29] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 02/29] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 03/29] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 04/29] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 05/29] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 06/29] mm: introduce fault_env Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 07/29] mm: postpone page table allocation until we have page to map Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 08/29] rmap: support file thp Kirill A. Shutemov
2016-03-03 16:51 ` [PATCHv3 09/29] mm: introduce do_set_pmd() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 10/29] mm, rmap: account file thp pages Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 11/29] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 12/29] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 13/29] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 14/29] thp: handle file COW faults Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 15/29] thp: handle file pages in mremap() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 16/29] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 17/29] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 18/29] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 19/29] thp: file pages support for split_huge_page() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 20/29] thp, mlock: do not mlock PTE-mapped file huge pages Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 21/29] vmscan: split file huge pages before paging them out Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 22/29] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 23/29] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 24/29] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 25/29] truncate: handle file thp Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 26/29] shmem: prepare huge= mount option and sysfs knob Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 27/29] shmem: get_unmapped_area align huge page Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 28/29] shmem: add huge pages support Kirill A. Shutemov
2016-03-03 16:52 ` [PATCHv3 29/29] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings Kirill A. Shutemov
2016-03-04 4:20 ` [PATCHv3 00/29] huge tmpfs implementation using compound pages Sasha Levin
2016-03-04 22:53 ` Kirill A. Shutemov
2016-03-04 11:26 ` THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE Kirill A. Shutemov
2016-03-04 17:40 ` Dave Hansen
2016-03-04 19:38 ` Hugh Dickins
2016-03-04 22:48 ` Kirill A. Shutemov
2016-03-04 23:05 ` Dave Chinner
2016-03-04 23:24 ` Kirill A. Shutemov
2016-03-05 22:38 ` Dave Chinner [this message]
2016-03-06 0:30 ` Kirill A. Shutemov
2016-03-06 23:03 ` Dave Chinner
2016-03-06 23:33 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160305223811.GD11282@dastard \
--to=david@fromorbit.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=dave.hansen@intel.com \
--cc=hughd@google.com \
--cc=jmarchan@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=sasha.levin@oracle.com \
--cc=vbabka@suse.cz \
--cc=yang.shi@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).