From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Ken Chen <kenchen@google.com>
Cc: William Lee Irwin III <wli@holomorphy.com>,
linux-mm@kvack.org, agl@us.ibm.com, dwg@au1.ibm.com
Subject: Re: FADV_DONTNEED on hugetlbfs files broken
Date: Sun, 18 Mar 2007 10:27:11 -0700 [thread overview]
Message-ID: <20070318172711.GA12978@us.ibm.com> (raw)
In-Reply-To: <b040c32a0703180043t29c675bfr9a9554575a261f96@mail.gmail.com>
On 18.03.2007 [00:43:01 -0700], Ken Chen wrote:
> On 3/17/07, Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> >Yes, that could be :) Sorry if my e-mail indicated I was asking
> >otherwise. I don't want Ken's commit to be reverted, as that would
> >make hugepages very nearly unusable on x86 and x86_64. But I had
> >found a functional change and wanted it to be documented. If
> >hugepages can no longer be dropped from the page cache, then we
> >should make sure that is clear (and expected/desired).
>
> Oh gosh, I think you are really abusing the buggy hugetlb behavior in
> the dark age of 2.6.19. Hugetlb file does not have disk based backing
> store. The in-core page that resides in the page cache is the only
> copy of the file. For pages that are dirty, there are no place to
> sync them to and thus they have to stay in the page cache for the life
> of the file.
And 2.6.20, fwiw. Your explanation makes sense. Frustrating, though,
since it means segment remapping uses twice as many huge pages as it
needs to for each writable segment.
> And currently, there is no way to allocate hugetlb page in "clean"
> state because we can't mmap hugetlb page onto a disk file. So pages
> for live file in hugetlbfs are always being written to initially and
> it is just not possible to drop them out of page cache, otherwise we
> suffer from data corruption.
Let's be clear, for the sake of the archives of the world, this is only
for *writable* allocations. In make_huge_pte():
if (writable) {
entry =
pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
} else {
entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
}
Probably obvious to anyone, since you need to be able to dirty the page to have
it in a dirty state.
> >Now, even if I call fsync() on the file descriptor, I still don't get
> >the pages out of the page cache. It seems to me like fsync() would
> >clear the dirty state -- although perhaps with Ken's patch, writable
> >hugetlbfs pages will *always* be dirty? I'm still trying to figure
> >out what ever clears that dirty state (in hugetlbfs or anywhere
> >else). Seems like hugetlbfs truncates call cancel_dirty_page(), but
> >the comment there indicates it's only for truncates.
>
> fsync can not drop dirty pages out of page cache because there are no
> backing store. I believe truncate is the only way to remove hugetlb
> page out of page cache.
Which won't work here, because we don't want to lose the data. We just
want to drop the original MAP_SHARED copy of the file out of the
page_cache. I tried ftruncate()'ing the file down to 0 after we've
mapped it PRIVATE and COW'd each hugepage, but then the process
(obviously) SEGVs. We lose all hugepages in the page cache.
> >> Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for
> >> that matter, if they suffer from the same issue as Ken Chen
> >> identified for hugetlbfs. Perhaps the issue is not hugetlb's dirty
> >> state, but drop_pagecache_sb() failing to check the bdi for
> >> BDI_CAP_NO_WRITEBACK. Or perhaps what safety guarantees
> >> drop_pagecache_sb() is supposed to have or lack.
>
> I looked, ramfs and tmpfs does the same thing. fadvice(DONTNEED)
> doesn't do anything to live files.
Ok, thanks for looking into it, Ken.
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2007-03-18 17:27 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-17 5:13 FADV_DONTNEED on hugetlbfs files broken Nishanth Aravamudan
2007-03-17 6:13 ` William Lee Irwin III
2007-03-17 19:37 ` Nishanth Aravamudan
2007-03-18 2:13 ` William Lee Irwin III
2007-03-18 7:43 ` Ken Chen
2007-03-18 17:27 ` Nishanth Aravamudan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070318172711.GA12978@us.ibm.com \
--to=nacc@us.ibm.com \
--cc=agl@us.ibm.com \
--cc=dwg@au1.ibm.com \
--cc=kenchen@google.com \
--cc=linux-mm@kvack.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.