Re: [Qemu-devel] Fix refcounting in hugetlbfs quota handling

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Barry <abarry@cray.com>
To: Hugh Dickins <hughd@google.com>, Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Jan Kiszka <jan.kiszka@web.de>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"agraf@suse.de" <agraf@suse.de>, kvm <kvm@vger.kernel.org>,
	Paul Mackerras <paulus@samba.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Andrew Hastings <abh@cray.com>,
	Adam Litke <agl@us.ibm.com>, Minchan Kim <minchan.kim@gmail.com>
Subject: Re: [Qemu-devel] Fix refcounting in hugetlbfs quota handling
Date: Tue, 16 Aug 2011 12:45:30 -0500	[thread overview]
Message-ID: <4E4AACBA.50408@cray.com> (raw)
In-Reply-To: <20110816034748.GG18203@yookeroo.fritz.box>

It is not a unique KVM problem. We saw the race while doing large async rDMA in
our network driver, but I can imagine it happening with a slow NFS server, or
other DMA that could complete after umount.

What I need, in order to push this upstream, is:
1. For you to light a fire under my feet to get this going. (done)
2. For you all to review the patch and let me know if I've missed anything.
Patch RFC coming ASAP.

-Andrew Barry


On 08/15/2011 10:47 PM, David Gibson wrote:
> On Mon, Aug 15, 2011 at 03:25:35PM -0500, Andrew Barry wrote:
>> I've been doing something similar to this last proposal. I put a
>> hugetlbfs_sb_info pointer into page_private, and dropped a reference counter and
>> an active/inactive bit into the hugetlbfs_sb_info struct. At Umount time, the
>> sbinfo is freed, only if the reference count is zero. Otherwise, the last
>> put_quota frees the sbinfo structure. This fixed the race we were seeing between
>> umount and a put_quota from an rdma transaction. I just gave it a cursory test
>> on a 3.0 kernel; it has seen quite a lot more testing on a 2.6.32-derived
>> kernel, with no more hits of the umount race.
>>
>> Does this address the problems you were thinking about?
> 
> Ah, this looks much better than my patch.  And the fact that you've
> seen your race demonstrates clearly that this isn't just a kvm
> problem.  I hope we can push this upstream very soon - what can I do
> to help?
> 
>> -Andrew Barry
>>

>>
>> On 08/15/2011 01:00 PM, Hugh Dickins wrote:
>>> On Sat, 13 Aug 2011, David Gibson wrote:
>>>> On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
>>>>>
>>>>> Setting that aside, I think this thing of grabbing a reference to inode
>>>>> for each page just does not work as you wish: when we unlink an inode,
>>>>> all its pages should be freed; but because they are themselves holding
>>>>> references to the inode, it and its pages stick around forever.
>>>>
>>>> Ugh, yes.  You're absolutely right.  That circular reference will mess
>>>> everything up.  Thinking it through and testing fail.
>>>>
>>>>> A quick experiment with your patch versus without confirmed that:
>>>>> meminfo HugePages_Free stayed down with your patch, but went back to
>>>>> HugePages_Total without it.  Please check, perhaps I'm just mistaken.
>>>>>
>>>>> Sorry, I've not looked into what a constructive alternative might be;
>>>>> and it's not the first time we've had this difficulty - it came up last
>>>>> year when the ->freepage function was added, that the inode may be gone
>>>>> by the time ->freepage(page) is called.
>>>>
>>>> Ok, so.  In fact the quota functions we call at free time only need
>>>> the super block, not the inode per se.  If we put a superblock pointer
>>>> instead of an inode pointer in page private, and refcounted that, I
>>>> think that should remove the circular ref.  The only reason I didn't
>>>> do it before is that the superblock refcounting functions didn't seem
>>>> to be globally visible in an obvious way.
>>>>
>>>> Does that sound like a reasonable approach?
>>>
>>> That does sound closer to a reaonable approach, but my guess is that it
>>> will suck you into a world of superblock mutexes and semaphores, which
>>> you cannot take at free_huge_page() time.
>>>
>>> It might be necessary to hang your own tiny structure off the superblock,
>>> with one refcount for the superblock, and one for each hugepage attached,
>>> you freeing that structure when the count goes down to zero from either
>>> direction.
>>>
>>> Whatever you do needs testing with lockdep and atomic sleep checks.
>>>
>>> I do dislike tying these separate levels together in such an unusual way,
>>> but it is a difficult problem and I don't know of an easy answer.  Maybe
>>> we'll need to find a better answer for other reasons, it does come up
>>> from time to time e.g. recent race between evicting inode and nrpages
>>> going down to 0.
>>>
>>> You might care to take a look at how tmpfs (mm/shmem.c) deals with
>>> the equivalent issue there (sbinfo->used_blocks).  But I expect you to
>>> conclude that hugetlbfs cannot afford the kind of approximations that
>>> tmpfs can afford.
>>>
>>> Although I think tmpfs is more correct, to be associating the count
>>> with pagecache (so the count goes down as soon as a page is truncated
>>> or evicted from pagecache), your fewer and huger pages, and reservation
>>> conventions, may well demand that the count stays up until the page is
>>> actually freed back to hugepool.  And let's not pretend that what tmpfs
>>> does is wonderful: the strange shmem_recalc_inode() tries its best to
>>> notice when memory pressure has freed clean pages, but it never looks
>>> beyond the inode being accessed at the times it's called.  Not at all
>>> satisfactory, but not actually an issue in practice, since we stopped
>>> allocating pages for simple reads from sparse file.  I did want to
>>> convert tmpfs to use ->freepage(), but couldn't manage it without
>>> stable mapping - same problem as you have.
>>>
>>> Hugh
>>
> 
> --
> David Gibson                    | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
>                                 | _way_ _around_!
> http://www.ozlabs.org/~dgibson

next prev parent reply	other threads:[~2011-08-16 17:46 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-11  6:40 [Qemu-devel] Fix refcounting in hugetlbfs quota handling David Gibson
2011-08-12  0:48 ` Linus Torvalds
2011-08-12  4:34   ` Minchan Kim
2011-08-12 19:15     ` Hugh Dickins
2011-08-13  1:08       ` David Gibson
2011-08-15 18:00         ` Hugh Dickins
2011-08-15 20:25           ` Andrew Barry
2011-08-16  3:47             ` David Gibson
2011-08-16 17:45               ` [Qemu-devel] [RFC PATCH] mm/hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
2011-08-16 17:45               ` Andrew Barry [this message]
2011-08-12 22:20 ` [Qemu-devel] Fix refcounting in hugetlbfs quota handling Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E4AACBA.50408@cray.com \
    --to=abarry@cray.com \
    --cc=abh@cray.com \
    --cc=agl@us.ibm.com \
    --cc=agraf@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=hughd@google.com \
    --cc=jan.kiszka@web.de \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=mtosatti@redhat.com \
    --cc=paulus@samba.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).