public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jin Dongming <jin.dongming@np.css.fujitsu.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Andi Kleen" <andi@firstfloor.org>,
	AKPM  <akpm@linux-foundation.org>,
	"Hidetoshi Seto" <seto.hidetoshi@jp.fujitsu.com>,
	"Huang Ying" <ying.huang@intel.com>,
	LKLM <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/3] Isolate only one page of anonymous THP
Date: Thu, 27 Jan 2011 09:12:36 +0900	[thread overview]
Message-ID: <4D40B874.6000203@np.css.fujitsu.com> (raw)
In-Reply-To: <20110125224244.GK926@random.random>

Hi, Andrea
(2011/01/26 7:42), Andrea Arcangeli wrote:
> Hello Jin,
> 
> On Tue, Jan 25, 2011 at 02:46:08PM +0900, Jin Dongming wrote:
>> When the tail page of THP is poisoned, the head page will be
>> poisoned too.
>>
>> Lets make an assumption like following:
>>   1. Guest OS is running on KVM.
>>   2. Two processes(A and B) on Guest OS use pages in the same THP
>>      of Host.
>>      - process A is using the head page.
>>      - process B is using the tail pages.
>>
>>   So when the tail page is poisoned, process B should be killed.
>>   But process A is killed and process B is still alive in fact.
> 
> Do we have paravirt memory-failure support to actually kill single
> processes in guest depending on which page they run on instead of the
> whole guest?
Yes.
When the guest is running on KVM, such functionality is supported.

> 
>>   The reason for process A killed is that the head page is poisoned
>>   when the tail page is poisoned and the address reported
>>   with sigbus is the address of head page not the poisoned tail page.
>>
>>   The reason for process B alive is that PG_hwpoisoned of the poisoned
>>   tail page is cleared after the poisoned THP is split and the address
>>   reported with sigbus is the address of head page.
> 
> Agree about the fact we mark the head page poisoned instead of the
> tail page and it's not accurate as it should.
> 
>> It is expected that the process using the poisoned tail page is killed,
>> but not that the process using the healthy head page is killed.
>>
>> So it is better to avoid poisoning other than the page which is really
>> poisoned.
>>   (While we poison all pages in a huge page in case of hugetlb,
>>    we can do this for THP thanks to split_huge_page().)
> 
> Yes we can be more accurate with THP thanks to your change to
> __split_huge_page_refcount, full agreement with your huge_memory.c change.
> 
>>
>> Here we fix two parts:
>>   1. poison the real poisoned page only.
> 
> Agreed.
> 
>>   2. make the poisoned page work as the poisoned regular
>>      page(4k page).
> 
> You're adding one more unsafe compound_head() usage.
> 
> This is not a remark not specific to this patch, and I pointed this
> out already in some header commit of THP, but it likely got lost in
> the noise so I remind it here (and no, I don't expect everyone to read
> the headers, you're already doing great job at reading the code and
> the details of split_huge_page internals to fix these bits).
> 
> So in short my remark is that most compound_head() are broken for
> hugetlbfs too in memory-failure.c.
> 
> I introduced a compound_trans_head that can be used safely like
> memory-failure does in most places:
> 
> get_page_unless_zero(compound_head(pfn_to_page(pfn))) -> unsafe for THP and hugetlbfs (current status)
> get_page_unless_zero(compound_trans_head(pfn_to_page(pfn))) -> safe only for THP
> 
> I didn't go search and replace compound_trans_head in memory-failure
> because it would remain broken for hugetlbfs... so it wasn't
> definitive solution.
> 
> I however made a compound_trans_order that is safe for both (hugetlbfs
> I doubt frees the page from under you like split_huge_page could do,
> so after get_page_unless_zero succeeds on the head, hugetlbfs should
> be safe calling both compound_order or compound_trans_order, THP
> instead needs compound_trans_order). I'm afraid however that in some
> place compound_trans_order may be still unsafe for hugetlbfs if
> compound_trans_order is called on a hugetlbfs page before getting its
> refcount (it's definitely safe for THP instead).
> 
> I'm uncertain if it's worth to fix these races considering it's
> hardware failure we're talking about, so maybe math guarantee isn't
> needed for something that deals with an imperfect world that can crash
> anyway if the memory failure happens in a kernel .text. hugetlbfs
> allocations usually are static and practically only used by commercial
> DB, so it's almost impossible to hit the race in any practical case
> (unless you exercise it while the db shutsdown). It's your call...
> 
> If we don't fix the race for hugetlbfs, then you can go ahead and
> replace compound_head with compound_trans_head for every place where
> __split_huge_page_refcount can run from under you, and then THP will
> be math-safe. Very easy fix.
> 
> NOTE: when memcg calls compound_order it's safe because it's calling
> it always on the head page. Also if you call compound_order inside a
> page_table_lock when the hugepmd page isn't set to splitting yet, it's
> safe. The unsafe is taking an arbitrary pfn, convert to page struct,
> and run compound_order or compound_head on it, and it's equally unsafe
> for hugetlbfs and THP. The difference is, after get_page_unless_zero
> succeeds on hugetlbfs then you're safe calling compound_order, while
> for THP split_huge_page can still run and so compound_trans_order is
> needed.
> 
>> @@ -906,6 +907,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>>  	}
>>  
>>  	/*
>> +	 * ppage: poisoned page
>> +	 *   if p is regular page(4k page) or THP(anonymous),
>> +	 *        ppage == real poisoned page;
>> +	 *   else p is hugetlb or others, ppage == head page.
>> +	 */
>> +	ppage = compound_head(p);
>> +
>> +	/*
>>  	 * First collect all the processes that have the page
>>  	 * mapped in dirty form.  This has to be done before try_to_unmap,
>>  	 * because ttu takes the rmap data structures down.
> 
> I would suggest to change it like this pseudocode:
> 
>  if (PageTransHuge(hpage)) {
>    /* verify that this isn't a hugetlbfs head page, the check for
>    PageAnon is just for avoid tripping a split_huge_page internal
>    debug check, as split_huge_page refuses to deal with anything that
>    isn't an anon page. PageAnon can't go away from under us because we
>    hold a refcount on the hpage, without a refcount on the hpage
>    split_huge_page can't be safely called in the first place, having a
>    refcount on the tail isn't enough to be safe. */
>    if (!PageHuge(hpage) && PageAnon(hpage)) {
>      if (split_huge_page(hpage)) {
>       /*
>        * The vma/anon_vma was freed from under us, the page should be
>        * unused, let it be freed by releasing it.
>        */
>        /* comment for you: we must make sure to mark the already unused pages poisoned */
>        return SWAP_FAIL;
>      }
>      hpage = p;
>    }
>  }
> 
OK. I will modify this part.

Thanks.

Best Regards,
Jin Dongming
> 
> Something like that, which also avoids to call a second compound_head
> at all for CONFIG_TRANSPARENT_HUGEPAGE=n as the whole block goes away.
> 
> Thanks a lot for looking into the THP memory-failure support, I
> appreciate!
> 
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



      reply	other threads:[~2011-01-27  0:09 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-25  5:46 [PATCH 2/3] Isolate only one page of anonymous THP Jin Dongming
2011-01-25 22:42 ` Andrea Arcangeli
2011-01-27  0:12   ` Jin Dongming [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D40B874.6000203@np.css.fujitsu.com \
    --to=jin.dongming@np.css.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox