From mboxrd@z Thu Jan  1 00:00:00 1970
From: Larry Woodman <lwoodman@redhat.com>
Subject: Re: [PATCH] fix pgd_lock deadlock
Date: Tue, 15 Feb 2011 14:31:30 -0500
Message-ID: <4D5AD492.5030500@redhat.com>
References: <4CB76E8B.2090309@goop.org> <4CC0AB73.8060609@goop.org>
	<20110203024838.GI5843@random.random> <4D4B1392.5090603@goop.org>
	<20110204012109.GP5843@random.random> <4D4C6F45.6010204@goop.org>
	<20110207232045.GJ3347@random.random>
	<20110215190710.GL5935@random.random>
	<alpine.LFD.2.00.1102152020590.26192@localhost6.localdomain6>
Reply-To: lwoodman@redhat.com
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============0741702087=="
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <alpine.LFD.2.00.1102152020590.26192@localhost6.localdomain6>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Jeremy Fitzhardinge <jeremy@goop.org>, "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>, Ian Campbell <Ian.Campbell@citrix.com>, the arch/x86 maintainers <x86@kernel.org>, Hugh Dickins <hughd@google.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Jan Beulich <JBeulich@novell.com>, Andi Kleen <ak@suse.de>, "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <jweiner@redhat.com>
List-Id: xen-devel@lists.xenproject.org

This is a multi-part message in MIME format.
--===============0741702087==
Content-Type: multipart/alternative;
	boundary="------------040305040806010203010400"

This is a multi-part message in MIME format.
--------------040305040806010203010400
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 02/15/2011 02:26 PM, Thomas Gleixner wrote:
> On Tue, 15 Feb 2011, Andrea Arcangeli wrote:
>
>> Hello,
>>
>> Without this patch we can deadlock in the page_table_lock with NR_CPUS
>> <  4 or THP on, with this patch we hopefully won't deadlock in the
>> pgd_lock (if taken from irq). I can't see anything taking it from irq
>> (maybe aio? to check I also tried the libaio testuite with no apparent
>> VM_BUG_ON triggering), so unless somebody sees it, I think we should
>> apply it. I've been running for a while with this patch applied
>> without apparent problems. Other archs may follow suit if it's proven
>> that there's nothing taking the pgd_lock from irq.
>>
>> ===
>> Subject: fix pgd_lock deadlock
>>
>> From: Andrea Arcangeli<aarcange@redhat.com>
>>
>> It's forbidden to take the page_table_lock with the irq disabled or if there's
>> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
>> never run leading to a deadlock.
> I really read this thing 5 times and still cannot make any sense of it.
>
> You talk about page_table_lock and then fiddle with pgd_lock.
>
> -ENOSENSE
>
> 	tglx
>
>
I put this expanation in the redhat BZ, says it all:


Larry Woodman <mailto:lwoodman@redhat.com> 2011-01-21 15:54:58 EST

The problem is with THP.  The page reclaim code calls page_referenced_one()
which takes the mm->page_table_lock on one CPU before sending an IPI to other
CPU(s):

On CPU1 we take the mm->page_table_lock, send IPIs and wait for a response:
page_referenced_one(...)
         if (unlikely(PageTransHuge(page))) {
                 pmd_t *pmd;

                 spin_lock(&mm->page_table_lock);
                 pmd = page_check_address_pmd(page, mm, address,
                                              PAGE_CHECK_ADDRESS_PMD_FLAG);
                 if (pmd&&  !pmd_trans_splitting(*pmd)&&
                     pmdp_clear_flush_young_notify(vma, address, pmd))
                         referenced++;
                 spin_unlock(&mm->page_table_lock);
         } else {


CPU2 can race in vmalloc_sync_all() because it disables interrupt(preventing a
response to the IPI from CPU1) and takes the pgd_lock then spins in the
mm->page_table_lock which is already held on CPU1.

                 spin_lock_irqsave(&pgd_lock, flags);
                 list_for_each_entry(page,&pgd_list, lru) {
                         pgd_t *pgd;
                         spinlock_t *pgt_lock;

                         pgd = (pgd_t *)page_address(page) + pgd_index(address);

                         pgt_lock =&pgd_page_get_mm(page)->page_table_lock;
                         spin_lock(pgt_lock);


At this point the system is deadlocked.  The pmdp_clear_flush_young_notify
needs to do its PDG business with the page_table_lock held then release that
lock before sending the IPIs to the other CPUs.


Larry


--------------040305040806010203010400
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#ffffff" text="#000000">
    On 02/15/2011 02:26 PM, Thomas Gleixner wrote:
    <blockquote
      cite="mid:alpine.LFD.2.00.1102152020590.26192@localhost6.localdomain6"
      type="cite">
      <pre wrap="">On Tue, 15 Feb 2011, Andrea Arcangeli wrote:

</pre>
      <blockquote type="cite">
        <pre wrap="">Hello,

Without this patch we can deadlock in the page_table_lock with NR_CPUS
&lt; 4 or THP on, with this patch we hopefully won't deadlock in the
pgd_lock (if taken from irq). I can't see anything taking it from irq
(maybe aio? to check I also tried the libaio testuite with no apparent
VM_BUG_ON triggering), so unless somebody sees it, I think we should
apply it. I've been running for a while with this patch applied
without apparent problems. Other archs may follow suit if it's proven
that there's nothing taking the pgd_lock from irq.

===
Subject: fix pgd_lock deadlock

From: Andrea Arcangeli <a class="moz-txt-link-rfc2396E" href="mailto:aarcange@redhat.com">&lt;aarcange@redhat.com&gt;</a>

It's forbidden to take the page_table_lock with the irq disabled or if there's
contention the IPIs (for tlb flushes) sent with the page_table_lock held will
never run leading to a deadlock.
</pre>
      </blockquote>
      <pre wrap="">
I really read this thing 5 times and still cannot make any sense of it.

You talk about page_table_lock and then fiddle with pgd_lock.

-ENOSENSE

	tglx

 
</pre>
    </blockquote>
    I put this expanation in the redhat BZ, says it all:<br>
    <br>
    <br>
    <div class="bz_comment_head"><span class="bz_comment_user"><span
          class="vcard redhat_user"><a class="email"
            href="mailto:lwoodman@redhat.com" title="Larry Woodman
            &lt;lwoodman@redhat.com&gt;"> <span class="fn">Larry
              Woodman</span></a>
        </span> </span> <span class="bz_comment_user_images"> </span>
      <span class="bz_comment_time"> 2011-01-21 15:54:58 EST </span> </div>
    <pre class="bz_comment_text" id="comment_text_2">The problem is with THP.  The page reclaim code calls page_referenced_one()
which takes the mm-&gt;page_table_lock on one CPU before sending an IPI to other
CPU(s):

On CPU1 we take the mm-&gt;page_table_lock, send IPIs and wait for a response:
page_referenced_one(...)
        if (unlikely(PageTransHuge(page))) {
                pmd_t *pmd;

                spin_lock(&amp;mm-&gt;page_table_lock);
                pmd = page_check_address_pmd(page, mm, address,
                                             PAGE_CHECK_ADDRESS_PMD_FLAG);
                if (pmd &amp;&amp; !pmd_trans_splitting(*pmd) &amp;&amp;
                    pmdp_clear_flush_young_notify(vma, address, pmd))
                        referenced++;
                spin_unlock(&amp;mm-&gt;page_table_lock);
        } else {


CPU2 can race in vmalloc_sync_all() because it disables interrupt(preventing a
response to the IPI from CPU1) and takes the pgd_lock then spins in the
mm-&gt;page_table_lock which is already held on CPU1.

                spin_lock_irqsave(&amp;pgd_lock, flags);
                list_for_each_entry(page, &amp;pgd_list, lru) {
                        pgd_t *pgd;
                        spinlock_t *pgt_lock;

                        pgd = (pgd_t *)page_address(page) + pgd_index(address);

                        pgt_lock = &amp;pgd_page_get_mm(page)-&gt;page_table_lock;
                        spin_lock(pgt_lock);


At this point the system is deadlocked.  The pmdp_clear_flush_young_notify
needs to do its PDG business with the page_table_lock held then release that
lock before sending the IPIs to the other CPUs.


Larry</pre>
  </body>
</html>

--------------040305040806010203010400--


--===============0741702087==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

--===============0741702087==--