I put this expanation in the redhat BZ, says it all:On Tue, 15 Feb 2011, Andrea Arcangeli wrote:Hello, Without this patch we can deadlock in the page_table_lock with NR_CPUS < 4 or THP on, with this patch we hopefully won't deadlock in the pgd_lock (if taken from irq). I can't see anything taking it from irq (maybe aio? to check I also tried the libaio testuite with no apparent VM_BUG_ON triggering), so unless somebody sees it, I think we should apply it. I've been running for a while with this patch applied without apparent problems. Other archs may follow suit if it's proven that there's nothing taking the pgd_lock from irq. === Subject: fix pgd_lock deadlock From: Andrea Arcangeli <aarcange@redhat.com> It's forbidden to take the page_table_lock with the irq disabled or if there's contention the IPIs (for tlb flushes) sent with the page_table_lock held will never run leading to a deadlock.I really read this thing 5 times and still cannot make any sense of it. You talk about page_table_lock and then fiddle with pgd_lock. -ENOSENSE tglx
The problem is with THP. The page reclaim code calls page_referenced_one()
which takes the mm->page_table_lock on one CPU before sending an IPI to other
CPU(s):
On CPU1 we take the mm->page_table_lock, send IPIs and wait for a response:
page_referenced_one(...)
if (unlikely(PageTransHuge(page))) {
pmd_t *pmd;
spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_FLAG);
if (pmd && !pmd_trans_splitting(*pmd) &&
pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
spin_unlock(&mm->page_table_lock);
} else {
CPU2 can race in vmalloc_sync_all() because it disables interrupt(preventing a
response to the IPI from CPU1) and takes the pgd_lock then spins in the
mm->page_table_lock which is already held on CPU1.
spin_lock_irqsave(&pgd_lock, flags);
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
spinlock_t *pgt_lock;
pgd = (pgd_t *)page_address(page) + pgd_index(address);
pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
spin_lock(pgt_lock);
At this point the system is deadlocked. The pmdp_clear_flush_young_notify
needs to do its PDG business with the page_table_lock held then release that
lock before sending the IPIs to the other CPUs.
Larry