From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Jeremy Fitzhardinge <jeremy@goop.org>,
"H. Peter Anvin" <hpa@zytor.com>,
the arch/x86 maintainers <x86@kernel.org>,
"Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Ian Campbell <Ian.Campbell@citrix.com>,
Jan Beulich <JBeulich@novell.com>,
Larry Woodman <lwoodman@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <andi@firstfloor.org>,
Johannes Weiner <jweiner@redhat.com>,
Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH] fix pgd_lock deadlock
Date: Wed, 16 Feb 2011 16:34:39 -0500 [thread overview]
Message-ID: <20110216213439.GA26109@dumpdata.com> (raw)
In-Reply-To: <20110216183304.GD5935@random.random>
On Wed, Feb 16, 2011 at 07:33:04PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 15, 2011 at 09:05:20PM +0100, Thomas Gleixner wrote:
> > Did you try with DEBUG_PAGEALLOC, which is calling into cpa quite a
> > lot?
>
> I tried DEBUG_PAGEALLOC and it seems to work fine (in addition to
> lockdep), it doesn't spwan any debug check.
>
> In addition to testing it (both prev patch and below one) I looked
I checked this under Xen running as PV and Dom0 and had no trouble.
You can stick:
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> into the code and the free_pages calling into
> pageattr->split_large_page apparently happens all at boot time.
>
> Now one doubt remains if we need change_page_attr to run from irqs
> (not from DEBUG_PAGEALLOC though). Now is change_page_attr really sane
> to run from irqs? I thought __flush_tlb_all was delivering IPI (in
> that case it also wouldn't have been safe in the first place to run
> with irq disabled) but of course the "__" version is local, so after
> all maybe it's safe to run with interrupts too (I'd be amazed if
> somebody is calling it from irq, if not even DEBUG_PAGEALLOC does) but
> with the below patch it will remain safe from irq as far as the
> pgd_lock is concerned.
>
> I think the previous patch was safe too though, avoiding VM
> manipulations from interrupts makes everything simpler. Normally only
> gart drivers should call it at init time to avoid prefetching of
> cachelines in the next 2m page with different (writeback) cache
> attributes of the pages physically aliased in the gart and mapped with
> different cache attribute, that init stuff happening from interrupt
> sounds weird. Anyway I post the below patch too as an alternative to
> still allow pageattr from irq.
>
> With both patches the big dependency remains on __mmdrop not to run
> from irq. The alternative approach is to remove the page_table_lock
> from vmalloc_sync_all (which is only needed by Xen paravirt guest
> AFIK) and solve that problem in a different way, but I don't even know
> why they need it exactly, I tried not to impact that.
>
> ===
> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> It's forbidden to take the page_table_lock with the irq disabled or if there's
> contention the IPIs (for tlb flushes) sent with the page_table_lock held will
> never run leading to a deadlock.
>
> pageattr also seems to take advantage the pgd_lock to serialize
> split_large_page which isn't necessary so split it off to a separate spinlock
> (as a read_lock wouldn't enough to serialize split_large_page). Then we can use
> a read-write lock to allow pageattr to keep taking it from interrupts, but
> without having to always clear interrupts for readers. Writers are only safe
> from regular kernel context and they must clear irqs before taking it.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> arch/x86/include/asm/pgtable.h | 2 +-
> arch/x86/mm/fault.c | 23 ++++++++++++++++++-----
> arch/x86/mm/init_64.c | 6 +++---
> arch/x86/mm/pageattr.c | 17 ++++++++++-------
> arch/x86/mm/pgtable.c | 10 +++++-----
> arch/x86/xen/mmu.c | 10 ++++------
> 6 files changed, 41 insertions(+), 27 deletions(-)
>
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -179,7 +179,21 @@ force_sig_info_fault(int si_signo, int s
> force_sig_info(si_signo, &info, tsk);
> }
>
> -DEFINE_SPINLOCK(pgd_lock);
> +/*
> + * The pgd_lock only protects the pgd_list. It can be taken from
> + * interrupt context but only in read mode (there's no need to clear
> + * interrupts before taking it in read mode). Write mode can only be
> + * taken from regular kernel context (not from irqs) and it must
> + * disable irqs before taking it. This way it can still be taken in
> + * read mode from interrupt context (in case
> + * pageattr->split_large_page is ever called from interrupt context),
> + * but without forcing the pgd_list readers to clear interrupts before
> + * taking it (like vmalloc_sync_all() that then wants to take the
> + * page_table_lock while holding the pgd_lock, which can only be taken
> + * while irqs are left enabled to avoid deadlocking against the IPI
> + * delivery of an SMP TLB flush run with the page_table_lock hold).
> + */
> +DEFINE_RWLOCK(pgd_lock);
> LIST_HEAD(pgd_list);
>
> #ifdef CONFIG_X86_32
> @@ -229,15 +243,14 @@ void vmalloc_sync_all(void)
> for (address = VMALLOC_START & PMD_MASK;
> address >= TASK_SIZE && address < FIXADDR_TOP;
> address += PMD_SIZE) {
> -
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> spinlock_t *pgt_lock;
> pmd_t *ret;
>
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
>
> spin_lock(pgt_lock);
> @@ -247,7 +260,7 @@ void vmalloc_sync_all(void)
> if (!ret)
> break;
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
> }
>
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -105,18 +105,18 @@ void sync_global_pgds(unsigned long star
>
> for (address = start; address <= end; address += PGDIR_SIZE) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> - unsigned long flags;
> struct page *page;
>
> if (pgd_none(*pgd_ref))
> continue;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> spinlock_t *pgt_lock;
>
> pgd = (pgd_t *)page_address(page) + pgd_index(address);
> + /* the pgt_lock only for Xen */
> pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
> spin_lock(pgt_lock);
>
> @@ -128,7 +128,7 @@ void sync_global_pgds(unsigned long star
>
> spin_unlock(pgt_lock);
> }
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
> }
>
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -47,6 +47,7 @@ struct cpa_data {
> * splitting a large page entry along with changing the attribute.
> */
> static DEFINE_SPINLOCK(cpa_lock);
> +static DEFINE_SPINLOCK(pgattr_lock);
>
> #define CPA_FLUSHTLB 1
> #define CPA_ARRAY 2
> @@ -60,9 +61,9 @@ void update_page_count(int level, unsign
> unsigned long flags;
>
> /* Protect against CPA */
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> direct_pages_count[level] += pages;
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
> }
>
> static void split_page_count(int level)
> @@ -376,6 +377,7 @@ static void __set_pmd_pte(pte_t *kpte, u
> if (!SHARED_KERNEL_PMD) {
> struct page *page;
>
> + read_lock(&pgd_lock);
> list_for_each_entry(page, &pgd_list, lru) {
> pgd_t *pgd;
> pud_t *pud;
> @@ -386,6 +388,7 @@ static void __set_pmd_pte(pte_t *kpte, u
> pmd = pmd_offset(pud, address);
> set_pte_atomic((pte_t *)pmd, pte);
> }
> + read_unlock(&pgd_lock);
> }
> #endif
> }
> @@ -403,7 +406,7 @@ try_preserve_large_page(pte_t *kpte, uns
> if (cpa->force_split)
> return 1;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> /*
> * Check for races, another CPU might have split this page
> * up already:
> @@ -498,7 +501,7 @@ try_preserve_large_page(pte_t *kpte, uns
> }
>
> out_unlock:
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
>
> return do_split;
> }
> @@ -519,7 +522,7 @@ static int split_large_page(pte_t *kpte,
> if (!base)
> return -ENOMEM;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + spin_lock_irqsave(&pgattr_lock, flags);
> /*
> * Check for races, another CPU might have split this page
> * up for us already:
> @@ -587,11 +590,11 @@ static int split_large_page(pte_t *kpte,
> out_unlock:
> /*
> * If we dropped out via the lookup_address check under
> - * pgd_lock then stick the page back into the pool:
> + * pgattr_lock then stick the page back into the pool:
> */
> if (base)
> __free_page(base);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + spin_unlock_irqrestore(&pgattr_lock, flags);
>
> return 0;
> }
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -121,14 +121,14 @@ static void pgd_ctor(struct mm_struct *m
>
> static void pgd_dtor(pgd_t *pgd)
> {
> - unsigned long flags; /* can be called from interrupt context */
> + unsigned long flags;
>
> if (SHARED_KERNEL_PMD)
> return;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + write_lock_irqsave(&pgd_lock, flags);
> pgd_list_del(pgd);
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + write_unlock_irqrestore(&pgd_lock, flags);
> }
>
> /*
> @@ -280,12 +280,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> * respect to anything walking the pgd_list, so that they
> * never see a partially populated pgd.
> */
> - spin_lock_irqsave(&pgd_lock, flags);
> + write_lock_irqsave(&pgd_lock, flags);
>
> pgd_ctor(mm, pgd);
> pgd_prepopulate_pmd(mm, pgd, pmds);
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + write_unlock_irqrestore(&pgd_lock, flags);
>
> return pgd;
>
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -986,10 +986,9 @@ static void xen_pgd_pin(struct mm_struct
> */
> void xen_mm_pin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (!PagePinned(page)) {
> @@ -998,7 +997,7 @@ void xen_mm_pin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
>
> /*
> @@ -1099,10 +1098,9 @@ static void xen_pgd_unpin(struct mm_stru
> */
> void xen_mm_unpin_all(void)
> {
> - unsigned long flags;
> struct page *page;
>
> - spin_lock_irqsave(&pgd_lock, flags);
> + read_lock(&pgd_lock);
>
> list_for_each_entry(page, &pgd_list, lru) {
> if (PageSavePinned(page)) {
> @@ -1112,7 +1110,7 @@ void xen_mm_unpin_all(void)
> }
> }
>
> - spin_unlock_irqrestore(&pgd_lock, flags);
> + read_unlock(&pgd_lock);
> }
>
> void xen_activate_mm(struct mm_struct *prev, struct mm_struct *next)
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -25,7 +25,7 @@
> extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
> #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
>
> -extern spinlock_t pgd_lock;
> +extern rwlock_t pgd_lock;
> extern struct list_head pgd_list;
>
> extern struct mm_struct *pgd_page_get_mm(struct page *page);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2011-02-16 21:36 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-14 20:56 [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync Jeremy Fitzhardinge
2010-10-15 17:07 ` [Xen-devel] " Jeremy Fitzhardinge
2010-10-19 22:17 ` [tip:x86/mm] x86, mm: Hold " tip-bot for Jeremy Fitzhardinge
2010-10-20 10:36 ` Borislav Petkov
2010-10-20 19:31 ` [tip:x86/mm] x86, mm: Fix incorrect data type in vmalloc_sync_all() tip-bot for tip-bot for Jeremy Fitzhardinge
2010-10-20 19:50 ` Borislav Petkov
2010-10-20 19:53 ` H. Peter Anvin
2010-10-20 20:10 ` Borislav Petkov
2010-10-20 20:13 ` H. Peter Anvin
2010-10-20 22:11 ` Borislav Petkov
2010-10-20 21:26 ` Ben Pfaff
2010-10-20 19:58 ` tip-bot for Borislav Petkov
2010-10-21 21:06 ` [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync Jeremy Fitzhardinge
2010-10-21 21:26 ` H. Peter Anvin
2010-10-21 21:34 ` Jeremy Fitzhardinge
2011-02-03 2:48 ` Andrea Arcangeli
2011-02-03 20:44 ` Jeremy Fitzhardinge
2011-02-04 1:21 ` Andrea Arcangeli
2011-02-04 21:27 ` Jeremy Fitzhardinge
2011-02-07 23:20 ` Andrea Arcangeli
2011-02-15 19:07 ` [PATCH] fix pgd_lock deadlock Andrea Arcangeli
2011-02-15 19:26 ` Thomas Gleixner
2011-02-15 19:54 ` Andrea Arcangeli
2011-02-15 20:05 ` Thomas Gleixner
2011-02-15 20:26 ` Thomas Gleixner
2011-02-15 22:52 ` Andrea Arcangeli
2011-02-15 23:03 ` Thomas Gleixner
2011-02-15 23:17 ` Andrea Arcangeli
2011-02-16 9:58 ` Peter Zijlstra
2011-02-16 10:15 ` Andrea Arcangeli
2011-02-16 10:28 ` Ingo Molnar
2011-02-16 14:49 ` Andrea Arcangeli
2011-02-16 16:26 ` Rik van Riel
2011-02-16 20:15 ` Ingo Molnar
2012-04-23 9:07 ` [2.6.32.y][PATCH] " Philipp Hahn
2012-04-23 19:09 ` Willy Tarreau
2011-02-16 18:33 ` [PATCH] " Andrea Arcangeli
2011-02-16 21:34 ` Konrad Rzeszutek Wilk [this message]
2011-02-17 10:19 ` Johannes Weiner
2011-02-21 14:30 ` Andrea Arcangeli
2011-02-21 14:53 ` Johannes Weiner
2011-02-22 7:48 ` Jan Beulich
2011-02-22 13:49 ` Andrea Arcangeli
2011-02-22 14:22 ` Jan Beulich
2011-02-22 14:34 ` Andrea Arcangeli
2011-02-22 17:08 ` Jeremy Fitzhardinge
2011-02-22 17:13 ` Andrea Arcangeli
2011-02-24 4:22 ` Andrea Arcangeli
2011-02-24 8:23 ` Jan Beulich
2011-02-24 14:11 ` Andrea Arcangeli
2011-02-21 17:40 ` Jeremy Fitzhardinge
2011-02-03 20:59 ` [PATCH] x86: hold mm->page_table_lock while doing vmalloc_sync Larry Woodman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110216213439.GA26109@dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=Ian.Campbell@citrix.com \
--cc=JBeulich@novell.com \
--cc=Xen-devel@lists.xensource.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=jeremy@goop.org \
--cc=jweiner@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lwoodman@redhat.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.