From: Nick Piggin <npiggin@suse.de>
To: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrea Arcangeli <andrea@qumranet.com>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <clameter@sgi.com>,
Jack Steiner <steiner@sgi.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
kvm-devel@lists.sourceforge.net,
Kanoj Sarcar <kanojsarcar@yahoo.com>,
Roland Dreier <rdreier@cisco.com>,
Steve Wise <swise@opengridcomputing.com>,
linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>,
linux-mm@kvack.org, general@lists.openfabrics.org,
Hugh Dickins <hugh@veritas.com>,
Rusty Russell <rusty@rustcorp.com.au>,
Anthony Liguori <aliguori@us.ibm.com>,
Chris Wright <chrisw@redhat.com>,
Marcelo Tosatti <marcelo@kvack.org>,
Eric Dumazet <dada1@cosmosbay.com>,
"Paul E. McKenney" <paulmck@us.ibm.com>
Subject: Re: [PATCH 08 of 11] anon-vma-rwsem
Date: Thu, 15 May 2008 09:57:47 +0200 [thread overview]
Message-ID: <20080515075747.GA7177@wotan.suse.de> (raw)
In-Reply-To: <20080514112625.GY9878@sgi.com>
On Wed, May 14, 2008 at 06:26:25AM -0500, Robin Holt wrote:
> On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote:
> >
> > I guess that you have found a way to perform TLB flushing within coherent
> > domains over the numalink interconnect without sleeping. I'm sure it would
> > be possible to send similar messages between non coherent domains.
>
> I assume by coherent domains, your are actually talking about system
> images.
Yes
> Our memory coherence domain on the 3700 family is 512 processors
> on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes.
> We extend a "Read-Exclusive" mode beyond the coherence domain so any
> processor is able to read any cacheline on the system. We also provide
> uncached access for certain types of memory beyond the coherence domain.
Yes, I understand the basics.
> For the other partitions, the exporting partition does not know what
> virtual address the imported pages are mapped. The pages are frequently
> mapped in a different order by the MPI library to help with MPI collective
> operations.
>
> For the exporting side to do those TLB flushes, we would need to replicate
> all that importing information back to the exporting side.
Right. Or the exporting side could be passed tokens that it tracks itself,
rather than virtual addresses.
> Additionally, the hardware that does the TLB flushing is protected
> by a spinlock on each system image. We would need to change that
> simple spinlock into a type of hardware lock that would work (on 3700)
> outside the processors coherence domain. The only way to do that is to
> use uncached addresses with our Atomic Memory Operations which do the
> cmpxchg at the memory controller. The uncached accesses are an order
> of magnitude or more slower.
I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.
> > So yes, I'd much rather rework such highly specialized system to fit in
> > closer with Linux than rework Linux to fit with these machines (and
> > apparently slow everyone else down).
>
> But it isn't that we are having a problem adapting to just the hardware.
> One of the limiting factors is Linux on the other partition.
In what way is the Linux limiting?
> > > Additionally, the call to zap_page_range expects to have the mmap_sem
> > > held. I suppose we could use something other than zap_page_range and
> > > atomically clear the process page tables.
> >
> > zap_page_range does not expect to have mmap_sem held. I think for anon
> > pages it is always called with mmap_sem, however try_to_unmap_anon is
> > not (although it expects page lock to be held, I think we should be able
> > to avoid that).
>
> zap_page_range calls unmap_vmas which walks to vma->next. Are you saying
> that can be walked without grabbing the mmap_sem at least readably?
Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.
> I feel my understanding of list management and locking completely
> shifting.
FWIW, mmap_sem isn't held to protect vma->next there anyway, because at
that point the vmas are detached from the mm's rbtree and linked list.
But sure, in that particular path it is held for other reasons.
> > > Doing that will not alleviate
> > > the need to sleep for the messaging to the other partitions.
> >
> > No, but I'd venture to guess that is not impossible to implement even
> > on your current hardware (maybe a firmware update is needed)?
>
> Are you suggesting the sending side would not need to sleep or the
> receiving side? Assuming you meant the sender, it spins waiting for the
> remote side to acknowledge the invalidate request? We place the data
> into a previously agreed upon buffer and send an interrupt. At this
> point, we would need to start spinning and waiting for completion.
> Let's assume we never run out of buffer space.
How would you run out of buffer space if it is synchronous?
> The receiving side receives an interrupt. The interrupt currently wakes
> an XPC thread to do the work of transfering and delivering the message
> to XPMEM. The transfer of the data which XPC does uses the BTE engine
> which takes up to 28 seconds to timeout (hardware timeout before raising
> and error) and the BTE code automatically does a retry for certain
> types of failure. We currently need to grab semaphores which _MAY_
> be able to be reworked into other types of locks.
Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.
What is XPMEM exactly anyway? I'd assumed it is a Linux driver.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-05-15 7:57 UTC|newest]
Thread overview: 106+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-07 14:35 [PATCH 00 of 11] mmu notifier #v16 Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
2008-05-07 17:35 ` Rik van Riel
2008-05-07 20:02 ` Andrew Morton
2008-05-07 20:05 ` Andrew Morton
2008-05-07 20:30 ` Linus Torvalds
2008-05-07 21:58 ` Andrea Arcangeli
2008-05-07 22:11 ` Linus Torvalds
2008-05-07 22:27 ` Andrea Arcangeli
2008-05-07 22:31 ` [ofa-general] " Roland Dreier
2008-05-07 22:39 ` Andrea Arcangeli
2008-05-07 23:03 ` Linus Torvalds
2008-05-07 22:37 ` Andrea Arcangeli
2008-05-07 23:38 ` Linus Torvalds
2008-05-07 23:00 ` Linus Torvalds
2008-05-07 14:35 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
2008-05-07 15:59 ` Robin Holt
2008-05-07 16:20 ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
2008-05-07 17:39 ` Rik van Riel
2008-05-07 17:57 ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
2008-05-07 17:41 ` Rik van Riel
2008-05-07 14:35 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
2008-05-07 17:46 ` Rik van Riel
2008-05-07 14:35 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
2008-05-07 20:56 ` Linus Torvalds
2008-05-07 21:26 ` Andrea Arcangeli
2008-05-07 21:36 ` Linus Torvalds
2008-05-07 22:22 ` Andrea Arcangeli
2008-05-07 22:31 ` Andrew Morton
2008-05-07 22:44 ` Andrea Arcangeli
2008-05-07 22:59 ` Andrew Morton
2008-05-07 23:19 ` Linus Torvalds
2008-05-07 23:39 ` Christoph Lameter
2008-05-08 0:03 ` Linus Torvalds
2008-05-08 0:52 ` Robin Holt
2008-05-08 0:56 ` Christoph Lameter
2008-05-08 1:07 ` Linus Torvalds
2008-05-08 1:39 ` Linus Torvalds
2008-05-08 1:52 ` Andrea Arcangeli
2008-05-08 1:57 ` Linus Torvalds
2008-05-08 2:24 ` Andrea Arcangeli
2008-05-08 2:32 ` Linus Torvalds
2008-05-07 23:39 ` Andrea Arcangeli
2008-05-08 1:02 ` Linus Torvalds
2008-05-08 1:12 ` Christoph Lameter
2008-05-08 1:32 ` Linus Torvalds
2008-05-08 2:56 ` Andrea Arcangeli
2008-05-08 3:10 ` Christoph Lameter
2008-05-08 3:41 ` Andrea Arcangeli
2008-05-08 4:14 ` Linus Torvalds
2008-05-08 5:20 ` Andrea Arcangeli
2008-05-08 5:27 ` Pekka Enberg
2008-05-08 5:30 ` Pekka Enberg
2008-05-08 5:49 ` Andrea Arcangeli
2008-05-08 15:03 ` Linus Torvalds
2008-05-08 16:11 ` Linus Torvalds
2008-05-08 22:01 ` Andrea Arcangeli
2008-05-09 18:37 ` Peter Zijlstra
2008-05-09 18:55 ` Andrea Arcangeli
2008-05-09 19:04 ` Peter Zijlstra
2008-05-08 1:26 ` Andrea Arcangeli
2008-05-07 23:28 ` Benjamin Herrenschmidt
2008-05-07 23:45 ` Andrea Arcangeli
2008-05-08 1:34 ` Andrea Arcangeli
2008-05-13 12:14 ` Nick Piggin
2008-05-14 5:43 ` Benjamin Herrenschmidt
2008-05-14 6:06 ` Nick Piggin
2008-05-14 13:15 ` Jack Steiner
2008-05-07 22:44 ` Linus Torvalds
2008-05-07 22:58 ` Andrea Arcangeli
2008-05-07 23:02 ` Andrea Arcangeli
2008-05-07 23:09 ` Linus Torvalds
2008-05-08 0:38 ` Robin Holt
2008-05-08 0:55 ` Linus Torvalds
2008-05-13 12:06 ` Nick Piggin
2008-05-13 15:32 ` Robin Holt
2008-05-14 4:11 ` Nick Piggin
2008-05-14 11:26 ` Robin Holt
2008-05-14 15:18 ` Linus Torvalds
2008-05-14 16:22 ` Robin Holt
2008-05-14 16:56 ` Linus Torvalds
2008-05-14 17:57 ` Christoph Lameter
2008-05-14 18:27 ` Linus Torvalds
2008-05-17 1:38 ` mm notifier: Notifications when pages are unmapped Christoph Lameter
2008-05-15 7:57 ` Nick Piggin [this message]
2008-05-15 11:01 ` [PATCH 08 of 11] anon-vma-rwsem Robin Holt
2008-05-15 11:12 ` Avi Kivity
2008-05-15 17:33 ` Christoph Lameter
2008-05-15 23:52 ` Nick Piggin
2008-05-16 11:23 ` Robin Holt
2008-05-16 11:50 ` Robin Holt
2008-05-20 5:31 ` Nick Piggin
2008-05-20 10:01 ` Robin Holt
2008-05-20 10:50 ` Nick Piggin
2008-05-20 11:05 ` Robin Holt
2008-05-20 11:14 ` Nick Piggin
2008-05-20 11:26 ` Robin Holt
2008-05-07 22:42 ` Jack Steiner
2008-05-07 14:35 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
-- strict thread matches above, loose matches on Subject: below --
2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080515075747.GA7177@wotan.suse.de \
--to=npiggin@suse.de \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=aliguori@us.ibm.com \
--cc=andrea@qumranet.com \
--cc=avi@qumranet.com \
--cc=chrisw@redhat.com \
--cc=clameter@sgi.com \
--cc=dada1@cosmosbay.com \
--cc=general@lists.openfabrics.org \
--cc=holt@sgi.com \
--cc=hugh@veritas.com \
--cc=kanojsarcar@yahoo.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=marcelo@kvack.org \
--cc=nickpiggin@yahoo.com.au \
--cc=paulmck@us.ibm.com \
--cc=rdreier@cisco.com \
--cc=rusty@rustcorp.com.au \
--cc=steiner@sgi.com \
--cc=swise@opengridcomputing.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).