Re: [RFC]numa: improve I/O performance by optimizing numa interleave allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shaohua Li <shaohua.li@intel.com>
To: Andi Kleen <ak@linux.intel.com>
Cc: lkml <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jens Axboe <axboe@kernel.dk>, Christoph Lameter <cl@linux.com>,
	"lee.schermerhorn@hp.com" <lee.schermerhorn@hp.com>
Subject: Re: [RFC]numa: improve I/O performance by optimizing numa interleave allocation
Date: Wed, 23 Nov 2011 11:36:23 +0800	[thread overview]
Message-ID: <1322019383.22361.346.camel@sli10-conroe> (raw)
In-Reply-To: <1321839585.22361.328.camel@sli10-conroe>

On Mon, 2011-11-21 at 09:39 +0800, Shaohua Li wrote:
> On Sat, 2011-11-19 at 01:30 +0800, Andi Kleen wrote:
> > On Fri, Nov 18, 2011 at 03:12:12PM +0800, Shaohua Li wrote:
> > > If mem plicy is interleaves, we will allocated pages from nodes in a round
> > > robin way. This surely can do interleave fairly, but not optimal.
> > > 
> > > Say the pages will be used for I/O later. Interleave allocation for two pages
> > > are allocated from two nodes, so the pages are not physically continuous. Later
> > > each page needs one segment for DMA scatter-gathering. But maxium hardware
> > > segment number is limited. The non-continuous pages will use up maxium
> > > hardware segment number soon and we can't merge I/O to bigger DMA. Allocating
> > > pages from one node hasn't such issue. The memory allocator pcp list makes
> > > we can get physically continuous pages in several alloc quite likely.
> > 
> > FWIW it depends a lot on the IO hardware if the SG limitation
> > really makes a measurable difference for IO performance. I saw some wins from 
> > clustering using the IOMMU before, but that was a long time ago. I wouldn't 
> > consider it a truth without strong numbers, and then also only
> > for that particular device measured.
> > 
> > My understanding is that modern IO devices like NHM Express will
> > be faster at large SG lists.
> This is a LSI SAS1068E HBA card attaching some hard disks. The
> clustering has real benefit here. I/O throughput increases 3% or so.
> Not sure about NHM Express, wondering why large SG list could be faster.
> doesn't large SG means large DMA descriptor?
> 
> > > So can we make both interleave fairness and continuous allocation happy?
> > > Simplily we can adjust the round robin algorithm. We switch to another node
> > > after several (N) allocation happens. If N isn't too big, we can still get
> > > fair allocation. And we get N continuous pages. I use N=8 in below patch.
> > > I thought 8 isn't too big for modern NUMA machine. Applications which use
> > > interleave are unlikely run short time, so I thought fairness still works.
> > 
> > It depends a lot on the CPU access pattern.
> > 
> > Some workloads seem to do reasonable well with 2MB huge page interleaving.
> > But others actually prefer the cache line interleaving supplied by 
> > the BIOS.
> > 
> > So you can have a trade off between IO and CPU performance.
> > When in doubt I usually opt for CPU performance by default.
> Can you elaborate this more? the cache line interleaving can only be
> supplied by BIOS. OS can provide N*PAGE_SIZE interleave. I'm wondering
> what's the difference for example a 4k or 8k interleave for CPU
> performance. Actually if adjacent pages interleaved in two nodes could
> be in the same coloring, while two adjacent pages allocated from one
> node not. So clustering could be more cache efficient from coloring
> point of view.
> 
> > I definitely wouldn't make it default, but if there are workloads
> > that benefits a lot it could be an additional parameter to the
> > interleave policy.
> Christoph suggested the same way. the problem is we need change the API,
> right? And how are users supposed to use it? It would be difficult to
> determine the correct parameter.
> 
> If 8 pages clustering is too big, maybe we can use small. I guess a 2
> pages clustering is a big win too.
> 
> And I didn't change the allocation with a VMA case, which is supposed to
> be used for anonymous pages.
I tried a 2 pages clustering, it has the same effect like 8 page
clustering in my test environment.
would making the clustering a config option or sysctl be better?

Thanks,
Shaohua


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2011-11-23  3:25 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-18  7:12 [RFC]numa: improve I/O performance by optimizing numa interleave allocation Shaohua Li
2011-11-18 15:56 ` Christoph Lameter
2011-11-18 17:30 ` Andi Kleen
2011-11-21  1:39   ` Shaohua Li
2011-11-23  3:36     ` Shaohua Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1322019383.22361.346.camel@sli10-conroe \
    --to=shaohua.li@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=cl@linux.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).