Re: [PATCH] Avoiding fragmentation through different allocator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: William Lee Irwin III <wli@holomorphy.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Avoiding fragmentation through different allocator
Date: Sun, 23 Jan 2005 11:28:10 -0200	[thread overview]
Message-ID: <20050123132810.GA583@logos.cnet> (raw)
In-Reply-To: <20050122215949.GD26391@logos.cnet>

On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote:
> On Sat, Jan 22, 2005 at 09:48:20PM +0000, Mel Gorman wrote:
> > On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
> > 
> > > On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> > > > <Changelog snipped>
> > >
> > > Hi Mel,
> > >
> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> > 
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> > 
> > 1. Download and unpack vmregress
> > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> > The patch currently requires the modified allocator but I can fix that up
> > if people want it. Build and deploy the kernel
> > 3. Build vmregress by
> >   ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
> >   (or whatever path is appropriate)
> >   make
> > 4. Load the modules with;
> >   insmod src/code/vmregress_core.ko
> >   insmod src/sense/trace_alloccount.ko
> > 
> > This will create a proc entry /proc/vmregress/trace_alloccount that looks
> > something like;
> > 
> > Allocations (V1)
> > -----------
> > KernNoRclm   997453      370       50        0        0        0        0        0        0        0        0
> > KernRclm      35279        0        0        0        0        0        0        0        0        0        0
> > UserRclm    9870808        0        0        0        0        0        0        0        0        0        0
> > Total      10903540      370       50        0        0        0        0        0        0        0        0
> > 
> > Frees
> > -----
> > KernNoRclm   590965      244       28        0        0        0        0        0        0        0        0
> > KernRclm     227100       60        5        0        0        0        0        0        0        0        0
> > UserRclm    7974200       73       17        0        0        0        0        0        0        0        0
> > Total      19695805      747      100        0        0        0        0        0        0        0        0
> > 
> > To blank the counters, use
> > 
> > echo 0 > /proc/vmregress/trace_alloccount
> > 
> > Whatever workload we come up with, this proc entry will tell us if it is
> > exercising high-order allocations right now.
> 
> Great, excellent! Thanks.
> 
> I plan to spend some time testing and trying to understand the vmregress package 
> this week.
>  
> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> > 
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
> 
> Thoughts and criticism of the following ideas are very much appreciated:
> 
> In private conversation with wli (who helped me providing this information) we can 
> conjecture the following:
> 
> Modern IO devices are capable of doing scatter/gather IO.
> 
> There is overhead associated with setting up and managing the scatter/gather tables. 
> 
> The benefit of large physically contiguous blocks is the ability to avoid the SG 
> management overhead. 
> 
> Now the question is: The added overhead of allocating high order blocks through migration 
> offsets the overhead of SG IO ? Quantifying that is interesting.

What is the overhead of the SG IO management and how is the improvement without them?

Are block IO drivers trying to allocate big physical segments? I bet they are not, because the
"pool of huge pages" (as you say) is limited.

> 
> This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and 
> device/IO subsystem characteristics.
> 
> Also filesystems benefit from big physically contiguous blocks. Quoting wli
> "they want bigger blocks and contiguous memory to match bigger blocks..."
> 
> I completly agree that your simplified allocator decreases fragmentation which in turn
> benefits the system overall. 
> 
> This is an area which can be further improved - ie efficiency in reducing fragmentation 
> is excellent. 
> I sincerely appreciate the work you are doing!
> 
> > > To measure whether the cost of page migration offsets the ability to be
> > > able to deliver high-order allocations we want a set of meaningful
> > > performance tests?
> > >
> > 
> > Bear in mind, there are more considerations. The allocator potentially
> > makes hotplug problems easier and could be easily tied into any
> > page-zeroing system. Some of your own benchmarks also implied that the
> > modified allocator helped some types of workloads which is beneficial in
> > itself.The last consideration is HugeTLB pages, which I am hoping William
> > will weigh in.
> > 
> > Right now, I believe that the pool of huge pages is of a fixed size
> > because of fragmentation difficulties. If we knew we could allocate huge
> > pages, this pool would not have to be fixed. Some applications will
> > heavily benefit from this. While databases are the obvious one,
> > applications with large heaps will also benefit like Java Virtual
> > Machines. I can dig up papers that measured this on Solaris although I
> > don't have them at hand right now.
> 
> Please.
> 
> > We know right now that the overhead of this allocator is fairly low
> > (anyone got benchmarks to disagree) but I understand that page migration
> > is relatively expensive. The allocator also does not have adverse
 > CPU+cache affects like migration and the concept is fairly simple.
> 
> Agreed.
> 
> > > Its quite possible that not all unsatisfiable high-order allocations
> > > want to force page migration (which is quite expensive in terms of
> > > CPU/cache). Only migrate on __GFP_NOFAIL ?
> > >
> > 
> > I still believe with the allocator, we will only have to migrate in
> > exceptional circumstances.
> 
> Agreed - best scenario is the guaranteed availability of high-order blocks, where 
> migration is not necessary.

WARNING: multiple messages have this Message-ID (diff)

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: William Lee Irwin III <wli@holomorphy.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Avoiding fragmentation through different allocator
Date: Sun, 23 Jan 2005 11:28:10 -0200	[thread overview]
Message-ID: <20050123132810.GA583@logos.cnet> (raw)
In-Reply-To: <20050122215949.GD26391@logos.cnet>

On Sat, Jan 22, 2005 at 07:59:49PM -0200, Marcelo Tosatti wrote:
> On Sat, Jan 22, 2005 at 09:48:20PM +0000, Mel Gorman wrote:
> > On Fri, 21 Jan 2005, Marcelo Tosatti wrote:
> > 
> > > On Thu, Jan 20, 2005 at 10:13:00AM +0000, Mel Gorman wrote:
> > > > <Changelog snipped>
> > >
> > > Hi Mel,
> > >
> > > I was thinking that it would be nice to have a set of high-order
> > > intensive workloads, and I wonder what are the most common high-order
> > > allocation paths which fail.
> > >
> > 
> > Agreed. As I am not fully sure what workloads require high-order
> > allocations, I updated VMRegress to keep track of the count of
> > allocations and released 0.11
> > (http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.11.tar.gz). To
> > use it to track allocations, do the following
> > 
> > 1. Download and unpack vmregress
> > 2. Patch a kernel with kernel_patches/v2.6/trace_pagealloc-count.diff .
> > The patch currently requires the modified allocator but I can fix that up
> > if people want it. Build and deploy the kernel
> > 3. Build vmregress by
> >   ./configure --with-linux=/usr/src/linux-2.6.11-rc1-mbuddy
> >   (or whatever path is appropriate)
> >   make
> > 4. Load the modules with;
> >   insmod src/code/vmregress_core.ko
> >   insmod src/sense/trace_alloccount.ko
> > 
> > This will create a proc entry /proc/vmregress/trace_alloccount that looks
> > something like;
> > 
> > Allocations (V1)
> > -----------
> > KernNoRclm   997453      370       50        0        0        0        0        0        0        0        0
> > KernRclm      35279        0        0        0        0        0        0        0        0        0        0
> > UserRclm    9870808        0        0        0        0        0        0        0        0        0        0
> > Total      10903540      370       50        0        0        0        0        0        0        0        0
> > 
> > Frees
> > -----
> > KernNoRclm   590965      244       28        0        0        0        0        0        0        0        0
> > KernRclm     227100       60        5        0        0        0        0        0        0        0        0
> > UserRclm    7974200       73       17        0        0        0        0        0        0        0        0
> > Total      19695805      747      100        0        0        0        0        0        0        0        0
> > 
> > To blank the counters, use
> > 
> > echo 0 > /proc/vmregress/trace_alloccount
> > 
> > Whatever workload we come up with, this proc entry will tell us if it is
> > exercising high-order allocations right now.
> 
> Great, excellent! Thanks.
> 
> I plan to spend some time testing and trying to understand the vmregress package 
> this week.
>  
> > > It mostly depends on hardware because most high-order allocations happen
> > > inside device drivers? What are the kernel codepaths which try to do
> > > high-order allocations and fallback if failed?
> > >
> > 
> > I'm not sure. I think that the paths we exercise right now will be largely
> > artifical. For example, you can force order-2 allocations by scping a
> > large file through localhost (because of the large MTU in that interface).
> > I have not come up with another meaningful workload that guarentees
> > high-order allocations yet.
> 
> Thoughts and criticism of the following ideas are very much appreciated:
> 
> In private conversation with wli (who helped me providing this information) we can 
> conjecture the following:
> 
> Modern IO devices are capable of doing scatter/gather IO.
> 
> There is overhead associated with setting up and managing the scatter/gather tables. 
> 
> The benefit of large physically contiguous blocks is the ability to avoid the SG 
> management overhead. 
> 
> Now the question is: The added overhead of allocating high order blocks through migration 
> offsets the overhead of SG IO ? Quantifying that is interesting.

What is the overhead of the SG IO management and how is the improvement without them?

Are block IO drivers trying to allocate big physical segments? I bet they are not, because the
"pool of huge pages" (as you say) is limited.

> 
> This depends on the driver implementation (how efficiently its able to manage the SG IO tables) and 
> device/IO subsystem characteristics.
> 
> Also filesystems benefit from big physically contiguous blocks. Quoting wli
> "they want bigger blocks and contiguous memory to match bigger blocks..."
> 
> I completly agree that your simplified allocator decreases fragmentation which in turn
> benefits the system overall. 
> 
> This is an area which can be further improved - ie efficiency in reducing fragmentation 
> is excellent. 
> I sincerely appreciate the work you are doing!
> 
> > > To measure whether the cost of page migration offsets the ability to be
> > > able to deliver high-order allocations we want a set of meaningful
> > > performance tests?
> > >
> > 
> > Bear in mind, there are more considerations. The allocator potentially
> > makes hotplug problems easier and could be easily tied into any
> > page-zeroing system. Some of your own benchmarks also implied that the
> > modified allocator helped some types of workloads which is beneficial in
> > itself.The last consideration is HugeTLB pages, which I am hoping William
> > will weigh in.
> > 
> > Right now, I believe that the pool of huge pages is of a fixed size
> > because of fragmentation difficulties. If we knew we could allocate huge
> > pages, this pool would not have to be fixed. Some applications will
> > heavily benefit from this. While databases are the obvious one,
> > applications with large heaps will also benefit like Java Virtual
> > Machines. I can dig up papers that measured this on Solaris although I
> > don't have them at hand right now.
> 
> Please.
> 
> > We know right now that the overhead of this allocator is fairly low
> > (anyone got benchmarks to disagree) but I understand that page migration
> > is relatively expensive. The allocator also does not have adverse
 > CPU+cache affects like migration and the concept is fairly simple.
> 
> Agreed.
> 
> > > Its quite possible that not all unsatisfiable high-order allocations
> > > want to force page migration (which is quite expensive in terms of
> > > CPU/cache). Only migrate on __GFP_NOFAIL ?
> > >
> > 
> > I still believe with the allocator, we will only have to migrate in
> > exceptional circumstances.
> 
> Agreed - best scenario is the guaranteed availability of high-order blocks, where 
> migration is not necessary.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next prev parent reply	other threads:[~2005-01-23 17:05 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-20 10:13 [PATCH] Avoiding fragmentation through different allocator Mel Gorman
2005-01-20 10:13 ` Mel Gorman
2005-01-21 14:28 ` Marcelo Tosatti
2005-01-21 14:28   ` Marcelo Tosatti
2005-01-22 21:48   ` Mel Gorman
2005-01-22 21:48     ` Mel Gorman
2005-01-22 21:59     ` Marcelo Tosatti
2005-01-22 21:59       ` Marcelo Tosatti
2005-01-23 13:28       ` Marcelo Tosatti [this message]
2005-01-23 13:28         ` Marcelo Tosatti
2005-01-24 13:28       ` Mel Gorman
2005-01-24 13:28         ` Mel Gorman
2005-01-24 12:29         ` Marcelo Tosatti
2005-01-24 12:29           ` Marcelo Tosatti
2005-01-24 16:44           ` James Bottomley
2005-01-24 16:44             ` James Bottomley
2005-01-24 15:49             ` Marcelo Tosatti
2005-01-24 15:49               ` Marcelo Tosatti
2005-01-24 20:36               ` James Bottomley
2005-01-24 20:36                 ` James Bottomley
2005-01-24 20:47             ` Steve Lord
2005-01-24 20:47               ` Steve Lord
2005-01-25  7:39               ` Andi Kleen
2005-01-25  7:39                 ` Andi Kleen
2005-01-24 19:55           ` Grant Grundler
2005-01-24 19:55             ` Grant Grundler
  -- strict thread matches above, loose matches on Subject: below --
2005-01-25 14:02 Mukker, Atul
2005-01-25 14:02 ` Mukker, Atul
2005-01-25 14:17 ` Steve Lord
2005-01-25 14:17   ` Steve Lord
2005-01-25 14:27   ` Christoph Hellwig
2005-01-25 14:27     ` Christoph Hellwig
2005-01-25 14:49     ` Andi Kleen
2005-01-25 14:49       ` Andi Kleen
2005-01-25 14:56 ` Andi Kleen
2005-01-25 14:56   ` Andi Kleen
2005-01-25 16:12   ` Mel Gorman
2005-01-25 16:12     ` Mel Gorman
2005-01-25 18:50 ` Grant Grundler
2005-01-25 18:50   ` Grant Grundler
2005-01-20 10:12 Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050123132810.GA583@logos.cnet \
    --to=marcelo.tosatti@cyclades.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.