linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Enrik Berkhan <Enrik.Berkhan@ge.com>
Cc: Clemens Eisserer <linuxhippy@gmail.com>,
	linux-kernel@vger.kernel.org,
	"kosaki.motohiro@jp.fujitsu.com" <kosaki.motohiro@jp.fujitsu.com>,
	Neil Brown <neilb@suse.de>, Jens Axboe <jens.axboe@oracle.com>,
	linux-fsdevel@vger.kernel.org
Subject: [RFC] md: don't scale up readahead size if RAID chunk size >= 4MB
Date: Fri, 11 Sep 2009 15:08:47 +0800	[thread overview]
Message-ID: <20090911070847.GI6267@localhost> (raw)
In-Reply-To: <4AA9170B.70306@ge.com>

On Thu, Sep 10, 2009 at 05:11:07PM +0200, Enrik Berkhan wrote:
> Clemens Eisserer wrote:
>> Does nobody have an idea what could be the cause of this OOM situation?
>
> I guess it's too large readahead. I had this situation recently, too,  
> with a raid0 of 8 disks (4MB chunks) that set the file readahead count  
> to 32MB or so (on a 60MB NOMMU system).

The default readahead size would be 2 * 8 * 4MB = 64MB for such an
software RAID. However max_sane_readahead() will limit runtime
readahead size to available_cache / 2, which is ~30MB for your system.

> When I tried to read a 100MB file via sendfile(), the kernel insisted on  
> doing the 32MB readahead ... (in __do_page_cache_readahead, like in your  
> trace).

You could configure readahead size with the blockdev command.

But I admit that the default 64MB ra size is insanely large.
I tended to change this long ago. Though I'm not sure if that
is exactly someone with big arrays wanted. Anyway here is the
patch. Maybe some storage gurus can drop us some hint of use
cases.

Thanks,
Fengguang
---
md: don't scale up readahead size for large RAID chunk size

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 drivers/md/raid0.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- linux.orig/drivers/md/raid0.c	2009-09-11 14:36:02.000000000 +0800
+++ linux/drivers/md/raid0.c	2009-09-11 14:51:13.000000000 +0800
@@ -341,12 +341,16 @@ static int raid0_run(mddev_t *mddev)
 	 * chunk size, then we will not drive that device as hard as it
 	 * wants.  We consider this a configuration error: a larger
 	 * chunksize should be used in that case.
+	 * Also don't touch readahead size if the chunk size is large enough
+	 * (4MB), so that individual devices can already receive good enough
+	 * IO sizes, and otherwise (2 * stripe) would grow too large.
 	 */
 	{
 		int stripe = mddev->raid_disks *
 			(mddev->chunk_sectors << 9) / PAGE_SIZE;
-		if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
-			mddev->queue->backing_dev_info.ra_pages = 2* stripe;
+		if (mddev->chunk_sectors < (4 * 1024 * 1024 / 512) &&
+			mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
+			mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
 	}
 
 	blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec);

           reply	other threads:[~2009-09-11  7:08 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <4AA9170B.70306@ge.com>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090911070847.GI6267@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Enrik.Berkhan@ge.com \
    --cc=jens.axboe@oracle.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxhippy@gmail.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).