(no subject) - Wu Fengguang

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Quentin Barnes <qbarnes+nfs@yahoo-inc.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, Nick Piggin <npiggin@suse.de>
Date: Wed, 30 Dec 2009 13:41:19 +0800	[thread overview]
Message-ID: <20091230054119.GB16308@localhost> (raw)

Andrew Morton <akpm@linux-foundation.org> 
CC: Steven Whitehouse <swhiteho@redhat.com>
Subject: Re: [RFC][PATCH] Disabling read-ahead makes I/O of large reads
	small
Reply-To: 
In-Reply-To: <87aax18xms.fsf@basil.nowhere.org>

Hi Quentin,

Quentin Barnes <qbarnes+nfs@yahoo-inc.com> writes:

> Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets

Have you tried w/o POSIX_FADV_RANDOM (ie. comment out the fadivse call)?
It should be able to achieve the same good performance. The heuristic
readahead logic should detect that the application is doing random
reads.

> ra_pages=0.  This has a very odd side-effect in the kernel.  Once
> read-ahead is disabled, subsequent calls to read(2) are now done in
> the kernel via ->readpage() callback doing I/O one page at a time!
> 
> Pouring through the code in mm/filemap.c I see that the kernel has
> commingled read-ahead and plain read implementations.  The algorithms
> have much in common, so I can see why it was done, but it left this
> anomaly of severely pimping read(2) calls on file descriptors with
> read-ahead disabled.
> 
> 
> For example, with a read(2) of 98K bytes of a file opened with
> O_DIRECT accessed over NFSv3 with rsize=32768, I see:
> =========
> V3 ACCESS Call (Reply In 249), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 248)
> V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 251) Len:32768
> V3 READ Reply (Call In 250) Len:32768
> V3 READ Reply (Call In 252) Len:32768
> =========
> I would expect three READs issued of size 32K, and that's exactly
> what I see.
> 
> 
> For the same file without O_DIRECT but with read-ahead disabled
> (its ra_pages=0), I see:
> =========
> V3 ACCESS Call (Reply In 167), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 166)
> V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096 
> V3 READ Reply (Call In 168) Len:4096
> V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096  
> V3 READ Reply (Call In 173) Len:4096 
> V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096
> V3 READ Reply (Call In 178) Len:4096
> [... READ Call/Reply pairs repeated another 21 times ...]
> =========
> Now I see 24 READ calls of 4K each!

Good catch, Thank you very much!

> A workaround for this kernel problem is to hack the app to do a
> readahead(2) call prior to the read(2), however, I would think a
> better approach would be to fix the kernel.  I came up with the
> included patch that once applied restores the expected read(2)
> behavior.  For the latter test case above of a file with read-ahead
> disabled but now with the patch below applied, I now see:
> =========
> V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 1349)
> V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 1351) Len:32768
> V3 READ Reply (Call In 1352) Len:32768
> V3 READ Reply (Call In 1353) Len:32768
> =========
> Which is what I would expect -- back to just three 32K READs.
> 
> After this change, the overall performance of the application
> increased by 313%!

And awesome improvements!

> 
> I have no idea if my patch is the appropriate fix.  I'm well out of
> my area in this part of the kernel.  It solves this one problem, but
> I have no idea how many boundary cases it doesn't cover or even if
> it is the right way to go about addressing this issue.
> 
> Is this behavior of shorting I/O of read(2) considered a bug?  And
> is this approach for a fix approriate?

The approach is mostly OK for the bug. However one issue is missed --
the ra_pages is somehow overloaded. I try to fix the problems in the
two patches just posted. Will that solve your problem?

Thanks,
Fengguang

> 
> --- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600
> +++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24 13:07:07.000000000 -0600
> @@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct 
>  find_page:
>                 page = find_get_page(mapping, index);
>                 if (!page) {
> -                       page_cache_sync_readahead(mapping,
> -                                       ra, filp,
> -                                       index, last_index - index);
> +                       if (ra->ra_pages)
> +                               page_cache_sync_readahead(mapping,
> +                                               ra, filp,
> +                                               index, last_index - index);
> +                       else
> +                               force_page_cache_readahead(mapping, filp,
> +                                               index, last_index - index);
>                         page = find_get_page(mapping, index);
>                         if (unlikely(page == NULL))
>                                 goto no_cached_page;

WARNING: multiple messages have this Message-ID (diff)

From: Wu Fengguang <fengguang.wu@intel.com>
To: Quentin Barnes <qbarnes+nfs@yahoo-inc.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, Nick Piggin <npiggin@suse.de>
Subject: (unknown)
Date: Wed, 30 Dec 2009 13:41:19 +0800	[thread overview]
Message-ID: <20091230054119.GB16308@localhost> (raw)

Andrew Morton <akpm@linux-foundation.org> 
CC: Steven Whitehouse <swhiteho@redhat.com>
Subject: Re: [RFC][PATCH] Disabling read-ahead makes I/O of large reads
	small
Reply-To: 
In-Reply-To: <87aax18xms.fsf@basil.nowhere.org>

Hi Quentin,

Quentin Barnes <qbarnes+nfs@yahoo-inc.com> writes:

> Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets

Have you tried w/o POSIX_FADV_RANDOM (ie. comment out the fadivse call)?
It should be able to achieve the same good performance. The heuristic
readahead logic should detect that the application is doing random
reads.

> ra_pages=0.  This has a very odd side-effect in the kernel.  Once
> read-ahead is disabled, subsequent calls to read(2) are now done in
> the kernel via ->readpage() callback doing I/O one page at a time!
> 
> Pouring through the code in mm/filemap.c I see that the kernel has
> commingled read-ahead and plain read implementations.  The algorithms
> have much in common, so I can see why it was done, but it left this
> anomaly of severely pimping read(2) calls on file descriptors with
> read-ahead disabled.
> 
> 
> For example, with a read(2) of 98K bytes of a file opened with
> O_DIRECT accessed over NFSv3 with rsize=32768, I see:
> =========
> V3 ACCESS Call (Reply In 249), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 248)
> V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 251) Len:32768
> V3 READ Reply (Call In 250) Len:32768
> V3 READ Reply (Call In 252) Len:32768
> =========
> I would expect three READs issued of size 32K, and that's exactly
> what I see.
> 
> 
> For the same file without O_DIRECT but with read-ahead disabled
> (its ra_pages=0), I see:
> =========
> V3 ACCESS Call (Reply In 167), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 166)
> V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096 
> V3 READ Reply (Call In 168) Len:4096
> V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096  
> V3 READ Reply (Call In 173) Len:4096 
> V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096
> V3 READ Reply (Call In 178) Len:4096
> [... READ Call/Reply pairs repeated another 21 times ...]
> =========
> Now I see 24 READ calls of 4K each!

Good catch, Thank you very much!

> A workaround for this kernel problem is to hack the app to do a
> readahead(2) call prior to the read(2), however, I would think a
> better approach would be to fix the kernel.  I came up with the
> included patch that once applied restores the expected read(2)
> behavior.  For the latter test case above of a file with read-ahead
> disabled but now with the patch below applied, I now see:
> =========
> V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 1349)
> V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 1351) Len:32768
> V3 READ Reply (Call In 1352) Len:32768
> V3 READ Reply (Call In 1353) Len:32768
> =========
> Which is what I would expect -- back to just three 32K READs.
> 
> After this change, the overall performance of the application
> increased by 313%!

And awesome improvements!

> 
> I have no idea if my patch is the appropriate fix.  I'm well out of
> my area in this part of the kernel.  It solves this one problem, but
> I have no idea how many boundary cases it doesn't cover or even if
> it is the right way to go about addressing this issue.
> 
> Is this behavior of shorting I/O of read(2) considered a bug?  And
> is this approach for a fix approriate?

The approach is mostly OK for the bug. However one issue is missed --
the ra_pages is somehow overloaded. I try to fix the problems in the
two patches just posted. Will that solve your problem?

Thanks,
Fengguang

> 
> --- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600
> +++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24 13:07:07.000000000 -0600
> @@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct 
>  find_page:
>                 page = find_get_page(mapping, index);
>                 if (!page) {
> -                       page_cache_sync_readahead(mapping,
> -                                       ra, filp,
> -                                       index, last_index - index);
> +                       if (ra->ra_pages)
> +                               page_cache_sync_readahead(mapping,
> +                                               ra, filp,
> +                                               index, last_index - index);
> +                       else
> +                               force_page_cache_readahead(mapping, filp,
> +                                               index, last_index - index);
>                         page = find_get_page(mapping, index);
>                         if (unlikely(page == NULL))
>                                 goto no_cached_page;

next             reply	other threads:[~2009-12-30  5:41 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-30  5:41 Wu Fengguang [this message]
2009-12-30  5:41 ` (unknown) Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091230054119.GB16308@localhost \
    --to=fengguang.wu@intel.com \
    --cc=andi@firstfloor.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=qbarnes+nfs@yahoo-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.