From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM Date: Mon, 4 Jan 2010 14:20:49 +0900 Message-ID: <28c262361001032120v284e92b5ub1211f3d1fca6140@mail.gmail.com> References: <20091225000717.GA26949@yahoo-inc.com> <87aax18xms.fsf@basil.nowhere.org> <20091230051540.GA16308@localhost> <20091230052402.GB26364@localhost> <873a2s8hmp.fsf@basil.nowhere.org> <20100104045020.GA21021@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Andi Kleen , Andrew Morton , Quentin Barnes , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , Nick Piggin , Steven Whitehouse , David Howells , Al Viro , Jonathan Corbet , Christoph Hellwig To: Wu Fengguang Return-path: Received: from mail-px0-f189.google.com ([209.85.216.189]:51000 "EHLO mail-px0-f189.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750738Ab0ADFUu convert rfc822-to-8bit (ORCPT ); Mon, 4 Jan 2010 00:20:50 -0500 In-Reply-To: <20100104045020.GA21021@localhost> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi, Wu. On Mon, Jan 4, 2010 at 1:50 PM, Wu Fengguang w= rote: > This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. > > POSIX_FADV_RANDOM used to set ra_pages=3D0, which leads to poor > performance: a 16K read will be carried out in 4 _sync_ 1-page reads. > > In other places, ra_pages=3D=3D0 means > - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs > - some IO error happened > where multi-page read IO won't help or should be avoided. > > POSIX_FADV_RANDOM actually want a different semantics: to disable the > *heuristic* readahead algorithm, and to use a dumb one which faithful= ly > submit read IO for whatever application requests. > > So introduce a flag O_RANDOM for POSIX_FADV_RANDOM. > It will be visible to fcntl(F_GETFL). > > Note that the random hint is not likely to help random reads performa= nce > noticeably. And it may be too permissive on huge request size (its IO > size is not limited by read_ahead_kb). > > In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overal= l > (NFS read) performance of the application increased by 313%! > > v3: use O_RANDOM to indicate both read/write access pattern as in > =C2=A0 =C2=A0posix_fadvise(), although it only takes effect for read(= ) now > =C2=A0 =C2=A0(proposed by Quentin) > v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi) > > CC: Nick Piggin > CC: Andi Kleen > CC: Steven Whitehouse > CC: David Howells > CC: Al Viro > CC: Jonathan Corbet > CC: Christoph Hellwig > Tested-by: Quentin Barnes > Signed-off-by: Wu Fengguang > --- > =C2=A0include/asm-generic/fcntl.h | =C2=A0 =C2=A04 ++++ > =C2=A0mm/fadvise.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0| =C2=A0 10 +++++++++- > =C2=A0mm/readahead.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= | =C2=A0 =C2=A06 ++++++ > =C2=A03 files changed, 19 insertions(+), 1 deletion(-) > > --- linux.orig/include/asm-generic/fcntl.h =C2=A0 =C2=A0 =C2=A02010-0= 1-04 12:39:29.000000000 +0800 > +++ linux/include/asm-generic/fcntl.h =C2=A0 2010-01-04 12:40:11.0000= 00000 +0800 > @@ -80,6 +80,10 @@ > =C2=A0#define O_NDELAY =C2=A0 =C2=A0 =C2=A0 O_NONBLOCK > =C2=A0#endif > > +#ifndef O_RANDOM > +#define O_RANDOM =C2=A0 =C2=A0 =C2=A0 010000000 =C2=A0 =C2=A0 =C2=A0= /* random access pattern hint */ > +#endif > + > =C2=A0#define F_DUPFD =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A00 =C2=A0 =C2=A0 =C2=A0 /* dup */ > =C2=A0#define F_GETFD =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A01 =C2=A0 =C2=A0 =C2=A0 /* get close_on_exec */ > =C2=A0#define F_SETFD =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A02 =C2=A0 =C2=A0 =C2=A0 /* set/clear close_on_exec */ > --- linux.orig/mm/fadvise.c =C2=A0 =C2=A0 2010-01-04 12:39:29.0000000= 00 +0800 > +++ linux/mm/fadvise.c =C2=A02010-01-04 12:39:30.000000000 +0800 > @@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof > =C2=A0 =C2=A0 =C2=A0 =C2=A0switch (advice) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0case POSIX_FADV_NORMAL: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0file->f_ra.ra_= pages =3D bdi->ra_pages; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&file->f= _lock); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 file->f_flags &=3D= ~O_RANDOM; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&file-= >f_lock); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break; > =C2=A0 =C2=A0 =C2=A0 =C2=A0case POSIX_FADV_RANDOM: > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 file->f_ra.ra_page= s =3D 0; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&file->f= _lock); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 file->f_flags |=3D= O_RANDOM; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&file-= >f_lock); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break; > =C2=A0 =C2=A0 =C2=A0 =C2=A0case POSIX_FADV_SEQUENTIAL: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0file->f_ra.ra_= pages =3D bdi->ra_pages * 2; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&file->f= _lock); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 file->f_flags &=3D= ~O_RANDOM; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&file-= >f_lock); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break; > =C2=A0 =C2=A0 =C2=A0 =C2=A0case POSIX_FADV_WILLNEED: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!mapping->= a_ops->readpage) { > --- linux.orig/mm/readahead.c =C2=A0 2010-01-04 12:39:29.000000000 +0= 800 > +++ linux/mm/readahead.c =C2=A0 =C2=A0 =C2=A0 =C2=A02010-01-04 12:39:= 30.000000000 +0800 > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!ra->ra_pages) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return; > > + =C2=A0 =C2=A0 =C2=A0 /* be dumb */ > + =C2=A0 =C2=A0 =C2=A0 if (filp->f_flags & O_RANDOM) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 force_page_cache_r= eadahead(mapping, filp, offset, req_size); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return; > + =C2=A0 =C2=A0 =C2=A0 } > + Let me have a dumb question. :) How about testing O_RANDOM in front of ra_pages testing? My intention is that although we turn off ra, it would be better to rea= d contiguous block all at once than readpage() callback doing I/O one page at a time. Is it break some semantics or happen some problem in ondemand readahead= ? > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* do read-ahead */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0ondemand_readahead(mapping, ra, filp, fals= e, offset, req_size); > =C2=A0} > -- > To unsubscribe from this list: send the line "unsubscribe linux-kerne= l" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.ht= ml > Please read the FAQ at =C2=A0http://www.tux.org/lkml/ > --=20 Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html