[LSF/MM TOPIC] a few storage topics

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] a few storage topics
       [not found] <CABE8wws67dn0fwhTCs_XqH0g_CxGuT+hPQH9cVFe1xx5t_O9Jw@mail.gmail.com>
@ 2012-01-17 20:06 ` Mike Snitzer
  2012-01-17 21:36   ` [Lsf-pc] " Jan Kara
  2012-01-24 17:59   ` Martin K. Petersen
  0 siblings, 2 replies; 76+ messages in thread
From: Mike Snitzer @ 2012-01-17 20:06 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-scsi, dm-devel, linux-fsdevel

1) expose WRITE SAME via higher level interface (ala sb_issue_discard)
   for more efficient zeroing on SCSI devices that support it
   - dm-thinp and dm-kcopyd could benefit from offloading the zeroing to
     the array
   - I'll be reviewing this closer to assess the scope of the work

2) revise fs/block_dev.c:__blkdev_put's "sync on last close" semantic,
   please see Mikulas Patocka's recent proposal on dm-devel:
   http://www.redhat.com/archives/dm-devel/2012-January/msg00021.html
   - patch didn't create much discussion (other than hch's suggestion to
     use file->private_data).  Are the current semantics somehow 
     important to some filesystems (e.g. NFS)?
   - allowing read-only opens to _not_ trigger a sync is desirable
     (e.g. if dm-thinp's storage pool was exhausted we should still be
      able to read data from thinp devices)

3) Are any SSD+rotational storage caching layers that are being
   developed for upstream consideration (there were: bcache, fb's
   dm-cache, etc).
   - Red Hat would like to know if leveraging the dm-thinp
     infrastructure to implement a new DM target for caching would be
     well received by the greater community
   - and are there any proposals for classifying data/files as cache
     hot, etc (T10 has an active proposal for passing info in the CDB)
     -- is anyone working in this area?

4) is anyone working on an interface to GET LBA STATUS?
   - Martin Petersen added GET LBA STATUS support to scsi_debug, but is
     there a vision for how tools (e.g. pvmove) could access such info
     in a uniform way across different vendors' storage?

5) Any more progress on stable pages?
   - I know Darrick Wong had some proposals, what remains?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer
@ 2012-01-17 21:36   ` Jan Kara
  2012-01-18 22:58     ` Darrick J. Wong
  2012-01-24 17:59   ` Martin K. Petersen
  1 sibling, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-17 21:36 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf-pc, linux-fsdevel, dm-devel, linux-scsi

On Tue 17-01-12 15:06:12, Mike Snitzer wrote:
> 5) Any more progress on stable pages?
>    - I know Darrick Wong had some proposals, what remains?
  As far as I know this is done for XFS, btrfs, ext4. Is more needed?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-17 21:36   ` [Lsf-pc] " Jan Kara
@ 2012-01-18 22:58     ` Darrick J. Wong
  2012-01-18 23:22       ` Jan Kara
  2012-01-18 23:39       ` Dan Williams
  0 siblings, 2 replies; 76+ messages in thread
From: Darrick J. Wong @ 2012-01-18 22:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel, linux-scsi

On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote:
> On Tue 17-01-12 15:06:12, Mike Snitzer wrote:
> > 5) Any more progress on stable pages?
> >    - I know Darrick Wong had some proposals, what remains?
>   As far as I know this is done for XFS, btrfs, ext4. Is more needed?

Yep, it's done for those three fses.

I suppose it might help some people if instead of wait_on_page_writeback we
could simply page-migrate all the processes onto a new page...?

Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
(I never really bothered to find out if it really does this.)

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-18 22:58     ` Darrick J. Wong
@ 2012-01-18 23:22       ` Jan Kara
  2012-01-18 23:42         ` Boaz Harrosh
  2012-01-18 23:39       ` Dan Williams
  1 sibling, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-18 23:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel,
	linux-scsi, neilb

On Wed 18-01-12 14:58:08, Darrick J. Wong wrote:
> On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote:
> > On Tue 17-01-12 15:06:12, Mike Snitzer wrote:
> > > 5) Any more progress on stable pages?
> > >    - I know Darrick Wong had some proposals, what remains?
> >   As far as I know this is done for XFS, btrfs, ext4. Is more needed?
> 
> Yep, it's done for those three fses.
> 
> I suppose it might help some people if instead of wait_on_page_writeback we
> could simply page-migrate all the processes onto a new page...?
  Well, but it will cost some more memory & copying so whether it's faster
or not pretty much depends on the workload, doesn't it? Anyway I've already
heard one guy complaining that his RT application does redirtying of mmaped
pages and it started seeing big latencies due to stable pages work. So for
these guys migrating might be an option (or maybe fadvise/madvise flag to
do copy out before submitting for IO?).

> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
> (I never really bothered to find out if it really does this.)
  Not sure either. Neil should know :) (added to CC).

								Honze
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-18 23:22       ` Jan Kara
@ 2012-01-18 23:42         ` Boaz Harrosh
  2012-01-19  9:46           ` Jan Kara
  2012-01-25  0:23           ` NeilBrown
  0 siblings, 2 replies; 76+ messages in thread
From: Boaz Harrosh @ 2012-01-18 23:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel, lsf-pc

On 01/19/2012 01:22 AM, Jan Kara wrote:
> On Wed 18-01-12 14:58:08, Darrick J. Wong wrote:
>> On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote:
>>> On Tue 17-01-12 15:06:12, Mike Snitzer wrote:
>>>> 5) Any more progress on stable pages?
>>>>    - I know Darrick Wong had some proposals, what remains?
>>>   As far as I know this is done for XFS, btrfs, ext4. Is more needed?
>>
>> Yep, it's done for those three fses.
>>
>> I suppose it might help some people if instead of wait_on_page_writeback we
>> could simply page-migrate all the processes onto a new page...?

>   Well, but it will cost some more memory & copying so whether it's faster
> or not pretty much depends on the workload, doesn't it? Anyway I've already
> heard one guy complaining that his RT application does redirtying of mmaped
> pages and it started seeing big latencies due to stable pages work. So for
> these guys migrating might be an option (or maybe fadvise/madvise flag to
> do copy out before submitting for IO?).
> 

OK That one is interesting. Because I'd imagine that the Kernel would not
start write-out on a busily modified page.

Some heavy modifying then a single write. If it's not so then there is already
great inefficiency, just now exposed, but was always there. The "page-migrate"
mentioned here will not help.

Could we not better our page write-out algorithms to avoid heavy contended pages?

Do you have a more detailed description of the workload? Is it theoretically
avoidable?

>> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
>> (I never really bothered to find out if it really does this.)

md-raid5/1 currently copies all pages if that what you meant.

>   Not sure either. Neil should know :) (added to CC).
> 
> 								Honze

Thanks
Boaz

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-18 23:42         ` Boaz Harrosh
@ 2012-01-19  9:46           ` Jan Kara
  2012-01-19 15:08             ` Andrea Arcangeli
  2012-01-22 12:21             ` Boaz Harrosh
  2012-01-25  0:23           ` NeilBrown
  1 sibling, 2 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-19  9:46 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Thu 19-01-12 01:42:12, Boaz Harrosh wrote:
> On 01/19/2012 01:22 AM, Jan Kara wrote:
> > On Wed 18-01-12 14:58:08, Darrick J. Wong wrote:
> >> On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote:
> >>> On Tue 17-01-12 15:06:12, Mike Snitzer wrote:
> >>>> 5) Any more progress on stable pages?
> >>>>    - I know Darrick Wong had some proposals, what remains?
> >>>   As far as I know this is done for XFS, btrfs, ext4. Is more needed?
> >>
> >> Yep, it's done for those three fses.
> >>
> >> I suppose it might help some people if instead of wait_on_page_writeback we
> >> could simply page-migrate all the processes onto a new page...?
> 
> >   Well, but it will cost some more memory & copying so whether it's faster
> > or not pretty much depends on the workload, doesn't it? Anyway I've already
> > heard one guy complaining that his RT application does redirtying of mmaped
> > pages and it started seeing big latencies due to stable pages work. So for
> > these guys migrating might be an option (or maybe fadvise/madvise flag to
> > do copy out before submitting for IO?).
> > 
> 
> OK That one is interesting. Because I'd imagine that the Kernel would not
> start write-out on a busily modified page.
  So currently writeback doesn't use the fact how busily is page modified.
After all whole mm has only two sorts of pages - active & inactive - which
reflects how often page is accessed but says nothing about how often is it
dirtied. So we don't have this information in the kernel and it would be
relatively (memory) expensive to keep it.

> Some heavy modifying then a single write. If it's not so then there is
> already great inefficiency, just now exposed, but was always there. The
> "page-migrate" mentioned here will not help.
  Yes, but I believe RT guy doesn't redirty the page that often. It is just
that if you have to meet certain latency criteria, you cannot afford a
single case where you have to wait. And if you redirty pages, you are bound
to hit PageWriteback case sooner or later.

> Could we not better our page write-out algorithms to avoid heavy
> contended pages?
  That's not so easy. Firstly, you'll have track and keep that information
somehow. Secondly, it is better to writeout a busily dirtied page than to
introduce a seek. Also definition of 'busy' differs for different purposes.
So to make this useful the logic won't be trivial. Thirdly, the benefit is
questionable anyway (at least for most of realistic workloads) because
flusher thread doesn't write the pages all that often - when there are not
many pages, we write them out just once every couple of seconds, when we
have lots of dirty pages we cycle through all of them so one page is not
written that often.

> Do you have a more detailed description of the workload? Is it theoretically
> avoidable?
  See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout
would solve the problems of this guy.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-19  9:46           ` Jan Kara
@ 2012-01-19 15:08             ` Andrea Arcangeli
  2012-01-19 20:52               ` Jan Kara
  2012-01-22 12:21             ` Boaz Harrosh
  1 sibling, 1 reply; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-19 15:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Thu, Jan 19, 2012 at 10:46:37AM +0100, Jan Kara wrote:
> So to make this useful the logic won't be trivial. Thirdly, the benefit is
> questionable anyway (at least for most of realistic workloads) because
> flusher thread doesn't write the pages all that often - when there are not
> many pages, we write them out just once every couple of seconds, when we
> have lots of dirty pages we cycle through all of them so one page is not
> written that often.

If you mean migrate as in mm/migrate.c that's also not cheap, it will
page fault anybody accessing the page, it'll do the page copy, and
it'll IPI all cpus that had the mm on the TLB, it locks the page too
and does all sort of checks. But it's true it'll be CPU bound... while
I understand the current solution is I/O bound.

> 
> > Do you have a more detailed description of the workload? Is it theoretically
> > avoidable?
>   See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout
> would solve the problems of this guy.

Copying in the I/O layer should be better than page migration,
1) copying the page to a I/O kernel buffer won't involve expensive TLB
IPIs that migration requires, 2) copying the page to a I/O kernel
buffer won't cause page faults because of migration entries being set,
3) migration has to copy too so the cost on the memory bus is the
same.

So unless I'm missing something page migration and pte/tlb mangling (I
mean as in mm/migrate.c) is worse in every way than bounce buffering
at the I/O layer if you notice the page can be modified while it's
under I/O.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-19 15:08             ` Andrea Arcangeli
@ 2012-01-19 20:52               ` Jan Kara
  2012-01-19 21:39                 ` Andrea Arcangeli
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-19 20:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Boaz Harrosh,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Thu 19-01-12 16:08:49, Andrea Arcangeli wrote:
> On Thu, Jan 19, 2012 at 10:46:37AM +0100, Jan Kara wrote:
> > So to make this useful the logic won't be trivial. Thirdly, the benefit is
> > questionable anyway (at least for most of realistic workloads) because
> > flusher thread doesn't write the pages all that often - when there are not
> > many pages, we write them out just once every couple of seconds, when we
> > have lots of dirty pages we cycle through all of them so one page is not
> > written that often.
> 
> If you mean migrate as in mm/migrate.c that's also not cheap, it will
> page fault anybody accessing the page, it'll do the page copy, and
> it'll IPI all cpus that had the mm on the TLB, it locks the page too
> and does all sort of checks. But it's true it'll be CPU bound... while
> I understand the current solution is I/O bound.
  Thanks for explanation. You are right that currently we are I/O bound so
migration is probably faster on most HW but as I said earlier, different
things might end up better in different workloads.

> > > Do you have a more detailed description of the workload? Is it theoretically
> > > avoidable?
> >   See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout
> > would solve the problems of this guy.
> 
> Copying in the I/O layer should be better than page migration,
> 1) copying the page to a I/O kernel buffer won't involve expensive TLB
> IPIs that migration requires, 2) copying the page to a I/O kernel
> buffer won't cause page faults because of migration entries being set,
> 3) migration has to copy too so the cost on the memory bus is the
> same.
>
> So unless I'm missing something page migration and pte/tlb mangling (I
> mean as in mm/migrate.c) is worse in every way than bounce buffering
> at the I/O layer if you notice the page can be modified while it's
> under I/O.
  Well, but the advantage of migration is that you need to do it only if
the page is redirtied while under IO. Copying to I/O buffer would have to
be done for *all* pages because once we submit the bio, we cannot change
anything. So what will be cheaper depends on how often are redirtied pages
under IO. This is rather rare because pages aren't flushed all that often.
So the effect of stable pages in not observable on throughput. But you can
certainly see it on max latency...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-19 20:52               ` Jan Kara
@ 2012-01-19 21:39                 ` Andrea Arcangeli
  2012-01-22 11:31                   ` Boaz Harrosh
  0 siblings, 1 reply; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-19 21:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mike Snitzer, linux-scsi, neilb, dm-devel, Boaz Harrosh,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote:
> anything. So what will be cheaper depends on how often are redirtied pages
> under IO. This is rather rare because pages aren't flushed all that often.
> So the effect of stable pages in not observable on throughput. But you can
> certainly see it on max latency...

I see your point. A problem with migrate though is that the page must
be pinned by the I/O layer to prevent migration to free the page under
I/O, or how else it could be safe to read from a freed page? And if
the page is pinned migration won't work at all. See page_freeze_refs
in migrate_page_move_mapping. So the pinning issue would need to be
handled somehow. It's needed for example when there's an O_DIRECT
read, and the I/O is going to the page, if the page is migrated in
that case, we'd lose a part of the I/O. Differentiating how many page
pins are ok to be ignored by migration won't be trivial but probably
possible to do.

Another way maybe would be to detect when there's too much re-dirtying
of pages in flight in a short amount of time, and to start the bounce
buffering and stop waiting, until the re-dirtying stops, and then you
stop the bounce buffering. But unlike migration, it can't prevent an
initial burst of high fault latency...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-19 21:39                 ` Andrea Arcangeli
@ 2012-01-22 11:31                   ` Boaz Harrosh
  2012-01-23 16:30                     ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Boaz Harrosh @ 2012-01-22 11:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jan Kara, Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel,
	lsf-pc

On 01/19/2012 11:39 PM, Andrea Arcangeli wrote:
> On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote:
>> anything. So what will be cheaper depends on how often are redirtied pages
>> under IO. This is rather rare because pages aren't flushed all that often.
>> So the effect of stable pages in not observable on throughput. But you can
>> certainly see it on max latency...
> 
> I see your point. A problem with migrate though is that the page must
> be pinned by the I/O layer to prevent migration to free the page under
> I/O, or how else it could be safe to read from a freed page? And if
> the page is pinned migration won't work at all. See page_freeze_refs
> in migrate_page_move_mapping. So the pinning issue would need to be
> handled somehow. It's needed for example when there's an O_DIRECT
> read, and the I/O is going to the page, if the page is migrated in
> that case, we'd lose a part of the I/O. Differentiating how many page
> pins are ok to be ignored by migration won't be trivial but probably
> possible to do.
> 
> Another way maybe would be to detect when there's too much re-dirtying
> of pages in flight in a short amount of time, and to start the bounce
> buffering and stop waiting, until the re-dirtying stops, and then you
> stop the bounce buffering. But unlike migration, it can't prevent an
> initial burst of high fault latency...

Or just change that RT program that is one - latency bound but, two - does
unpredictable, statistically bad, things to a memory mapped file.

Can a memory-mapped-file writer have some control on the time of
writeback with data_sync or such, or it's purely: Timer fired, Kernel see
a dirty page, start a writeout? What about if the application maps a
portion of the file at a time, and the Kernel gets more lazy on an active
memory mapped region. (That's what windows NT do. It will never IO a mapped
section unless in OOM conditions. The application needs to map small sections
and unmap to IO. It's more of a direct_io than mmap)

In any case, if you are very latency sensitive an mmap writeout is bad for
you. Not only because of this new problem, but because mmap writeout can
sync with tones of other things, that are do to memory management. (As mentioned
by Andrea). The best for latency sensitive application is asynchronous direct-io
by far. Only with asynchronous and direct-io you can have any real control on
your latency. (I understand they used to have empirically observed latency bound
but that is just luck, not real control)

BTW: The application mentioned would probably not want it's IO bounced at
the block layer, other wise why would it use mmap if not for preventing
the copy induced by buffer IO?

All that said, a mount option to ext4 (Is ext4 used?) to revert to the old
behavior is the easiest solution. When originally we brought this up in LSF
my thought was that the block request Q should have some flag that says
need_stable_pages. If set by the likes of dm/md-raid, iscsi-with-data-signed, DIFF
enabled devices and so on, and the FS does not guaranty/wants stable pages
then an IO bounce is set up. But if not set then the like of ext4 need not
bother.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-22 11:31                   ` Boaz Harrosh
@ 2012-01-23 16:30                     ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-23 16:30 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong

On Sun 22-01-12 13:31:38, Boaz Harrosh wrote:
> On 01/19/2012 11:39 PM, Andrea Arcangeli wrote:
> > On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote:
> >> anything. So what will be cheaper depends on how often are redirtied pages
> >> under IO. This is rather rare because pages aren't flushed all that often.
> >> So the effect of stable pages in not observable on throughput. But you can
> >> certainly see it on max latency...
> > 
> > I see your point. A problem with migrate though is that the page must
> > be pinned by the I/O layer to prevent migration to free the page under
> > I/O, or how else it could be safe to read from a freed page? And if
> > the page is pinned migration won't work at all. See page_freeze_refs
> > in migrate_page_move_mapping. So the pinning issue would need to be
> > handled somehow. It's needed for example when there's an O_DIRECT
> > read, and the I/O is going to the page, if the page is migrated in
> > that case, we'd lose a part of the I/O. Differentiating how many page
> > pins are ok to be ignored by migration won't be trivial but probably
> > possible to do.
> > 
> > Another way maybe would be to detect when there's too much re-dirtying
> > of pages in flight in a short amount of time, and to start the bounce
> > buffering and stop waiting, until the re-dirtying stops, and then you
> > stop the bounce buffering. But unlike migration, it can't prevent an
> > initial burst of high fault latency...
> 
> Or just change that RT program that is one - latency bound but, two - does
> unpredictable, statistically bad, things to a memory mapped file.
  Right. That's what I told the RT guy as well :) But he didn't like to
hear that because it meant more coding for him.

> Can a memory-mapped-file writer have some control on the time of
> writeback with data_sync or such, or it's purely: Timer fired, Kernel see
> a dirty page, start a writeout? What about if the application maps a
> portion of the file at a time, and the Kernel gets more lazy on an active
> memory mapped region. (That's what windows NT do. It will never IO a mapped
> section unless in OOM conditions. The application needs to map small sections
> and unmap to IO. It's more of a direct_io than mmap)
  You can always start writeback by sync_file_range() but you have no
guarantees what writeback does. Also if you need to redirty the page
pernamently (e.g. it's a head of your transaction log), there's simply no
good time when it can be written when you also want stable pages.

> In any case, if you are very latency sensitive an mmap writeout is bad for
> you. Not only because of this new problem, but because mmap writeout can
> sync with tones of other things, that are do to memory management. (As mentioned
> by Andrea). The best for latency sensitive application is asynchronous direct-io
> by far. Only with asynchronous and direct-io you can have any real control on
> your latency. (I understand they used to have empirically observed latency bound
> but that is just luck, not real control)
> 
> BTW: The application mentioned would probably not want it's IO bounced at
> the block layer, other wise why would it use mmap if not for preventing
> the copy induced by buffer IO?
  Yeah, I'm not sure why their design was as it was.

> All that said, a mount option to ext4 (Is ext4 used?) to revert to the old
> behavior is the easiest solution. When originally we brought this up in LSF
> my thought was that the block request Q should have some flag that says
> need_stable_pages. If set by the likes of dm/md-raid, iscsi-with-data-signed, DIFF
> enabled devices and so on, and the FS does not guaranty/wants stable pages
> then an IO bounce is set up. But if not set then the like of ext4 need not
> bother.
  There's no mount option. The behavior is on unconditionally. And so far I
have not seen enough people complain to introduce something like that -
automatic logic is a different thing of course. That might be nice to have.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-19  9:46           ` Jan Kara
  2012-01-19 15:08             ` Andrea Arcangeli
@ 2012-01-22 12:21             ` Boaz Harrosh
  2012-01-23 16:18               ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Boaz Harrosh @ 2012-01-22 12:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel, lsf-pc

On 01/19/2012 11:46 AM, Jan Kara wrote:
>>
>> OK That one is interesting. Because I'd imagine that the Kernel would not
>> start write-out on a busily modified page.
>   So currently writeback doesn't use the fact how busily is page modified.
> After all whole mm has only two sorts of pages - active & inactive - which
> reflects how often page is accessed but says nothing about how often is it
> dirtied. So we don't have this information in the kernel and it would be
> relatively (memory) expensive to keep it.
> 

Don't we? what about the information used by the IO elevators per-io-group.
Is it not collected at redirty time. Is it only recorded by the time a bio
is submitted? How does the io-elevator keeps small IO behind heavy writer
latency bound? We could use the reverse of that to not IO the "too soon"

>> Some heavy modifying then a single write. If it's not so then there is
>> already great inefficiency, just now exposed, but was always there. The
>> "page-migrate" mentioned here will not help.
>   Yes, but I believe RT guy doesn't redirty the page that often. It is just
> that if you have to meet certain latency criteria, you cannot afford a
> single case where you have to wait. And if you redirty pages, you are bound
> to hit PageWriteback case sooner or later.
> 

OK, thanks. I need this overview. What you mean is that since the writeback
fires periodically then there must be times when the page or group of pages
are just in the stage of changing and the writeback takes only half of the
modification.

So What if we let the dirty data always wait that writeback timeout, if
the pages are "to-new" and memory condition is fine, then postpone the
writeout to the next round. (Assuming we have that information from the
first part)

>> Could we not better our page write-out algorithms to avoid heavy
>> contended pages?
>   That's not so easy. Firstly, you'll have track and keep that information
> somehow. Secondly, it is better to writeout a busily dirtied page than to
> introduce a seek. 

Sure I'd say we just go on the timestamp of the first page in the group.
Because I'd imagine that the application has changed that group of pages
ruffly at the same time.

> Also definition of 'busy' differs for different purposes.
> So to make this useful the logic won't be trivial. 

I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of
"too new data". So any dirtying has some "aging time" before attack. The
aging time is very much related to your writeback timer. (Which is
 "the amount of memory buffer you want to keep" divide by your writeout-rate)

> Thirdly, the benefit is
> questionable anyway (at least for most of realistic workloads) because
> flusher thread doesn't write the pages all that often - when there are not
> many pages, we write them out just once every couple of seconds, when we
> have lots of dirty pages we cycle through all of them so one page is not
> written that often.
> 

Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let
that timer sample data that is just been dirtied.

Which brings me to another subject in the second case "when we have lots of
dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle
on sb's inodes but do a time sort write-out. The writeout is always started
from the lowest addressed page (inode->i_index) so take the time-of-dirty of
that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time
per SB to prioritize on SBs.

Because you see elevator-less FileSystems. Which are none-block-dev BDIs like
NFS or exofs have a problem. An heavy writer can easily totally starve a slow
IOer (read or write). I can easily demonstrate how an NFS heavy writer starves
a KDE desktop to a crawl. We should be starting to think on IO fairness and
interactivity at the VFS layer. So to not let every none-block-FS solve it's
own problem all over again.

>> Do you have a more detailed description of the workload? Is it theoretically
>> avoidable?
>   See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout
> would solve the problems of this guy.
> 
> 								Honza

Thanks
Boaz

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-22 12:21             ` Boaz Harrosh
@ 2012-01-23 16:18               ` Jan Kara
  2012-01-23 17:53                 ` Andrea Arcangeli
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-23 16:18 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Sun 22-01-12 14:21:51, Boaz Harrosh wrote:
> On 01/19/2012 11:46 AM, Jan Kara wrote:
> >>
> >> OK That one is interesting. Because I'd imagine that the Kernel would not
> >> start write-out on a busily modified page.
> >   So currently writeback doesn't use the fact how busily is page modified.
> > After all whole mm has only two sorts of pages - active & inactive - which
> > reflects how often page is accessed but says nothing about how often is it
> > dirtied. So we don't have this information in the kernel and it would be
> > relatively (memory) expensive to keep it.
> > 
> 
> Don't we? what about the information used by the IO elevators per-io-group.
> Is it not collected at redirty time. Is it only recorded by the time a bio
> is submitted? How does the io-elevator keeps small IO behind heavy writer
> latency bound? We could use the reverse of that to not IO the "too soon"
  IO elevator is at rather different level. It only starts tracking
something once we have struct request. So it knows nothing about
redirtying, or even pages as such. Also prioritization works only with the
requst granularity. Sure, big requests will take longer to complete but
maximum request size is relatively low (512k by default) so writing maximum
sized request isn't that much slower than writing 4k. So it works OK in
practice.

> >> Some heavy modifying then a single write. If it's not so then there is
> >> already great inefficiency, just now exposed, but was always there. The
> >> "page-migrate" mentioned here will not help.
> >   Yes, but I believe RT guy doesn't redirty the page that often. It is just
> > that if you have to meet certain latency criteria, you cannot afford a
> > single case where you have to wait. And if you redirty pages, you are bound
> > to hit PageWriteback case sooner or later.
> > 
> 
> OK, thanks. I need this overview. What you mean is that since the writeback
> fires periodically then there must be times when the page or group of pages
> are just in the stage of changing and the writeback takes only half of the
> modification.
> 
> So What if we let the dirty data always wait that writeback timeout, if
  What do you mean by writeback timeout?

> the pages are "to-new" and memory condition is fine, then postpone the
  And what do you mean by "to-new"?

> writeout to the next round. (Assuming we have that information from the
> first part)
  Sorry, I don't understand your idea...

> >> Could we not better our page write-out algorithms to avoid heavy
> >> contended pages?
> >   That's not so easy. Firstly, you'll have track and keep that information
> > somehow. Secondly, it is better to writeout a busily dirtied page than to
> > introduce a seek. 
> 
> Sure I'd say we just go on the timestamp of the first page in the group.
> Because I'd imagine that the application has changed that group of pages
> ruffly at the same time.
  We don't have a timestamp on a page. What we have is a timestamp on an
inode. Ideally that would be a time when the oldest dirty page in the inode
was dirtied. Practically, we cannot really keep that information (e.g.
after writing just some dirty pages in an inode)  so it is rather crude
approximation of that.

> > Also definition of 'busy' differs for different purposes.
> > So to make this useful the logic won't be trivial. 
> 
> I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of
> "too new data". So any dirtying has some "aging time" before attack. The
> aging time is very much related to your writeback timer. (Which is
>  "the amount of memory buffer you want to keep" divide by your writeout-rate)
  Again I repeat - you don't want to introduce seek into your IO stream
only because that single page got dirtied too recently. For randomly
written files there's always some compromise between how linear IO you want
and how much you want to reflect page aging. Currently to go for 'totally
linear' which is easier to do and generally better for throughput.

> > Thirdly, the benefit is
> > questionable anyway (at least for most of realistic workloads) because
> > flusher thread doesn't write the pages all that often - when there are not
> > many pages, we write them out just once every couple of seconds, when we
> > have lots of dirty pages we cycle through all of them so one page is not
> > written that often.
> 
> Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let
> that timer sample data that is just been dirtied.
> 
> Which brings me to another subject in the second case "when we have lots of
> dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle
> on sb's inodes but do a time sort write-out. The writeout is always started
> from the lowest addressed page (inode->i_index) so take the time-of-dirty of
> that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time
> per SB to prioritize on SBs.
  Boaz, we already do track inodes by dirty time and do writeback in that
order. Go read that code in fs/fs-writeback.c.
 
> Because you see elevator-less FileSystems. Which are none-block-dev BDIs like
> NFS or exofs have a problem. An heavy writer can easily totally starve a slow
> IOer (read or write). I can easily demonstrate how an NFS heavy writer starves
> a KDE desktop to a crawl.
  Currently, we rely on IO scheduler to protect light writers / readers.
You are right that for non-block filesystems that is problematic because
for them it is not hard to starve light readers by heavy writers. But
that doesn't seem like a problem of writeback but rather as a problem of
NFS client or exofs? Especially in the reader-vs-writer case writeback
simply doesn't have enough information and isn't the right place to solve
your problems. And I agree it would be stupid to duplicate code in CFQ in
several places so maybe you could lift some parts of it and generalize them
enough so that they can be used by others. 

> We should be starting to think on IO fairness and interactivity at the
> VFS layer. So to not let every none-block-FS solve it's own problem all
> over again.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-23 16:18               ` Jan Kara
@ 2012-01-23 17:53                 ` Andrea Arcangeli
  2012-01-23 18:28                   ` Jeff Moyer
  0 siblings, 1 reply; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-23 17:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> requst granularity. Sure, big requests will take longer to complete but
> maximum request size is relatively low (512k by default) so writing maximum
> sized request isn't that much slower than writing 4k. So it works OK in
> practice.

Totally unrelated to the writeback, but the merged big 512k requests
actually adds up some measurable I/O scheduler latencies and they in
turn slightly diminish the fairness that cfq could provide with
smaller max request size. Probably even more measurable with SSDs (but
then SSDs are even faster).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-23 17:53                 ` Andrea Arcangeli
@ 2012-01-23 18:28                   ` Jeff Moyer
  2012-01-23 18:56                     ` Andrea Arcangeli
  2012-01-24 15:15                     ` Chris Mason
  0 siblings, 2 replies; 76+ messages in thread
From: Jeff Moyer @ 2012-01-23 18:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

Andrea Arcangeli <aarcange@redhat.com> writes:

> On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> requst granularity. Sure, big requests will take longer to complete but
>> maximum request size is relatively low (512k by default) so writing maximum
>> sized request isn't that much slower than writing 4k. So it works OK in
>> practice.
>
> Totally unrelated to the writeback, but the merged big 512k requests
> actually adds up some measurable I/O scheduler latencies and they in
> turn slightly diminish the fairness that cfq could provide with
> smaller max request size. Probably even more measurable with SSDs (but
> then SSDs are even faster).

Are you speaking from experience?  If so, what workloads were negatively
affected by merging, and how did you measure that?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-23 18:28                   ` Jeff Moyer
@ 2012-01-23 18:56                     ` Andrea Arcangeli
  2012-01-23 19:19                       ` Jeff Moyer
  2012-01-24 15:15                     ` Chris Mason
  1 sibling, 1 reply; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-23 18:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> Are you speaking from experience?  If so, what workloads were negatively
> affected by merging, and how did you measure that?

Any workload where two processes compete for accessing the same disk
and one process writes big requests (usually async writes), the other
small (usually sync reads). The one with the small 4k requests
(usually reads) gets some artificial latency if the big requests are
512k. Vivek did a recent measurement to verify the issue is still
there, and it's basically an hardware issue. Software can't do much
other than possibly reducing the max request size when we notice such
an I/O pattern coming in cfq. I did old measurements that's how I knew
it, but they were so ancient they're worthless by now, this is why
Vivek had to repeat it to verify before we could assume it still
existed on recent hardware.

These days with cgroups it may be a bit more relevant as max write
bandwidth may be secondary to latency/QoS.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-23 18:56                     ` Andrea Arcangeli
@ 2012-01-23 19:19                       ` Jeff Moyer
  0 siblings, 0 replies; 76+ messages in thread
From: Jeff Moyer @ 2012-01-23 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel,
	linux-fsdevel, lsf-pc, Darrick J. Wong

Andrea Arcangeli <aarcange@redhat.com> writes:

> On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> Are you speaking from experience?  If so, what workloads were negatively
>> affected by merging, and how did you measure that?
>
> Any workload where two processes compete for accessing the same disk
> and one process writes big requests (usually async writes), the other
> small (usually sync reads). The one with the small 4k requests
> (usually reads) gets some artificial latency if the big requests are
> 512k. Vivek did a recent measurement to verify the issue is still
> there, and it's basically an hardware issue. Software can't do much
> other than possibly reducing the max request size when we notice such
> an I/O pattern coming in cfq. I did old measurements that's how I knew
> it, but they were so ancient they're worthless by now, this is why
> Vivek had to repeat it to verify before we could assume it still
> existed on recent hardware.
>
> These days with cgroups it may be a bit more relevant as max write
> bandwidth may be secondary to latency/QoS.

Thanks, Vivek was able to point me at the old thread:
  http://www.spinics.net/lists/linux-fsdevel/msg44191.html

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-23 18:28                   ` Jeff Moyer
  2012-01-23 18:56                     ` Andrea Arcangeli
@ 2012-01-24 15:15                     ` Chris Mason
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
  2012-01-24 17:12                       ` Jeff Moyer
  1 sibling, 2 replies; 76+ messages in thread
From: Chris Mason @ 2012-01-24 15:15 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer,
	linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> >> requst granularity. Sure, big requests will take longer to complete but
> >> maximum request size is relatively low (512k by default) so writing maximum
> >> sized request isn't that much slower than writing 4k. So it works OK in
> >> practice.
> >
> > Totally unrelated to the writeback, but the merged big 512k requests
> > actually adds up some measurable I/O scheduler latencies and they in
> > turn slightly diminish the fairness that cfq could provide with
> > smaller max request size. Probably even more measurable with SSDs (but
> > then SSDs are even faster).
> 
> Are you speaking from experience?  If so, what workloads were negatively
> affected by merging, and how did you measure that?

https://lkml.org/lkml/2011/12/13/326

This patch is another example, although for a slight different reason.
I really have no idea yet what the right answer is in a generic sense,
but you don't need a 512K request to see higher latencies from merging.

-chris

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 15:15                     ` Chris Mason
@ 2012-01-24 16:56                       ` Christoph Hellwig
  2012-01-24 17:01                         ` Andreas Dilger
                                           ` (3 more replies)
  2012-01-24 17:12                       ` Jeff Moyer
  1 sibling, 4 replies; 76+ messages in thread
From: Christoph Hellwig @ 2012-01-24 16:56 UTC (permalink / raw)
  To: Chris Mason, Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh,
	Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
> https://lkml.org/lkml/2011/12/13/326
> 
> This patch is another example, although for a slight different reason.
> I really have no idea yet what the right answer is in a generic sense,
> but you don't need a 512K request to see higher latencies from merging.

That assumes the 512k requests is created by merging.  We have enough
workloads that create large I/O from the get go, and not splitting them
and eventually merging them again would be a big win.  E.g. I'm
currently looking at a distributed block device which uses internal 4MB
chunks, and increasing the maximum request size to that dramatically
increases the read performance.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
@ 2012-01-24 17:01                         ` Andreas Dilger
  2012-01-24 17:06                         ` [Lsf-pc] [dm-devel] " Andrea Arcangeli
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Andreas Dilger @ 2012-01-24 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh,
	Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de,
	dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong



Cheers, Andreas

On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>> https://lkml.org/lkml/2011/12/13/326
>> 
>> This patch is another example, although for a slight different reason.
>> I really have no idea yet what the right answer is in a generic sense,
>> but you don't need a 512K request to see higher latencies from merging.
> 
> That assumes the 512k requests is created by merging.  We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win.  E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
  2012-01-24 17:01                         ` Andreas Dilger
@ 2012-01-24 17:06                         ` Andrea Arcangeli
  2012-01-24 17:08                         ` Chris Mason
  2012-01-24 17:08                         ` [Lsf-pc] " Andreas Dilger
  3 siblings, 0 replies; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-24 17:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Jeff Moyer, Jan Kara, Boaz Harrosh, Mike Snitzer,
	linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote:
> That assumes the 512k requests is created by merging.  We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win.  E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

Depends on the device though, if it's a normal disk, it likely only
reduces the number of dma ops without increasing performance too
much. Most disks should reach platter speed at 64KB, so larger request
only saves a bit of cpu in interrutps and stuff.

But I think nobody here was suggesting to reduce the request size by
default. cfq should easily notice when there are multiple queues that
are being submitted in the same time range. A device in addition to
specifying the max request dma size it can handle it could specify the
minimum it runs at platter speed and cfq could degrade to it when
there's multiple queues running in parallel over the same millisecond
or so. Reads will return in the I/O queue almost immediately but
they'll be out for a little while until the data is copied to
userland. So it'd need to keep it down to the min request size the
device allows to reach platter speed, for a little while. Then if no
other queue presents itself it double up the request size for each
unit of time until it reaches the max again. Maybe that could work, maybe
not :). Waiting only once for 4MB sounds better than waiting every
time 4MB for each 4k metadata seeking read.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
  2012-01-24 17:01                         ` Andreas Dilger
  2012-01-24 17:06                         ` [Lsf-pc] [dm-devel] " Andrea Arcangeli
@ 2012-01-24 17:08                         ` Chris Mason
  2012-01-24 17:08                         ` [Lsf-pc] " Andreas Dilger
  3 siblings, 0 replies; 76+ messages in thread
From: Chris Mason @ 2012-01-24 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh,
	Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
> > https://lkml.org/lkml/2011/12/13/326
> > 
> > This patch is another example, although for a slight different reason.
> > I really have no idea yet what the right answer is in a generic sense,
> > but you don't need a 512K request to see higher latencies from merging.
> 
> That assumes the 512k requests is created by merging.  We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win.  E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

Is this read latency or read tput?  If you're waiting on the whole 4MB
anyway, I'd expect one request to be better for both.  But Andrea's
original question  was on the impact of the big request on other requests
being serviced by the drive....there's really not much we can do about
that outside of more knobs for the admin.

-chris

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
                                           ` (2 preceding siblings ...)
  2012-01-24 17:08                         ` Chris Mason
@ 2012-01-24 17:08                         ` Andreas Dilger
  2012-01-24 18:05                           ` [dm-devel] " Jeff Moyer
  3 siblings, 1 reply; 76+ messages in thread
From: Andreas Dilger @ 2012-01-24 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org,
	Mike Snitzer, dm-devel@redhat.com, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason

On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>> https://lkml.org/lkml/2011/12/13/326
>> 
>> This patch is another example, although for a slight different reason.
>> I really have no idea yet what the right answer is in a generic sense,
>> but you don't need a 512K request to see higher latencies from merging.
> 
> That assumes the 512k requests is created by merging.  We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win.  E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

(sorry about last email, hit send by accident)

I don't think we can have a "one size fits all" policy here. In most RAID devices the IO size needs to be at least 1MB, and with newer devices 4MB gives better performance.

One of the reasons that Lustre used to hack so much around the VFS and VM APIs is exactly to avoid the splitting of read/write requests into pages and then depend on the elevator to reconstruct a good-sized IO out of it.

Things have gotten better with newer kernels, but there is still a ways to go w.r.t. allowing large IO requests to pass unhindered through to disk (or at least as far as enduring that the IO is aligned to the underlying disk geometry). 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 17:08                         ` [Lsf-pc] " Andreas Dilger
@ 2012-01-24 18:05                           ` Jeff Moyer
  2012-01-24 18:40                             ` Christoph Hellwig
  2012-01-26 22:31                             ` Dave Chinner
  0 siblings, 2 replies; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 18:05 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Chris Mason, Andrea Arcangeli, Jan Kara,
	Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org,
	neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

Andreas Dilger <adilger@dilger.ca> writes:

> On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>>> https://lkml.org/lkml/2011/12/13/326
>>> 
>>> This patch is another example, although for a slight different reason.
>>> I really have no idea yet what the right answer is in a generic sense,
>>> but you don't need a 512K request to see higher latencies from merging.
>> 
>> That assumes the 512k requests is created by merging.  We have enough
>> workloads that create large I/O from the get go, and not splitting them
>> and eventually merging them again would be a big win.  E.g. I'm
>> currently looking at a distributed block device which uses internal 4MB
>> chunks, and increasing the maximum request size to that dramatically
>> increases the read performance.
>
> (sorry about last email, hit send by accident)
>
> I don't think we can have a "one size fits all" policy here. In most
> RAID devices the IO size needs to be at least 1MB, and with newer
> devices 4MB gives better performance.

Right, and there's more to it than just I/O size.  There's access
pattern, and more importantly, workload and related requirements
(latency vs throughput).

> One of the reasons that Lustre used to hack so much around the VFS and
> VM APIs is exactly to avoid the splitting of read/write requests into
> pages and then depend on the elevator to reconstruct a good-sized IO
> out of it.
>
> Things have gotten better with newer kernels, but there is still a
> ways to go w.r.t. allowing large IO requests to pass unhindered
> through to disk (or at least as far as enduring that the IO is aligned
> to the underlying disk geometry).

I've been wondering if it's gotten better, so decided to run a few quick
tests.

kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq,
max_sectors_kb: 1024, test program: dd

ext3:
- buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
  I/Os passed down to the I/O scheduler
- buffered 1MB reads are a little better, typically in the 128k-256k
  range when they hit the I/O scheduler.

ext4:
- buffered writes: 512K I/Os show up at the elevator
- buffered O_SYNC writes: data is again 512KB, journal writes are 4K
- buffered 1MB reads get down to the scheduler in 128KB chunks

xfs:
- buffered writes: 1MB I/Os show up at the elevator
- buffered O_SYNC writes: 1MB I/Os
- buffered 1MB reads: 128KB chunks show up at the I/O scheduler

So, ext4 is doing better than ext3, but still not perfect.  xfs is
kicking ass for writes, but reads are still split up.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 18:05                           ` [dm-devel] " Jeff Moyer
@ 2012-01-24 18:40                             ` Christoph Hellwig
  2012-01-24 19:07                               ` Chris Mason
  2012-01-24 19:11                               ` [dm-devel] [Lsf-pc] " Jeff Moyer
  2012-01-26 22:31                             ` Dave Chinner
  1 sibling, 2 replies; 76+ messages in thread
From: Christoph Hellwig @ 2012-01-24 18:40 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andreas Dilger, Christoph Hellwig, Chris Mason, Andrea Arcangeli,
	Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org,
	neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote:
> - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
>   I/Os passed down to the I/O scheduler
> - buffered 1MB reads are a little better, typically in the 128k-256k
>   range when they hit the I/O scheduler.
> 
> ext4:
> - buffered writes: 512K I/Os show up at the elevator
> - buffered O_SYNC writes: data is again 512KB, journal writes are 4K
> - buffered 1MB reads get down to the scheduler in 128KB chunks
> 
> xfs:
> - buffered writes: 1MB I/Os show up at the elevator
> - buffered O_SYNC writes: 1MB I/Os
> - buffered 1MB reads: 128KB chunks show up at the I/O scheduler
> 
> So, ext4 is doing better than ext3, but still not perfect.  xfs is
> kicking ass for writes, but reads are still split up.

All three filesystems use the generic mpages code for reads, so they
all get the same (bad) I/O patterns.  Looks like we need to fix this up
ASAP.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 18:40                             ` Christoph Hellwig
@ 2012-01-24 19:07                               ` Chris Mason
  2012-01-24 19:14                                 ` Jeff Moyer
  2012-01-24 19:11                               ` [dm-devel] [Lsf-pc] " Jeff Moyer
  1 sibling, 1 reply; 76+ messages in thread
From: Chris Mason @ 2012-01-24 19:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org,
	neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

On Tue, Jan 24, 2012 at 01:40:54PM -0500, Christoph Hellwig wrote:
> On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote:
> > - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
> >   I/Os passed down to the I/O scheduler
> > - buffered 1MB reads are a little better, typically in the 128k-256k
> >   range when they hit the I/O scheduler.
> > 
> > ext4:
> > - buffered writes: 512K I/Os show up at the elevator
> > - buffered O_SYNC writes: data is again 512KB, journal writes are 4K
> > - buffered 1MB reads get down to the scheduler in 128KB chunks
> > 
> > xfs:
> > - buffered writes: 1MB I/Os show up at the elevator
> > - buffered O_SYNC writes: 1MB I/Os
> > - buffered 1MB reads: 128KB chunks show up at the I/O scheduler
> > 
> > So, ext4 is doing better than ext3, but still not perfect.  xfs is
> > kicking ass for writes, but reads are still split up.
> 
> All three filesystems use the generic mpages code for reads, so they
> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> ASAP.

Can you easily run btrfs through the same rig?  We don't use mpages and
I'm curious.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 19:07                               ` Chris Mason
@ 2012-01-24 19:14                                 ` Jeff Moyer
  2012-01-24 20:09                                   ` [Lsf-pc] [dm-devel] " Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 19:14 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, Christoph Hellwig,
	dm-devel@redhat.com, fengguang.wu, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org

Chris Mason <chris.mason@oracle.com> writes:

>> All three filesystems use the generic mpages code for reads, so they
>> all get the same (bad) I/O patterns.  Looks like we need to fix this up
>> ASAP.
>
> Can you easily run btrfs through the same rig?  We don't use mpages and
> I'm curious.

The readahead code was to blame, here.  I wonder if we can change the
logic there to not break larger I/Os down into smaller sized ones.
Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
when 128KB is the read_ahead_kb value.  Is there any heuristic you could
apply to not break larger I/Os up like this?  Does that make sense?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 19:14                                 ` Jeff Moyer
@ 2012-01-24 20:09                                   ` Jan Kara
  2012-01-24 20:13                                     ` [Lsf-pc] " Jeff Moyer
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-24 20:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Chris Mason, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, fengguang.wu,
	Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> >> All three filesystems use the generic mpages code for reads, so they
> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> >> ASAP.
> >
> > Can you easily run btrfs through the same rig?  We don't use mpages and
> > I'm curious.
> 
> The readahead code was to blame, here.  I wonder if we can change the
> logic there to not break larger I/Os down into smaller sized ones.
> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
> apply to not break larger I/Os up like this?  Does that make sense?
  Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
already knows how much do we want to read. We just trim the submitted I/O to
read_ahead_kb artificially. And that is done so that you don't trash page
cache (possibly evicting pages you have not yet copied to userspace) when
there are several processes doing large reads.

Maybe 128 KB is a too small default these days but OTOH noone prevents you
from raising it (e.g. SLES uses 1 MB as a default).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc]   [LSF/MM TOPIC] a few storage topics
  2012-01-24 20:09                                   ` [Lsf-pc] [dm-devel] " Jan Kara
@ 2012-01-24 20:13                                     ` Jeff Moyer
  2012-01-24 20:39                                       ` [Lsf-pc] [dm-devel] " Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 20:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org,
	Mike Snitzer, Christoph Hellwig, dm-devel@redhat.com,
	fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Chris Mason

Jan Kara <jack@suse.cz> writes:

> On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
>> Chris Mason <chris.mason@oracle.com> writes:
>> 
>> >> All three filesystems use the generic mpages code for reads, so they
>> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
>> >> ASAP.
>> >
>> > Can you easily run btrfs through the same rig?  We don't use mpages and
>> > I'm curious.
>> 
>> The readahead code was to blame, here.  I wonder if we can change the
>> logic there to not break larger I/Os down into smaller sized ones.
>> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
>> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
>> apply to not break larger I/Os up like this?  Does that make sense?
>   Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
> already knows how much do we want to read. We just trim the submitted I/O to
> read_ahead_kb artificially. And that is done so that you don't trash page
> cache (possibly evicting pages you have not yet copied to userspace) when
> there are several processes doing large reads.

Do you really think applications issue large reads and then don't use
the data?  I mean, I've seen some bad programming, so I can believe that
would be the case.  Still, I'd like to think it doesn't happen.  ;-)

> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> from raising it (e.g. SLES uses 1 MB as a default).

For some reason, I thought it had been bumped to 512KB by default.  Must
be that overactive imagination I have...  Anyway, if all of the distros
start bumping the default, don't you think it's time to consider bumping
it upstream, too?  I thought there was a lot of work put into not being
too aggressive on readahead, so the downside of having a larger
read_ahead_kb setting was fairly small.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 20:13                                     ` [Lsf-pc] " Jeff Moyer
@ 2012-01-24 20:39                                       ` Jan Kara
  2012-01-24 20:59                                         ` Jeff Moyer
  2012-01-25  3:29                                         ` Wu Fengguang
  0 siblings, 2 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-24 20:39 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andreas Dilger, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, fengguang.wu,
	Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong

On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
> >> Chris Mason <chris.mason@oracle.com> writes:
> >> 
> >> >> All three filesystems use the generic mpages code for reads, so they
> >> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> >> >> ASAP.
> >> >
> >> > Can you easily run btrfs through the same rig?  We don't use mpages and
> >> > I'm curious.
> >> 
> >> The readahead code was to blame, here.  I wonder if we can change the
> >> logic there to not break larger I/Os down into smaller sized ones.
> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
> >> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
> >> apply to not break larger I/Os up like this?  Does that make sense?
> >   Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
> > already knows how much do we want to read. We just trim the submitted I/O to
> > read_ahead_kb artificially. And that is done so that you don't trash page
> > cache (possibly evicting pages you have not yet copied to userspace) when
> > there are several processes doing large reads.
> 
> Do you really think applications issue large reads and then don't use
> the data?  I mean, I've seen some bad programming, so I can believe that
> would be the case.  Still, I'd like to think it doesn't happen.  ;-)
  No, I meant a cache thrashing problem. Suppose that we always readahead
as much as user asks and there are say 100 processes each wanting to read 4
MB.  Then you need to find 400 MB in the page cache so that all reads can
fit.  And if you don't have them, reads for process 50 may evict pages we
already preread for process 1, but process one didn't yet get to CPU to
copy the data to userspace buffer. So the read becomes wasted.

> > Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > from raising it (e.g. SLES uses 1 MB as a default).
> 
> For some reason, I thought it had been bumped to 512KB by default.  Must
> be that overactive imagination I have...  Anyway, if all of the distros
> start bumping the default, don't you think it's time to consider bumping
> it upstream, too?  I thought there was a lot of work put into not being
> too aggressive on readahead, so the downside of having a larger
> read_ahead_kb setting was fairly small.
  Yeah, I believe 512KB should be pretty safe these days except for
embedded world. OTOH average desktop user doesn't really care so it's
mostly servers with beefy storage that care... (note that I wrote we raised
the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
distro)).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 20:39                                       ` [Lsf-pc] [dm-devel] " Jan Kara
@ 2012-01-24 20:59                                         ` Jeff Moyer
  2012-01-24 21:08                                           ` Jan Kara
  2012-01-25  3:29                                         ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 20:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org,
	Mike Snitzer, neilb@suse.de, Christoph Hellwig,
	dm-devel@redhat.com, fengguang.wu, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

Jan Kara <jack@suse.cz> writes:

> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
>> >> Chris Mason <chris.mason@oracle.com> writes:
>> >> 
>> >> >> All three filesystems use the generic mpages code for reads, so they
>> >> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
>> >> >> ASAP.
>> >> >
>> >> > Can you easily run btrfs through the same rig?  We don't use mpages and
>> >> > I'm curious.
>> >> 
>> >> The readahead code was to blame, here.  I wonder if we can change the
>> >> logic there to not break larger I/Os down into smaller sized ones.
>> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
>> >> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
>> >> apply to not break larger I/Os up like this?  Does that make sense?
>> >   Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
>> > already knows how much do we want to read. We just trim the submitted I/O to
>> > read_ahead_kb artificially. And that is done so that you don't trash page
>> > cache (possibly evicting pages you have not yet copied to userspace) when
>> > there are several processes doing large reads.
>> 
>> Do you really think applications issue large reads and then don't use
>> the data?  I mean, I've seen some bad programming, so I can believe that
>> would be the case.  Still, I'd like to think it doesn't happen.  ;-)
>   No, I meant a cache thrashing problem. Suppose that we always readahead
> as much as user asks and there are say 100 processes each wanting to read 4
> MB.  Then you need to find 400 MB in the page cache so that all reads can
> fit.  And if you don't have them, reads for process 50 may evict pages we
> already preread for process 1, but process one didn't yet get to CPU to
> copy the data to userspace buffer. So the read becomes wasted.

Yeah, you're right, cache thrashing is an issue.  In my tests, I didn't
actually see the *initial* read come through as a full 1MB I/O, though.
That seems odd to me.

>> > Maybe 128 KB is a too small default these days but OTOH noone prevents you
>> > from raising it (e.g. SLES uses 1 MB as a default).
>> 
>> For some reason, I thought it had been bumped to 512KB by default.  Must
>> be that overactive imagination I have...  Anyway, if all of the distros
>> start bumping the default, don't you think it's time to consider bumping
>> it upstream, too?  I thought there was a lot of work put into not being
>> too aggressive on readahead, so the downside of having a larger
>> read_ahead_kb setting was fairly small.
>   Yeah, I believe 512KB should be pretty safe these days except for
> embedded world. OTOH average desktop user doesn't really care so it's
> mostly servers with beefy storage that care... (note that I wrote we raised
> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> distro)).

Fair enough.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 20:59                                         ` Jeff Moyer
@ 2012-01-24 21:08                                           ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-24 21:08 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Andreas Dilger, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, fengguang.wu,
	Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong

On Tue 24-01-12 15:59:02, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> > On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> 
> >> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote:
> >> >> Chris Mason <chris.mason@oracle.com> writes:
> >> >> 
> >> >> >> All three filesystems use the generic mpages code for reads, so they
> >> >> >> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> >> >> >> ASAP.
> >> >> >
> >> >> > Can you easily run btrfs through the same rig?  We don't use mpages and
> >> >> > I'm curious.
> >> >> 
> >> >> The readahead code was to blame, here.  I wonder if we can change the
> >> >> logic there to not break larger I/Os down into smaller sized ones.
> >> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os,
> >> >> when 128KB is the read_ahead_kb value.  Is there any heuristic you could
> >> >> apply to not break larger I/Os up like this?  Does that make sense?
> >> >   Well, not breaking up I/Os would be fairly simple as ondemand_readahead()
> >> > already knows how much do we want to read. We just trim the submitted I/O to
> >> > read_ahead_kb artificially. And that is done so that you don't trash page
> >> > cache (possibly evicting pages you have not yet copied to userspace) when
> >> > there are several processes doing large reads.
> >> 
> >> Do you really think applications issue large reads and then don't use
> >> the data?  I mean, I've seen some bad programming, so I can believe that
> >> would be the case.  Still, I'd like to think it doesn't happen.  ;-)
> >   No, I meant a cache thrashing problem. Suppose that we always readahead
> > as much as user asks and there are say 100 processes each wanting to read 4
> > MB.  Then you need to find 400 MB in the page cache so that all reads can
> > fit.  And if you don't have them, reads for process 50 may evict pages we
> > already preread for process 1, but process one didn't yet get to CPU to
> > copy the data to userspace buffer. So the read becomes wasted.
> 
> Yeah, you're right, cache thrashing is an issue.  In my tests, I didn't
> actually see the *initial* read come through as a full 1MB I/O, though.
> That seems odd to me.
  At first sight yes. But buffered reading internally works page-by-page
so what it does is that it looks at the first page it wants, sees we don't
have that in memory, so we submit readahead (hence 128 KB request) and then
wait for that page to become uptodate. Then, when we are coming to the end
of preread window (trip over marked page), we submit another chunk of
readahead...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-24 20:39                                       ` [Lsf-pc] [dm-devel] " Jan Kara
  2012-01-24 20:59                                         ` Jeff Moyer
@ 2012-01-25  3:29                                         ` Wu Fengguang
  2012-01-25  6:15                                           ` [Lsf-pc] " Andreas Dilger
  1 sibling, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2012-01-25  3:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, Andreas Dilger, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
[snip]
> > > Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > > from raising it (e.g. SLES uses 1 MB as a default).
> > 
> > For some reason, I thought it had been bumped to 512KB by default.  Must
> > be that overactive imagination I have...  Anyway, if all of the distros
> > start bumping the default, don't you think it's time to consider bumping
> > it upstream, too?  I thought there was a lot of work put into not being
> > too aggressive on readahead, so the downside of having a larger
> > read_ahead_kb setting was fairly small.
>   Yeah, I believe 512KB should be pretty safe these days except for
> embedded world. OTOH average desktop user doesn't really care so it's
> mostly servers with beefy storage that care... (note that I wrote we raised
> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> distro)).

Maybe we don't need to care much about the embedded world when raising
the default readahead size? Because even the current 128KB is too much
for them, and I see Android setting the readahead size to 4KB...

Some time ago I posted a series for raising the default readahead size
to 512KB. But I'm open to use 1MB now (shall we vote on it?).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc]   [LSF/MM TOPIC] a few storage topics
  2012-01-25  3:29                                         ` Wu Fengguang
@ 2012-01-25  6:15                                           ` Andreas Dilger
  2012-01-25  6:35                                             ` [Lsf-pc] [dm-devel] " Wu Fengguang
  2012-01-25 14:33                                             ` Steven Whitehouse
  0 siblings, 2 replies; 76+ messages in thread
From: Andreas Dilger @ 2012-01-25  6:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org,
	Mike Snitzer, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer,
	Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Chris Mason

On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
>> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
>>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
>>>> from raising it (e.g. SLES uses 1 MB as a default).
>>> 
>>> For some reason, I thought it had been bumped to 512KB by default.  Must
>>> be that overactive imagination I have...  Anyway, if all of the distros
>>> start bumping the default, don't you think it's time to consider bumping
>>> it upstream, too?  I thought there was a lot of work put into not being
>>> too aggressive on readahead, so the downside of having a larger
>>> read_ahead_kb setting was fairly small.
>> 
>>  Yeah, I believe 512KB should be pretty safe these days except for
>> embedded world. OTOH average desktop user doesn't really care so it's
>> mostly servers with beefy storage that care... (note that I wrote we raised
>> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
>> distro)).
> 
> Maybe we don't need to care much about the embedded world when raising
> the default readahead size? Because even the current 128KB is too much
> for them, and I see Android setting the readahead size to 4KB...
> 
> Some time ago I posted a series for raising the default readahead size
> to 512KB. But I'm open to use 1MB now (shall we vote on it?).

I'm all in favour of 1MB (aligned) readahead.  I think the embedded folks
already set enough CONFIG opts that we could trigger on one of those
(e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.  It would also be
possible to trigger on the size of the device so that the 32MB USB stick
doesn't sit busy for a minute with readahead that is useless.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25  6:15                                           ` [Lsf-pc] " Andreas Dilger
@ 2012-01-25  6:35                                             ` Wu Fengguang
  2012-01-25 14:00                                               ` Jan Kara
  2012-01-26 16:25                                               ` Vivek Goyal
  2012-01-25 14:33                                             ` Steven Whitehouse
  1 sibling, 2 replies; 76+ messages in thread
From: Wu Fengguang @ 2012-01-25  6:35 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Jeff Moyer, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote:
> On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> >>>> from raising it (e.g. SLES uses 1 MB as a default).
> >>> 
> >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> >>> be that overactive imagination I have...  Anyway, if all of the distros
> >>> start bumping the default, don't you think it's time to consider bumping
> >>> it upstream, too?  I thought there was a lot of work put into not being
> >>> too aggressive on readahead, so the downside of having a larger
> >>> read_ahead_kb setting was fairly small.
> >> 
> >>  Yeah, I believe 512KB should be pretty safe these days except for
> >> embedded world. OTOH average desktop user doesn't really care so it's
> >> mostly servers with beefy storage that care... (note that I wrote we raised
> >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> >> distro)).
> > 
> > Maybe we don't need to care much about the embedded world when raising
> > the default readahead size? Because even the current 128KB is too much
> > for them, and I see Android setting the readahead size to 4KB...
> > 
> > Some time ago I posted a series for raising the default readahead size
> > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> 
> I'm all in favour of 1MB (aligned) readahead.

1MB readahead aligned to i*1MB boundaries? I like this idea. It will
work well if the filesystems employ the same alignment rule for large
files.

> I think the embedded folks
> already set enough CONFIG opts that we could trigger on one of those
> (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.

Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when
CONFIG_EMBEDDED is selected.

> It would also be
> possible to trigger on the size of the device so that the 32MB USB stick
> doesn't sit busy for a minute with readahead that is useless.

Yeah, I do have a patch for shrinking readahead size based on device size.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25  6:35                                             ` [Lsf-pc] [dm-devel] " Wu Fengguang
@ 2012-01-25 14:00                                               ` Jan Kara
  2012-01-26 12:29                                                 ` Andreas Dilger
  2012-01-26 16:25                                               ` Vivek Goyal
  1 sibling, 1 reply; 76+ messages in thread
From: Jan Kara @ 2012-01-25 14:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andreas Dilger, Jan Kara, Jeff Moyer, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Wed 25-01-12 14:35:52, Wu Fengguang wrote:
> On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote:
> > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > >>>> from raising it (e.g. SLES uses 1 MB as a default).
> > >>> 
> > >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> > >>> be that overactive imagination I have...  Anyway, if all of the distros
> > >>> start bumping the default, don't you think it's time to consider bumping
> > >>> it upstream, too?  I thought there was a lot of work put into not being
> > >>> too aggressive on readahead, so the downside of having a larger
> > >>> read_ahead_kb setting was fairly small.
> > >> 
> > >>  Yeah, I believe 512KB should be pretty safe these days except for
> > >> embedded world. OTOH average desktop user doesn't really care so it's
> > >> mostly servers with beefy storage that care... (note that I wrote we raised
> > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> > >> distro)).
> > > 
> > > Maybe we don't need to care much about the embedded world when raising
> > > the default readahead size? Because even the current 128KB is too much
> > > for them, and I see Android setting the readahead size to 4KB...
> > > 
> > > Some time ago I posted a series for raising the default readahead size
> > > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> > 
> > I'm all in favour of 1MB (aligned) readahead.
> 
> 1MB readahead aligned to i*1MB boundaries? I like this idea. It will
> work well if the filesystems employ the same alignment rule for large
> files.
  Yeah. Clever filesystems (e.g. XFS) can be configured to align files e.g.
to raid stripes AFAIK so for them this could be worthwhile.

> > I think the embedded folks
> > already set enough CONFIG opts that we could trigger on one of those
> > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.
> 
> Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when
> CONFIG_EMBEDDED is selected.
  Sounds good.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 14:00                                               ` Jan Kara
@ 2012-01-26 12:29                                                 ` Andreas Dilger
  2012-01-27 17:03                                                   ` Ted Ts'o
  0 siblings, 1 reply; 76+ messages in thread
From: Andreas Dilger @ 2012-01-26 12:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer,
	neilb@suse.de Brown, Christoph Hellwig,
	dm-devel@redhat.com development, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org Devel, lsf-pc, Chris Mason,
	Darrick J.Wong

On 2012-01-25, at 7:00 AM, Jan Kara wrote:
> On Wed 25-01-12 14:35:52, Wu Fengguang wrote:
>> On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote:
>>> I'm all in favour of 1MB (aligned) readahead.
>> 
>> 1MB readahead aligned to i*1MB boundaries? I like this idea. It will
>> work well if the filesystems employ the same alignment rule for large
>> files.
> 
> Yeah. Clever filesystems (e.g. XFS) can be configured to align files e.g.
> to raid stripes AFAIK so for them this could be worthwhile.

Ext4 will also align IO to 1MB boundaries (from the start of LUN/partition)
by default.  If the mke2fs code detects the underlying RAID geometry (or the
sysadmin sets this manually with tune2fs) it will store this in the superblock
for the allocator to pick a better alignment.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 12:29                                                 ` Andreas Dilger
@ 2012-01-27 17:03                                                   ` Ted Ts'o
  0 siblings, 0 replies; 76+ messages in thread
From: Ted Ts'o @ 2012-01-27 17:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi,
	Mike Snitzer, neilb@suse.de Brown, Christoph Hellwig,
	dm-devel@redhat.com development, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org Devel, lsf-pc, Chris Mason,
	Darrick J.Wong

On Thu, Jan 26, 2012 at 05:29:03AM -0700, Andreas Dilger wrote:
> 
> Ext4 will also align IO to 1MB boundaries (from the start of
> LUN/partition) by default.  If the mke2fs code detects the
> underlying RAID geometry (or the sysadmin sets this manually with
> tune2fs) it will store this in the superblock for the allocator to
> pick a better alignment.

(Still in Hawaii on vacation, but picked this up while I was quickly
scanning through e-mail.)

This is true only if you're using the special (non-upstream'ed) Lustre
interfaces for writing Lustre objects.  The writepages interface
doesn't have all of the necessary smarts to do the right thing.  It's
been on my todo list to look at, but I've been mostly concentrated on
single disk file systems since that's what we use at Google.  (GFS can
scale to many many file systems and servers, and avoiding RAID means
fast FSCK recoveries, simplifying things since we don't have to worry
about RAID-related failures, etc.)

Eventually I'd like ext4 to handle RAID better, but unless you're
forced to support really large files, I've come around to believing
that n=3 replication or Reed-Solomon encoding across multiple servers
is a much better way of achieving data robustness, so it's just not
been high on my list of priorities.  I'm much more interested in
making sure ext4 works well under high memory pressure, and other
cloud-related issues.

							- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25  6:35                                             ` [Lsf-pc] [dm-devel] " Wu Fengguang
  2012-01-25 14:00                                               ` Jan Kara
@ 2012-01-26 16:25                                               ` Vivek Goyal
  2012-01-26 20:37                                                 ` Jan Kara
  2012-01-26 22:34                                                 ` Dave Chinner
  1 sibling, 2 replies; 76+ messages in thread
From: Vivek Goyal @ 2012-01-26 16:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
> On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote:
> > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > >>>> from raising it (e.g. SLES uses 1 MB as a default).
> > >>> 
> > >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> > >>> be that overactive imagination I have...  Anyway, if all of the distros
> > >>> start bumping the default, don't you think it's time to consider bumping
> > >>> it upstream, too?  I thought there was a lot of work put into not being
> > >>> too aggressive on readahead, so the downside of having a larger
> > >>> read_ahead_kb setting was fairly small.
> > >> 
> > >>  Yeah, I believe 512KB should be pretty safe these days except for
> > >> embedded world. OTOH average desktop user doesn't really care so it's
> > >> mostly servers with beefy storage that care... (note that I wrote we raised
> > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> > >> distro)).
> > > 
> > > Maybe we don't need to care much about the embedded world when raising
> > > the default readahead size? Because even the current 128KB is too much
> > > for them, and I see Android setting the readahead size to 4KB...
> > > 
> > > Some time ago I posted a series for raising the default readahead size
> > > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> > 
> > I'm all in favour of 1MB (aligned) readahead.
> 
> 1MB readahead aligned to i*1MB boundaries? I like this idea. It will
> work well if the filesystems employ the same alignment rule for large
> files.
> 
> > I think the embedded folks
> > already set enough CONFIG opts that we could trigger on one of those
> > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.
> 
> Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when
> CONFIG_EMBEDDED is selected.
> 
> > It would also be
> > possible to trigger on the size of the device so that the 32MB USB stick
> > doesn't sit busy for a minute with readahead that is useless.
> 
> Yeah, I do have a patch for shrinking readahead size based on device size.

Should it be a udev rule to change read_ahead_kb on device based on device
size, instead of a kernel patch?

This is assuming device size is a good way to determine read ahead window
size. I would guess that device speed should also matter though isn't it.
If device is small but fast then it is probably ok to have larger read ahead
window and vice versa.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 16:25                                               ` Vivek Goyal
@ 2012-01-26 20:37                                                 ` Jan Kara
  2012-01-26 22:34                                                 ` Dave Chinner
  1 sibling, 0 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-26 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Jeff Moyer, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Thu 26-01-12 11:25:56, Vivek Goyal wrote:
> On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
> > On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote:
> > > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> > > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> > > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > > >>>> from raising it (e.g. SLES uses 1 MB as a default).
> > > >>> 
> > > >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> > > >>> be that overactive imagination I have...  Anyway, if all of the distros
> > > >>> start bumping the default, don't you think it's time to consider bumping
> > > >>> it upstream, too?  I thought there was a lot of work put into not being
> > > >>> too aggressive on readahead, so the downside of having a larger
> > > >>> read_ahead_kb setting was fairly small.
> > > >> 
> > > >>  Yeah, I believe 512KB should be pretty safe these days except for
> > > >> embedded world. OTOH average desktop user doesn't really care so it's
> > > >> mostly servers with beefy storage that care... (note that I wrote we raised
> > > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> > > >> distro)).
> > > > 
> > > > Maybe we don't need to care much about the embedded world when raising
> > > > the default readahead size? Because even the current 128KB is too much
> > > > for them, and I see Android setting the readahead size to 4KB...
> > > > 
> > > > Some time ago I posted a series for raising the default readahead size
> > > > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> > > 
> > > I'm all in favour of 1MB (aligned) readahead.
> > 
> > 1MB readahead aligned to i*1MB boundaries? I like this idea. It will
> > work well if the filesystems employ the same alignment rule for large
> > files.
> > 
> > > I think the embedded folks
> > > already set enough CONFIG opts that we could trigger on one of those
> > > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.
> > 
> > Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when
> > CONFIG_EMBEDDED is selected.
> > 
> > > It would also be
> > > possible to trigger on the size of the device so that the 32MB USB stick
> > > doesn't sit busy for a minute with readahead that is useless.
> > 
> > Yeah, I do have a patch for shrinking readahead size based on device size.
> 
> Should it be a udev rule to change read_ahead_kb on device based on device
> size, instead of a kernel patch?
  Yes, we talked about that and I think having the logic in udev rule is
easier. Just if we decided the logic should use a lot of kernel internal
state, then it's better to have it in kernel.

> This is assuming device size is a good way to determine read ahead window
> size. I would guess that device speed should also matter though isn't it.
> If device is small but fast then it is probably ok to have larger read ahead
> window and vice versa.
  Yes, but speed is harder to measure than size ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 16:25                                               ` Vivek Goyal
  2012-01-26 20:37                                                 ` Jan Kara
@ 2012-01-26 22:34                                                 ` Dave Chinner
  2012-01-27  3:27                                                   ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2012-01-26 22:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote:
> On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
> > > It would also be
> > > possible to trigger on the size of the device so that the 32MB USB stick
> > > doesn't sit busy for a minute with readahead that is useless.
> > 
> > Yeah, I do have a patch for shrinking readahead size based on device size.
> 
> Should it be a udev rule to change read_ahead_kb on device based on device
> size, instead of a kernel patch?

That's effectively what vendors like SGI have been doing since udev
was first introduced, though more often the rules are based on device
type rather than size. e.g. a 64GB device might be a USB flash drive
now, but a 40GB device might be a really fast SSD....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 22:34                                                 ` Dave Chinner
@ 2012-01-27  3:27                                                   ` Wu Fengguang
  2012-01-27  5:25                                                     ` Andreas Dilger
  0 siblings, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2012-01-27  3:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Fri, Jan 27, 2012 at 09:34:49AM +1100, Dave Chinner wrote:
> On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote:
> > On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
> > > > It would also be
> > > > possible to trigger on the size of the device so that the 32MB USB stick
> > > > doesn't sit busy for a minute with readahead that is useless.
> > > 
> > > Yeah, I do have a patch for shrinking readahead size based on device size.
> > 
> > Should it be a udev rule to change read_ahead_kb on device based on device
> > size, instead of a kernel patch?
> 
> That's effectively what vendors like SGI have been doing since udev
> was first introduced, though more often the rules are based on device
> type rather than size. e.g. a 64GB device might be a USB flash drive
> now, but a 40GB device might be a really fast SSD....

Fair enough. I'll drop this kernel policy patch 

        block: limit default readahead size for small devices
        https://lkml.org/lkml/2011/12/19/89

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-27  3:27                                                   ` Wu Fengguang
@ 2012-01-27  5:25                                                     ` Andreas Dilger
  2012-01-27  7:53                                                       ` Wu Fengguang
  0 siblings, 1 reply; 76+ messages in thread
From: Andreas Dilger @ 2012-01-27  5:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Vivek Goyal, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On 2012-01-26, at 8:27 PM, Wu Fengguang wrote:
> On Fri, Jan 27, 2012 at 09:34:49AM +1100, Dave Chinner wrote:
>> On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote:
>>> On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
>>>>> It would also be
>>>>> possible to trigger on the size of the device so that the 32MB USB stick
>>>>> doesn't sit busy for a minute with readahead that is useless.
>>>> 
>>>> Yeah, I do have a patch for shrinking readahead size based on device size.
>>> 
>>> Should it be a udev rule to change read_ahead_kb on device based on device
>>> size, instead of a kernel patch?
>> 
>> That's effectively what vendors like SGI have been doing since udev
>> was first introduced, though more often the rules are based on device
>> type rather than size. e.g. a 64GB device might be a USB flash drive
>> now, but a 40GB device might be a really fast SSD....
> 
> Fair enough. I'll drop this kernel policy patch 
> 
>        block: limit default readahead size for small devices
>        https://lkml.org/lkml/2011/12/19/89

Fengguang,
Doesn't the kernel derive at least some idea of the speed of a device
due to the writeback changes that you made?  It would be very useful
if we could derive at least some rough metric for the device performance
in the kernel and use that as input to the readahead window size as well.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-27  5:25                                                     ` Andreas Dilger
@ 2012-01-27  7:53                                                       ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2012-01-27  7:53 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, Vivek Goyal, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

On Thu, Jan 26, 2012 at 10:25:33PM -0700, Andreas Dilger wrote:
[snip]
> Doesn't the kernel derive at least some idea of the speed of a device
> due to the writeback changes that you made?  It would be very useful
> if we could derive at least some rough metric for the device performance
> in the kernel and use that as input to the readahead window size as well.

Yeah we now have bdi->write_bandwidth (exported as "BdiWriteBandwidth"
in /debug/bdi/8:0/stats) for estimating the bdi write bandwidth.

However the value is not reflecting the sequential throughput in some
cases:

1) when doing random writes
2) when doing mixed reads+writes
3) when not enough IO have been issued
4) in the rare case, when writing to a small area repeatedly so that
   it's effectively writing to the internal disk buffer at high speed

So there are still some challenges in getting a reliably usable
runtime estimation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25  6:15                                           ` [Lsf-pc] " Andreas Dilger
  2012-01-25  6:35                                             ` [Lsf-pc] [dm-devel] " Wu Fengguang
@ 2012-01-25 14:33                                             ` Steven Whitehouse
  2012-01-25 14:45                                               ` Jan Kara
  2012-01-25 16:22                                               ` Loke, Chetan
  1 sibling, 2 replies; 76+ messages in thread
From: Steven Whitehouse @ 2012-01-25 14:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Chris Mason, Darrick J.Wong

Hi,

On Tue, 2012-01-24 at 23:15 -0700, Andreas Dilger wrote:
> On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> >>>> from raising it (e.g. SLES uses 1 MB as a default).
> >>> 
> >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> >>> be that overactive imagination I have...  Anyway, if all of the distros
> >>> start bumping the default, don't you think it's time to consider bumping
> >>> it upstream, too?  I thought there was a lot of work put into not being
> >>> too aggressive on readahead, so the downside of having a larger
> >>> read_ahead_kb setting was fairly small.
> >> 
> >>  Yeah, I believe 512KB should be pretty safe these days except for
> >> embedded world. OTOH average desktop user doesn't really care so it's
> >> mostly servers with beefy storage that care... (note that I wrote we raised
> >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> >> distro)).
> > 
> > Maybe we don't need to care much about the embedded world when raising
> > the default readahead size? Because even the current 128KB is too much
> > for them, and I see Android setting the readahead size to 4KB...
> > 
> > Some time ago I posted a series for raising the default readahead size
> > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> 
> I'm all in favour of 1MB (aligned) readahead.  I think the embedded folks
> already set enough CONFIG opts that we could trigger on one of those
> (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.  It would also be
> possible to trigger on the size of the device so that the 32MB USB stick
> doesn't sit busy for a minute with readahead that is useless.
> 
> Cheers, Andreas
> 

If the reason for not setting a larger readahead value is just that it
might increase memory pressure and thus decrease performance, is it
possible to use a suitable metric from the VM in order to set the value
automatically according to circumstances?

Steve.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 14:33                                             ` Steven Whitehouse
@ 2012-01-25 14:45                                               ` Jan Kara
  2012-01-25 16:22                                               ` Loke, Chetan
  1 sibling, 0 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-25 14:45 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara,
	linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de,
	dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong

On Wed 25-01-12 14:33:54, Steven Whitehouse wrote:
> On Tue, 2012-01-24 at 23:15 -0700, Andreas Dilger wrote:
> > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote:
> > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote:
> > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote:
> > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you
> > >>>> from raising it (e.g. SLES uses 1 MB as a default).
> > >>> 
> > >>> For some reason, I thought it had been bumped to 512KB by default.  Must
> > >>> be that overactive imagination I have...  Anyway, if all of the distros
> > >>> start bumping the default, don't you think it's time to consider bumping
> > >>> it upstream, too?  I thought there was a lot of work put into not being
> > >>> too aggressive on readahead, so the downside of having a larger
> > >>> read_ahead_kb setting was fairly small.
> > >> 
> > >>  Yeah, I believe 512KB should be pretty safe these days except for
> > >> embedded world. OTOH average desktop user doesn't really care so it's
> > >> mostly servers with beefy storage that care... (note that I wrote we raised
> > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise
> > >> distro)).
> > > 
> > > Maybe we don't need to care much about the embedded world when raising
> > > the default readahead size? Because even the current 128KB is too much
> > > for them, and I see Android setting the readahead size to 4KB...
> > > 
> > > Some time ago I posted a series for raising the default readahead size
> > > to 512KB. But I'm open to use 1MB now (shall we vote on it?).
> > 
> > I'm all in favour of 1MB (aligned) readahead.  I think the embedded folks
> > already set enough CONFIG opts that we could trigger on one of those
> > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes.  It would also be
> > possible to trigger on the size of the device so that the 32MB USB stick
> > doesn't sit busy for a minute with readahead that is useless.
> > 
> > Cheers, Andreas
> > 
> 
> If the reason for not setting a larger readahead value is just that it
> might increase memory pressure and thus decrease performance, is it
> possible to use a suitable metric from the VM in order to set the value
> automatically according to circumstances?
  In theory yes. In practice - do you have such heuristic ;)? There are lot
of factors and it's hard to quantify how increased cache pressure
influences performance of a particular workload. We could introduce some
adaptive logic but so far fixed upperbound worked OK.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 14:33                                             ` Steven Whitehouse
  2012-01-25 14:45                                               ` Jan Kara
@ 2012-01-25 16:22                                               ` Loke, Chetan
  2012-01-25 16:40                                                 ` Steven Whitehouse
  1 sibling, 1 reply; 76+ messages in thread
From: Loke, Chetan @ 2012-01-25 16:22 UTC (permalink / raw)
  To: Steven Whitehouse, Andreas Dilger
  Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi,
	Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh,
	linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong

 
> If the reason for not setting a larger readahead value is just that it
> might increase memory pressure and thus decrease performance, is it
> possible to use a suitable metric from the VM in order to set the value
> automatically according to circumstances?
> 

How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?

> Steve.

Chetan Loke

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 16:22                                               ` Loke, Chetan
@ 2012-01-25 16:40                                                 ` Steven Whitehouse
  2012-01-25 17:08                                                   ` Loke, Chetan
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Steven Whitehouse @ 2012-01-25 16:40 UTC (permalink / raw)
  To: Loke, Chetan
  Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong

Hi,

On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote:
> > If the reason for not setting a larger readahead value is just that it
> > might increase memory pressure and thus decrease performance, is it
> > possible to use a suitable metric from the VM in order to set the value
> > automatically according to circumstances?
> > 
> 
> How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> 
> > Steve.
> 
> Chetan Loke

I'd been wondering about something similar to that. The basic scheme
would be:

 - Set a page flag when readahead is performed
 - Clear the flag when the page is read (or on page fault for mmap)
(i.e. when it is first used after readahead)

Then when the VM scans for pages to eject from cache, check the flag and
keep an exponential average (probably on a per-cpu basis) of the rate at
which such flagged pages are ejected. That number can then be used to
reduce the max readahead value.

The questions are whether this would provide a fast enough reduction in
readahead size to avoid problems? and whether the extra complication is
worth it compared with using an overall metric for memory pressure?

There may well be better solutions though,

Steve.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 16:40                                                 ` Steven Whitehouse
@ 2012-01-25 17:08                                                   ` Loke, Chetan
  2012-01-25 17:32                                                   ` James Bottomley
  2012-02-03 12:55                                                   ` Wu Fengguang
  2 siblings, 0 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-25 17:08 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong

> > How about tracking heuristics for 'read-hits from previous read-aheads'? 
> > If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> >
> 
> I'd been wondering about something similar to that. The basic scheme
> would be:
> 
>  - Set a page flag when readahead is performed
>  - Clear the flag when the page is read (or on page fault for mmap)
> (i.e. when it is first used after readahead)
> 
> Then when the VM scans for pages to eject from cache, check the flag
> and keep an exponential average (probably on a per-cpu basis) of the rate
> at which such flagged pages are ejected. That number can then be used to
> reduce the max readahead value.
> 
> The questions are whether this would provide a fast enough reduction in
> readahead size to avoid problems? and whether the extra complication is
> worth it compared with using an overall metric for memory pressure?
> 

Steve - I'm not a VM guy so can't help much. But if we maintain a separate list
of pages 'fetched with read-ahead' then we can use the flag you suggested above.
So when memory pressure is triggered:
a) Evict these pages (which still have the page-flag set) first as they were a pure opportunistic bet from our side.
b) scale-down(or just temporarily disable?) on read-aheads till the pressure goes low.
c) admission control - disable(?) read-aheads for new threads/processes that are created? Then enable once we are ok?

> There may well be better solutions though,

Quite possible. But we need to start somewhere with the adaptive logic otherwise we will just keep on increasing(second guessing?) the upper bound and assuming that's what applications want. Increasing it to MB[s] may not be attractive for desktop users. If we raise it to MB[s] then desktop distro's might scale it down to KB[s].Exactly opposite of what enterprise distro's could be doing today.

> Steve.
> 
Chetan Loke

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 16:40                                                 ` Steven Whitehouse
  2012-01-25 17:08                                                   ` Loke, Chetan
@ 2012-01-25 17:32                                                   ` James Bottomley
  2012-01-25 18:28                                                     ` Loke, Chetan
  2012-02-03 12:55                                                   ` Wu Fengguang
  2 siblings, 1 reply; 76+ messages in thread
From: James Bottomley @ 2012-01-25 17:32 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Loke, Chetan, Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong, linux-mm

On Wed, 2012-01-25 at 16:40 +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote:
> > > If the reason for not setting a larger readahead value is just that it
> > > might increase memory pressure and thus decrease performance, is it
> > > possible to use a suitable metric from the VM in order to set the value
> > > automatically according to circumstances?
> > > 
> > 
> > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> > 
> > > Steve.
> > 
> > Chetan Loke
> 
> I'd been wondering about something similar to that. The basic scheme
> would be:
> 
>  - Set a page flag when readahead is performed
>  - Clear the flag when the page is read (or on page fault for mmap)
> (i.e. when it is first used after readahead)
> 
> Then when the VM scans for pages to eject from cache, check the flag and
> keep an exponential average (probably on a per-cpu basis) of the rate at
> which such flagged pages are ejected. That number can then be used to
> reduce the max readahead value.
> 
> The questions are whether this would provide a fast enough reduction in
> readahead size to avoid problems? and whether the extra complication is
> worth it compared with using an overall metric for memory pressure?
> 
> There may well be better solutions though,

So there are two separate problems mentioned here.  The first is to
ensure that readahead (RA) pages are treated as more disposable than
accessed pages under memory pressure and then to derive a statistic for
futile RA (those pages that were read in but never accessed).

The first sounds really like its an LRU thing rather than adding yet
another page flag.  We need a position in the LRU list for never
accessed ... that way they're first to be evicted as memory pressure
rises.

The second is you can derive this futile readahead statistic from the
LRU position of unaccessed pages ... you could keep this globally.

Now the problem: if you trash all unaccessed RA pages first, you end up
with the situation of say playing a movie under moderate memory pressure
that we do RA, then trash the RA page then have to re-read to display to
the user resulting in an undesirable uptick in read I/O.

Based on the above, it sounds like a better heuristic would be to evict
accessed clean pages at the top of the LRU list before unaccessed clean
pages because the expectation is that the unaccessed clean pages will be
accessed (that's after all, why we did the readahead).  As RA pages age
in the LRU list, they become candidates for being futile, since they've
been in memory for a while and no-one has accessed them, leading to the
conclusion that they aren't ever going to be read.

So I think futility is a measure of unaccessed aging, not necessarily of
ejection (which is a memory pressure response).

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 17:32                                                   ` James Bottomley
@ 2012-01-25 18:28                                                     ` Loke, Chetan
  2012-01-25 18:37                                                       ` Loke, Chetan
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-25 18:28 UTC (permalink / raw)
  To: James Bottomley, Steven Whitehouse
  Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong, linux-mm

> So there are two separate problems mentioned here.  The first is to
> ensure that readahead (RA) pages are treated as more disposable than
> accessed pages under memory pressure and then to derive a statistic for
> futile RA (those pages that were read in but never accessed).
> 
> The first sounds really like its an LRU thing rather than adding yet
> another page flag.  We need a position in the LRU list for never
> accessed ... that way they're first to be evicted as memory pressure
> rises.
> 
> The second is you can derive this futile readahead statistic from the
> LRU position of unaccessed pages ... you could keep this globally.
> 
> Now the problem: if you trash all unaccessed RA pages first, you end up
> with the situation of say playing a movie under moderate memory
> pressure that we do RA, then trash the RA page then have to re-read to display
> to the user resulting in an undesirable uptick in read I/O.
> 
> Based on the above, it sounds like a better heuristic would be to evict
> accessed clean pages at the top of the LRU list before unaccessed clean
> pages because the expectation is that the unaccessed clean pages will
> be accessed (that's after all, why we did the readahead).  As RA pages age

Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.

We can try to bring-in process run-time heuristics while evicting pages. So in the one-shot search case, the application did it's thing and went to sleep.
While the movie-app has a pretty good run-time and is still running. So be a little gentle(?) on such apps? Selective eviction?

In addition what if we do something like this:

RA block[X], RA block[X+1], ... , RA block[X+m]

Assume a block reads 'N' pages.

Evict unaccessed RA page 'a' from block[X+2] and not [X+1].

We might need tracking at the RA-block level. This way if a movie touched RA-page 'a' from block[X], it would at least have [X+1] in cache. And while [X+1] is being read, the new slow-down version of RA will not RA that many blocks.

Also, application's should use xxx_fadvise calls to give us hints...

> James

Chetan Loke

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 18:28                                                     ` Loke, Chetan
@ 2012-01-25 18:37                                                       ` Loke, Chetan
  2012-01-25 18:37                                                       ` James Bottomley
  2012-01-25 18:44                                                       ` Boaz Harrosh
  2 siblings, 0 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-25 18:37 UTC (permalink / raw)
  To: James Bottomley, Steven Whitehouse
  Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong, linux-mm

> 
> > So there are two separate problems mentioned here.  The first is to
> > ensure that readahead (RA) pages are treated as more disposable than
> > accessed pages under memory pressure and then to derive a statistic for
> > futile RA (those pages that were read in but never accessed).
> >
> > The first sounds really like its an LRU thing rather than adding yet
> > another page flag.  We need a position in the LRU list for never
> > accessed ... that way they're first to be evicted as memory pressure
> > rises.
> >
> > The second is you can derive this futile readahead statistic from the
> > LRU position of unaccessed pages ... you could keep this globally.
> >
> > Now the problem: if you trash all unaccessed RA pages first, you end up
> > with the situation of say playing a movie under moderate memory
> > pressure that we do RA, then trash the RA page then have to re-read to display
> > to the user resulting in an undesirable uptick in read I/O.
> >


James - now that I'm thinking about it. I think the movie should be fine because when we calculate the read-hit from RA'd pages, the movie RA blocks will get a good hit-ratio and hence it's RA'd blocks won't be touched. But then we might need to track the hit-ratio at the RA-block(?) level.

Chetan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 18:28                                                     ` Loke, Chetan
  2012-01-25 18:37                                                       ` Loke, Chetan
@ 2012-01-25 18:37                                                       ` James Bottomley
  2012-01-25 20:06                                                         ` Chris Mason
  2012-01-26 16:17                                                         ` Loke, Chetan
  2012-01-25 18:44                                                       ` Boaz Harrosh
  2 siblings, 2 replies; 76+ messages in thread
From: James Bottomley @ 2012-01-25 18:37 UTC (permalink / raw)
  To: Loke, Chetan
  Cc: Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig,
	linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel,
	lsf-pc, Chris Mason, Darrick J.Wong

On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote:
> > So there are two separate problems mentioned here.  The first is to
> > ensure that readahead (RA) pages are treated as more disposable than
> > accessed pages under memory pressure and then to derive a statistic for
> > futile RA (those pages that were read in but never accessed).
> > 
> > The first sounds really like its an LRU thing rather than adding yet
> > another page flag.  We need a position in the LRU list for never
> > accessed ... that way they're first to be evicted as memory pressure
> > rises.
> > 
> > The second is you can derive this futile readahead statistic from the
> > LRU position of unaccessed pages ... you could keep this globally.
> > 
> > Now the problem: if you trash all unaccessed RA pages first, you end up
> > with the situation of say playing a movie under moderate memory
> > pressure that we do RA, then trash the RA page then have to re-read to display
> > to the user resulting in an undesirable uptick in read I/O.
> > 
> > Based on the above, it sounds like a better heuristic would be to evict
> > accessed clean pages at the top of the LRU list before unaccessed clean
> > pages because the expectation is that the unaccessed clean pages will
> > be accessed (that's after all, why we did the readahead).  As RA pages age
> 
> Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
> The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.

Well not really: RA is always wrong for random reads.  The whole purpose
of RA is assumption of sequential access patterns.

The point I'm making is that for the case where RA works (sequential
patterns), evicting unaccessed RA pages before accessed ones is the
wrong thing to do, so the heuristic isn't what you first thought of
(evicting unaccessed RA pages first).

For the random read case, either heuristic is wrong, so it doesn't
matter.

However, when you add the futility measure, random read processes will
end up with aged unaccessed RA pages, so its RA window will get closed.

> We can try to bring-in process run-time heuristics while evicting pages. So in the one-shot search case, the application did it's thing and went to sleep.
> While the movie-app has a pretty good run-time and is still running. So be a little gentle(?) on such apps? Selective eviction?
> 
> In addition what if we do something like this:
> 
> RA block[X], RA block[X+1], ... , RA block[X+m]
> 
> Assume a block reads 'N' pages.
> 
> Evict unaccessed RA page 'a' from block[X+2] and not [X+1].
> 
> We might need tracking at the RA-block level. This way if a movie touched RA-page 'a' from block[X], it would at least have [X+1] in cache. And while [X+1] is being read, the new slow-down version of RA will not RA that many blocks.
> 
> Also, application's should use xxx_fadvise calls to give us hints...

I think that's a bit over complex.  As long as the futility measure
works, a sequential pattern read process gets a reasonable RA window.
The trick is to prove that the simple doesn't work before considering
the complex.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 18:37                                                       ` James Bottomley
@ 2012-01-25 20:06                                                         ` Chris Mason
  2012-01-25 22:46                                                           ` Andrea Arcangeli
  2012-01-26 22:38                                                           ` Dave Chinner
  2012-01-26 16:17                                                         ` Loke, Chetan
  1 sibling, 2 replies; 76+ messages in thread
From: Chris Mason @ 2012-01-25 20:06 UTC (permalink / raw)
  To: James Bottomley
  Cc: Loke, Chetan, Steven Whitehouse, Andreas Dilger, Andrea Arcangeli,
	Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel,
	Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong

On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote:
> On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote:
> > > So there are two separate problems mentioned here.  The first is to
> > > ensure that readahead (RA) pages are treated as more disposable than
> > > accessed pages under memory pressure and then to derive a statistic for
> > > futile RA (those pages that were read in but never accessed).
> > > 
> > > The first sounds really like its an LRU thing rather than adding yet
> > > another page flag.  We need a position in the LRU list for never
> > > accessed ... that way they're first to be evicted as memory pressure
> > > rises.
> > > 
> > > The second is you can derive this futile readahead statistic from the
> > > LRU position of unaccessed pages ... you could keep this globally.
> > > 
> > > Now the problem: if you trash all unaccessed RA pages first, you end up
> > > with the situation of say playing a movie under moderate memory
> > > pressure that we do RA, then trash the RA page then have to re-read to display
> > > to the user resulting in an undesirable uptick in read I/O.
> > > 
> > > Based on the above, it sounds like a better heuristic would be to evict
> > > accessed clean pages at the top of the LRU list before unaccessed clean
> > > pages because the expectation is that the unaccessed clean pages will
> > > be accessed (that's after all, why we did the readahead).  As RA pages age
> > 
> > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
> > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.
> 
> Well not really: RA is always wrong for random reads.  The whole purpose
> of RA is assumption of sequential access patterns.

Just to jump back, Jeff's benchmark that started this (on xfs and ext4):

	- buffered 1MB reads get down to the scheduler in 128KB chunks

The really hard part about readahead is that you don't know what
userland wants.  In Jeff's test, he's telling the kernel he wants 1MB
ios and our RA engine is doing 128KB ios.

We can talk about scaling up how big the RA windows get on their own,
but if userland asks for 1MB, we don't have to worry about futile RA, we
just have to make sure we don't oom the box trying to honor 1MB reads
from 5000 different procs.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 20:06                                                         ` Chris Mason
@ 2012-01-25 22:46                                                           ` Andrea Arcangeli
  2012-01-25 22:58                                                             ` Jan Kara
                                                                               ` (2 more replies)
  2012-01-26 22:38                                                           ` Dave Chinner
  1 sibling, 3 replies; 76+ messages in thread
From: Andrea Arcangeli @ 2012-01-25 22:46 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong

On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote:
> We can talk about scaling up how big the RA windows get on their own,
> but if userland asks for 1MB, we don't have to worry about futile RA, we
> just have to make sure we don't oom the box trying to honor 1MB reads
> from 5000 different procs.

:) that's for sure if read has a 1M buffer as destination. However
even cp /dev/sda reads/writes through a 32kb buffer, so it's not so
common to read in 1m buffers.

But I also would prefer to stay on the simple side (on a side note we
run out of page flags already on 32bit I think as I had to nuke
PG_buddy already).

Overall I think the risk of the pages being evicted before they can be
copied to userland is quite a minor risk. A 16G system with 100
readers all hitting on disk at the same time using 100M readahead
would still only create a 100m memory pressure... So it'd sure be ok,
100m is less than what kswapd keeps always free for example. Think a
4TB system. Especially if 128k fixed has been ok so far on a 1G system.

If we really want to be more dynamic than a setting at boot depending
on ram size, we could limit it to a fraction of freeable memory (using
similar math to determine_dirtyable_memory, maybe calling it over time
but not too frequently to reduce the overhead). Like if there's 0
memory freeable keep it low. If there's 1G freeable out of that math
(and we assume the readahead hit rate is near 100%), raise the maximum
readahead to 1M even if the total ram is only 1G. So we allow up to
1000 readers before we even recycle the readahead.

I doubt the complexity of tracking exactly how many pages are getting
recycled before they're copied to userland would be worth it, besides
it'd be 0% for 99% of systems and workloads.

Way more important is to have feedback on the readahead hits and be
sure when readahead is raised to the maximum the hit rate is near 100%
and fallback to lower readaheads if we don't get that hit rate. But
that's not a VM problem and it's a readahead issue only.

The actual VM pressure side of it, sounds minor issue if the hit rate
of the readahead cache is close to 100%.

The config option is also ok with me, but I think it'd be nicer to set
it at boot depending on ram size (one less option to configure
manually and zero overhead).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 22:46                                                           ` Andrea Arcangeli
@ 2012-01-25 22:58                                                             ` Jan Kara
  2012-01-26  8:59                                                             ` Boaz Harrosh
  2012-01-26 16:40                                                             ` Loke, Chetan
  2 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2012-01-25 22:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong

On Wed 25-01-12 23:46:14, Andrea Arcangeli wrote:
> On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote:
> > We can talk about scaling up how big the RA windows get on their own,
> > but if userland asks for 1MB, we don't have to worry about futile RA, we
> > just have to make sure we don't oom the box trying to honor 1MB reads
> > from 5000 different procs.
> 
> :) that's for sure if read has a 1M buffer as destination. However
> even cp /dev/sda reads/writes through a 32kb buffer, so it's not so
> common to read in 1m buffers.
> 
> But I also would prefer to stay on the simple side (on a side note we
> run out of page flags already on 32bit I think as I had to nuke
> PG_buddy already).
> 
> Overall I think the risk of the pages being evicted before they can be
> copied to userland is quite a minor risk. A 16G system with 100
> readers all hitting on disk at the same time using 100M readahead
> would still only create a 100m memory pressure... So it'd sure be ok,
> 100m is less than what kswapd keeps always free for example. Think a
> 4TB system. Especially if 128k fixed has been ok so far on a 1G system.
> 
> If we really want to be more dynamic than a setting at boot depending
> on ram size, we could limit it to a fraction of freeable memory (using
> similar math to determine_dirtyable_memory, maybe calling it over time
> but not too frequently to reduce the overhead). Like if there's 0
> memory freeable keep it low. If there's 1G freeable out of that math
> (and we assume the readahead hit rate is near 100%), raise the maximum
> readahead to 1M even if the total ram is only 1G. So we allow up to
> 1000 readers before we even recycle the readahead.
> 
> I doubt the complexity of tracking exactly how many pages are getting
> recycled before they're copied to userland would be worth it, besides
> it'd be 0% for 99% of systems and workloads.
> 
> Way more important is to have feedback on the readahead hits and be
> sure when readahead is raised to the maximum the hit rate is near 100%
> and fallback to lower readaheads if we don't get that hit rate. But
> that's not a VM problem and it's a readahead issue only.
> 
> The actual VM pressure side of it, sounds minor issue if the hit rate
> of the readahead cache is close to 100%.
> 
> The config option is also ok with me, but I think it'd be nicer to set
> it at boot depending on ram size (one less option to configure
> manually and zero overhead).
  Yeah. I'd also keep it simple. Tuning max readahead size based on
available memory (and device size) once in a while is about the maximum
complexity I'd consider meaningful. If you have real data that shows
problems which are not solved by that simple strategy, then sure, we can
speak about more complex algorithms. But currently I don't think they are
needed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 22:46                                                           ` Andrea Arcangeli
  2012-01-25 22:58                                                             ` Jan Kara
@ 2012-01-26  8:59                                                             ` Boaz Harrosh
  2012-01-26 16:40                                                             ` Loke, Chetan
  2 siblings, 0 replies; 76+ messages in thread
From: Boaz Harrosh @ 2012-01-26  8:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	linux-fsdevel, lsf-pc, Darrick J.Wong

On 01/26/2012 12:46 AM, Andrea Arcangeli wrote:
> On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote:
>> We can talk about scaling up how big the RA windows get on their own,
>> but if userland asks for 1MB, we don't have to worry about futile RA, we
>> just have to make sure we don't oom the box trying to honor 1MB reads
>> from 5000 different procs.
> 
> :) that's for sure if read has a 1M buffer as destination. However
> even cp /dev/sda reads/writes through a 32kb buffer, so it's not so
> common to read in 1m buffers.
> 

That's not so true. cp is a bad example because it's brain dead and
someone should fix it. cp performance is terrible. Even KDE's GUI
copy is better.

But applications (and dd users) that do care about read performance
do use large buffers and want the Kernel to not ignore that.

What a better hint for Kernel is the read() destination buffer size.

> But I also would prefer to stay on the simple side (on a side note we
> run out of page flags already on 32bit I think as I had to nuke
> PG_buddy already).
> 

So what would be more simple then not ignoring read() request
size from application, which will give applications all the control
they need.

<snip> (I Agree)

> The config option is also ok with me, but I think it'd be nicer to set
> it at boot depending on ram size (one less option to configure
> manually and zero overhead).

If you actually take into account the destination buffer size, you'll see
that the read-ahead size becomes less important for these workloads that
actually care. But Yes some mount time heuristics could be nice, depending
on DEV size and MEM size.

For example in my file-system with self registered BDI I set readhead sizes
according to raid-strip sizes and such so to get good read performance.

And speaking of reads and readhead. What about alignments? both of offset
and length? though in reads it's not so important. One thing some people
have ask for is raid-verify-reads as a mount option.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 22:46                                                           ` Andrea Arcangeli
  2012-01-25 22:58                                                             ` Jan Kara
  2012-01-26  8:59                                                             ` Boaz Harrosh
@ 2012-01-26 16:40                                                             ` Loke, Chetan
  2012-01-26 17:00                                                               ` Andreas Dilger
  2012-02-03 12:37                                                               ` Wu Fengguang
  2 siblings, 2 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-26 16:40 UTC (permalink / raw)
  To: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong

> From: Andrea Arcangeli [mailto:aarcange@redhat.com]
> Sent: January 25, 2012 5:46 PM

....

> Way more important is to have feedback on the readahead hits and be
> sure when readahead is raised to the maximum the hit rate is near 100%
> and fallback to lower readaheads if we don't get that hit rate. But
> that's not a VM problem and it's a readahead issue only.
> 

A quick google showed up - http://kerneltrap.org/node/6642 

Interesting thread to follow. I haven't looked further as to what was
merged and what wasn't.

A quote from the patch - " It works by peeking into the file cache and
check if there are any history pages present or accessed."
Now I don't understand anything about this but I would think digging the
file-cache isn't needed(?). So, yes, a simple RA hit-rate feedback could
be fine.

And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some
N) over period of time. No more smartness. A simple 10 line function is
easy to debug/maintain. That is, a scaled-down version of
ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like
SCSI LLDD madness). Wait for some event to happen.

I can see where Andrew Morton's concerns could be(just my
interpretation). We may not want to end up like a protocol state machine
code: tcp slow-start, then increase , then congestion, then let's
back-off. hmmm, slow-start is a problem for my business logic, so let's
speed-up slow-start ;).

Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 16:40                                                             ` Loke, Chetan
@ 2012-01-26 17:00                                                               ` Andreas Dilger
  2012-01-26 17:16                                                                 ` Loke, Chetan
  2012-02-03 12:37                                                               ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Andreas Dilger @ 2012-01-26 17:00 UTC (permalink / raw)
  To: Loke, Chetan
  Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer,
	<linux-scsi@vger.kernel.org>, <neilb@suse.de>,
	<dm-devel@redhat.com>, Christoph Hellwig,
	<linux-mm@kvack.org>, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, <linux-fsdevel@vger.kernel.org>,
	<lsf-pc@lists.linux-foundation.org>, Darrick J.Wong

On 2012-01-26, at 9:40, "Loke, Chetan" <Chetan.Loke@netscout.com> wrote:
> And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some
> N) over period of time. No more smartness. A simple 10 line function is
> easy to debug/maintain. That is, a scaled-down version of
> ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like
> SCSI LLDD madness). Wait for some event to happen.

Doing 1-block readahead increments is a performance disaster on RAID-5/6. That means you seek all the disks, but use only a fraction of the data that the controller read internally and had to parity check.

It makes more sense to keep the read units the same size as write units (1 MB or as dictated by RAID geometry) that the filesystem is also hopefully using for allocation.  When doing a readahead it should fetch the whole chunk at one time, then not do another until it needs another full chunk.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 17:00                                                               ` Andreas Dilger
@ 2012-01-26 17:16                                                                 ` Loke, Chetan
  0 siblings, 0 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-26 17:16 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse,
	Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel,
	Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang,
	Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong

> > And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or
some N) over period of time. No more smartness. A simple 10 line
function is
> > easy to debug/maintain. That is, a scaled-down version of
ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like
> > SCSI LLDD madness). Wait for some event to happen.
> 
> Doing 1-block readahead increments is a performance disaster on RAID-
> 5/6. That means you seek all the disks, but use only a fraction of the
> data that the controller read internally and had to parity check.
> 
> It makes more sense to keep the read units the same size as write
units
> (1 MB or as dictated by RAID geometry) that the filesystem is also
> hopefully using for allocation.  When doing a readahead it should
fetch
> the whole chunk at one time, then not do another until it needs
another
> full chunk.
> 

I was using it loosely(don't confuse it with 1 block as in 4K :). RA
could be tied to whatever appropriate parameters depending on the
setup(underlying backing store) etc.
But the point I'm trying to make is to (may be)keep the adaptive logic
simple. So if you start with RA-chunk == 512KB/xMB, then when we
increment it, do something like (RA-chunk << N).
BTW, it's not just RAID but also different abstractions you might have.
Stripe-width worth of RA is still useless if your LVM chunk is N *
stripe-width.

> Cheers, Andreas
Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-26 16:40                                                             ` Loke, Chetan
  2012-01-26 17:00                                                               ` Andreas Dilger
@ 2012-02-03 12:37                                                               ` Wu Fengguang
  1 sibling, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2012-02-03 12:37 UTC (permalink / raw)
  To: Loke, Chetan
  Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse,
	Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb,
	dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Boaz Harrosh,
	linux-fsdevel, lsf-pc, Darrick J.Wong, Dan Magenheimer

On Thu, Jan 26, 2012 at 11:40:47AM -0500, Loke, Chetan wrote:
> > From: Andrea Arcangeli [mailto:aarcange@redhat.com]
> > Sent: January 25, 2012 5:46 PM
> 
> ....
> 
> > Way more important is to have feedback on the readahead hits and be
> > sure when readahead is raised to the maximum the hit rate is near 100%
> > and fallback to lower readaheads if we don't get that hit rate. But
> > that's not a VM problem and it's a readahead issue only.
> > 
> 
> A quick google showed up - http://kerneltrap.org/node/6642 
> 
> Interesting thread to follow. I haven't looked further as to what was
> merged and what wasn't.
> 
> A quote from the patch - " It works by peeking into the file cache and
> check if there are any history pages present or accessed."
> Now I don't understand anything about this but I would think digging the
> file-cache isn't needed(?). So, yes, a simple RA hit-rate feedback could
> be fine.
> 
> And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some
> N) over period of time. No more smartness. A simple 10 line function is
> easy to debug/maintain. That is, a scaled-down version of
> ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like
> SCSI LLDD madness). Wait for some event to happen.
> 
> I can see where Andrew Morton's concerns could be(just my
> interpretation). We may not want to end up like a protocol state machine
> code: tcp slow-start, then increase , then congestion, then let's
> back-off. hmmm, slow-start is a problem for my business logic, so let's
> speed-up slow-start ;).

Loke,

Thrashing safe readahead can work as simple as:

        readahead_size = min(nr_history_pages, MAX_READAHEAD_PAGES)

No need for more slow-start or back-off magics.

This is because nr_history_pages is a lower estimation of the threshing
threshold:

   chunk A           chunk B                      chunk C                 head

   l01 l11           l12   l21                    l22
| |-->|-->|       |------>|-->|                |------>|
| +-------+       +-----------+                +-------------+               |
| |   #   |       |       #   |                |       #     |               |
| +-------+       +-----------+                +-------------+               |
| |<==============|<===========================|<============================|
        L0                     L1                            L2

 Let f(l) = L be a map from
     l: the number of pages read by the stream
 to
     L: the number of pages pushed into inactive_list in the mean time
 then
     f(l01) <= L0
     f(l11 + l12) = L1
     f(l21 + l22) = L2
     ...
     f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
                        <= Length(inactive_list) = f(thrashing-threshold)

So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range

        (thrashing_threshold/2, thrashing_threshold)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 20:06                                                         ` Chris Mason
  2012-01-25 22:46                                                           ` Andrea Arcangeli
@ 2012-01-26 22:38                                                           ` Dave Chinner
  1 sibling, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2012-01-26 22:38 UTC (permalink / raw)
  To: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse,
	Andreas Dilger, Andrea Arcangeli, Jan Kara, Mike Snitzer,
	linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm,
	Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Darrick J.Wong

On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote:
> On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote:
> > On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote:
> > > > So there are two separate problems mentioned here.  The first is to
> > > > ensure that readahead (RA) pages are treated as more disposable than
> > > > accessed pages under memory pressure and then to derive a statistic for
> > > > futile RA (those pages that were read in but never accessed).
> > > > 
> > > > The first sounds really like its an LRU thing rather than adding yet
> > > > another page flag.  We need a position in the LRU list for never
> > > > accessed ... that way they're first to be evicted as memory pressure
> > > > rises.
> > > > 
> > > > The second is you can derive this futile readahead statistic from the
> > > > LRU position of unaccessed pages ... you could keep this globally.
> > > > 
> > > > Now the problem: if you trash all unaccessed RA pages first, you end up
> > > > with the situation of say playing a movie under moderate memory
> > > > pressure that we do RA, then trash the RA page then have to re-read to display
> > > > to the user resulting in an undesirable uptick in read I/O.
> > > > 
> > > > Based on the above, it sounds like a better heuristic would be to evict
> > > > accessed clean pages at the top of the LRU list before unaccessed clean
> > > > pages because the expectation is that the unaccessed clean pages will
> > > > be accessed (that's after all, why we did the readahead).  As RA pages age
> > > 
> > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
> > > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.
> > 
> > Well not really: RA is always wrong for random reads.  The whole purpose
> > of RA is assumption of sequential access patterns.
> 
> Just to jump back, Jeff's benchmark that started this (on xfs and ext4):
> 
> 	- buffered 1MB reads get down to the scheduler in 128KB chunks
> 
> The really hard part about readahead is that you don't know what
> userland wants.  In Jeff's test, he's telling the kernel he wants 1MB
> ios and our RA engine is doing 128KB ios.
> 
> We can talk about scaling up how big the RA windows get on their own,
> but if userland asks for 1MB, we don't have to worry about futile RA, we
> just have to make sure we don't oom the box trying to honor 1MB reads
> from 5000 different procs.

Right - if we know the read request is larger than the RA window,
then we should ignore the RA window and just service the request in
a single bio. Well, at least, in chunks as large as the underlying
device will allow us to build....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 18:37                                                       ` James Bottomley
  2012-01-25 20:06                                                         ` Chris Mason
@ 2012-01-26 16:17                                                         ` Loke, Chetan
  1 sibling, 0 replies; 76+ messages in thread
From: Loke, Chetan @ 2012-01-26 16:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara,
	Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig,
	linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel,
	lsf-pc, Chris Mason, Darrick J.Wong

> > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search?
> > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another.
> 
> Well not really: RA is always wrong for random reads.  The whole purpose of RA is assumption of sequential access patterns.
> 

James - I must agree that 'random' was not the proper choice of word here. What I meant was this - 

search-app reads enough data to trick the lazy/deferred-RA logic. RA thinks, oh well, this is now a sequential pattern and will RA.
But all this search-app did was that it kept reading till it found what it was looking for. Once it was done, it went back to sleep waiting for the next query.
Now all that RA data could be of total waste if the read-hit on the RA data-set was 'zero percent'.

Some would argue that how would we(the kernel) know that the next query may not be close the earlier data-set? Well, we don't and we may not want to. That is why the application better know how to use XXX_advise calls. If they are not using it then well it's their problem. The app knows about the statistics/etc about the queries. What was used and what wasn't.

> James

Chetan Loke

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 18:28                                                     ` Loke, Chetan
  2012-01-25 18:37                                                       ` Loke, Chetan
  2012-01-25 18:37                                                       ` James Bottomley
@ 2012-01-25 18:44                                                       ` Boaz Harrosh
  2 siblings, 0 replies; 76+ messages in thread
From: Boaz Harrosh @ 2012-01-25 18:44 UTC (permalink / raw)
  To: Loke, Chetan
  Cc: James Bottomley, Steven Whitehouse, Andreas Dilger, Wu Fengguang,
	Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer,
	neilb, Christoph Hellwig, dm-devel, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong, linux-mm

On 01/25/2012 08:28 PM, Loke, Chetan wrote:
>> So there are two separate problems mentioned here.  The first is to
>> ensure that readahead (RA) pages are treated as more disposable than
>> accessed pages under memory pressure and then to derive a statistic for
>> futile RA (those pages that were read in but never accessed).
>>
>> The first sounds really like its an LRU thing rather than adding yet
>> another page flag.  We need a position in the LRU list for never
>> accessed ... that way they're first to be evicted as memory pressure
>> rises.
>>
>> The second is you can derive this futile readahead statistic from the
>> LRU position of unaccessed pages ... you could keep this globally.
>>
>> Now the problem: if you trash all unaccessed RA pages first, you end up
>> with the situation of say playing a movie under moderate memory
>> pressure that we do RA, then trash the RA page then have to re-read to display
>> to the user resulting in an undesirable uptick in read I/O.
>>
>> Based on the above, it sounds like a better heuristic would be to evict
>> accessed clean pages at the top of the LRU list before unaccessed clean
>> pages because the expectation is that the unaccessed clean pages will
>> be accessed (that's after all, why we did the readahead).  As RA pages age
> 
> Well, the movie example is one case where evicting unaccessed page
> may not be the right thing to do. But what about a workload that
> perform a random one-shot search? The search was done and the RA'd
> blocks are of no use anymore. So it seems one solution would hurt
> another.
> 

I think there is a "seeky" flag the Kernel keeps to prevent read-ahead
in the case of seeks.

> We can try to bring-in process run-time heuristics while evicting
> pages. So in the one-shot search case, the application did it's thing
> and went to sleep. While the movie-app has a pretty good run-time and
> is still running. So be a little gentle(?) on such apps? Selective
> eviction?
> 
> In addition what if we do something like this:
> 
> RA block[X], RA block[X+1], ... , RA block[X+m]
> 
> Assume a block reads 'N' pages.
> 
> Evict unaccessed RA page 'a' from block[X+2] and not [X+1].
> 
> We might need tracking at the RA-block level. This way if a movie
> touched RA-page 'a' from block[X], it would at least have [X+1] in
> cache. And while [X+1] is being read, the new slow-down version of RA
> will not RA that many blocks.
> 
> Also, application's should use xxx_fadvise calls to give us hints...
> 

Lets start by reading the number of pages requested by the read()
call, first. 
The application is reading 4M and we still send 128K. Don't you
think that would be fadvise enough?

Lets start with the simple stuff.

The only flag I see on read pages is that if it's read ahead
pages that we Kernel initiated without an application request.
Like beyond the read() call or a surrounding an mmap read
that was not actually requested by the application.

For generality we always initiate a read in the page fault
and loose all the wonderful information the app gave us in the
different read API's. Lets start with that.

> 
>> James
> 
> Chetan Loke

Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [dm-devel]  [LSF/MM TOPIC] a few storage topics
  2012-01-25 16:40                                                 ` Steven Whitehouse
  2012-01-25 17:08                                                   ` Loke, Chetan
  2012-01-25 17:32                                                   ` James Bottomley
@ 2012-02-03 12:55                                                   ` Wu Fengguang
  2 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2012-02-03 12:55 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Loke, Chetan, Andreas Dilger, Jan Kara, Jeff Moyer,
	Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb,
	Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc,
	Chris Mason, Darrick J.Wong, Dan Magenheimer

On Wed, Jan 25, 2012 at 04:40:23PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote:
> > > If the reason for not setting a larger readahead value is just that it
> > > might increase memory pressure and thus decrease performance, is it
> > > possible to use a suitable metric from the VM in order to set the value
> > > automatically according to circumstances?
> > > 
> > 
> > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> > 
> > > Steve.
> > 
> > Chetan Loke
> 
> I'd been wondering about something similar to that. The basic scheme
> would be:
> 
>  - Set a page flag when readahead is performed
>  - Clear the flag when the page is read (or on page fault for mmap)
> (i.e. when it is first used after readahead)
> 
> Then when the VM scans for pages to eject from cache, check the flag and
> keep an exponential average (probably on a per-cpu basis) of the rate at
> which such flagged pages are ejected. That number can then be used to
> reduce the max readahead value.
> 
> The questions are whether this would provide a fast enough reduction in
> readahead size to avoid problems? and whether the extra complication is
> worth it compared with using an overall metric for memory pressure?
> 
> There may well be better solutions though,

The caveat is, on a consistently thrashed machine, the readahead size
should better be determined for each read stream.

Repeated readahead thrashing typically happen in a file server with
large number of concurrent clients. For example, if there are 1000
read streams each doing 1MB readahead, since there are 2 readahead
window for each stream, there could be up to 2GB readahead pages that
will sure be thrashed in a server with only 1GB memory.

Typically the 1000 clients will have different read speeds. A few of
them will be doing 1MB/s, most others may be doing 100KB/s. In this
case, we shall only decrease readahead size for the 100KB/s clients.
The 1MB/s clients actually won't see readahead thrashing at all and
we'll want them to do large 1MB I/O to achieve good disk utilization.

So we need something better than the "global feedback" scheme, and we
do have such a solution ;)  As said in my other email, the number of
history pages remained in the page cache is a good estimation of that
particular read stream's thrashing safe readahead size.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 18:40                             ` Christoph Hellwig
  2012-01-24 19:07                               ` Chris Mason
@ 2012-01-24 19:11                               ` Jeff Moyer
  1 sibling, 0 replies; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 19:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Chris Mason, Andrea Arcangeli, Jan Kara,
	Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org,
	neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

Christoph Hellwig <hch@infradead.org> writes:

> All three filesystems use the generic mpages code for reads, so they
> all get the same (bad) I/O patterns.  Looks like we need to fix this up
> ASAP.

Actually, in discussing this with Vivek, he mentioned that read ahead
might be involved.  Sure enough, after bumping read_ahead_kb, 1MB I/Os
are sent down to the storage (for xfs anyway).

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 18:05                           ` [dm-devel] " Jeff Moyer
  2012-01-24 18:40                             ` Christoph Hellwig
@ 2012-01-26 22:31                             ` Dave Chinner
  1 sibling, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2012-01-26 22:31 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andreas Dilger, Christoph Hellwig, Chris Mason, Andrea Arcangeli,
	Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org,
	neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, Darrick J.Wong

On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote:
> Andreas Dilger <adilger@dilger.ca> writes:
> I've been wondering if it's gotten better, so decided to run a few quick
> tests.
> 
> kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq,
> max_sectors_kb: 1024, test program: dd
> 
> ext3:
> - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
>   I/Os passed down to the I/O scheduler
> - buffered 1MB reads are a little better, typically in the 128k-256k
>   range when they hit the I/O scheduler.
> 
> ext4:
> - buffered writes: 512K I/Os show up at the elevator
> - buffered O_SYNC writes: data is again 512KB, journal writes are 4K
> - buffered 1MB reads get down to the scheduler in 128KB chunks
> 
> xfs:
> - buffered writes: 1MB I/Os show up at the elevator
> - buffered O_SYNC writes: 1MB I/Os
> - buffered 1MB reads: 128KB chunks show up at the I/O scheduler
> 
> So, ext4 is doing better than ext3, but still not perfect.  xfs is
> kicking ass for writes, but reads are still split up.

Isn't that simply because the default readahead is 128k? Change the
readahead to be much larger, and you should see much larger IOs
being issued....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 15:15                     ` Chris Mason
  2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
@ 2012-01-24 17:12                       ` Jeff Moyer
  2012-01-24 17:32                         ` Chris Mason
  1 sibling, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 17:12 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer,
	linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

Chris Mason <chris.mason@oracle.com> writes:

> On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> Andrea Arcangeli <aarcange@redhat.com> writes:
>> 
>> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> >> requst granularity. Sure, big requests will take longer to complete but
>> >> maximum request size is relatively low (512k by default) so writing maximum
>> >> sized request isn't that much slower than writing 4k. So it works OK in
>> >> practice.
>> >
>> > Totally unrelated to the writeback, but the merged big 512k requests
>> > actually adds up some measurable I/O scheduler latencies and they in
>> > turn slightly diminish the fairness that cfq could provide with
>> > smaller max request size. Probably even more measurable with SSDs (but
>> > then SSDs are even faster).
>> 
>> Are you speaking from experience?  If so, what workloads were negatively
>> affected by merging, and how did you measure that?
>
> https://lkml.org/lkml/2011/12/13/326
>
> This patch is another example, although for a slight different reason.
> I really have no idea yet what the right answer is in a generic sense,
> but you don't need a 512K request to see higher latencies from merging.

Well, this patch has almost nothing to with merging, right?  It's about
keeping I/O from the I/O scheduler for too long (or, prior to on-stack
plugging, it was about keeping the queue plugged for too long).  And,
I'm pretty sure that the testing involved there was with deadline or
noop, nothing to do with CFQ fairness.  ;-)

However, this does bring to light the bigger problem of optimizing for
the underlying storage and the workload requirements.  Some tuning can
be done in the I/O scheduler, but the plugging definitely circumvents
that a little bit.

-Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 17:12                       ` Jeff Moyer
@ 2012-01-24 17:32                         ` Chris Mason
  2012-01-24 18:14                           ` Jeff Moyer
  0 siblings, 1 reply; 76+ messages in thread
From: Chris Mason @ 2012-01-24 17:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer,
	linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> >> Andrea Arcangeli <aarcange@redhat.com> writes:
> >> 
> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> >> >> requst granularity. Sure, big requests will take longer to complete but
> >> >> maximum request size is relatively low (512k by default) so writing maximum
> >> >> sized request isn't that much slower than writing 4k. So it works OK in
> >> >> practice.
> >> >
> >> > Totally unrelated to the writeback, but the merged big 512k requests
> >> > actually adds up some measurable I/O scheduler latencies and they in
> >> > turn slightly diminish the fairness that cfq could provide with
> >> > smaller max request size. Probably even more measurable with SSDs (but
> >> > then SSDs are even faster).
> >> 
> >> Are you speaking from experience?  If so, what workloads were negatively
> >> affected by merging, and how did you measure that?
> >
> > https://lkml.org/lkml/2011/12/13/326
> >
> > This patch is another example, although for a slight different reason.
> > I really have no idea yet what the right answer is in a generic sense,
> > but you don't need a 512K request to see higher latencies from merging.
> 
> Well, this patch has almost nothing to with merging, right?  It's about
> keeping I/O from the I/O scheduler for too long (or, prior to on-stack
> plugging, it was about keeping the queue plugged for too long).  And,
> I'm pretty sure that the testing involved there was with deadline or
> noop, nothing to do with CFQ fairness.  ;-)
> 
> However, this does bring to light the bigger problem of optimizing for
> the underlying storage and the workload requirements.  Some tuning can
> be done in the I/O scheduler, but the plugging definitely circumvents
> that a little bit.

Well, its merging in the sense that we know with perfect accuracy how
often it happens (all the time) and how big an impact it had on latency.
You're right that it isn't related to fairness because in this workload
the only IO being sent down was these writes, and only one process was
doing it.

I mention it mostly because the numbers go against all common sense (at
least for me).  Storage just isn't as predictable anymore.

The benchmarking team later reported the patch improved latencies on all
io, not just the log writer.  This one box is fairly consistent.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-24 17:32                         ` Chris Mason
@ 2012-01-24 18:14                           ` Jeff Moyer
  0 siblings, 0 replies; 76+ messages in thread
From: Jeff Moyer @ 2012-01-24 18:14 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer,
	linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc,
	Darrick J. Wong

Chris Mason <chris.mason@oracle.com> writes:

> On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote:
>> Chris Mason <chris.mason@oracle.com> writes:
>> 
>> > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> >> Andrea Arcangeli <aarcange@redhat.com> writes:
>> >> 
>> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> >> >> requst granularity. Sure, big requests will take longer to complete but
>> >> >> maximum request size is relatively low (512k by default) so writing maximum
>> >> >> sized request isn't that much slower than writing 4k. So it works OK in
>> >> >> practice.
>> >> >
>> >> > Totally unrelated to the writeback, but the merged big 512k requests
>> >> > actually adds up some measurable I/O scheduler latencies and they in
>> >> > turn slightly diminish the fairness that cfq could provide with
>> >> > smaller max request size. Probably even more measurable with SSDs (but
>> >> > then SSDs are even faster).
>> >> 
>> >> Are you speaking from experience?  If so, what workloads were negatively
>> >> affected by merging, and how did you measure that?
>> >
>> > https://lkml.org/lkml/2011/12/13/326
>> >
>> > This patch is another example, although for a slight different reason.
>> > I really have no idea yet what the right answer is in a generic sense,
>> > but you don't need a 512K request to see higher latencies from merging.
>> 
>> Well, this patch has almost nothing to with merging, right?  It's about
>> keeping I/O from the I/O scheduler for too long (or, prior to on-stack
>> plugging, it was about keeping the queue plugged for too long).  And,
>> I'm pretty sure that the testing involved there was with deadline or
>> noop, nothing to do with CFQ fairness.  ;-)
>> 
>> However, this does bring to light the bigger problem of optimizing for
>> the underlying storage and the workload requirements.  Some tuning can
>> be done in the I/O scheduler, but the plugging definitely circumvents
>> that a little bit.
>
> Well, its merging in the sense that we know with perfect accuracy how
> often it happens (all the time) and how big an impact it had on latency.
> You're right that it isn't related to fairness because in this workload
> the only IO being sent down was these writes, and only one process was
> doing it.
>
> I mention it mostly because the numbers go against all common sense (at
> least for me).  Storage just isn't as predictable anymore.

Right, strange that we saw an improvement with the patch even on FC
storage.  So, it's not just fast SSDs that benefit.

> The benchmarking team later reported the patch improved latencies on all
> io, not just the log writer.  This one box is fairly consistent.

We've been running tests with that patch as well, and I've yet to find a
downside.  I haven't yet run the original synthetic workload, since I
wanted real-world data first.  It's on my list to keep poking at it.  I
haven't yet run against really slow storage, either, which I expect to
show some regression with the patch.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-18 23:42         ` Boaz Harrosh
  2012-01-19  9:46           ` Jan Kara
@ 2012-01-25  0:23           ` NeilBrown
  2012-01-25  6:11             ` Andreas Dilger
  1 sibling, 1 reply; 76+ messages in thread
From: NeilBrown @ 2012-01-25  0:23 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Darrick J. Wong, Mike Snitzer, lsf-pc, linux-fsdevel,
	dm-devel, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

On Thu, 19 Jan 2012 01:42:12 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote:


> >> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
> >> (I never really bothered to find out if it really does this.)
> 
> md-raid5/1 currently copies all pages if that what you meant.
> 

Small correction:  RAID5 and RAID6 copy all pages.
                   RAID1 and RAID10 do not.

If the incoming bios had nicely aligned pages which were somehow flagged to
say that they would not change until the request completed, then it should be
trivial to avoid that copy.

NeilBrown


> >   Not sure either. Neil should know :) (added to CC).
> > 
> > 								Honze
> 
> Thanks
> Boaz


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-25  0:23           ` NeilBrown
@ 2012-01-25  6:11             ` Andreas Dilger
  0 siblings, 0 replies; 76+ messages in thread
From: Andreas Dilger @ 2012-01-25  6:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: Boaz Harrosh, Jan Kara, Darrick J. Wong, Mike Snitzer, lsf-pc,
	linux-fsdevel, dm-devel, linux-scsi

On 2012-01-24, at 5:23 PM, NeilBrown wrote:
> On Thu, 19 Jan 2012 01:42:12 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>>>> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
>>>> (I never really bothered to find out if it really does this.)
>> 
>> md-raid5/1 currently copies all pages if that what you meant.
>> 
> 
> Small correction:  RAID5 and RAID6 copy all pages.
>                   RAID1 and RAID10 do not.
> 
> If the incoming bios had nicely aligned pages which were somehow flagged to
> say that they would not change until the request completed, then it should be
> trivial to avoid that copy.

Lustre has a patch to that effect that we've been carrying for several years.
It avoids copying of the pages submitted to the RAID5/6 layer, and provides
a significant improvement in performance and efficiency.

A version of the patches for RHEL6 is available at:

http://review.whamcloud.com/1142

though I don't know how close it is to working with the latest kernel.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics
  2012-01-18 22:58     ` Darrick J. Wong
  2012-01-18 23:22       ` Jan Kara
@ 2012-01-18 23:39       ` Dan Williams
  1 sibling, 0 replies; 76+ messages in thread
From: Dan Williams @ 2012-01-18 23:39 UTC (permalink / raw)
  To: djwong; +Cc: Jan Kara, linux-fsdevel, dm-devel, lsf-pc, linux-scsi,
	Mike Snitzer

On Wed, Jan 18, 2012 at 2:58 PM, Darrick J. Wong <djwong@us.ibm.com> wrote:
> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write?
> (I never really bothered to find out if it really does this.)

It does.  ops_run_biodrain() copies from bio to the stripe cache
before performing xor.


--
Dan

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [LSF/MM TOPIC] a few storage topics
  2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer
  2012-01-17 21:36   ` [Lsf-pc] " Jan Kara
@ 2012-01-24 17:59   ` Martin K. Petersen
  2012-01-24 19:48     ` Douglas Gilbert
  1 sibling, 1 reply; 76+ messages in thread
From: Martin K. Petersen @ 2012-01-24 17:59 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf-pc, linux-scsi, dm-devel, linux-fsdevel

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> 1) expose WRITE SAME via higher level interface (ala
Mike>    sb_issue_discard) for more efficient zeroing on SCSI devices
Mike>    that support it

I actually thought I had submitted those patches as part of the thin
provisioning update. Looks like I held them back for some reason. I'll
check my notes to figure out why and get the kit merged forward ASAP!

Mike> 4) is anyone working on an interface to GET LBA STATUS?
Mike>    - Martin Petersen added GET LBA STATUS support to scsi_debug,
Mike>      but is there a vision for how tools (e.g. pvmove) could
Mike>      access such info in a uniform way across different vendors'
Mike>      storage?

I hadn't thought of that use case. Going to be a bit tricky given how
GET LBA STATUS works...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [LSF/MM TOPIC] a few storage topics
  2012-01-24 17:59   ` Martin K. Petersen
@ 2012-01-24 19:48     ` Douglas Gilbert
  2012-01-24 20:04       ` Martin K. Petersen
  0 siblings, 1 reply; 76+ messages in thread
From: Douglas Gilbert @ 2012-01-24 19:48 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, lsf-pc, linux-scsi, dm-devel, linux-fsdevel

On 12-01-24 12:59 PM, Martin K. Petersen wrote:
>>>>>> "Mike" == Mike Snitzer<snitzer@redhat.com>  writes:
>
> Mike>  1) expose WRITE SAME via higher level interface (ala
> Mike>     sb_issue_discard) for more efficient zeroing on SCSI devices
> Mike>     that support it
>
> I actually thought I had submitted those patches as part of the thin
> provisioning update. Looks like I held them back for some reason. I'll
> check my notes to figure out why and get the kit merged forward ASAP!
>
>
> Mike>  4) is anyone working on an interface to GET LBA STATUS?
> Mike>     - Martin Petersen added GET LBA STATUS support to scsi_debug,
> Mike>       but is there a vision for how tools (e.g. pvmove) could
> Mike>       access such info in a uniform way across different vendors'
> Mike>       storage?
>
> I hadn't thought of that use case. Going to be a bit tricky given how
> GET LBA STATUS works...

What's new in ACS-3 (t13.org ATA Command Set):
.....
   f10138r6 Adds the ability for the device to return a
            list of the LBAs that are currently trimmed.

So it looks like t13.org are adding a GET LBA STATUS type facility.
That in turn should lead to a SAT-3 (SCSI to ATA Translation)
definition of a mapping between both facilities.

Doug Gilbert



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [LSF/MM TOPIC] a few storage topics
  2012-01-24 19:48     ` Douglas Gilbert
@ 2012-01-24 20:04       ` Martin K. Petersen
  0 siblings, 0 replies; 76+ messages in thread
From: Martin K. Petersen @ 2012-01-24 20:04 UTC (permalink / raw)
  To: dgilbert
  Cc: Martin K. Petersen, Mike Snitzer, lsf-pc, linux-scsi, dm-devel,
	linux-fsdevel

>>>>> "Doug" == Douglas Gilbert <dgilbert@interlog.com> writes:

>> I hadn't thought of that use case. Going to be a bit tricky given how
>> GET LBA STATUS works...

Doug> What's new in ACS-3 (t13.org ATA Command Set): .....
Doug>   f10138r6 Adds the ability for the device to return a
Doug>            list of the LBAs that are currently trimmed.

Doug> So it looks like t13.org are adding a GET LBA STATUS type
Doug> facility.  That in turn should lead to a SAT-3 (SCSI to ATA
Doug> Translation) definition of a mapping between both facilities.

Yep.

It is mostly how to handle the multi-range stuff going up the stack that
concerns me. We'd need something like FIEMAP...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2012-02-03 12:55 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CABE8wws67dn0fwhTCs_XqH0g_CxGuT+hPQH9cVFe1xx5t_O9Jw@mail.gmail.com>
2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer
2012-01-17 21:36   ` [Lsf-pc] " Jan Kara
2012-01-18 22:58     ` Darrick J. Wong
2012-01-18 23:22       ` Jan Kara
2012-01-18 23:42         ` Boaz Harrosh
2012-01-19  9:46           ` Jan Kara
2012-01-19 15:08             ` Andrea Arcangeli
2012-01-19 20:52               ` Jan Kara
2012-01-19 21:39                 ` Andrea Arcangeli
2012-01-22 11:31                   ` Boaz Harrosh
2012-01-23 16:30                     ` Jan Kara
2012-01-22 12:21             ` Boaz Harrosh
2012-01-23 16:18               ` Jan Kara
2012-01-23 17:53                 ` Andrea Arcangeli
2012-01-23 18:28                   ` Jeff Moyer
2012-01-23 18:56                     ` Andrea Arcangeli
2012-01-23 19:19                       ` Jeff Moyer
2012-01-24 15:15                     ` Chris Mason
2012-01-24 16:56                       ` [dm-devel] " Christoph Hellwig
2012-01-24 17:01                         ` Andreas Dilger
2012-01-24 17:06                         ` [Lsf-pc] [dm-devel] " Andrea Arcangeli
2012-01-24 17:08                         ` Chris Mason
2012-01-24 17:08                         ` [Lsf-pc] " Andreas Dilger
2012-01-24 18:05                           ` [dm-devel] " Jeff Moyer
2012-01-24 18:40                             ` Christoph Hellwig
2012-01-24 19:07                               ` Chris Mason
2012-01-24 19:14                                 ` Jeff Moyer
2012-01-24 20:09                                   ` [Lsf-pc] [dm-devel] " Jan Kara
2012-01-24 20:13                                     ` [Lsf-pc] " Jeff Moyer
2012-01-24 20:39                                       ` [Lsf-pc] [dm-devel] " Jan Kara
2012-01-24 20:59                                         ` Jeff Moyer
2012-01-24 21:08                                           ` Jan Kara
2012-01-25  3:29                                         ` Wu Fengguang
2012-01-25  6:15                                           ` [Lsf-pc] " Andreas Dilger
2012-01-25  6:35                                             ` [Lsf-pc] [dm-devel] " Wu Fengguang
2012-01-25 14:00                                               ` Jan Kara
2012-01-26 12:29                                                 ` Andreas Dilger
2012-01-27 17:03                                                   ` Ted Ts'o
2012-01-26 16:25                                               ` Vivek Goyal
2012-01-26 20:37                                                 ` Jan Kara
2012-01-26 22:34                                                 ` Dave Chinner
2012-01-27  3:27                                                   ` Wu Fengguang
2012-01-27  5:25                                                     ` Andreas Dilger
2012-01-27  7:53                                                       ` Wu Fengguang
2012-01-25 14:33                                             ` Steven Whitehouse
2012-01-25 14:45                                               ` Jan Kara
2012-01-25 16:22                                               ` Loke, Chetan
2012-01-25 16:40                                                 ` Steven Whitehouse
2012-01-25 17:08                                                   ` Loke, Chetan
2012-01-25 17:32                                                   ` James Bottomley
2012-01-25 18:28                                                     ` Loke, Chetan
2012-01-25 18:37                                                       ` Loke, Chetan
2012-01-25 18:37                                                       ` James Bottomley
2012-01-25 20:06                                                         ` Chris Mason
2012-01-25 22:46                                                           ` Andrea Arcangeli
2012-01-25 22:58                                                             ` Jan Kara
2012-01-26  8:59                                                             ` Boaz Harrosh
2012-01-26 16:40                                                             ` Loke, Chetan
2012-01-26 17:00                                                               ` Andreas Dilger
2012-01-26 17:16                                                                 ` Loke, Chetan
2012-02-03 12:37                                                               ` Wu Fengguang
2012-01-26 22:38                                                           ` Dave Chinner
2012-01-26 16:17                                                         ` Loke, Chetan
2012-01-25 18:44                                                       ` Boaz Harrosh
2012-02-03 12:55                                                   ` Wu Fengguang
2012-01-24 19:11                               ` [dm-devel] [Lsf-pc] " Jeff Moyer
2012-01-26 22:31                             ` Dave Chinner
2012-01-24 17:12                       ` Jeff Moyer
2012-01-24 17:32                         ` Chris Mason
2012-01-24 18:14                           ` Jeff Moyer
2012-01-25  0:23           ` NeilBrown
2012-01-25  6:11             ` Andreas Dilger
2012-01-18 23:39       ` Dan Williams
2012-01-24 17:59   ` Martin K. Petersen
2012-01-24 19:48     ` Douglas Gilbert
2012-01-24 20:04       ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).