* [LSF/MM TOPIC] a few storage topics [not found] <CABE8wws67dn0fwhTCs_XqH0g_CxGuT+hPQH9cVFe1xx5t_O9Jw@mail.gmail.com> @ 2012-01-17 20:06 ` Mike Snitzer 2012-01-17 21:36 ` [Lsf-pc] " Jan Kara 2012-01-24 17:59 ` Martin K. Petersen 0 siblings, 2 replies; 76+ messages in thread From: Mike Snitzer @ 2012-01-17 20:06 UTC (permalink / raw) To: lsf-pc; +Cc: linux-scsi, dm-devel, linux-fsdevel 1) expose WRITE SAME via higher level interface (ala sb_issue_discard) for more efficient zeroing on SCSI devices that support it - dm-thinp and dm-kcopyd could benefit from offloading the zeroing to the array - I'll be reviewing this closer to assess the scope of the work 2) revise fs/block_dev.c:__blkdev_put's "sync on last close" semantic, please see Mikulas Patocka's recent proposal on dm-devel: http://www.redhat.com/archives/dm-devel/2012-January/msg00021.html - patch didn't create much discussion (other than hch's suggestion to use file->private_data). Are the current semantics somehow important to some filesystems (e.g. NFS)? - allowing read-only opens to _not_ trigger a sync is desirable (e.g. if dm-thinp's storage pool was exhausted we should still be able to read data from thinp devices) 3) Are any SSD+rotational storage caching layers that are being developed for upstream consideration (there were: bcache, fb's dm-cache, etc). - Red Hat would like to know if leveraging the dm-thinp infrastructure to implement a new DM target for caching would be well received by the greater community - and are there any proposals for classifying data/files as cache hot, etc (T10 has an active proposal for passing info in the CDB) -- is anyone working in this area? 4) is anyone working on an interface to GET LBA STATUS? - Martin Petersen added GET LBA STATUS support to scsi_debug, but is there a vision for how tools (e.g. pvmove) could access such info in a uniform way across different vendors' storage? 5) Any more progress on stable pages? - I know Darrick Wong had some proposals, what remains? ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer @ 2012-01-17 21:36 ` Jan Kara 2012-01-18 22:58 ` Darrick J. Wong 2012-01-24 17:59 ` Martin K. Petersen 1 sibling, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-17 21:36 UTC (permalink / raw) To: Mike Snitzer; +Cc: lsf-pc, linux-fsdevel, dm-devel, linux-scsi On Tue 17-01-12 15:06:12, Mike Snitzer wrote: > 5) Any more progress on stable pages? > - I know Darrick Wong had some proposals, what remains? As far as I know this is done for XFS, btrfs, ext4. Is more needed? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-17 21:36 ` [Lsf-pc] " Jan Kara @ 2012-01-18 22:58 ` Darrick J. Wong 2012-01-18 23:22 ` Jan Kara 2012-01-18 23:39 ` Dan Williams 0 siblings, 2 replies; 76+ messages in thread From: Darrick J. Wong @ 2012-01-18 22:58 UTC (permalink / raw) To: Jan Kara; +Cc: Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel, linux-scsi On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote: > On Tue 17-01-12 15:06:12, Mike Snitzer wrote: > > 5) Any more progress on stable pages? > > - I know Darrick Wong had some proposals, what remains? > As far as I know this is done for XFS, btrfs, ext4. Is more needed? Yep, it's done for those three fses. I suppose it might help some people if instead of wait_on_page_writeback we could simply page-migrate all the processes onto a new page...? Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? (I never really bothered to find out if it really does this.) --D > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-18 22:58 ` Darrick J. Wong @ 2012-01-18 23:22 ` Jan Kara 2012-01-18 23:42 ` Boaz Harrosh 2012-01-18 23:39 ` Dan Williams 1 sibling, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-18 23:22 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel, linux-scsi, neilb On Wed 18-01-12 14:58:08, Darrick J. Wong wrote: > On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote: > > On Tue 17-01-12 15:06:12, Mike Snitzer wrote: > > > 5) Any more progress on stable pages? > > > - I know Darrick Wong had some proposals, what remains? > > As far as I know this is done for XFS, btrfs, ext4. Is more needed? > > Yep, it's done for those three fses. > > I suppose it might help some people if instead of wait_on_page_writeback we > could simply page-migrate all the processes onto a new page...? Well, but it will cost some more memory & copying so whether it's faster or not pretty much depends on the workload, doesn't it? Anyway I've already heard one guy complaining that his RT application does redirtying of mmaped pages and it started seeing big latencies due to stable pages work. So for these guys migrating might be an option (or maybe fadvise/madvise flag to do copy out before submitting for IO?). > Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? > (I never really bothered to find out if it really does this.) Not sure either. Neil should know :) (added to CC). Honze -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-18 23:22 ` Jan Kara @ 2012-01-18 23:42 ` Boaz Harrosh 2012-01-19 9:46 ` Jan Kara 2012-01-25 0:23 ` NeilBrown 0 siblings, 2 replies; 76+ messages in thread From: Boaz Harrosh @ 2012-01-18 23:42 UTC (permalink / raw) To: Jan Kara; +Cc: Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel, lsf-pc On 01/19/2012 01:22 AM, Jan Kara wrote: > On Wed 18-01-12 14:58:08, Darrick J. Wong wrote: >> On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote: >>> On Tue 17-01-12 15:06:12, Mike Snitzer wrote: >>>> 5) Any more progress on stable pages? >>>> - I know Darrick Wong had some proposals, what remains? >>> As far as I know this is done for XFS, btrfs, ext4. Is more needed? >> >> Yep, it's done for those three fses. >> >> I suppose it might help some people if instead of wait_on_page_writeback we >> could simply page-migrate all the processes onto a new page...? > Well, but it will cost some more memory & copying so whether it's faster > or not pretty much depends on the workload, doesn't it? Anyway I've already > heard one guy complaining that his RT application does redirtying of mmaped > pages and it started seeing big latencies due to stable pages work. So for > these guys migrating might be an option (or maybe fadvise/madvise flag to > do copy out before submitting for IO?). > OK That one is interesting. Because I'd imagine that the Kernel would not start write-out on a busily modified page. Some heavy modifying then a single write. If it's not so then there is already great inefficiency, just now exposed, but was always there. The "page-migrate" mentioned here will not help. Could we not better our page write-out algorithms to avoid heavy contended pages? Do you have a more detailed description of the workload? Is it theoretically avoidable? >> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? >> (I never really bothered to find out if it really does this.) md-raid5/1 currently copies all pages if that what you meant. > Not sure either. Neil should know :) (added to CC). > > Honze Thanks Boaz ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-18 23:42 ` Boaz Harrosh @ 2012-01-19 9:46 ` Jan Kara 2012-01-19 15:08 ` Andrea Arcangeli 2012-01-22 12:21 ` Boaz Harrosh 2012-01-25 0:23 ` NeilBrown 1 sibling, 2 replies; 76+ messages in thread From: Jan Kara @ 2012-01-19 9:46 UTC (permalink / raw) To: Boaz Harrosh Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Thu 19-01-12 01:42:12, Boaz Harrosh wrote: > On 01/19/2012 01:22 AM, Jan Kara wrote: > > On Wed 18-01-12 14:58:08, Darrick J. Wong wrote: > >> On Tue, Jan 17, 2012 at 10:36:48PM +0100, Jan Kara wrote: > >>> On Tue 17-01-12 15:06:12, Mike Snitzer wrote: > >>>> 5) Any more progress on stable pages? > >>>> - I know Darrick Wong had some proposals, what remains? > >>> As far as I know this is done for XFS, btrfs, ext4. Is more needed? > >> > >> Yep, it's done for those three fses. > >> > >> I suppose it might help some people if instead of wait_on_page_writeback we > >> could simply page-migrate all the processes onto a new page...? > > > Well, but it will cost some more memory & copying so whether it's faster > > or not pretty much depends on the workload, doesn't it? Anyway I've already > > heard one guy complaining that his RT application does redirtying of mmaped > > pages and it started seeing big latencies due to stable pages work. So for > > these guys migrating might be an option (or maybe fadvise/madvise flag to > > do copy out before submitting for IO?). > > > > OK That one is interesting. Because I'd imagine that the Kernel would not > start write-out on a busily modified page. So currently writeback doesn't use the fact how busily is page modified. After all whole mm has only two sorts of pages - active & inactive - which reflects how often page is accessed but says nothing about how often is it dirtied. So we don't have this information in the kernel and it would be relatively (memory) expensive to keep it. > Some heavy modifying then a single write. If it's not so then there is > already great inefficiency, just now exposed, but was always there. The > "page-migrate" mentioned here will not help. Yes, but I believe RT guy doesn't redirty the page that often. It is just that if you have to meet certain latency criteria, you cannot afford a single case where you have to wait. And if you redirty pages, you are bound to hit PageWriteback case sooner or later. > Could we not better our page write-out algorithms to avoid heavy > contended pages? That's not so easy. Firstly, you'll have track and keep that information somehow. Secondly, it is better to writeout a busily dirtied page than to introduce a seek. Also definition of 'busy' differs for different purposes. So to make this useful the logic won't be trivial. Thirdly, the benefit is questionable anyway (at least for most of realistic workloads) because flusher thread doesn't write the pages all that often - when there are not many pages, we write them out just once every couple of seconds, when we have lots of dirty pages we cycle through all of them so one page is not written that often. > Do you have a more detailed description of the workload? Is it theoretically > avoidable? See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout would solve the problems of this guy. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-19 9:46 ` Jan Kara @ 2012-01-19 15:08 ` Andrea Arcangeli 2012-01-19 20:52 ` Jan Kara 2012-01-22 12:21 ` Boaz Harrosh 1 sibling, 1 reply; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-19 15:08 UTC (permalink / raw) To: Jan Kara Cc: Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Thu, Jan 19, 2012 at 10:46:37AM +0100, Jan Kara wrote: > So to make this useful the logic won't be trivial. Thirdly, the benefit is > questionable anyway (at least for most of realistic workloads) because > flusher thread doesn't write the pages all that often - when there are not > many pages, we write them out just once every couple of seconds, when we > have lots of dirty pages we cycle through all of them so one page is not > written that often. If you mean migrate as in mm/migrate.c that's also not cheap, it will page fault anybody accessing the page, it'll do the page copy, and it'll IPI all cpus that had the mm on the TLB, it locks the page too and does all sort of checks. But it's true it'll be CPU bound... while I understand the current solution is I/O bound. > > > Do you have a more detailed description of the workload? Is it theoretically > > avoidable? > See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout > would solve the problems of this guy. Copying in the I/O layer should be better than page migration, 1) copying the page to a I/O kernel buffer won't involve expensive TLB IPIs that migration requires, 2) copying the page to a I/O kernel buffer won't cause page faults because of migration entries being set, 3) migration has to copy too so the cost on the memory bus is the same. So unless I'm missing something page migration and pte/tlb mangling (I mean as in mm/migrate.c) is worse in every way than bounce buffering at the I/O layer if you notice the page can be modified while it's under I/O. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-19 15:08 ` Andrea Arcangeli @ 2012-01-19 20:52 ` Jan Kara 2012-01-19 21:39 ` Andrea Arcangeli 0 siblings, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-19 20:52 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J. Wong On Thu 19-01-12 16:08:49, Andrea Arcangeli wrote: > On Thu, Jan 19, 2012 at 10:46:37AM +0100, Jan Kara wrote: > > So to make this useful the logic won't be trivial. Thirdly, the benefit is > > questionable anyway (at least for most of realistic workloads) because > > flusher thread doesn't write the pages all that often - when there are not > > many pages, we write them out just once every couple of seconds, when we > > have lots of dirty pages we cycle through all of them so one page is not > > written that often. > > If you mean migrate as in mm/migrate.c that's also not cheap, it will > page fault anybody accessing the page, it'll do the page copy, and > it'll IPI all cpus that had the mm on the TLB, it locks the page too > and does all sort of checks. But it's true it'll be CPU bound... while > I understand the current solution is I/O bound. Thanks for explanation. You are right that currently we are I/O bound so migration is probably faster on most HW but as I said earlier, different things might end up better in different workloads. > > > Do you have a more detailed description of the workload? Is it theoretically > > > avoidable? > > See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout > > would solve the problems of this guy. > > Copying in the I/O layer should be better than page migration, > 1) copying the page to a I/O kernel buffer won't involve expensive TLB > IPIs that migration requires, 2) copying the page to a I/O kernel > buffer won't cause page faults because of migration entries being set, > 3) migration has to copy too so the cost on the memory bus is the > same. > > So unless I'm missing something page migration and pte/tlb mangling (I > mean as in mm/migrate.c) is worse in every way than bounce buffering > at the I/O layer if you notice the page can be modified while it's > under I/O. Well, but the advantage of migration is that you need to do it only if the page is redirtied while under IO. Copying to I/O buffer would have to be done for *all* pages because once we submit the bio, we cannot change anything. So what will be cheaper depends on how often are redirtied pages under IO. This is rather rare because pages aren't flushed all that often. So the effect of stable pages in not observable on throughput. But you can certainly see it on max latency... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-19 20:52 ` Jan Kara @ 2012-01-19 21:39 ` Andrea Arcangeli 2012-01-22 11:31 ` Boaz Harrosh 0 siblings, 1 reply; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-19 21:39 UTC (permalink / raw) To: Jan Kara Cc: Mike Snitzer, linux-scsi, neilb, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J. Wong On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote: > anything. So what will be cheaper depends on how often are redirtied pages > under IO. This is rather rare because pages aren't flushed all that often. > So the effect of stable pages in not observable on throughput. But you can > certainly see it on max latency... I see your point. A problem with migrate though is that the page must be pinned by the I/O layer to prevent migration to free the page under I/O, or how else it could be safe to read from a freed page? And if the page is pinned migration won't work at all. See page_freeze_refs in migrate_page_move_mapping. So the pinning issue would need to be handled somehow. It's needed for example when there's an O_DIRECT read, and the I/O is going to the page, if the page is migrated in that case, we'd lose a part of the I/O. Differentiating how many page pins are ok to be ignored by migration won't be trivial but probably possible to do. Another way maybe would be to detect when there's too much re-dirtying of pages in flight in a short amount of time, and to start the bounce buffering and stop waiting, until the re-dirtying stops, and then you stop the bounce buffering. But unlike migration, it can't prevent an initial burst of high fault latency... ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-19 21:39 ` Andrea Arcangeli @ 2012-01-22 11:31 ` Boaz Harrosh 2012-01-23 16:30 ` Jan Kara 0 siblings, 1 reply; 76+ messages in thread From: Boaz Harrosh @ 2012-01-22 11:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jan Kara, Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel, lsf-pc On 01/19/2012 11:39 PM, Andrea Arcangeli wrote: > On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote: >> anything. So what will be cheaper depends on how often are redirtied pages >> under IO. This is rather rare because pages aren't flushed all that often. >> So the effect of stable pages in not observable on throughput. But you can >> certainly see it on max latency... > > I see your point. A problem with migrate though is that the page must > be pinned by the I/O layer to prevent migration to free the page under > I/O, or how else it could be safe to read from a freed page? And if > the page is pinned migration won't work at all. See page_freeze_refs > in migrate_page_move_mapping. So the pinning issue would need to be > handled somehow. It's needed for example when there's an O_DIRECT > read, and the I/O is going to the page, if the page is migrated in > that case, we'd lose a part of the I/O. Differentiating how many page > pins are ok to be ignored by migration won't be trivial but probably > possible to do. > > Another way maybe would be to detect when there's too much re-dirtying > of pages in flight in a short amount of time, and to start the bounce > buffering and stop waiting, until the re-dirtying stops, and then you > stop the bounce buffering. But unlike migration, it can't prevent an > initial burst of high fault latency... Or just change that RT program that is one - latency bound but, two - does unpredictable, statistically bad, things to a memory mapped file. Can a memory-mapped-file writer have some control on the time of writeback with data_sync or such, or it's purely: Timer fired, Kernel see a dirty page, start a writeout? What about if the application maps a portion of the file at a time, and the Kernel gets more lazy on an active memory mapped region. (That's what windows NT do. It will never IO a mapped section unless in OOM conditions. The application needs to map small sections and unmap to IO. It's more of a direct_io than mmap) In any case, if you are very latency sensitive an mmap writeout is bad for you. Not only because of this new problem, but because mmap writeout can sync with tones of other things, that are do to memory management. (As mentioned by Andrea). The best for latency sensitive application is asynchronous direct-io by far. Only with asynchronous and direct-io you can have any real control on your latency. (I understand they used to have empirically observed latency bound but that is just luck, not real control) BTW: The application mentioned would probably not want it's IO bounced at the block layer, other wise why would it use mmap if not for preventing the copy induced by buffer IO? All that said, a mount option to ext4 (Is ext4 used?) to revert to the old behavior is the easiest solution. When originally we brought this up in LSF my thought was that the block request Q should have some flag that says need_stable_pages. If set by the likes of dm/md-raid, iscsi-with-data-signed, DIFF enabled devices and so on, and the FS does not guaranty/wants stable pages then an IO bounce is set up. But if not set then the like of ext4 need not bother. Thanks Boaz ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-22 11:31 ` Boaz Harrosh @ 2012-01-23 16:30 ` Jan Kara 0 siblings, 0 replies; 76+ messages in thread From: Jan Kara @ 2012-01-23 16:30 UTC (permalink / raw) To: Boaz Harrosh Cc: Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Sun 22-01-12 13:31:38, Boaz Harrosh wrote: > On 01/19/2012 11:39 PM, Andrea Arcangeli wrote: > > On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote: > >> anything. So what will be cheaper depends on how often are redirtied pages > >> under IO. This is rather rare because pages aren't flushed all that often. > >> So the effect of stable pages in not observable on throughput. But you can > >> certainly see it on max latency... > > > > I see your point. A problem with migrate though is that the page must > > be pinned by the I/O layer to prevent migration to free the page under > > I/O, or how else it could be safe to read from a freed page? And if > > the page is pinned migration won't work at all. See page_freeze_refs > > in migrate_page_move_mapping. So the pinning issue would need to be > > handled somehow. It's needed for example when there's an O_DIRECT > > read, and the I/O is going to the page, if the page is migrated in > > that case, we'd lose a part of the I/O. Differentiating how many page > > pins are ok to be ignored by migration won't be trivial but probably > > possible to do. > > > > Another way maybe would be to detect when there's too much re-dirtying > > of pages in flight in a short amount of time, and to start the bounce > > buffering and stop waiting, until the re-dirtying stops, and then you > > stop the bounce buffering. But unlike migration, it can't prevent an > > initial burst of high fault latency... > > Or just change that RT program that is one - latency bound but, two - does > unpredictable, statistically bad, things to a memory mapped file. Right. That's what I told the RT guy as well :) But he didn't like to hear that because it meant more coding for him. > Can a memory-mapped-file writer have some control on the time of > writeback with data_sync or such, or it's purely: Timer fired, Kernel see > a dirty page, start a writeout? What about if the application maps a > portion of the file at a time, and the Kernel gets more lazy on an active > memory mapped region. (That's what windows NT do. It will never IO a mapped > section unless in OOM conditions. The application needs to map small sections > and unmap to IO. It's more of a direct_io than mmap) You can always start writeback by sync_file_range() but you have no guarantees what writeback does. Also if you need to redirty the page pernamently (e.g. it's a head of your transaction log), there's simply no good time when it can be written when you also want stable pages. > In any case, if you are very latency sensitive an mmap writeout is bad for > you. Not only because of this new problem, but because mmap writeout can > sync with tones of other things, that are do to memory management. (As mentioned > by Andrea). The best for latency sensitive application is asynchronous direct-io > by far. Only with asynchronous and direct-io you can have any real control on > your latency. (I understand they used to have empirically observed latency bound > but that is just luck, not real control) > > BTW: The application mentioned would probably not want it's IO bounced at > the block layer, other wise why would it use mmap if not for preventing > the copy induced by buffer IO? Yeah, I'm not sure why their design was as it was. > All that said, a mount option to ext4 (Is ext4 used?) to revert to the old > behavior is the easiest solution. When originally we brought this up in LSF > my thought was that the block request Q should have some flag that says > need_stable_pages. If set by the likes of dm/md-raid, iscsi-with-data-signed, DIFF > enabled devices and so on, and the FS does not guaranty/wants stable pages > then an IO bounce is set up. But if not set then the like of ext4 need not > bother. There's no mount option. The behavior is on unconditionally. And so far I have not seen enough people complain to introduce something like that - automatic logic is a different thing of course. That might be nice to have. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-19 9:46 ` Jan Kara 2012-01-19 15:08 ` Andrea Arcangeli @ 2012-01-22 12:21 ` Boaz Harrosh 2012-01-23 16:18 ` Jan Kara 1 sibling, 1 reply; 76+ messages in thread From: Boaz Harrosh @ 2012-01-22 12:21 UTC (permalink / raw) To: Jan Kara; +Cc: Mike Snitzer, linux-scsi, dm-devel, linux-fsdevel, lsf-pc On 01/19/2012 11:46 AM, Jan Kara wrote: >> >> OK That one is interesting. Because I'd imagine that the Kernel would not >> start write-out on a busily modified page. > So currently writeback doesn't use the fact how busily is page modified. > After all whole mm has only two sorts of pages - active & inactive - which > reflects how often page is accessed but says nothing about how often is it > dirtied. So we don't have this information in the kernel and it would be > relatively (memory) expensive to keep it. > Don't we? what about the information used by the IO elevators per-io-group. Is it not collected at redirty time. Is it only recorded by the time a bio is submitted? How does the io-elevator keeps small IO behind heavy writer latency bound? We could use the reverse of that to not IO the "too soon" >> Some heavy modifying then a single write. If it's not so then there is >> already great inefficiency, just now exposed, but was always there. The >> "page-migrate" mentioned here will not help. > Yes, but I believe RT guy doesn't redirty the page that often. It is just > that if you have to meet certain latency criteria, you cannot afford a > single case where you have to wait. And if you redirty pages, you are bound > to hit PageWriteback case sooner or later. > OK, thanks. I need this overview. What you mean is that since the writeback fires periodically then there must be times when the page or group of pages are just in the stage of changing and the writeback takes only half of the modification. So What if we let the dirty data always wait that writeback timeout, if the pages are "to-new" and memory condition is fine, then postpone the writeout to the next round. (Assuming we have that information from the first part) >> Could we not better our page write-out algorithms to avoid heavy >> contended pages? > That's not so easy. Firstly, you'll have track and keep that information > somehow. Secondly, it is better to writeout a busily dirtied page than to > introduce a seek. Sure I'd say we just go on the timestamp of the first page in the group. Because I'd imagine that the application has changed that group of pages ruffly at the same time. > Also definition of 'busy' differs for different purposes. > So to make this useful the logic won't be trivial. I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of "too new data". So any dirtying has some "aging time" before attack. The aging time is very much related to your writeback timer. (Which is "the amount of memory buffer you want to keep" divide by your writeout-rate) > Thirdly, the benefit is > questionable anyway (at least for most of realistic workloads) because > flusher thread doesn't write the pages all that often - when there are not > many pages, we write them out just once every couple of seconds, when we > have lots of dirty pages we cycle through all of them so one page is not > written that often. > Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let that timer sample data that is just been dirtied. Which brings me to another subject in the second case "when we have lots of dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle on sb's inodes but do a time sort write-out. The writeout is always started from the lowest addressed page (inode->i_index) so take the time-of-dirty of that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time per SB to prioritize on SBs. Because you see elevator-less FileSystems. Which are none-block-dev BDIs like NFS or exofs have a problem. An heavy writer can easily totally starve a slow IOer (read or write). I can easily demonstrate how an NFS heavy writer starves a KDE desktop to a crawl. We should be starting to think on IO fairness and interactivity at the VFS layer. So to not let every none-block-FS solve it's own problem all over again. >> Do you have a more detailed description of the workload? Is it theoretically >> avoidable? > See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout > would solve the problems of this guy. > > Honza Thanks Boaz ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-22 12:21 ` Boaz Harrosh @ 2012-01-23 16:18 ` Jan Kara 2012-01-23 17:53 ` Andrea Arcangeli 0 siblings, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-23 16:18 UTC (permalink / raw) To: Boaz Harrosh Cc: Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Sun 22-01-12 14:21:51, Boaz Harrosh wrote: > On 01/19/2012 11:46 AM, Jan Kara wrote: > >> > >> OK That one is interesting. Because I'd imagine that the Kernel would not > >> start write-out on a busily modified page. > > So currently writeback doesn't use the fact how busily is page modified. > > After all whole mm has only two sorts of pages - active & inactive - which > > reflects how often page is accessed but says nothing about how often is it > > dirtied. So we don't have this information in the kernel and it would be > > relatively (memory) expensive to keep it. > > > > Don't we? what about the information used by the IO elevators per-io-group. > Is it not collected at redirty time. Is it only recorded by the time a bio > is submitted? How does the io-elevator keeps small IO behind heavy writer > latency bound? We could use the reverse of that to not IO the "too soon" IO elevator is at rather different level. It only starts tracking something once we have struct request. So it knows nothing about redirtying, or even pages as such. Also prioritization works only with the requst granularity. Sure, big requests will take longer to complete but maximum request size is relatively low (512k by default) so writing maximum sized request isn't that much slower than writing 4k. So it works OK in practice. > >> Some heavy modifying then a single write. If it's not so then there is > >> already great inefficiency, just now exposed, but was always there. The > >> "page-migrate" mentioned here will not help. > > Yes, but I believe RT guy doesn't redirty the page that often. It is just > > that if you have to meet certain latency criteria, you cannot afford a > > single case where you have to wait. And if you redirty pages, you are bound > > to hit PageWriteback case sooner or later. > > > > OK, thanks. I need this overview. What you mean is that since the writeback > fires periodically then there must be times when the page or group of pages > are just in the stage of changing and the writeback takes only half of the > modification. > > So What if we let the dirty data always wait that writeback timeout, if What do you mean by writeback timeout? > the pages are "to-new" and memory condition is fine, then postpone the And what do you mean by "to-new"? > writeout to the next round. (Assuming we have that information from the > first part) Sorry, I don't understand your idea... > >> Could we not better our page write-out algorithms to avoid heavy > >> contended pages? > > That's not so easy. Firstly, you'll have track and keep that information > > somehow. Secondly, it is better to writeout a busily dirtied page than to > > introduce a seek. > > Sure I'd say we just go on the timestamp of the first page in the group. > Because I'd imagine that the application has changed that group of pages > ruffly at the same time. We don't have a timestamp on a page. What we have is a timestamp on an inode. Ideally that would be a time when the oldest dirty page in the inode was dirtied. Practically, we cannot really keep that information (e.g. after writing just some dirty pages in an inode) so it is rather crude approximation of that. > > Also definition of 'busy' differs for different purposes. > > So to make this useful the logic won't be trivial. > > I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of > "too new data". So any dirtying has some "aging time" before attack. The > aging time is very much related to your writeback timer. (Which is > "the amount of memory buffer you want to keep" divide by your writeout-rate) Again I repeat - you don't want to introduce seek into your IO stream only because that single page got dirtied too recently. For randomly written files there's always some compromise between how linear IO you want and how much you want to reflect page aging. Currently to go for 'totally linear' which is easier to do and generally better for throughput. > > Thirdly, the benefit is > > questionable anyway (at least for most of realistic workloads) because > > flusher thread doesn't write the pages all that often - when there are not > > many pages, we write them out just once every couple of seconds, when we > > have lots of dirty pages we cycle through all of them so one page is not > > written that often. > > Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let > that timer sample data that is just been dirtied. > > Which brings me to another subject in the second case "when we have lots of > dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle > on sb's inodes but do a time sort write-out. The writeout is always started > from the lowest addressed page (inode->i_index) so take the time-of-dirty of > that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time > per SB to prioritize on SBs. Boaz, we already do track inodes by dirty time and do writeback in that order. Go read that code in fs/fs-writeback.c. > Because you see elevator-less FileSystems. Which are none-block-dev BDIs like > NFS or exofs have a problem. An heavy writer can easily totally starve a slow > IOer (read or write). I can easily demonstrate how an NFS heavy writer starves > a KDE desktop to a crawl. Currently, we rely on IO scheduler to protect light writers / readers. You are right that for non-block filesystems that is problematic because for them it is not hard to starve light readers by heavy writers. But that doesn't seem like a problem of writeback but rather as a problem of NFS client or exofs? Especially in the reader-vs-writer case writeback simply doesn't have enough information and isn't the right place to solve your problems. And I agree it would be stupid to duplicate code in CFQ in several places so maybe you could lift some parts of it and generalize them enough so that they can be used by others. > We should be starting to think on IO fairness and interactivity at the > VFS layer. So to not let every none-block-FS solve it's own problem all > over again. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-23 16:18 ` Jan Kara @ 2012-01-23 17:53 ` Andrea Arcangeli 2012-01-23 18:28 ` Jeff Moyer 0 siblings, 1 reply; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-23 17:53 UTC (permalink / raw) To: Jan Kara Cc: Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: > requst granularity. Sure, big requests will take longer to complete but > maximum request size is relatively low (512k by default) so writing maximum > sized request isn't that much slower than writing 4k. So it works OK in > practice. Totally unrelated to the writeback, but the merged big 512k requests actually adds up some measurable I/O scheduler latencies and they in turn slightly diminish the fairness that cfq could provide with smaller max request size. Probably even more measurable with SSDs (but then SSDs are even faster). ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-23 17:53 ` Andrea Arcangeli @ 2012-01-23 18:28 ` Jeff Moyer 2012-01-23 18:56 ` Andrea Arcangeli 2012-01-24 15:15 ` Chris Mason 0 siblings, 2 replies; 76+ messages in thread From: Jeff Moyer @ 2012-01-23 18:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong Andrea Arcangeli <aarcange@redhat.com> writes: > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: >> requst granularity. Sure, big requests will take longer to complete but >> maximum request size is relatively low (512k by default) so writing maximum >> sized request isn't that much slower than writing 4k. So it works OK in >> practice. > > Totally unrelated to the writeback, but the merged big 512k requests > actually adds up some measurable I/O scheduler latencies and they in > turn slightly diminish the fairness that cfq could provide with > smaller max request size. Probably even more measurable with SSDs (but > then SSDs are even faster). Are you speaking from experience? If so, what workloads were negatively affected by merging, and how did you measure that? Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-23 18:28 ` Jeff Moyer @ 2012-01-23 18:56 ` Andrea Arcangeli 2012-01-23 19:19 ` Jeff Moyer 2012-01-24 15:15 ` Chris Mason 1 sibling, 1 reply; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-23 18:56 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: > Are you speaking from experience? If so, what workloads were negatively > affected by merging, and how did you measure that? Any workload where two processes compete for accessing the same disk and one process writes big requests (usually async writes), the other small (usually sync reads). The one with the small 4k requests (usually reads) gets some artificial latency if the big requests are 512k. Vivek did a recent measurement to verify the issue is still there, and it's basically an hardware issue. Software can't do much other than possibly reducing the max request size when we notice such an I/O pattern coming in cfq. I did old measurements that's how I knew it, but they were so ancient they're worthless by now, this is why Vivek had to repeat it to verify before we could assume it still existed on recent hardware. These days with cgroups it may be a bit more relevant as max write bandwidth may be secondary to latency/QoS. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-23 18:56 ` Andrea Arcangeli @ 2012-01-23 19:19 ` Jeff Moyer 0 siblings, 0 replies; 76+ messages in thread From: Jeff Moyer @ 2012-01-23 19:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong Andrea Arcangeli <aarcange@redhat.com> writes: > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: >> Are you speaking from experience? If so, what workloads were negatively >> affected by merging, and how did you measure that? > > Any workload where two processes compete for accessing the same disk > and one process writes big requests (usually async writes), the other > small (usually sync reads). The one with the small 4k requests > (usually reads) gets some artificial latency if the big requests are > 512k. Vivek did a recent measurement to verify the issue is still > there, and it's basically an hardware issue. Software can't do much > other than possibly reducing the max request size when we notice such > an I/O pattern coming in cfq. I did old measurements that's how I knew > it, but they were so ancient they're worthless by now, this is why > Vivek had to repeat it to verify before we could assume it still > existed on recent hardware. > > These days with cgroups it may be a bit more relevant as max write > bandwidth may be secondary to latency/QoS. Thanks, Vivek was able to point me at the old thread: http://www.spinics.net/lists/linux-fsdevel/msg44191.html Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-23 18:28 ` Jeff Moyer 2012-01-23 18:56 ` Andrea Arcangeli @ 2012-01-24 15:15 ` Chris Mason 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig 2012-01-24 17:12 ` Jeff Moyer 1 sibling, 2 replies; 76+ messages in thread From: Chris Mason @ 2012-01-24 15:15 UTC (permalink / raw) To: Jeff Moyer Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: > Andrea Arcangeli <aarcange@redhat.com> writes: > > > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: > >> requst granularity. Sure, big requests will take longer to complete but > >> maximum request size is relatively low (512k by default) so writing maximum > >> sized request isn't that much slower than writing 4k. So it works OK in > >> practice. > > > > Totally unrelated to the writeback, but the merged big 512k requests > > actually adds up some measurable I/O scheduler latencies and they in > > turn slightly diminish the fairness that cfq could provide with > > smaller max request size. Probably even more measurable with SSDs (but > > then SSDs are even faster). > > Are you speaking from experience? If so, what workloads were negatively > affected by merging, and how did you measure that? https://lkml.org/lkml/2011/12/13/326 This patch is another example, although for a slight different reason. I really have no idea yet what the right answer is in a generic sense, but you don't need a 512K request to see higher latencies from merging. -chris ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 15:15 ` Chris Mason @ 2012-01-24 16:56 ` Christoph Hellwig 2012-01-24 17:01 ` Andreas Dilger ` (3 more replies) 2012-01-24 17:12 ` Jeff Moyer 1 sibling, 4 replies; 76+ messages in thread From: Christoph Hellwig @ 2012-01-24 16:56 UTC (permalink / raw) To: Chris Mason, Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: > https://lkml.org/lkml/2011/12/13/326 > > This patch is another example, although for a slight different reason. > I really have no idea yet what the right answer is in a generic sense, > but you don't need a 512K request to see higher latencies from merging. That assumes the 512k requests is created by merging. We have enough workloads that create large I/O from the get go, and not splitting them and eventually merging them again would be a big win. E.g. I'm currently looking at a distributed block device which uses internal 4MB chunks, and increasing the maximum request size to that dramatically increases the read performance. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig @ 2012-01-24 17:01 ` Andreas Dilger 2012-01-24 17:06 ` [Lsf-pc] [dm-devel] " Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 0 replies; 76+ messages in thread From: Andreas Dilger @ 2012-01-24 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Mason, Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong Cheers, Andreas On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: >> https://lkml.org/lkml/2011/12/13/326 >> >> This patch is another example, although for a slight different reason. >> I really have no idea yet what the right answer is in a generic sense, >> but you don't need a 512K request to see higher latencies from merging. > > That assumes the 512k requests is created by merging. We have enough > workloads that create large I/O from the get go, and not splitting them > and eventually merging them again would be a big win. E.g. I'm > currently looking at a distributed block device which uses internal 4MB > chunks, and increasing the maximum request size to that dramatically > increases the read performance. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig 2012-01-24 17:01 ` Andreas Dilger @ 2012-01-24 17:06 ` Andrea Arcangeli 2012-01-24 17:08 ` Chris Mason 2012-01-24 17:08 ` [Lsf-pc] " Andreas Dilger 3 siblings, 0 replies; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-24 17:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Mason, Jeff Moyer, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote: > That assumes the 512k requests is created by merging. We have enough > workloads that create large I/O from the get go, and not splitting them > and eventually merging them again would be a big win. E.g. I'm > currently looking at a distributed block device which uses internal 4MB > chunks, and increasing the maximum request size to that dramatically > increases the read performance. Depends on the device though, if it's a normal disk, it likely only reduces the number of dma ops without increasing performance too much. Most disks should reach platter speed at 64KB, so larger request only saves a bit of cpu in interrutps and stuff. But I think nobody here was suggesting to reduce the request size by default. cfq should easily notice when there are multiple queues that are being submitted in the same time range. A device in addition to specifying the max request dma size it can handle it could specify the minimum it runs at platter speed and cfq could degrade to it when there's multiple queues running in parallel over the same millisecond or so. Reads will return in the I/O queue almost immediately but they'll be out for a little while until the data is copied to userland. So it'd need to keep it down to the min request size the device allows to reach platter speed, for a little while. Then if no other queue presents itself it double up the request size for each unit of time until it reaches the max again. Maybe that could work, maybe not :). Waiting only once for 4MB sounds better than waiting every time 4MB for each 4k metadata seeking read. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig 2012-01-24 17:01 ` Andreas Dilger 2012-01-24 17:06 ` [Lsf-pc] [dm-devel] " Andrea Arcangeli @ 2012-01-24 17:08 ` Chris Mason 2012-01-24 17:08 ` [Lsf-pc] " Andreas Dilger 3 siblings, 0 replies; 76+ messages in thread From: Chris Mason @ 2012-01-24 17:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Jeff Moyer, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote: > On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: > > https://lkml.org/lkml/2011/12/13/326 > > > > This patch is another example, although for a slight different reason. > > I really have no idea yet what the right answer is in a generic sense, > > but you don't need a 512K request to see higher latencies from merging. > > That assumes the 512k requests is created by merging. We have enough > workloads that create large I/O from the get go, and not splitting them > and eventually merging them again would be a big win. E.g. I'm > currently looking at a distributed block device which uses internal 4MB > chunks, and increasing the maximum request size to that dramatically > increases the read performance. Is this read latency or read tput? If you're waiting on the whole 4MB anyway, I'd expect one request to be better for both. But Andrea's original question was on the impact of the big request on other requests being serviced by the drive....there's really not much we can do about that outside of more knobs for the admin. -chris ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig ` (2 preceding siblings ...) 2012-01-24 17:08 ` Chris Mason @ 2012-01-24 17:08 ` Andreas Dilger 2012-01-24 18:05 ` [dm-devel] " Jeff Moyer 3 siblings, 1 reply; 76+ messages in thread From: Andreas Dilger @ 2012-01-24 17:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, dm-devel@redhat.com, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: >> https://lkml.org/lkml/2011/12/13/326 >> >> This patch is another example, although for a slight different reason. >> I really have no idea yet what the right answer is in a generic sense, >> but you don't need a 512K request to see higher latencies from merging. > > That assumes the 512k requests is created by merging. We have enough > workloads that create large I/O from the get go, and not splitting them > and eventually merging them again would be a big win. E.g. I'm > currently looking at a distributed block device which uses internal 4MB > chunks, and increasing the maximum request size to that dramatically > increases the read performance. (sorry about last email, hit send by accident) I don't think we can have a "one size fits all" policy here. In most RAID devices the IO size needs to be at least 1MB, and with newer devices 4MB gives better performance. One of the reasons that Lustre used to hack so much around the VFS and VM APIs is exactly to avoid the splitting of read/write requests into pages and then depend on the elevator to reconstruct a good-sized IO out of it. Things have gotten better with newer kernels, but there is still a ways to go w.r.t. allowing large IO requests to pass unhindered through to disk (or at least as far as enduring that the IO is aligned to the underlying disk geometry). Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 17:08 ` [Lsf-pc] " Andreas Dilger @ 2012-01-24 18:05 ` Jeff Moyer 2012-01-24 18:40 ` Christoph Hellwig 2012-01-26 22:31 ` Dave Chinner 0 siblings, 2 replies; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 18:05 UTC (permalink / raw) To: Andreas Dilger Cc: Christoph Hellwig, Chris Mason, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong Andreas Dilger <adilger@dilger.ca> writes: > On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote: >> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: >>> https://lkml.org/lkml/2011/12/13/326 >>> >>> This patch is another example, although for a slight different reason. >>> I really have no idea yet what the right answer is in a generic sense, >>> but you don't need a 512K request to see higher latencies from merging. >> >> That assumes the 512k requests is created by merging. We have enough >> workloads that create large I/O from the get go, and not splitting them >> and eventually merging them again would be a big win. E.g. I'm >> currently looking at a distributed block device which uses internal 4MB >> chunks, and increasing the maximum request size to that dramatically >> increases the read performance. > > (sorry about last email, hit send by accident) > > I don't think we can have a "one size fits all" policy here. In most > RAID devices the IO size needs to be at least 1MB, and with newer > devices 4MB gives better performance. Right, and there's more to it than just I/O size. There's access pattern, and more importantly, workload and related requirements (latency vs throughput). > One of the reasons that Lustre used to hack so much around the VFS and > VM APIs is exactly to avoid the splitting of read/write requests into > pages and then depend on the elevator to reconstruct a good-sized IO > out of it. > > Things have gotten better with newer kernels, but there is still a > ways to go w.r.t. allowing large IO requests to pass unhindered > through to disk (or at least as far as enduring that the IO is aligned > to the underlying disk geometry). I've been wondering if it's gotten better, so decided to run a few quick tests. kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq, max_sectors_kb: 1024, test program: dd ext3: - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k I/Os passed down to the I/O scheduler - buffered 1MB reads are a little better, typically in the 128k-256k range when they hit the I/O scheduler. ext4: - buffered writes: 512K I/Os show up at the elevator - buffered O_SYNC writes: data is again 512KB, journal writes are 4K - buffered 1MB reads get down to the scheduler in 128KB chunks xfs: - buffered writes: 1MB I/Os show up at the elevator - buffered O_SYNC writes: 1MB I/Os - buffered 1MB reads: 128KB chunks show up at the I/O scheduler So, ext4 is doing better than ext3, but still not perfect. xfs is kicking ass for writes, but reads are still split up. Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 18:05 ` [dm-devel] " Jeff Moyer @ 2012-01-24 18:40 ` Christoph Hellwig 2012-01-24 19:07 ` Chris Mason 2012-01-24 19:11 ` [dm-devel] [Lsf-pc] " Jeff Moyer 2012-01-26 22:31 ` Dave Chinner 1 sibling, 2 replies; 76+ messages in thread From: Christoph Hellwig @ 2012-01-24 18:40 UTC (permalink / raw) To: Jeff Moyer Cc: Andreas Dilger, Christoph Hellwig, Chris Mason, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote: > - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k > I/Os passed down to the I/O scheduler > - buffered 1MB reads are a little better, typically in the 128k-256k > range when they hit the I/O scheduler. > > ext4: > - buffered writes: 512K I/Os show up at the elevator > - buffered O_SYNC writes: data is again 512KB, journal writes are 4K > - buffered 1MB reads get down to the scheduler in 128KB chunks > > xfs: > - buffered writes: 1MB I/Os show up at the elevator > - buffered O_SYNC writes: 1MB I/Os > - buffered 1MB reads: 128KB chunks show up at the I/O scheduler > > So, ext4 is doing better than ext3, but still not perfect. xfs is > kicking ass for writes, but reads are still split up. All three filesystems use the generic mpages code for reads, so they all get the same (bad) I/O patterns. Looks like we need to fix this up ASAP. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 18:40 ` Christoph Hellwig @ 2012-01-24 19:07 ` Chris Mason 2012-01-24 19:14 ` Jeff Moyer 2012-01-24 19:11 ` [dm-devel] [Lsf-pc] " Jeff Moyer 1 sibling, 1 reply; 76+ messages in thread From: Chris Mason @ 2012-01-24 19:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Jeff Moyer, Andreas Dilger, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong On Tue, Jan 24, 2012 at 01:40:54PM -0500, Christoph Hellwig wrote: > On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote: > > - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k > > I/Os passed down to the I/O scheduler > > - buffered 1MB reads are a little better, typically in the 128k-256k > > range when they hit the I/O scheduler. > > > > ext4: > > - buffered writes: 512K I/Os show up at the elevator > > - buffered O_SYNC writes: data is again 512KB, journal writes are 4K > > - buffered 1MB reads get down to the scheduler in 128KB chunks > > > > xfs: > > - buffered writes: 1MB I/Os show up at the elevator > > - buffered O_SYNC writes: 1MB I/Os > > - buffered 1MB reads: 128KB chunks show up at the I/O scheduler > > > > So, ext4 is doing better than ext3, but still not perfect. xfs is > > kicking ass for writes, but reads are still split up. > > All three filesystems use the generic mpages code for reads, so they > all get the same (bad) I/O patterns. Looks like we need to fix this up > ASAP. Can you easily run btrfs through the same rig? We don't use mpages and I'm curious. -chris ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 19:07 ` Chris Mason @ 2012-01-24 19:14 ` Jeff Moyer 2012-01-24 20:09 ` [Lsf-pc] [dm-devel] " Jan Kara 0 siblings, 1 reply; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 19:14 UTC (permalink / raw) To: Chris Mason Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org Chris Mason <chris.mason@oracle.com> writes: >> All three filesystems use the generic mpages code for reads, so they >> all get the same (bad) I/O patterns. Looks like we need to fix this up >> ASAP. > > Can you easily run btrfs through the same rig? We don't use mpages and > I'm curious. The readahead code was to blame, here. I wonder if we can change the logic there to not break larger I/Os down into smaller sized ones. Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, when 128KB is the read_ahead_kb value. Is there any heuristic you could apply to not break larger I/Os up like this? Does that make sense? Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 19:14 ` Jeff Moyer @ 2012-01-24 20:09 ` Jan Kara 2012-01-24 20:13 ` [Lsf-pc] " Jeff Moyer 0 siblings, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-24 20:09 UTC (permalink / raw) To: Jeff Moyer Cc: Chris Mason, Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong On Tue 24-01-12 14:14:14, Jeff Moyer wrote: > Chris Mason <chris.mason@oracle.com> writes: > > >> All three filesystems use the generic mpages code for reads, so they > >> all get the same (bad) I/O patterns. Looks like we need to fix this up > >> ASAP. > > > > Can you easily run btrfs through the same rig? We don't use mpages and > > I'm curious. > > The readahead code was to blame, here. I wonder if we can change the > logic there to not break larger I/Os down into smaller sized ones. > Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, > when 128KB is the read_ahead_kb value. Is there any heuristic you could > apply to not break larger I/Os up like this? Does that make sense? Well, not breaking up I/Os would be fairly simple as ondemand_readahead() already knows how much do we want to read. We just trim the submitted I/O to read_ahead_kb artificially. And that is done so that you don't trash page cache (possibly evicting pages you have not yet copied to userspace) when there are several processes doing large reads. Maybe 128 KB is a too small default these days but OTOH noone prevents you from raising it (e.g. SLES uses 1 MB as a default). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 20:09 ` [Lsf-pc] [dm-devel] " Jan Kara @ 2012-01-24 20:13 ` Jeff Moyer 2012-01-24 20:39 ` [Lsf-pc] [dm-devel] " Jan Kara 0 siblings, 1 reply; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 20:13 UTC (permalink / raw) To: Jan Kara Cc: Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason Jan Kara <jack@suse.cz> writes: > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: >> Chris Mason <chris.mason@oracle.com> writes: >> >> >> All three filesystems use the generic mpages code for reads, so they >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up >> >> ASAP. >> > >> > Can you easily run btrfs through the same rig? We don't use mpages and >> > I'm curious. >> >> The readahead code was to blame, here. I wonder if we can change the >> logic there to not break larger I/Os down into smaller sized ones. >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, >> when 128KB is the read_ahead_kb value. Is there any heuristic you could >> apply to not break larger I/Os up like this? Does that make sense? > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() > already knows how much do we want to read. We just trim the submitted I/O to > read_ahead_kb artificially. And that is done so that you don't trash page > cache (possibly evicting pages you have not yet copied to userspace) when > there are several processes doing large reads. Do you really think applications issue large reads and then don't use the data? I mean, I've seen some bad programming, so I can believe that would be the case. Still, I'd like to think it doesn't happen. ;-) > Maybe 128 KB is a too small default these days but OTOH noone prevents you > from raising it (e.g. SLES uses 1 MB as a default). For some reason, I thought it had been bumped to 512KB by default. Must be that overactive imagination I have... Anyway, if all of the distros start bumping the default, don't you think it's time to consider bumping it upstream, too? I thought there was a lot of work put into not being too aggressive on readahead, so the downside of having a larger read_ahead_kb setting was fairly small. Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 20:13 ` [Lsf-pc] " Jeff Moyer @ 2012-01-24 20:39 ` Jan Kara 2012-01-24 20:59 ` Jeff Moyer 2012-01-25 3:29 ` Wu Fengguang 0 siblings, 2 replies; 76+ messages in thread From: Jan Kara @ 2012-01-24 20:39 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: > >> Chris Mason <chris.mason@oracle.com> writes: > >> > >> >> All three filesystems use the generic mpages code for reads, so they > >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up > >> >> ASAP. > >> > > >> > Can you easily run btrfs through the same rig? We don't use mpages and > >> > I'm curious. > >> > >> The readahead code was to blame, here. I wonder if we can change the > >> logic there to not break larger I/Os down into smaller sized ones. > >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, > >> when 128KB is the read_ahead_kb value. Is there any heuristic you could > >> apply to not break larger I/Os up like this? Does that make sense? > > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() > > already knows how much do we want to read. We just trim the submitted I/O to > > read_ahead_kb artificially. And that is done so that you don't trash page > > cache (possibly evicting pages you have not yet copied to userspace) when > > there are several processes doing large reads. > > Do you really think applications issue large reads and then don't use > the data? I mean, I've seen some bad programming, so I can believe that > would be the case. Still, I'd like to think it doesn't happen. ;-) No, I meant a cache thrashing problem. Suppose that we always readahead as much as user asks and there are say 100 processes each wanting to read 4 MB. Then you need to find 400 MB in the page cache so that all reads can fit. And if you don't have them, reads for process 50 may evict pages we already preread for process 1, but process one didn't yet get to CPU to copy the data to userspace buffer. So the read becomes wasted. > > Maybe 128 KB is a too small default these days but OTOH noone prevents you > > from raising it (e.g. SLES uses 1 MB as a default). > > For some reason, I thought it had been bumped to 512KB by default. Must > be that overactive imagination I have... Anyway, if all of the distros > start bumping the default, don't you think it's time to consider bumping > it upstream, too? I thought there was a lot of work put into not being > too aggressive on readahead, so the downside of having a larger > read_ahead_kb setting was fairly small. Yeah, I believe 512KB should be pretty safe these days except for embedded world. OTOH average desktop user doesn't really care so it's mostly servers with beefy storage that care... (note that I wrote we raised the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise distro)). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 20:39 ` [Lsf-pc] [dm-devel] " Jan Kara @ 2012-01-24 20:59 ` Jeff Moyer 2012-01-24 21:08 ` Jan Kara 2012-01-25 3:29 ` Wu Fengguang 1 sibling, 1 reply; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 20:59 UTC (permalink / raw) To: Jan Kara Cc: Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong Jan Kara <jack@suse.cz> writes: > On Tue 24-01-12 15:13:40, Jeff Moyer wrote: >> Jan Kara <jack@suse.cz> writes: >> >> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: >> >> Chris Mason <chris.mason@oracle.com> writes: >> >> >> >> >> All three filesystems use the generic mpages code for reads, so they >> >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up >> >> >> ASAP. >> >> > >> >> > Can you easily run btrfs through the same rig? We don't use mpages and >> >> > I'm curious. >> >> >> >> The readahead code was to blame, here. I wonder if we can change the >> >> logic there to not break larger I/Os down into smaller sized ones. >> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, >> >> when 128KB is the read_ahead_kb value. Is there any heuristic you could >> >> apply to not break larger I/Os up like this? Does that make sense? >> > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() >> > already knows how much do we want to read. We just trim the submitted I/O to >> > read_ahead_kb artificially. And that is done so that you don't trash page >> > cache (possibly evicting pages you have not yet copied to userspace) when >> > there are several processes doing large reads. >> >> Do you really think applications issue large reads and then don't use >> the data? I mean, I've seen some bad programming, so I can believe that >> would be the case. Still, I'd like to think it doesn't happen. ;-) > No, I meant a cache thrashing problem. Suppose that we always readahead > as much as user asks and there are say 100 processes each wanting to read 4 > MB. Then you need to find 400 MB in the page cache so that all reads can > fit. And if you don't have them, reads for process 50 may evict pages we > already preread for process 1, but process one didn't yet get to CPU to > copy the data to userspace buffer. So the read becomes wasted. Yeah, you're right, cache thrashing is an issue. In my tests, I didn't actually see the *initial* read come through as a full 1MB I/O, though. That seems odd to me. >> > Maybe 128 KB is a too small default these days but OTOH noone prevents you >> > from raising it (e.g. SLES uses 1 MB as a default). >> >> For some reason, I thought it had been bumped to 512KB by default. Must >> be that overactive imagination I have... Anyway, if all of the distros >> start bumping the default, don't you think it's time to consider bumping >> it upstream, too? I thought there was a lot of work put into not being >> too aggressive on readahead, so the downside of having a larger >> read_ahead_kb setting was fairly small. > Yeah, I believe 512KB should be pretty safe these days except for > embedded world. OTOH average desktop user doesn't really care so it's > mostly servers with beefy storage that care... (note that I wrote we raised > the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > distro)). Fair enough. Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 20:59 ` Jeff Moyer @ 2012-01-24 21:08 ` Jan Kara 0 siblings, 0 replies; 76+ messages in thread From: Jan Kara @ 2012-01-24 21:08 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, fengguang.wu, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Tue 24-01-12 15:59:02, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > >> Jan Kara <jack@suse.cz> writes: > >> > >> > On Tue 24-01-12 14:14:14, Jeff Moyer wrote: > >> >> Chris Mason <chris.mason@oracle.com> writes: > >> >> > >> >> >> All three filesystems use the generic mpages code for reads, so they > >> >> >> all get the same (bad) I/O patterns. Looks like we need to fix this up > >> >> >> ASAP. > >> >> > > >> >> > Can you easily run btrfs through the same rig? We don't use mpages and > >> >> > I'm curious. > >> >> > >> >> The readahead code was to blame, here. I wonder if we can change the > >> >> logic there to not break larger I/Os down into smaller sized ones. > >> >> Fengguang, doing a dd if=file of=/dev/null bs=1M results in 128K I/Os, > >> >> when 128KB is the read_ahead_kb value. Is there any heuristic you could > >> >> apply to not break larger I/Os up like this? Does that make sense? > >> > Well, not breaking up I/Os would be fairly simple as ondemand_readahead() > >> > already knows how much do we want to read. We just trim the submitted I/O to > >> > read_ahead_kb artificially. And that is done so that you don't trash page > >> > cache (possibly evicting pages you have not yet copied to userspace) when > >> > there are several processes doing large reads. > >> > >> Do you really think applications issue large reads and then don't use > >> the data? I mean, I've seen some bad programming, so I can believe that > >> would be the case. Still, I'd like to think it doesn't happen. ;-) > > No, I meant a cache thrashing problem. Suppose that we always readahead > > as much as user asks and there are say 100 processes each wanting to read 4 > > MB. Then you need to find 400 MB in the page cache so that all reads can > > fit. And if you don't have them, reads for process 50 may evict pages we > > already preread for process 1, but process one didn't yet get to CPU to > > copy the data to userspace buffer. So the read becomes wasted. > > Yeah, you're right, cache thrashing is an issue. In my tests, I didn't > actually see the *initial* read come through as a full 1MB I/O, though. > That seems odd to me. At first sight yes. But buffered reading internally works page-by-page so what it does is that it looks at the first page it wants, sees we don't have that in memory, so we submit readahead (hence 128 KB request) and then wait for that page to become uptodate. Then, when we are coming to the end of preread window (trip over marked page), we submit another chunk of readahead... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-24 20:39 ` [Lsf-pc] [dm-devel] " Jan Kara 2012-01-24 20:59 ` Jeff Moyer @ 2012-01-25 3:29 ` Wu Fengguang 2012-01-25 6:15 ` [Lsf-pc] " Andreas Dilger 1 sibling, 1 reply; 76+ messages in thread From: Wu Fengguang @ 2012-01-25 3:29 UTC (permalink / raw) To: Jan Kara Cc: Jeff Moyer, Andreas Dilger, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > On Tue 24-01-12 15:13:40, Jeff Moyer wrote: [snip] > > > Maybe 128 KB is a too small default these days but OTOH noone prevents you > > > from raising it (e.g. SLES uses 1 MB as a default). > > > > For some reason, I thought it had been bumped to 512KB by default. Must > > be that overactive imagination I have... Anyway, if all of the distros > > start bumping the default, don't you think it's time to consider bumping > > it upstream, too? I thought there was a lot of work put into not being > > too aggressive on readahead, so the downside of having a larger > > read_ahead_kb setting was fairly small. > Yeah, I believe 512KB should be pretty safe these days except for > embedded world. OTOH average desktop user doesn't really care so it's > mostly servers with beefy storage that care... (note that I wrote we raised > the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > distro)). Maybe we don't need to care much about the embedded world when raising the default readahead size? Because even the current 128KB is too much for them, and I see Android setting the readahead size to 4KB... Some time ago I posted a series for raising the default readahead size to 512KB. But I'm open to use 1MB now (shall we vote on it?). Thanks, Fengguang ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-25 3:29 ` Wu Fengguang @ 2012-01-25 6:15 ` Andreas Dilger 2012-01-25 6:35 ` [Lsf-pc] [dm-devel] " Wu Fengguang 2012-01-25 14:33 ` Steven Whitehouse 0 siblings, 2 replies; 76+ messages in thread From: Andreas Dilger @ 2012-01-25 6:15 UTC (permalink / raw) To: Wu Fengguang Cc: Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you >>>> from raising it (e.g. SLES uses 1 MB as a default). >>> >>> For some reason, I thought it had been bumped to 512KB by default. Must >>> be that overactive imagination I have... Anyway, if all of the distros >>> start bumping the default, don't you think it's time to consider bumping >>> it upstream, too? I thought there was a lot of work put into not being >>> too aggressive on readahead, so the downside of having a larger >>> read_ahead_kb setting was fairly small. >> >> Yeah, I believe 512KB should be pretty safe these days except for >> embedded world. OTOH average desktop user doesn't really care so it's >> mostly servers with beefy storage that care... (note that I wrote we raised >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise >> distro)). > > Maybe we don't need to care much about the embedded world when raising > the default readahead size? Because even the current 128KB is too much > for them, and I see Android setting the readahead size to 4KB... > > Some time ago I posted a series for raising the default readahead size > to 512KB. But I'm open to use 1MB now (shall we vote on it?). I'm all in favour of 1MB (aligned) readahead. I think the embedded folks already set enough CONFIG opts that we could trigger on one of those (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. It would also be possible to trigger on the size of the device so that the 32MB USB stick doesn't sit busy for a minute with readahead that is useless. Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 6:15 ` [Lsf-pc] " Andreas Dilger @ 2012-01-25 6:35 ` Wu Fengguang 2012-01-25 14:00 ` Jan Kara 2012-01-26 16:25 ` Vivek Goyal 2012-01-25 14:33 ` Steven Whitehouse 1 sibling, 2 replies; 76+ messages in thread From: Wu Fengguang @ 2012-01-25 6:35 UTC (permalink / raw) To: Andreas Dilger Cc: Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote: > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > >>>> from raising it (e.g. SLES uses 1 MB as a default). > >>> > >>> For some reason, I thought it had been bumped to 512KB by default. Must > >>> be that overactive imagination I have... Anyway, if all of the distros > >>> start bumping the default, don't you think it's time to consider bumping > >>> it upstream, too? I thought there was a lot of work put into not being > >>> too aggressive on readahead, so the downside of having a larger > >>> read_ahead_kb setting was fairly small. > >> > >> Yeah, I believe 512KB should be pretty safe these days except for > >> embedded world. OTOH average desktop user doesn't really care so it's > >> mostly servers with beefy storage that care... (note that I wrote we raised > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > >> distro)). > > > > Maybe we don't need to care much about the embedded world when raising > > the default readahead size? Because even the current 128KB is too much > > for them, and I see Android setting the readahead size to 4KB... > > > > Some time ago I posted a series for raising the default readahead size > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > I'm all in favour of 1MB (aligned) readahead. 1MB readahead aligned to i*1MB boundaries? I like this idea. It will work well if the filesystems employ the same alignment rule for large files. > I think the embedded folks > already set enough CONFIG opts that we could trigger on one of those > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when CONFIG_EMBEDDED is selected. > It would also be > possible to trigger on the size of the device so that the 32MB USB stick > doesn't sit busy for a minute with readahead that is useless. Yeah, I do have a patch for shrinking readahead size based on device size. Thanks, Fengguang ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 6:35 ` [Lsf-pc] [dm-devel] " Wu Fengguang @ 2012-01-25 14:00 ` Jan Kara 2012-01-26 12:29 ` Andreas Dilger 2012-01-26 16:25 ` Vivek Goyal 1 sibling, 1 reply; 76+ messages in thread From: Jan Kara @ 2012-01-25 14:00 UTC (permalink / raw) To: Wu Fengguang Cc: Andreas Dilger, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Wed 25-01-12 14:35:52, Wu Fengguang wrote: > On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote: > > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > > >>>> from raising it (e.g. SLES uses 1 MB as a default). > > >>> > > >>> For some reason, I thought it had been bumped to 512KB by default. Must > > >>> be that overactive imagination I have... Anyway, if all of the distros > > >>> start bumping the default, don't you think it's time to consider bumping > > >>> it upstream, too? I thought there was a lot of work put into not being > > >>> too aggressive on readahead, so the downside of having a larger > > >>> read_ahead_kb setting was fairly small. > > >> > > >> Yeah, I believe 512KB should be pretty safe these days except for > > >> embedded world. OTOH average desktop user doesn't really care so it's > > >> mostly servers with beefy storage that care... (note that I wrote we raised > > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > > >> distro)). > > > > > > Maybe we don't need to care much about the embedded world when raising > > > the default readahead size? Because even the current 128KB is too much > > > for them, and I see Android setting the readahead size to 4KB... > > > > > > Some time ago I posted a series for raising the default readahead size > > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > > > I'm all in favour of 1MB (aligned) readahead. > > 1MB readahead aligned to i*1MB boundaries? I like this idea. It will > work well if the filesystems employ the same alignment rule for large > files. Yeah. Clever filesystems (e.g. XFS) can be configured to align files e.g. to raid stripes AFAIK so for them this could be worthwhile. > > I think the embedded folks > > already set enough CONFIG opts that we could trigger on one of those > > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. > > Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when > CONFIG_EMBEDDED is selected. Sounds good. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 14:00 ` Jan Kara @ 2012-01-26 12:29 ` Andreas Dilger 2012-01-27 17:03 ` Ted Ts'o 0 siblings, 1 reply; 76+ messages in thread From: Andreas Dilger @ 2012-01-26 12:29 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb@suse.de Brown, Christoph Hellwig, dm-devel@redhat.com development, Boaz Harrosh, linux-fsdevel@vger.kernel.org Devel, lsf-pc, Chris Mason, Darrick J.Wong On 2012-01-25, at 7:00 AM, Jan Kara wrote: > On Wed 25-01-12 14:35:52, Wu Fengguang wrote: >> On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote: >>> I'm all in favour of 1MB (aligned) readahead. >> >> 1MB readahead aligned to i*1MB boundaries? I like this idea. It will >> work well if the filesystems employ the same alignment rule for large >> files. > > Yeah. Clever filesystems (e.g. XFS) can be configured to align files e.g. > to raid stripes AFAIK so for them this could be worthwhile. Ext4 will also align IO to 1MB boundaries (from the start of LUN/partition) by default. If the mke2fs code detects the underlying RAID geometry (or the sysadmin sets this manually with tune2fs) it will store this in the superblock for the allocator to pick a better alignment. Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 12:29 ` Andreas Dilger @ 2012-01-27 17:03 ` Ted Ts'o 0 siblings, 0 replies; 76+ messages in thread From: Ted Ts'o @ 2012-01-27 17:03 UTC (permalink / raw) To: Andreas Dilger Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb@suse.de Brown, Christoph Hellwig, dm-devel@redhat.com development, Boaz Harrosh, linux-fsdevel@vger.kernel.org Devel, lsf-pc, Chris Mason, Darrick J.Wong On Thu, Jan 26, 2012 at 05:29:03AM -0700, Andreas Dilger wrote: > > Ext4 will also align IO to 1MB boundaries (from the start of > LUN/partition) by default. If the mke2fs code detects the > underlying RAID geometry (or the sysadmin sets this manually with > tune2fs) it will store this in the superblock for the allocator to > pick a better alignment. (Still in Hawaii on vacation, but picked this up while I was quickly scanning through e-mail.) This is true only if you're using the special (non-upstream'ed) Lustre interfaces for writing Lustre objects. The writepages interface doesn't have all of the necessary smarts to do the right thing. It's been on my todo list to look at, but I've been mostly concentrated on single disk file systems since that's what we use at Google. (GFS can scale to many many file systems and servers, and avoiding RAID means fast FSCK recoveries, simplifying things since we don't have to worry about RAID-related failures, etc.) Eventually I'd like ext4 to handle RAID better, but unless you're forced to support really large files, I've come around to believing that n=3 replication or Reed-Solomon encoding across multiple servers is a much better way of achieving data robustness, so it's just not been high on my list of priorities. I'm much more interested in making sure ext4 works well under high memory pressure, and other cloud-related issues. - Ted ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 6:35 ` [Lsf-pc] [dm-devel] " Wu Fengguang 2012-01-25 14:00 ` Jan Kara @ 2012-01-26 16:25 ` Vivek Goyal 2012-01-26 20:37 ` Jan Kara 2012-01-26 22:34 ` Dave Chinner 1 sibling, 2 replies; 76+ messages in thread From: Vivek Goyal @ 2012-01-26 16:25 UTC (permalink / raw) To: Wu Fengguang Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote: > On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote: > > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > > >>>> from raising it (e.g. SLES uses 1 MB as a default). > > >>> > > >>> For some reason, I thought it had been bumped to 512KB by default. Must > > >>> be that overactive imagination I have... Anyway, if all of the distros > > >>> start bumping the default, don't you think it's time to consider bumping > > >>> it upstream, too? I thought there was a lot of work put into not being > > >>> too aggressive on readahead, so the downside of having a larger > > >>> read_ahead_kb setting was fairly small. > > >> > > >> Yeah, I believe 512KB should be pretty safe these days except for > > >> embedded world. OTOH average desktop user doesn't really care so it's > > >> mostly servers with beefy storage that care... (note that I wrote we raised > > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > > >> distro)). > > > > > > Maybe we don't need to care much about the embedded world when raising > > > the default readahead size? Because even the current 128KB is too much > > > for them, and I see Android setting the readahead size to 4KB... > > > > > > Some time ago I posted a series for raising the default readahead size > > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > > > I'm all in favour of 1MB (aligned) readahead. > > 1MB readahead aligned to i*1MB boundaries? I like this idea. It will > work well if the filesystems employ the same alignment rule for large > files. > > > I think the embedded folks > > already set enough CONFIG opts that we could trigger on one of those > > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. > > Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when > CONFIG_EMBEDDED is selected. > > > It would also be > > possible to trigger on the size of the device so that the 32MB USB stick > > doesn't sit busy for a minute with readahead that is useless. > > Yeah, I do have a patch for shrinking readahead size based on device size. Should it be a udev rule to change read_ahead_kb on device based on device size, instead of a kernel patch? This is assuming device size is a good way to determine read ahead window size. I would guess that device speed should also matter though isn't it. If device is small but fast then it is probably ok to have larger read ahead window and vice versa. Thanks Vivek ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 16:25 ` Vivek Goyal @ 2012-01-26 20:37 ` Jan Kara 2012-01-26 22:34 ` Dave Chinner 1 sibling, 0 replies; 76+ messages in thread From: Jan Kara @ 2012-01-26 20:37 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Jeff Moyer, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Thu 26-01-12 11:25:56, Vivek Goyal wrote: > On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote: > > On Tue, Jan 24, 2012 at 11:15:13PM -0700, Andreas Dilger wrote: > > > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > > > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > > > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > > > >>>> from raising it (e.g. SLES uses 1 MB as a default). > > > >>> > > > >>> For some reason, I thought it had been bumped to 512KB by default. Must > > > >>> be that overactive imagination I have... Anyway, if all of the distros > > > >>> start bumping the default, don't you think it's time to consider bumping > > > >>> it upstream, too? I thought there was a lot of work put into not being > > > >>> too aggressive on readahead, so the downside of having a larger > > > >>> read_ahead_kb setting was fairly small. > > > >> > > > >> Yeah, I believe 512KB should be pretty safe these days except for > > > >> embedded world. OTOH average desktop user doesn't really care so it's > > > >> mostly servers with beefy storage that care... (note that I wrote we raised > > > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > > > >> distro)). > > > > > > > > Maybe we don't need to care much about the embedded world when raising > > > > the default readahead size? Because even the current 128KB is too much > > > > for them, and I see Android setting the readahead size to 4KB... > > > > > > > > Some time ago I posted a series for raising the default readahead size > > > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > > > > > I'm all in favour of 1MB (aligned) readahead. > > > > 1MB readahead aligned to i*1MB boundaries? I like this idea. It will > > work well if the filesystems employ the same alignment rule for large > > files. > > > > > I think the embedded folks > > > already set enough CONFIG opts that we could trigger on one of those > > > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. > > > > Good point. We could add a configurable CONFIG_READAHEAD_KB=128 when > > CONFIG_EMBEDDED is selected. > > > > > It would also be > > > possible to trigger on the size of the device so that the 32MB USB stick > > > doesn't sit busy for a minute with readahead that is useless. > > > > Yeah, I do have a patch for shrinking readahead size based on device size. > > Should it be a udev rule to change read_ahead_kb on device based on device > size, instead of a kernel patch? Yes, we talked about that and I think having the logic in udev rule is easier. Just if we decided the logic should use a lot of kernel internal state, then it's better to have it in kernel. > This is assuming device size is a good way to determine read ahead window > size. I would guess that device speed should also matter though isn't it. > If device is small but fast then it is probably ok to have larger read ahead > window and vice versa. Yes, but speed is harder to measure than size ;) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 16:25 ` Vivek Goyal 2012-01-26 20:37 ` Jan Kara @ 2012-01-26 22:34 ` Dave Chinner 2012-01-27 3:27 ` Wu Fengguang 1 sibling, 1 reply; 76+ messages in thread From: Dave Chinner @ 2012-01-26 22:34 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote: > On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote: > > > It would also be > > > possible to trigger on the size of the device so that the 32MB USB stick > > > doesn't sit busy for a minute with readahead that is useless. > > > > Yeah, I do have a patch for shrinking readahead size based on device size. > > Should it be a udev rule to change read_ahead_kb on device based on device > size, instead of a kernel patch? That's effectively what vendors like SGI have been doing since udev was first introduced, though more often the rules are based on device type rather than size. e.g. a 64GB device might be a USB flash drive now, but a 40GB device might be a really fast SSD.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 22:34 ` Dave Chinner @ 2012-01-27 3:27 ` Wu Fengguang 2012-01-27 5:25 ` Andreas Dilger 0 siblings, 1 reply; 76+ messages in thread From: Wu Fengguang @ 2012-01-27 3:27 UTC (permalink / raw) To: Dave Chinner Cc: Vivek Goyal, Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Fri, Jan 27, 2012 at 09:34:49AM +1100, Dave Chinner wrote: > On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote: > > On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote: > > > > It would also be > > > > possible to trigger on the size of the device so that the 32MB USB stick > > > > doesn't sit busy for a minute with readahead that is useless. > > > > > > Yeah, I do have a patch for shrinking readahead size based on device size. > > > > Should it be a udev rule to change read_ahead_kb on device based on device > > size, instead of a kernel patch? > > That's effectively what vendors like SGI have been doing since udev > was first introduced, though more often the rules are based on device > type rather than size. e.g. a 64GB device might be a USB flash drive > now, but a 40GB device might be a really fast SSD.... Fair enough. I'll drop this kernel policy patch block: limit default readahead size for small devices https://lkml.org/lkml/2011/12/19/89 Thanks, Fengguang ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-27 3:27 ` Wu Fengguang @ 2012-01-27 5:25 ` Andreas Dilger 2012-01-27 7:53 ` Wu Fengguang 0 siblings, 1 reply; 76+ messages in thread From: Andreas Dilger @ 2012-01-27 5:25 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Vivek Goyal, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On 2012-01-26, at 8:27 PM, Wu Fengguang wrote: > On Fri, Jan 27, 2012 at 09:34:49AM +1100, Dave Chinner wrote: >> On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote: >>> On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote: >>>>> It would also be >>>>> possible to trigger on the size of the device so that the 32MB USB stick >>>>> doesn't sit busy for a minute with readahead that is useless. >>>> >>>> Yeah, I do have a patch for shrinking readahead size based on device size. >>> >>> Should it be a udev rule to change read_ahead_kb on device based on device >>> size, instead of a kernel patch? >> >> That's effectively what vendors like SGI have been doing since udev >> was first introduced, though more often the rules are based on device >> type rather than size. e.g. a 64GB device might be a USB flash drive >> now, but a 40GB device might be a really fast SSD.... > > Fair enough. I'll drop this kernel policy patch > > block: limit default readahead size for small devices > https://lkml.org/lkml/2011/12/19/89 Fengguang, Doesn't the kernel derive at least some idea of the speed of a device due to the writeback changes that you made? It would be very useful if we could derive at least some rough metric for the device performance in the kernel and use that as input to the readahead window size as well. Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-27 5:25 ` Andreas Dilger @ 2012-01-27 7:53 ` Wu Fengguang 0 siblings, 0 replies; 76+ messages in thread From: Wu Fengguang @ 2012-01-27 7:53 UTC (permalink / raw) To: Andreas Dilger Cc: Dave Chinner, Vivek Goyal, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Thu, Jan 26, 2012 at 10:25:33PM -0700, Andreas Dilger wrote: [snip] > Doesn't the kernel derive at least some idea of the speed of a device > due to the writeback changes that you made? It would be very useful > if we could derive at least some rough metric for the device performance > in the kernel and use that as input to the readahead window size as well. Yeah we now have bdi->write_bandwidth (exported as "BdiWriteBandwidth" in /debug/bdi/8:0/stats) for estimating the bdi write bandwidth. However the value is not reflecting the sequential throughput in some cases: 1) when doing random writes 2) when doing mixed reads+writes 3) when not enough IO have been issued 4) in the rare case, when writing to a small area repeatedly so that it's effectively writing to the internal disk buffer at high speed So there are still some challenges in getting a reliably usable runtime estimation. Thanks, Fengguang ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 6:15 ` [Lsf-pc] " Andreas Dilger 2012-01-25 6:35 ` [Lsf-pc] [dm-devel] " Wu Fengguang @ 2012-01-25 14:33 ` Steven Whitehouse 2012-01-25 14:45 ` Jan Kara 2012-01-25 16:22 ` Loke, Chetan 1 sibling, 2 replies; 76+ messages in thread From: Steven Whitehouse @ 2012-01-25 14:33 UTC (permalink / raw) To: Andreas Dilger Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, Christoph Hellwig, dm-devel@redhat.com, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong Hi, On Tue, 2012-01-24 at 23:15 -0700, Andreas Dilger wrote: > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > >>>> from raising it (e.g. SLES uses 1 MB as a default). > >>> > >>> For some reason, I thought it had been bumped to 512KB by default. Must > >>> be that overactive imagination I have... Anyway, if all of the distros > >>> start bumping the default, don't you think it's time to consider bumping > >>> it upstream, too? I thought there was a lot of work put into not being > >>> too aggressive on readahead, so the downside of having a larger > >>> read_ahead_kb setting was fairly small. > >> > >> Yeah, I believe 512KB should be pretty safe these days except for > >> embedded world. OTOH average desktop user doesn't really care so it's > >> mostly servers with beefy storage that care... (note that I wrote we raised > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > >> distro)). > > > > Maybe we don't need to care much about the embedded world when raising > > the default readahead size? Because even the current 128KB is too much > > for them, and I see Android setting the readahead size to 4KB... > > > > Some time ago I posted a series for raising the default readahead size > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > I'm all in favour of 1MB (aligned) readahead. I think the embedded folks > already set enough CONFIG opts that we could trigger on one of those > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. It would also be > possible to trigger on the size of the device so that the 32MB USB stick > doesn't sit busy for a minute with readahead that is useless. > > Cheers, Andreas > If the reason for not setting a larger readahead value is just that it might increase memory pressure and thus decrease performance, is it possible to use a suitable metric from the VM in order to set the value automatically according to circumstances? Steve. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 14:33 ` Steven Whitehouse @ 2012-01-25 14:45 ` Jan Kara 2012-01-25 16:22 ` Loke, Chetan 1 sibling, 0 replies; 76+ messages in thread From: Jan Kara @ 2012-01-25 14:45 UTC (permalink / raw) To: Steven Whitehouse Cc: Andreas Dilger, Andrea Arcangeli, Jan Kara, linux-scsi@vger.kernel.org, Mike Snitzer, neilb@suse.de, dm-devel@redhat.com, Christoph Hellwig, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Chris Mason, Darrick J.Wong On Wed 25-01-12 14:33:54, Steven Whitehouse wrote: > On Tue, 2012-01-24 at 23:15 -0700, Andreas Dilger wrote: > > On 2012-01-24, at 8:29 PM, Wu Fengguang wrote: > > > On Tue, Jan 24, 2012 at 09:39:36PM +0100, Jan Kara wrote: > > >> On Tue 24-01-12 15:13:40, Jeff Moyer wrote: > > >>>> Maybe 128 KB is a too small default these days but OTOH noone prevents you > > >>>> from raising it (e.g. SLES uses 1 MB as a default). > > >>> > > >>> For some reason, I thought it had been bumped to 512KB by default. Must > > >>> be that overactive imagination I have... Anyway, if all of the distros > > >>> start bumping the default, don't you think it's time to consider bumping > > >>> it upstream, too? I thought there was a lot of work put into not being > > >>> too aggressive on readahead, so the downside of having a larger > > >>> read_ahead_kb setting was fairly small. > > >> > > >> Yeah, I believe 512KB should be pretty safe these days except for > > >> embedded world. OTOH average desktop user doesn't really care so it's > > >> mostly servers with beefy storage that care... (note that I wrote we raised > > >> the read_ahead_kb for SLES but not for openSUSE or SLED (desktop enterprise > > >> distro)). > > > > > > Maybe we don't need to care much about the embedded world when raising > > > the default readahead size? Because even the current 128KB is too much > > > for them, and I see Android setting the readahead size to 4KB... > > > > > > Some time ago I posted a series for raising the default readahead size > > > to 512KB. But I'm open to use 1MB now (shall we vote on it?). > > > > I'm all in favour of 1MB (aligned) readahead. I think the embedded folks > > already set enough CONFIG opts that we could trigger on one of those > > (e.g. CONFIG_EMBEDDED) to avoid stepping on their toes. It would also be > > possible to trigger on the size of the device so that the 32MB USB stick > > doesn't sit busy for a minute with readahead that is useless. > > > > Cheers, Andreas > > > > If the reason for not setting a larger readahead value is just that it > might increase memory pressure and thus decrease performance, is it > possible to use a suitable metric from the VM in order to set the value > automatically according to circumstances? In theory yes. In practice - do you have such heuristic ;)? There are lot of factors and it's hard to quantify how increased cache pressure influences performance of a particular workload. We could introduce some adaptive logic but so far fixed upperbound worked OK. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 14:33 ` Steven Whitehouse 2012-01-25 14:45 ` Jan Kara @ 2012-01-25 16:22 ` Loke, Chetan 2012-01-25 16:40 ` Steven Whitehouse 1 sibling, 1 reply; 76+ messages in thread From: Loke, Chetan @ 2012-01-25 16:22 UTC (permalink / raw) To: Steven Whitehouse, Andreas Dilger Cc: Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong > If the reason for not setting a larger readahead value is just that it > might increase memory pressure and thus decrease performance, is it > possible to use a suitable metric from the VM in order to set the value > automatically according to circumstances? > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead? > Steve. Chetan Loke ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 16:22 ` Loke, Chetan @ 2012-01-25 16:40 ` Steven Whitehouse 2012-01-25 17:08 ` Loke, Chetan ` (2 more replies) 0 siblings, 3 replies; 76+ messages in thread From: Steven Whitehouse @ 2012-01-25 16:40 UTC (permalink / raw) To: Loke, Chetan Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong Hi, On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote: > > If the reason for not setting a larger readahead value is just that it > > might increase memory pressure and thus decrease performance, is it > > possible to use a suitable metric from the VM in order to set the value > > automatically according to circumstances? > > > > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead? > > > Steve. > > Chetan Loke I'd been wondering about something similar to that. The basic scheme would be: - Set a page flag when readahead is performed - Clear the flag when the page is read (or on page fault for mmap) (i.e. when it is first used after readahead) Then when the VM scans for pages to eject from cache, check the flag and keep an exponential average (probably on a per-cpu basis) of the rate at which such flagged pages are ejected. That number can then be used to reduce the max readahead value. The questions are whether this would provide a fast enough reduction in readahead size to avoid problems? and whether the extra complication is worth it compared with using an overall metric for memory pressure? There may well be better solutions though, Steve. ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 16:40 ` Steven Whitehouse @ 2012-01-25 17:08 ` Loke, Chetan 2012-01-25 17:32 ` James Bottomley 2012-02-03 12:55 ` Wu Fengguang 2 siblings, 0 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-25 17:08 UTC (permalink / raw) To: Steven Whitehouse Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong > > How about tracking heuristics for 'read-hits from previous read-aheads'? > > If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead? > > > > I'd been wondering about something similar to that. The basic scheme > would be: > > - Set a page flag when readahead is performed > - Clear the flag when the page is read (or on page fault for mmap) > (i.e. when it is first used after readahead) > > Then when the VM scans for pages to eject from cache, check the flag > and keep an exponential average (probably on a per-cpu basis) of the rate > at which such flagged pages are ejected. That number can then be used to > reduce the max readahead value. > > The questions are whether this would provide a fast enough reduction in > readahead size to avoid problems? and whether the extra complication is > worth it compared with using an overall metric for memory pressure? > Steve - I'm not a VM guy so can't help much. But if we maintain a separate list of pages 'fetched with read-ahead' then we can use the flag you suggested above. So when memory pressure is triggered: a) Evict these pages (which still have the page-flag set) first as they were a pure opportunistic bet from our side. b) scale-down(or just temporarily disable?) on read-aheads till the pressure goes low. c) admission control - disable(?) read-aheads for new threads/processes that are created? Then enable once we are ok? > There may well be better solutions though, Quite possible. But we need to start somewhere with the adaptive logic otherwise we will just keep on increasing(second guessing?) the upper bound and assuming that's what applications want. Increasing it to MB[s] may not be attractive for desktop users. If we raise it to MB[s] then desktop distro's might scale it down to KB[s].Exactly opposite of what enterprise distro's could be doing today. > Steve. > Chetan Loke ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 16:40 ` Steven Whitehouse 2012-01-25 17:08 ` Loke, Chetan @ 2012-01-25 17:32 ` James Bottomley 2012-01-25 18:28 ` Loke, Chetan 2012-02-03 12:55 ` Wu Fengguang 2 siblings, 1 reply; 76+ messages in thread From: James Bottomley @ 2012-01-25 17:32 UTC (permalink / raw) To: Steven Whitehouse Cc: Loke, Chetan, Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong, linux-mm On Wed, 2012-01-25 at 16:40 +0000, Steven Whitehouse wrote: > Hi, > > On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote: > > > If the reason for not setting a larger readahead value is just that it > > > might increase memory pressure and thus decrease performance, is it > > > possible to use a suitable metric from the VM in order to set the value > > > automatically according to circumstances? > > > > > > > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead? > > > > > Steve. > > > > Chetan Loke > > I'd been wondering about something similar to that. The basic scheme > would be: > > - Set a page flag when readahead is performed > - Clear the flag when the page is read (or on page fault for mmap) > (i.e. when it is first used after readahead) > > Then when the VM scans for pages to eject from cache, check the flag and > keep an exponential average (probably on a per-cpu basis) of the rate at > which such flagged pages are ejected. That number can then be used to > reduce the max readahead value. > > The questions are whether this would provide a fast enough reduction in > readahead size to avoid problems? and whether the extra complication is > worth it compared with using an overall metric for memory pressure? > > There may well be better solutions though, So there are two separate problems mentioned here. The first is to ensure that readahead (RA) pages are treated as more disposable than accessed pages under memory pressure and then to derive a statistic for futile RA (those pages that were read in but never accessed). The first sounds really like its an LRU thing rather than adding yet another page flag. We need a position in the LRU list for never accessed ... that way they're first to be evicted as memory pressure rises. The second is you can derive this futile readahead statistic from the LRU position of unaccessed pages ... you could keep this globally. Now the problem: if you trash all unaccessed RA pages first, you end up with the situation of say playing a movie under moderate memory pressure that we do RA, then trash the RA page then have to re-read to display to the user resulting in an undesirable uptick in read I/O. Based on the above, it sounds like a better heuristic would be to evict accessed clean pages at the top of the LRU list before unaccessed clean pages because the expectation is that the unaccessed clean pages will be accessed (that's after all, why we did the readahead). As RA pages age in the LRU list, they become candidates for being futile, since they've been in memory for a while and no-one has accessed them, leading to the conclusion that they aren't ever going to be read. So I think futility is a measure of unaccessed aging, not necessarily of ejection (which is a memory pressure response). James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 17:32 ` James Bottomley @ 2012-01-25 18:28 ` Loke, Chetan 2012-01-25 18:37 ` Loke, Chetan ` (2 more replies) 0 siblings, 3 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-25 18:28 UTC (permalink / raw) To: James Bottomley, Steven Whitehouse Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong, linux-mm > So there are two separate problems mentioned here. The first is to > ensure that readahead (RA) pages are treated as more disposable than > accessed pages under memory pressure and then to derive a statistic for > futile RA (those pages that were read in but never accessed). > > The first sounds really like its an LRU thing rather than adding yet > another page flag. We need a position in the LRU list for never > accessed ... that way they're first to be evicted as memory pressure > rises. > > The second is you can derive this futile readahead statistic from the > LRU position of unaccessed pages ... you could keep this globally. > > Now the problem: if you trash all unaccessed RA pages first, you end up > with the situation of say playing a movie under moderate memory > pressure that we do RA, then trash the RA page then have to re-read to display > to the user resulting in an undesirable uptick in read I/O. > > Based on the above, it sounds like a better heuristic would be to evict > accessed clean pages at the top of the LRU list before unaccessed clean > pages because the expectation is that the unaccessed clean pages will > be accessed (that's after all, why we did the readahead). As RA pages age Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. We can try to bring-in process run-time heuristics while evicting pages. So in the one-shot search case, the application did it's thing and went to sleep. While the movie-app has a pretty good run-time and is still running. So be a little gentle(?) on such apps? Selective eviction? In addition what if we do something like this: RA block[X], RA block[X+1], ... , RA block[X+m] Assume a block reads 'N' pages. Evict unaccessed RA page 'a' from block[X+2] and not [X+1]. We might need tracking at the RA-block level. This way if a movie touched RA-page 'a' from block[X], it would at least have [X+1] in cache. And while [X+1] is being read, the new slow-down version of RA will not RA that many blocks. Also, application's should use xxx_fadvise calls to give us hints... > James Chetan Loke ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 18:28 ` Loke, Chetan @ 2012-01-25 18:37 ` Loke, Chetan 2012-01-25 18:37 ` James Bottomley 2012-01-25 18:44 ` Boaz Harrosh 2 siblings, 0 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-25 18:37 UTC (permalink / raw) To: James Bottomley, Steven Whitehouse Cc: Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong, linux-mm > > > So there are two separate problems mentioned here. The first is to > > ensure that readahead (RA) pages are treated as more disposable than > > accessed pages under memory pressure and then to derive a statistic for > > futile RA (those pages that were read in but never accessed). > > > > The first sounds really like its an LRU thing rather than adding yet > > another page flag. We need a position in the LRU list for never > > accessed ... that way they're first to be evicted as memory pressure > > rises. > > > > The second is you can derive this futile readahead statistic from the > > LRU position of unaccessed pages ... you could keep this globally. > > > > Now the problem: if you trash all unaccessed RA pages first, you end up > > with the situation of say playing a movie under moderate memory > > pressure that we do RA, then trash the RA page then have to re-read to display > > to the user resulting in an undesirable uptick in read I/O. > > James - now that I'm thinking about it. I think the movie should be fine because when we calculate the read-hit from RA'd pages, the movie RA blocks will get a good hit-ratio and hence it's RA'd blocks won't be touched. But then we might need to track the hit-ratio at the RA-block(?) level. Chetan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 18:28 ` Loke, Chetan 2012-01-25 18:37 ` Loke, Chetan @ 2012-01-25 18:37 ` James Bottomley 2012-01-25 20:06 ` Chris Mason 2012-01-26 16:17 ` Loke, Chetan 2012-01-25 18:44 ` Boaz Harrosh 2 siblings, 2 replies; 76+ messages in thread From: James Bottomley @ 2012-01-25 18:37 UTC (permalink / raw) To: Loke, Chetan Cc: Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote: > > So there are two separate problems mentioned here. The first is to > > ensure that readahead (RA) pages are treated as more disposable than > > accessed pages under memory pressure and then to derive a statistic for > > futile RA (those pages that were read in but never accessed). > > > > The first sounds really like its an LRU thing rather than adding yet > > another page flag. We need a position in the LRU list for never > > accessed ... that way they're first to be evicted as memory pressure > > rises. > > > > The second is you can derive this futile readahead statistic from the > > LRU position of unaccessed pages ... you could keep this globally. > > > > Now the problem: if you trash all unaccessed RA pages first, you end up > > with the situation of say playing a movie under moderate memory > > pressure that we do RA, then trash the RA page then have to re-read to display > > to the user resulting in an undesirable uptick in read I/O. > > > > Based on the above, it sounds like a better heuristic would be to evict > > accessed clean pages at the top of the LRU list before unaccessed clean > > pages because the expectation is that the unaccessed clean pages will > > be accessed (that's after all, why we did the readahead). As RA pages age > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. Well not really: RA is always wrong for random reads. The whole purpose of RA is assumption of sequential access patterns. The point I'm making is that for the case where RA works (sequential patterns), evicting unaccessed RA pages before accessed ones is the wrong thing to do, so the heuristic isn't what you first thought of (evicting unaccessed RA pages first). For the random read case, either heuristic is wrong, so it doesn't matter. However, when you add the futility measure, random read processes will end up with aged unaccessed RA pages, so its RA window will get closed. > We can try to bring-in process run-time heuristics while evicting pages. So in the one-shot search case, the application did it's thing and went to sleep. > While the movie-app has a pretty good run-time and is still running. So be a little gentle(?) on such apps? Selective eviction? > > In addition what if we do something like this: > > RA block[X], RA block[X+1], ... , RA block[X+m] > > Assume a block reads 'N' pages. > > Evict unaccessed RA page 'a' from block[X+2] and not [X+1]. > > We might need tracking at the RA-block level. This way if a movie touched RA-page 'a' from block[X], it would at least have [X+1] in cache. And while [X+1] is being read, the new slow-down version of RA will not RA that many blocks. > > Also, application's should use xxx_fadvise calls to give us hints... I think that's a bit over complex. As long as the futility measure works, a sequential pattern read process gets a reasonable RA window. The trick is to prove that the simple doesn't work before considering the complex. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 18:37 ` James Bottomley @ 2012-01-25 20:06 ` Chris Mason 2012-01-25 22:46 ` Andrea Arcangeli 2012-01-26 22:38 ` Dave Chinner 2012-01-26 16:17 ` Loke, Chetan 1 sibling, 2 replies; 76+ messages in thread From: Chris Mason @ 2012-01-25 20:06 UTC (permalink / raw) To: James Bottomley Cc: Loke, Chetan, Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote: > On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote: > > > So there are two separate problems mentioned here. The first is to > > > ensure that readahead (RA) pages are treated as more disposable than > > > accessed pages under memory pressure and then to derive a statistic for > > > futile RA (those pages that were read in but never accessed). > > > > > > The first sounds really like its an LRU thing rather than adding yet > > > another page flag. We need a position in the LRU list for never > > > accessed ... that way they're first to be evicted as memory pressure > > > rises. > > > > > > The second is you can derive this futile readahead statistic from the > > > LRU position of unaccessed pages ... you could keep this globally. > > > > > > Now the problem: if you trash all unaccessed RA pages first, you end up > > > with the situation of say playing a movie under moderate memory > > > pressure that we do RA, then trash the RA page then have to re-read to display > > > to the user resulting in an undesirable uptick in read I/O. > > > > > > Based on the above, it sounds like a better heuristic would be to evict > > > accessed clean pages at the top of the LRU list before unaccessed clean > > > pages because the expectation is that the unaccessed clean pages will > > > be accessed (that's after all, why we did the readahead). As RA pages age > > > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? > > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. > > Well not really: RA is always wrong for random reads. The whole purpose > of RA is assumption of sequential access patterns. Just to jump back, Jeff's benchmark that started this (on xfs and ext4): - buffered 1MB reads get down to the scheduler in 128KB chunks The really hard part about readahead is that you don't know what userland wants. In Jeff's test, he's telling the kernel he wants 1MB ios and our RA engine is doing 128KB ios. We can talk about scaling up how big the RA windows get on their own, but if userland asks for 1MB, we don't have to worry about futile RA, we just have to make sure we don't oom the box trying to honor 1MB reads from 5000 different procs. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 20:06 ` Chris Mason @ 2012-01-25 22:46 ` Andrea Arcangeli 2012-01-25 22:58 ` Jan Kara ` (2 more replies) 2012-01-26 22:38 ` Dave Chinner 1 sibling, 3 replies; 76+ messages in thread From: Andrea Arcangeli @ 2012-01-25 22:46 UTC (permalink / raw) To: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote: > We can talk about scaling up how big the RA windows get on their own, > but if userland asks for 1MB, we don't have to worry about futile RA, we > just have to make sure we don't oom the box trying to honor 1MB reads > from 5000 different procs. :) that's for sure if read has a 1M buffer as destination. However even cp /dev/sda reads/writes through a 32kb buffer, so it's not so common to read in 1m buffers. But I also would prefer to stay on the simple side (on a side note we run out of page flags already on 32bit I think as I had to nuke PG_buddy already). Overall I think the risk of the pages being evicted before they can be copied to userland is quite a minor risk. A 16G system with 100 readers all hitting on disk at the same time using 100M readahead would still only create a 100m memory pressure... So it'd sure be ok, 100m is less than what kswapd keeps always free for example. Think a 4TB system. Especially if 128k fixed has been ok so far on a 1G system. If we really want to be more dynamic than a setting at boot depending on ram size, we could limit it to a fraction of freeable memory (using similar math to determine_dirtyable_memory, maybe calling it over time but not too frequently to reduce the overhead). Like if there's 0 memory freeable keep it low. If there's 1G freeable out of that math (and we assume the readahead hit rate is near 100%), raise the maximum readahead to 1M even if the total ram is only 1G. So we allow up to 1000 readers before we even recycle the readahead. I doubt the complexity of tracking exactly how many pages are getting recycled before they're copied to userland would be worth it, besides it'd be 0% for 99% of systems and workloads. Way more important is to have feedback on the readahead hits and be sure when readahead is raised to the maximum the hit rate is near 100% and fallback to lower readaheads if we don't get that hit rate. But that's not a VM problem and it's a readahead issue only. The actual VM pressure side of it, sounds minor issue if the hit rate of the readahead cache is close to 100%. The config option is also ok with me, but I think it'd be nicer to set it at boot depending on ram size (one less option to configure manually and zero overhead). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 22:46 ` Andrea Arcangeli @ 2012-01-25 22:58 ` Jan Kara 2012-01-26 8:59 ` Boaz Harrosh 2012-01-26 16:40 ` Loke, Chetan 2 siblings, 0 replies; 76+ messages in thread From: Jan Kara @ 2012-01-25 22:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong On Wed 25-01-12 23:46:14, Andrea Arcangeli wrote: > On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote: > > We can talk about scaling up how big the RA windows get on their own, > > but if userland asks for 1MB, we don't have to worry about futile RA, we > > just have to make sure we don't oom the box trying to honor 1MB reads > > from 5000 different procs. > > :) that's for sure if read has a 1M buffer as destination. However > even cp /dev/sda reads/writes through a 32kb buffer, so it's not so > common to read in 1m buffers. > > But I also would prefer to stay on the simple side (on a side note we > run out of page flags already on 32bit I think as I had to nuke > PG_buddy already). > > Overall I think the risk of the pages being evicted before they can be > copied to userland is quite a minor risk. A 16G system with 100 > readers all hitting on disk at the same time using 100M readahead > would still only create a 100m memory pressure... So it'd sure be ok, > 100m is less than what kswapd keeps always free for example. Think a > 4TB system. Especially if 128k fixed has been ok so far on a 1G system. > > If we really want to be more dynamic than a setting at boot depending > on ram size, we could limit it to a fraction of freeable memory (using > similar math to determine_dirtyable_memory, maybe calling it over time > but not too frequently to reduce the overhead). Like if there's 0 > memory freeable keep it low. If there's 1G freeable out of that math > (and we assume the readahead hit rate is near 100%), raise the maximum > readahead to 1M even if the total ram is only 1G. So we allow up to > 1000 readers before we even recycle the readahead. > > I doubt the complexity of tracking exactly how many pages are getting > recycled before they're copied to userland would be worth it, besides > it'd be 0% for 99% of systems and workloads. > > Way more important is to have feedback on the readahead hits and be > sure when readahead is raised to the maximum the hit rate is near 100% > and fallback to lower readaheads if we don't get that hit rate. But > that's not a VM problem and it's a readahead issue only. > > The actual VM pressure side of it, sounds minor issue if the hit rate > of the readahead cache is close to 100%. > > The config option is also ok with me, but I think it'd be nicer to set > it at boot depending on ram size (one less option to configure > manually and zero overhead). Yeah. I'd also keep it simple. Tuning max readahead size based on available memory (and device size) once in a while is about the maximum complexity I'd consider meaningful. If you have real data that shows problems which are not solved by that simple strategy, then sure, we can speak about more complex algorithms. But currently I don't think they are needed. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 22:46 ` Andrea Arcangeli 2012-01-25 22:58 ` Jan Kara @ 2012-01-26 8:59 ` Boaz Harrosh 2012-01-26 16:40 ` Loke, Chetan 2 siblings, 0 replies; 76+ messages in thread From: Boaz Harrosh @ 2012-01-26 8:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, linux-fsdevel, lsf-pc, Darrick J.Wong On 01/26/2012 12:46 AM, Andrea Arcangeli wrote: > On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote: >> We can talk about scaling up how big the RA windows get on their own, >> but if userland asks for 1MB, we don't have to worry about futile RA, we >> just have to make sure we don't oom the box trying to honor 1MB reads >> from 5000 different procs. > > :) that's for sure if read has a 1M buffer as destination. However > even cp /dev/sda reads/writes through a 32kb buffer, so it's not so > common to read in 1m buffers. > That's not so true. cp is a bad example because it's brain dead and someone should fix it. cp performance is terrible. Even KDE's GUI copy is better. But applications (and dd users) that do care about read performance do use large buffers and want the Kernel to not ignore that. What a better hint for Kernel is the read() destination buffer size. > But I also would prefer to stay on the simple side (on a side note we > run out of page flags already on 32bit I think as I had to nuke > PG_buddy already). > So what would be more simple then not ignoring read() request size from application, which will give applications all the control they need. <snip> (I Agree) > The config option is also ok with me, but I think it'd be nicer to set > it at boot depending on ram size (one less option to configure > manually and zero overhead). If you actually take into account the destination buffer size, you'll see that the read-ahead size becomes less important for these workloads that actually care. But Yes some mount time heuristics could be nice, depending on DEV size and MEM size. For example in my file-system with self registered BDI I set readhead sizes according to raid-strip sizes and such so to get good read performance. And speaking of reads and readhead. What about alignments? both of offset and length? though in reads it's not so important. One thing some people have ask for is raid-verify-reads as a mount option. Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 22:46 ` Andrea Arcangeli 2012-01-25 22:58 ` Jan Kara 2012-01-26 8:59 ` Boaz Harrosh @ 2012-01-26 16:40 ` Loke, Chetan 2012-01-26 17:00 ` Andreas Dilger 2012-02-03 12:37 ` Wu Fengguang 2 siblings, 2 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-26 16:40 UTC (permalink / raw) To: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Sent: January 25, 2012 5:46 PM .... > Way more important is to have feedback on the readahead hits and be > sure when readahead is raised to the maximum the hit rate is near 100% > and fallback to lower readaheads if we don't get that hit rate. But > that's not a VM problem and it's a readahead issue only. > A quick google showed up - http://kerneltrap.org/node/6642 Interesting thread to follow. I haven't looked further as to what was merged and what wasn't. A quote from the patch - " It works by peeking into the file cache and check if there are any history pages present or accessed." Now I don't understand anything about this but I would think digging the file-cache isn't needed(?). So, yes, a simple RA hit-rate feedback could be fine. And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some N) over period of time. No more smartness. A simple 10 line function is easy to debug/maintain. That is, a scaled-down version of ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like SCSI LLDD madness). Wait for some event to happen. I can see where Andrew Morton's concerns could be(just my interpretation). We may not want to end up like a protocol state machine code: tcp slow-start, then increase , then congestion, then let's back-off. hmmm, slow-start is a problem for my business logic, so let's speed-up slow-start ;). Chetan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 16:40 ` Loke, Chetan @ 2012-01-26 17:00 ` Andreas Dilger 2012-01-26 17:16 ` Loke, Chetan 2012-02-03 12:37 ` Wu Fengguang 1 sibling, 1 reply; 76+ messages in thread From: Andreas Dilger @ 2012-01-26 17:00 UTC (permalink / raw) To: Loke, Chetan Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, <linux-scsi@vger.kernel.org>, <neilb@suse.de>, <dm-devel@redhat.com>, Christoph Hellwig, <linux-mm@kvack.org>, Jeff Moyer, Wu Fengguang, Boaz Harrosh, <linux-fsdevel@vger.kernel.org>, <lsf-pc@lists.linux-foundation.org>, Darrick J.Wong On 2012-01-26, at 9:40, "Loke, Chetan" <Chetan.Loke@netscout.com> wrote: > And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some > N) over period of time. No more smartness. A simple 10 line function is > easy to debug/maintain. That is, a scaled-down version of > ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like > SCSI LLDD madness). Wait for some event to happen. Doing 1-block readahead increments is a performance disaster on RAID-5/6. That means you seek all the disks, but use only a fraction of the data that the controller read internally and had to parity check. It makes more sense to keep the read units the same size as write units (1 MB or as dictated by RAID geometry) that the filesystem is also hopefully using for allocation. When doing a readahead it should fetch the whole chunk at one time, then not do another until it needs another full chunk. Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 17:00 ` Andreas Dilger @ 2012-01-26 17:16 ` Loke, Chetan 0 siblings, 0 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-26 17:16 UTC (permalink / raw) To: Andreas Dilger Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong > > And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some N) over period of time. No more smartness. A simple 10 line function is > > easy to debug/maintain. That is, a scaled-down version of ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like > > SCSI LLDD madness). Wait for some event to happen. > > Doing 1-block readahead increments is a performance disaster on RAID- > 5/6. That means you seek all the disks, but use only a fraction of the > data that the controller read internally and had to parity check. > > It makes more sense to keep the read units the same size as write units > (1 MB or as dictated by RAID geometry) that the filesystem is also > hopefully using for allocation. When doing a readahead it should fetch > the whole chunk at one time, then not do another until it needs another > full chunk. > I was using it loosely(don't confuse it with 1 block as in 4K :). RA could be tied to whatever appropriate parameters depending on the setup(underlying backing store) etc. But the point I'm trying to make is to (may be)keep the adaptive logic simple. So if you start with RA-chunk == 512KB/xMB, then when we increment it, do something like (RA-chunk << N). BTW, it's not just RAID but also different abstractions you might have. Stripe-width worth of RA is still useless if your LVM chunk is N * stripe-width. > Cheers, Andreas Chetan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-26 16:40 ` Loke, Chetan 2012-01-26 17:00 ` Andreas Dilger @ 2012-02-03 12:37 ` Wu Fengguang 1 sibling, 0 replies; 76+ messages in thread From: Wu Fengguang @ 2012-02-03 12:37 UTC (permalink / raw) To: Loke, Chetan Cc: Andrea Arcangeli, Chris Mason, James Bottomley, Steven Whitehouse, Andreas Dilger, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong, Dan Magenheimer On Thu, Jan 26, 2012 at 11:40:47AM -0500, Loke, Chetan wrote: > > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > > Sent: January 25, 2012 5:46 PM > > .... > > > Way more important is to have feedback on the readahead hits and be > > sure when readahead is raised to the maximum the hit rate is near 100% > > and fallback to lower readaheads if we don't get that hit rate. But > > that's not a VM problem and it's a readahead issue only. > > > > A quick google showed up - http://kerneltrap.org/node/6642 > > Interesting thread to follow. I haven't looked further as to what was > merged and what wasn't. > > A quote from the patch - " It works by peeking into the file cache and > check if there are any history pages present or accessed." > Now I don't understand anything about this but I would think digging the > file-cache isn't needed(?). So, yes, a simple RA hit-rate feedback could > be fine. > > And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some > N) over period of time. No more smartness. A simple 10 line function is > easy to debug/maintain. That is, a scaled-down version of > ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like > SCSI LLDD madness). Wait for some event to happen. > > I can see where Andrew Morton's concerns could be(just my > interpretation). We may not want to end up like a protocol state machine > code: tcp slow-start, then increase , then congestion, then let's > back-off. hmmm, slow-start is a problem for my business logic, so let's > speed-up slow-start ;). Loke, Thrashing safe readahead can work as simple as: readahead_size = min(nr_history_pages, MAX_READAHEAD_PAGES) No need for more slow-start or back-off magics. This is because nr_history_pages is a lower estimation of the threshing threshold: chunk A chunk B chunk C head l01 l11 l12 l21 l22 | |-->|-->| |------>|-->| |------>| | +-------+ +-----------+ +-------------+ | | | # | | # | | # | | | +-------+ +-----------+ +-------------+ | | |<==============|<===========================|<============================| L0 L1 L2 Let f(l) = L be a map from l: the number of pages read by the stream to L: the number of pages pushed into inactive_list in the mean time then f(l01) <= L0 f(l11 + l12) = L1 f(l21 + l22) = L2 ... f(l01 + l11 + ...) <= Sum(L0 + L1 + ...) <= Length(inactive_list) = f(thrashing-threshold) So the count of continuous history pages left in inactive_list is always a lower estimation of the true thrashing-threshold. Given a stable workload, the readahead size will keep ramping up and then stabilize in range (thrashing_threshold/2, thrashing_threshold) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 20:06 ` Chris Mason 2012-01-25 22:46 ` Andrea Arcangeli @ 2012-01-26 22:38 ` Dave Chinner 1 sibling, 0 replies; 76+ messages in thread From: Dave Chinner @ 2012-01-26 22:38 UTC (permalink / raw) To: Chris Mason, James Bottomley, Loke, Chetan, Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Darrick J.Wong On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote: > On Wed, Jan 25, 2012 at 12:37:48PM -0600, James Bottomley wrote: > > On Wed, 2012-01-25 at 13:28 -0500, Loke, Chetan wrote: > > > > So there are two separate problems mentioned here. The first is to > > > > ensure that readahead (RA) pages are treated as more disposable than > > > > accessed pages under memory pressure and then to derive a statistic for > > > > futile RA (those pages that were read in but never accessed). > > > > > > > > The first sounds really like its an LRU thing rather than adding yet > > > > another page flag. We need a position in the LRU list for never > > > > accessed ... that way they're first to be evicted as memory pressure > > > > rises. > > > > > > > > The second is you can derive this futile readahead statistic from the > > > > LRU position of unaccessed pages ... you could keep this globally. > > > > > > > > Now the problem: if you trash all unaccessed RA pages first, you end up > > > > with the situation of say playing a movie under moderate memory > > > > pressure that we do RA, then trash the RA page then have to re-read to display > > > > to the user resulting in an undesirable uptick in read I/O. > > > > > > > > Based on the above, it sounds like a better heuristic would be to evict > > > > accessed clean pages at the top of the LRU list before unaccessed clean > > > > pages because the expectation is that the unaccessed clean pages will > > > > be accessed (that's after all, why we did the readahead). As RA pages age > > > > > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? > > > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. > > > > Well not really: RA is always wrong for random reads. The whole purpose > > of RA is assumption of sequential access patterns. > > Just to jump back, Jeff's benchmark that started this (on xfs and ext4): > > - buffered 1MB reads get down to the scheduler in 128KB chunks > > The really hard part about readahead is that you don't know what > userland wants. In Jeff's test, he's telling the kernel he wants 1MB > ios and our RA engine is doing 128KB ios. > > We can talk about scaling up how big the RA windows get on their own, > but if userland asks for 1MB, we don't have to worry about futile RA, we > just have to make sure we don't oom the box trying to honor 1MB reads > from 5000 different procs. Right - if we know the read request is larger than the RA window, then we should ignore the RA window and just service the request in a single bio. Well, at least, in chunks as large as the underlying device will allow us to build.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* RE: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 18:37 ` James Bottomley 2012-01-25 20:06 ` Chris Mason @ 2012-01-26 16:17 ` Loke, Chetan 1 sibling, 0 replies; 76+ messages in thread From: Loke, Chetan @ 2012-01-26 16:17 UTC (permalink / raw) To: James Bottomley Cc: Steven Whitehouse, Andreas Dilger, Andrea Arcangeli, Jan Kara, Mike Snitzer, linux-scsi, neilb, dm-devel, Christoph Hellwig, linux-mm, Jeff Moyer, Wu Fengguang, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong > > Well, the movie example is one case where evicting unaccessed page may not be the right thing to do. But what about a workload that perform a random one-shot search? > > The search was done and the RA'd blocks are of no use anymore. So it seems one solution would hurt another. > > Well not really: RA is always wrong for random reads. The whole purpose of RA is assumption of sequential access patterns. > James - I must agree that 'random' was not the proper choice of word here. What I meant was this - search-app reads enough data to trick the lazy/deferred-RA logic. RA thinks, oh well, this is now a sequential pattern and will RA. But all this search-app did was that it kept reading till it found what it was looking for. Once it was done, it went back to sleep waiting for the next query. Now all that RA data could be of total waste if the read-hit on the RA data-set was 'zero percent'. Some would argue that how would we(the kernel) know that the next query may not be close the earlier data-set? Well, we don't and we may not want to. That is why the application better know how to use XXX_advise calls. If they are not using it then well it's their problem. The app knows about the statistics/etc about the queries. What was used and what wasn't. > James Chetan Loke ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 18:28 ` Loke, Chetan 2012-01-25 18:37 ` Loke, Chetan 2012-01-25 18:37 ` James Bottomley @ 2012-01-25 18:44 ` Boaz Harrosh 2 siblings, 0 replies; 76+ messages in thread From: Boaz Harrosh @ 2012-01-25 18:44 UTC (permalink / raw) To: Loke, Chetan Cc: James Bottomley, Steven Whitehouse, Andreas Dilger, Wu Fengguang, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong, linux-mm On 01/25/2012 08:28 PM, Loke, Chetan wrote: >> So there are two separate problems mentioned here. The first is to >> ensure that readahead (RA) pages are treated as more disposable than >> accessed pages under memory pressure and then to derive a statistic for >> futile RA (those pages that were read in but never accessed). >> >> The first sounds really like its an LRU thing rather than adding yet >> another page flag. We need a position in the LRU list for never >> accessed ... that way they're first to be evicted as memory pressure >> rises. >> >> The second is you can derive this futile readahead statistic from the >> LRU position of unaccessed pages ... you could keep this globally. >> >> Now the problem: if you trash all unaccessed RA pages first, you end up >> with the situation of say playing a movie under moderate memory >> pressure that we do RA, then trash the RA page then have to re-read to display >> to the user resulting in an undesirable uptick in read I/O. >> >> Based on the above, it sounds like a better heuristic would be to evict >> accessed clean pages at the top of the LRU list before unaccessed clean >> pages because the expectation is that the unaccessed clean pages will >> be accessed (that's after all, why we did the readahead). As RA pages age > > Well, the movie example is one case where evicting unaccessed page > may not be the right thing to do. But what about a workload that > perform a random one-shot search? The search was done and the RA'd > blocks are of no use anymore. So it seems one solution would hurt > another. > I think there is a "seeky" flag the Kernel keeps to prevent read-ahead in the case of seeks. > We can try to bring-in process run-time heuristics while evicting > pages. So in the one-shot search case, the application did it's thing > and went to sleep. While the movie-app has a pretty good run-time and > is still running. So be a little gentle(?) on such apps? Selective > eviction? > > In addition what if we do something like this: > > RA block[X], RA block[X+1], ... , RA block[X+m] > > Assume a block reads 'N' pages. > > Evict unaccessed RA page 'a' from block[X+2] and not [X+1]. > > We might need tracking at the RA-block level. This way if a movie > touched RA-page 'a' from block[X], it would at least have [X+1] in > cache. And while [X+1] is being read, the new slow-down version of RA > will not RA that many blocks. > > Also, application's should use xxx_fadvise calls to give us hints... > Lets start by reading the number of pages requested by the read() call, first. The application is reading 4M and we still send 128K. Don't you think that would be fadvise enough? Lets start with the simple stuff. The only flag I see on read pages is that if it's read ahead pages that we Kernel initiated without an application request. Like beyond the read() call or a surrounding an mmap read that was not actually requested by the application. For generality we always initiate a read in the page fault and loose all the wonderful information the app gave us in the different read API's. Lets start with that. > >> James > > Chetan Loke Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [dm-devel] [LSF/MM TOPIC] a few storage topics 2012-01-25 16:40 ` Steven Whitehouse 2012-01-25 17:08 ` Loke, Chetan 2012-01-25 17:32 ` James Bottomley @ 2012-02-03 12:55 ` Wu Fengguang 2 siblings, 0 replies; 76+ messages in thread From: Wu Fengguang @ 2012-02-03 12:55 UTC (permalink / raw) To: Steven Whitehouse Cc: Loke, Chetan, Andreas Dilger, Jan Kara, Jeff Moyer, Andrea Arcangeli, linux-scsi, Mike Snitzer, neilb, Christoph Hellwig, dm-devel, Boaz Harrosh, linux-fsdevel, lsf-pc, Chris Mason, Darrick J.Wong, Dan Magenheimer On Wed, Jan 25, 2012 at 04:40:23PM +0000, Steven Whitehouse wrote: > Hi, > > On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote: > > > If the reason for not setting a larger readahead value is just that it > > > might increase memory pressure and thus decrease performance, is it > > > possible to use a suitable metric from the VM in order to set the value > > > automatically according to circumstances? > > > > > > > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead? > > > > > Steve. > > > > Chetan Loke > > I'd been wondering about something similar to that. The basic scheme > would be: > > - Set a page flag when readahead is performed > - Clear the flag when the page is read (or on page fault for mmap) > (i.e. when it is first used after readahead) > > Then when the VM scans for pages to eject from cache, check the flag and > keep an exponential average (probably on a per-cpu basis) of the rate at > which such flagged pages are ejected. That number can then be used to > reduce the max readahead value. > > The questions are whether this would provide a fast enough reduction in > readahead size to avoid problems? and whether the extra complication is > worth it compared with using an overall metric for memory pressure? > > There may well be better solutions though, The caveat is, on a consistently thrashed machine, the readahead size should better be determined for each read stream. Repeated readahead thrashing typically happen in a file server with large number of concurrent clients. For example, if there are 1000 read streams each doing 1MB readahead, since there are 2 readahead window for each stream, there could be up to 2GB readahead pages that will sure be thrashed in a server with only 1GB memory. Typically the 1000 clients will have different read speeds. A few of them will be doing 1MB/s, most others may be doing 100KB/s. In this case, we shall only decrease readahead size for the 100KB/s clients. The 1MB/s clients actually won't see readahead thrashing at all and we'll want them to do large 1MB I/O to achieve good disk utilization. So we need something better than the "global feedback" scheme, and we do have such a solution ;) As said in my other email, the number of history pages remained in the page cache is a good estimation of that particular read stream's thrashing safe readahead size. Thanks, Fengguang ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 18:40 ` Christoph Hellwig 2012-01-24 19:07 ` Chris Mason @ 2012-01-24 19:11 ` Jeff Moyer 1 sibling, 0 replies; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 19:11 UTC (permalink / raw) To: Christoph Hellwig Cc: Andreas Dilger, Chris Mason, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong Christoph Hellwig <hch@infradead.org> writes: > All three filesystems use the generic mpages code for reads, so they > all get the same (bad) I/O patterns. Looks like we need to fix this up > ASAP. Actually, in discussing this with Vivek, he mentioned that read ahead might be involved. Sure enough, after bumping read_ahead_kb, 1MB I/Os are sent down to the storage (for xfs anyway). Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 18:05 ` [dm-devel] " Jeff Moyer 2012-01-24 18:40 ` Christoph Hellwig @ 2012-01-26 22:31 ` Dave Chinner 1 sibling, 0 replies; 76+ messages in thread From: Dave Chinner @ 2012-01-26 22:31 UTC (permalink / raw) To: Jeff Moyer Cc: Andreas Dilger, Christoph Hellwig, Chris Mason, Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi@vger.kernel.org, neilb@suse.de, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Darrick J.Wong On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote: > Andreas Dilger <adilger@dilger.ca> writes: > I've been wondering if it's gotten better, so decided to run a few quick > tests. > > kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq, > max_sectors_kb: 1024, test program: dd > > ext3: > - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k > I/Os passed down to the I/O scheduler > - buffered 1MB reads are a little better, typically in the 128k-256k > range when they hit the I/O scheduler. > > ext4: > - buffered writes: 512K I/Os show up at the elevator > - buffered O_SYNC writes: data is again 512KB, journal writes are 4K > - buffered 1MB reads get down to the scheduler in 128KB chunks > > xfs: > - buffered writes: 1MB I/Os show up at the elevator > - buffered O_SYNC writes: 1MB I/Os > - buffered 1MB reads: 128KB chunks show up at the I/O scheduler > > So, ext4 is doing better than ext3, but still not perfect. xfs is > kicking ass for writes, but reads are still split up. Isn't that simply because the default readahead is 128k? Change the readahead to be much larger, and you should see much larger IOs being issued.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 15:15 ` Chris Mason 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig @ 2012-01-24 17:12 ` Jeff Moyer 2012-01-24 17:32 ` Chris Mason 1 sibling, 1 reply; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 17:12 UTC (permalink / raw) To: Chris Mason Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong Chris Mason <chris.mason@oracle.com> writes: > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: >> Andrea Arcangeli <aarcange@redhat.com> writes: >> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: >> >> requst granularity. Sure, big requests will take longer to complete but >> >> maximum request size is relatively low (512k by default) so writing maximum >> >> sized request isn't that much slower than writing 4k. So it works OK in >> >> practice. >> > >> > Totally unrelated to the writeback, but the merged big 512k requests >> > actually adds up some measurable I/O scheduler latencies and they in >> > turn slightly diminish the fairness that cfq could provide with >> > smaller max request size. Probably even more measurable with SSDs (but >> > then SSDs are even faster). >> >> Are you speaking from experience? If so, what workloads were negatively >> affected by merging, and how did you measure that? > > https://lkml.org/lkml/2011/12/13/326 > > This patch is another example, although for a slight different reason. > I really have no idea yet what the right answer is in a generic sense, > but you don't need a 512K request to see higher latencies from merging. Well, this patch has almost nothing to with merging, right? It's about keeping I/O from the I/O scheduler for too long (or, prior to on-stack plugging, it was about keeping the queue plugged for too long). And, I'm pretty sure that the testing involved there was with deadline or noop, nothing to do with CFQ fairness. ;-) However, this does bring to light the bigger problem of optimizing for the underlying storage and the workload requirements. Some tuning can be done in the I/O scheduler, but the plugging definitely circumvents that a little bit. -Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 17:12 ` Jeff Moyer @ 2012-01-24 17:32 ` Chris Mason 2012-01-24 18:14 ` Jeff Moyer 0 siblings, 1 reply; 76+ messages in thread From: Chris Mason @ 2012-01-24 17:32 UTC (permalink / raw) To: Jeff Moyer Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote: > Chris Mason <chris.mason@oracle.com> writes: > > > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: > >> Andrea Arcangeli <aarcange@redhat.com> writes: > >> > >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: > >> >> requst granularity. Sure, big requests will take longer to complete but > >> >> maximum request size is relatively low (512k by default) so writing maximum > >> >> sized request isn't that much slower than writing 4k. So it works OK in > >> >> practice. > >> > > >> > Totally unrelated to the writeback, but the merged big 512k requests > >> > actually adds up some measurable I/O scheduler latencies and they in > >> > turn slightly diminish the fairness that cfq could provide with > >> > smaller max request size. Probably even more measurable with SSDs (but > >> > then SSDs are even faster). > >> > >> Are you speaking from experience? If so, what workloads were negatively > >> affected by merging, and how did you measure that? > > > > https://lkml.org/lkml/2011/12/13/326 > > > > This patch is another example, although for a slight different reason. > > I really have no idea yet what the right answer is in a generic sense, > > but you don't need a 512K request to see higher latencies from merging. > > Well, this patch has almost nothing to with merging, right? It's about > keeping I/O from the I/O scheduler for too long (or, prior to on-stack > plugging, it was about keeping the queue plugged for too long). And, > I'm pretty sure that the testing involved there was with deadline or > noop, nothing to do with CFQ fairness. ;-) > > However, this does bring to light the bigger problem of optimizing for > the underlying storage and the workload requirements. Some tuning can > be done in the I/O scheduler, but the plugging definitely circumvents > that a little bit. Well, its merging in the sense that we know with perfect accuracy how often it happens (all the time) and how big an impact it had on latency. You're right that it isn't related to fairness because in this workload the only IO being sent down was these writes, and only one process was doing it. I mention it mostly because the numbers go against all common sense (at least for me). Storage just isn't as predictable anymore. The benchmarking team later reported the patch improved latencies on all io, not just the log writer. This one box is fairly consistent. -chris ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-24 17:32 ` Chris Mason @ 2012-01-24 18:14 ` Jeff Moyer 0 siblings, 0 replies; 76+ messages in thread From: Jeff Moyer @ 2012-01-24 18:14 UTC (permalink / raw) To: Chris Mason Cc: Andrea Arcangeli, Jan Kara, Boaz Harrosh, Mike Snitzer, linux-scsi, neilb, dm-devel, linux-fsdevel, lsf-pc, Darrick J. Wong Chris Mason <chris.mason@oracle.com> writes: > On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote: >> Chris Mason <chris.mason@oracle.com> writes: >> >> > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote: >> >> Andrea Arcangeli <aarcange@redhat.com> writes: >> >> >> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote: >> >> >> requst granularity. Sure, big requests will take longer to complete but >> >> >> maximum request size is relatively low (512k by default) so writing maximum >> >> >> sized request isn't that much slower than writing 4k. So it works OK in >> >> >> practice. >> >> > >> >> > Totally unrelated to the writeback, but the merged big 512k requests >> >> > actually adds up some measurable I/O scheduler latencies and they in >> >> > turn slightly diminish the fairness that cfq could provide with >> >> > smaller max request size. Probably even more measurable with SSDs (but >> >> > then SSDs are even faster). >> >> >> >> Are you speaking from experience? If so, what workloads were negatively >> >> affected by merging, and how did you measure that? >> > >> > https://lkml.org/lkml/2011/12/13/326 >> > >> > This patch is another example, although for a slight different reason. >> > I really have no idea yet what the right answer is in a generic sense, >> > but you don't need a 512K request to see higher latencies from merging. >> >> Well, this patch has almost nothing to with merging, right? It's about >> keeping I/O from the I/O scheduler for too long (or, prior to on-stack >> plugging, it was about keeping the queue plugged for too long). And, >> I'm pretty sure that the testing involved there was with deadline or >> noop, nothing to do with CFQ fairness. ;-) >> >> However, this does bring to light the bigger problem of optimizing for >> the underlying storage and the workload requirements. Some tuning can >> be done in the I/O scheduler, but the plugging definitely circumvents >> that a little bit. > > Well, its merging in the sense that we know with perfect accuracy how > often it happens (all the time) and how big an impact it had on latency. > You're right that it isn't related to fairness because in this workload > the only IO being sent down was these writes, and only one process was > doing it. > > I mention it mostly because the numbers go against all common sense (at > least for me). Storage just isn't as predictable anymore. Right, strange that we saw an improvement with the patch even on FC storage. So, it's not just fast SSDs that benefit. > The benchmarking team later reported the patch improved latencies on all > io, not just the log writer. This one box is fairly consistent. We've been running tests with that patch as well, and I've yet to find a downside. I haven't yet run the original synthetic workload, since I wanted real-world data first. It's on my list to keep poking at it. I haven't yet run against really slow storage, either, which I expect to show some regression with the patch. Cheers, Jeff ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-18 23:42 ` Boaz Harrosh 2012-01-19 9:46 ` Jan Kara @ 2012-01-25 0:23 ` NeilBrown 2012-01-25 6:11 ` Andreas Dilger 1 sibling, 1 reply; 76+ messages in thread From: NeilBrown @ 2012-01-25 0:23 UTC (permalink / raw) To: Boaz Harrosh Cc: Jan Kara, Darrick J. Wong, Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel, linux-scsi [-- Attachment #1: Type: text/plain, Size: 717 bytes --] On Thu, 19 Jan 2012 01:42:12 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote: > >> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? > >> (I never really bothered to find out if it really does this.) > > md-raid5/1 currently copies all pages if that what you meant. > Small correction: RAID5 and RAID6 copy all pages. RAID1 and RAID10 do not. If the incoming bios had nicely aligned pages which were somehow flagged to say that they would not change until the request completed, then it should be trivial to avoid that copy. NeilBrown > > Not sure either. Neil should know :) (added to CC). > > > > Honze > > Thanks > Boaz [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-25 0:23 ` NeilBrown @ 2012-01-25 6:11 ` Andreas Dilger 0 siblings, 0 replies; 76+ messages in thread From: Andreas Dilger @ 2012-01-25 6:11 UTC (permalink / raw) To: NeilBrown Cc: Boaz Harrosh, Jan Kara, Darrick J. Wong, Mike Snitzer, lsf-pc, linux-fsdevel, dm-devel, linux-scsi On 2012-01-24, at 5:23 PM, NeilBrown wrote: > On Thu, 19 Jan 2012 01:42:12 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote: > >>>> Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? >>>> (I never really bothered to find out if it really does this.) >> >> md-raid5/1 currently copies all pages if that what you meant. >> > > Small correction: RAID5 and RAID6 copy all pages. > RAID1 and RAID10 do not. > > If the incoming bios had nicely aligned pages which were somehow flagged to > say that they would not change until the request completed, then it should be > trivial to avoid that copy. Lustre has a patch to that effect that we've been carrying for several years. It avoids copying of the pages submitted to the RAID5/6 layer, and provides a significant improvement in performance and efficiency. A version of the patches for RHEL6 is available at: http://review.whamcloud.com/1142 though I don't know how close it is to working with the latest kernel. Cheers, Andreas ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] a few storage topics 2012-01-18 22:58 ` Darrick J. Wong 2012-01-18 23:22 ` Jan Kara @ 2012-01-18 23:39 ` Dan Williams 1 sibling, 0 replies; 76+ messages in thread From: Dan Williams @ 2012-01-18 23:39 UTC (permalink / raw) To: djwong; +Cc: Jan Kara, linux-fsdevel, dm-devel, lsf-pc, linux-scsi, Mike Snitzer On Wed, Jan 18, 2012 at 2:58 PM, Darrick J. Wong <djwong@us.ibm.com> wrote: > Or possibly modify md-raid5 not to snapshot dirty pages prior to xor/write? > (I never really bothered to find out if it really does this.) It does. ops_run_biodrain() copies from bio to the stripe cache before performing xor. -- Dan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [LSF/MM TOPIC] a few storage topics 2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer 2012-01-17 21:36 ` [Lsf-pc] " Jan Kara @ 2012-01-24 17:59 ` Martin K. Petersen 2012-01-24 19:48 ` Douglas Gilbert 1 sibling, 1 reply; 76+ messages in thread From: Martin K. Petersen @ 2012-01-24 17:59 UTC (permalink / raw) To: Mike Snitzer; +Cc: lsf-pc, linux-scsi, dm-devel, linux-fsdevel >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: Mike> 1) expose WRITE SAME via higher level interface (ala Mike> sb_issue_discard) for more efficient zeroing on SCSI devices Mike> that support it I actually thought I had submitted those patches as part of the thin provisioning update. Looks like I held them back for some reason. I'll check my notes to figure out why and get the kit merged forward ASAP! Mike> 4) is anyone working on an interface to GET LBA STATUS? Mike> - Martin Petersen added GET LBA STATUS support to scsi_debug, Mike> but is there a vision for how tools (e.g. pvmove) could Mike> access such info in a uniform way across different vendors' Mike> storage? I hadn't thought of that use case. Going to be a bit tricky given how GET LBA STATUS works... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [LSF/MM TOPIC] a few storage topics 2012-01-24 17:59 ` Martin K. Petersen @ 2012-01-24 19:48 ` Douglas Gilbert 2012-01-24 20:04 ` Martin K. Petersen 0 siblings, 1 reply; 76+ messages in thread From: Douglas Gilbert @ 2012-01-24 19:48 UTC (permalink / raw) To: Martin K. Petersen Cc: Mike Snitzer, lsf-pc, linux-scsi, dm-devel, linux-fsdevel On 12-01-24 12:59 PM, Martin K. Petersen wrote: >>>>>> "Mike" == Mike Snitzer<snitzer@redhat.com> writes: > > Mike> 1) expose WRITE SAME via higher level interface (ala > Mike> sb_issue_discard) for more efficient zeroing on SCSI devices > Mike> that support it > > I actually thought I had submitted those patches as part of the thin > provisioning update. Looks like I held them back for some reason. I'll > check my notes to figure out why and get the kit merged forward ASAP! > > > Mike> 4) is anyone working on an interface to GET LBA STATUS? > Mike> - Martin Petersen added GET LBA STATUS support to scsi_debug, > Mike> but is there a vision for how tools (e.g. pvmove) could > Mike> access such info in a uniform way across different vendors' > Mike> storage? > > I hadn't thought of that use case. Going to be a bit tricky given how > GET LBA STATUS works... What's new in ACS-3 (t13.org ATA Command Set): ..... f10138r6 Adds the ability for the device to return a list of the LBAs that are currently trimmed. So it looks like t13.org are adding a GET LBA STATUS type facility. That in turn should lead to a SAT-3 (SCSI to ATA Translation) definition of a mapping between both facilities. Doug Gilbert ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [LSF/MM TOPIC] a few storage topics 2012-01-24 19:48 ` Douglas Gilbert @ 2012-01-24 20:04 ` Martin K. Petersen 0 siblings, 0 replies; 76+ messages in thread From: Martin K. Petersen @ 2012-01-24 20:04 UTC (permalink / raw) To: dgilbert Cc: Martin K. Petersen, Mike Snitzer, lsf-pc, linux-scsi, dm-devel, linux-fsdevel >>>>> "Doug" == Douglas Gilbert <dgilbert@interlog.com> writes: >> I hadn't thought of that use case. Going to be a bit tricky given how >> GET LBA STATUS works... Doug> What's new in ACS-3 (t13.org ATA Command Set): ..... Doug> f10138r6 Adds the ability for the device to return a Doug> list of the LBAs that are currently trimmed. Doug> So it looks like t13.org are adding a GET LBA STATUS type Doug> facility. That in turn should lead to a SAT-3 (SCSI to ATA Doug> Translation) definition of a mapping between both facilities. Yep. It is mostly how to handle the multi-range stuff going up the stack that concerns me. We'd need something like FIEMAP... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 76+ messages in thread
end of thread, other threads:[~2012-02-03 12:55 UTC | newest] Thread overview: 76+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CABE8wws67dn0fwhTCs_XqH0g_CxGuT+hPQH9cVFe1xx5t_O9Jw@mail.gmail.com> 2012-01-17 20:06 ` [LSF/MM TOPIC] a few storage topics Mike Snitzer 2012-01-17 21:36 ` [Lsf-pc] " Jan Kara 2012-01-18 22:58 ` Darrick J. Wong 2012-01-18 23:22 ` Jan Kara 2012-01-18 23:42 ` Boaz Harrosh 2012-01-19 9:46 ` Jan Kara 2012-01-19 15:08 ` Andrea Arcangeli 2012-01-19 20:52 ` Jan Kara 2012-01-19 21:39 ` Andrea Arcangeli 2012-01-22 11:31 ` Boaz Harrosh 2012-01-23 16:30 ` Jan Kara 2012-01-22 12:21 ` Boaz Harrosh 2012-01-23 16:18 ` Jan Kara 2012-01-23 17:53 ` Andrea Arcangeli 2012-01-23 18:28 ` Jeff Moyer 2012-01-23 18:56 ` Andrea Arcangeli 2012-01-23 19:19 ` Jeff Moyer 2012-01-24 15:15 ` Chris Mason 2012-01-24 16:56 ` [dm-devel] " Christoph Hellwig 2012-01-24 17:01 ` Andreas Dilger 2012-01-24 17:06 ` [Lsf-pc] [dm-devel] " Andrea Arcangeli 2012-01-24 17:08 ` Chris Mason 2012-01-24 17:08 ` [Lsf-pc] " Andreas Dilger 2012-01-24 18:05 ` [dm-devel] " Jeff Moyer 2012-01-24 18:40 ` Christoph Hellwig 2012-01-24 19:07 ` Chris Mason 2012-01-24 19:14 ` Jeff Moyer 2012-01-24 20:09 ` [Lsf-pc] [dm-devel] " Jan Kara 2012-01-24 20:13 ` [Lsf-pc] " Jeff Moyer 2012-01-24 20:39 ` [Lsf-pc] [dm-devel] " Jan Kara 2012-01-24 20:59 ` Jeff Moyer 2012-01-24 21:08 ` Jan Kara 2012-01-25 3:29 ` Wu Fengguang 2012-01-25 6:15 ` [Lsf-pc] " Andreas Dilger 2012-01-25 6:35 ` [Lsf-pc] [dm-devel] " Wu Fengguang 2012-01-25 14:00 ` Jan Kara 2012-01-26 12:29 ` Andreas Dilger 2012-01-27 17:03 ` Ted Ts'o 2012-01-26 16:25 ` Vivek Goyal 2012-01-26 20:37 ` Jan Kara 2012-01-26 22:34 ` Dave Chinner 2012-01-27 3:27 ` Wu Fengguang 2012-01-27 5:25 ` Andreas Dilger 2012-01-27 7:53 ` Wu Fengguang 2012-01-25 14:33 ` Steven Whitehouse 2012-01-25 14:45 ` Jan Kara 2012-01-25 16:22 ` Loke, Chetan 2012-01-25 16:40 ` Steven Whitehouse 2012-01-25 17:08 ` Loke, Chetan 2012-01-25 17:32 ` James Bottomley 2012-01-25 18:28 ` Loke, Chetan 2012-01-25 18:37 ` Loke, Chetan 2012-01-25 18:37 ` James Bottomley 2012-01-25 20:06 ` Chris Mason 2012-01-25 22:46 ` Andrea Arcangeli 2012-01-25 22:58 ` Jan Kara 2012-01-26 8:59 ` Boaz Harrosh 2012-01-26 16:40 ` Loke, Chetan 2012-01-26 17:00 ` Andreas Dilger 2012-01-26 17:16 ` Loke, Chetan 2012-02-03 12:37 ` Wu Fengguang 2012-01-26 22:38 ` Dave Chinner 2012-01-26 16:17 ` Loke, Chetan 2012-01-25 18:44 ` Boaz Harrosh 2012-02-03 12:55 ` Wu Fengguang 2012-01-24 19:11 ` [dm-devel] [Lsf-pc] " Jeff Moyer 2012-01-26 22:31 ` Dave Chinner 2012-01-24 17:12 ` Jeff Moyer 2012-01-24 17:32 ` Chris Mason 2012-01-24 18:14 ` Jeff Moyer 2012-01-25 0:23 ` NeilBrown 2012-01-25 6:11 ` Andreas Dilger 2012-01-18 23:39 ` Dan Williams 2012-01-24 17:59 ` Martin K. Petersen 2012-01-24 19:48 ` Douglas Gilbert 2012-01-24 20:04 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).