* LSF/MM 2014 Call For Proposals @ 2013-12-20 9:30 Mel Gorman 2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler ` (3 more replies) 0 siblings, 4 replies; 59+ messages in thread From: Mel Gorman @ 2013-12-20 9:30 UTC (permalink / raw) To: linux-scsi, linux-ide, linux-mm, linux-fsdevel; +Cc: linux-kernel, lsf-pc The annual Linux Storage, Filesystem and Memory Management Summit for 2014 will be held on March 24th and 25th before the Linux Foundation Collaboration summit at The Meritage Resort, Napa Valley, CA. http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit http://events.linuxfoundation.org/events/collaboration-summit Note that we are running LSF/MM a little earlier in 2014 than in previous years. On behalf of the committee I would like to issue a call for agenda proposals that are suitable for cross-track discussion as well as more technical subjects for discussion in the breakout sessions. 1) Suggestions for agenda topics should be sent before January 31st 2014 to: lsf-pc@lists.linux-foundation.org and cc the Linux list or lists that are most interested in it: ATA: linux-ide@vger.kernel.org FS: linux-fsdevel@vger.kernel.org MM: linux-mm@kvack.org SCSI: linux-scsi@vger.kernel.org People who need more time for visa applications should send proposals before January 15th. The committee will complete the first round of selections on that date to accommodate applications. Please remember to tag your subject with [LSF/MM TOPIC] to make it easier to track. Agenda topics and attendees will be selected by the program committee, but the final agenda will be formed by consensus of the attendees on the day. We will try to cap attendance at around 25-30 per track to facilitate discussions although the final numbers will depend on the room sizes at the venue. 2) Requests to attend the summit should be sent to: lsf-pc@lists.linux-foundation.org Please summarize what expertise you will bring to the meeting, and what you would like to discuss. Please also tag your email with [LSF/MM ATTEND] so there is less chance of it getting lost in the large mail pile. Presentations are allowed to guide discussion, but are strongly discouraged. There will be no recording or audio bridge. However, we expect that written minutes will be published as we did in previous years 2013: http://lwn.net/Articles/548089/ 2012: http://lwn.net/Articles/490114/ http://lwn.net/Articles/490501/ 2011: http://lwn.net/Articles/436871/ http://lwn.net/Articles/437066/ 3) If you have feedback on last year's meeting that we can use to improve this year's, please also send that to: lsf-pc@lists.linux-foundation.org Thank you on behalf of the program committee: Storage: James Bottomley Martin K. Petersen Filesystems: Trond Myklebust Jeff Layton Dave Chinner Jan Kara Ted Ts'o MM: Rik van Riel Michel Lespinasse -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems 2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman @ 2014-01-06 22:20 ` Ric Wheeler 2014-01-06 22:32 ` faibish, sorin 2014-01-21 7:00 ` LSF/MM 2014 Call For Proposals Michel Lespinasse ` (2 subsequent siblings) 3 siblings, 1 reply; 59+ messages in thread From: Ric Wheeler @ 2014-01-06 22:20 UTC (permalink / raw) To: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc; +Cc: linux-kernel I would like to attend this year and continue to talk about the work on enabling the new class of persistent memory devices. Specifically, very interested in talking about both using a block driver under our existing stack and also progress at the file system layer (adding xip/mmap tweaks to existing file systems and looking at new file systems). We also have a lot of work left to do on unifying management, it would be good to resync on that. Regards, Ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* RE: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems 2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler @ 2014-01-06 22:32 ` faibish, sorin 2014-01-07 19:44 ` Joel Becker 0 siblings, 1 reply; 59+ messages in thread From: faibish, sorin @ 2014-01-06 22:32 UTC (permalink / raw) To: Ric Wheeler, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Speaking of persistent memory I would like to discuss the PMFS as well as RDMA aspects of the persistent memory model. Also I would like to discuss KV stores and object stores on persistent memory. I was involved in the PMFS as a tester and I found several issues that I would like to discuss with the community. I assume that maybe others from Intel could join this discussion except for Andy and Matt which already asked for this topic. Thanks ./Sorin -----Original Message----- From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Monday, January 06, 2014 5:21 PM To: linux-scsi@vger.kernel.org; linux-ide@vger.kernel.org; linux-mm@kvack.org; linux-fsdevel@vger.kernel.org; lsf-pc@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Subject: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems I would like to attend this year and continue to talk about the work on enabling the new class of persistent memory devices. Specifically, very interested in talking about both using a block driver under our existing stack and also progress at the file system layer (adding xip/mmap tweaks to existing file systems and looking at new file systems). We also have a lot of work left to do on unifying management, it would be good to resync on that. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems 2014-01-06 22:32 ` faibish, sorin @ 2014-01-07 19:44 ` Joel Becker 0 siblings, 0 replies; 59+ messages in thread From: Joel Becker @ 2014-01-07 19:44 UTC (permalink / raw) To: faibish, sorin Cc: Ric Wheeler, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org On Mon, Jan 06, 2014 at 05:32:56PM -0500, faibish, sorin wrote: > Speaking of persistent memory I would like to discuss the PMFS as well as RDMA aspects of the persistent memory model. Also I would like to discuss KV stores and object stores on persistent memory. I was involved in the PMFS as a tester and I found several issues that I would like to discuss with the community. I assume that maybe others from Intel could join this discussion except for Andy and Matt which already asked for this topic. Thanks Ooh, and the cluster/remote filesystem stories there (eg, RDMA etc) are probably pretty cool. Joel > > ./Sorin > > -----Original Message----- > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Monday, January 06, 2014 5:21 PM > To: linux-scsi@vger.kernel.org; linux-ide@vger.kernel.org; linux-mm@kvack.org; linux-fsdevel@vger.kernel.org; lsf-pc@lists.linux-foundation.org > Cc: linux-kernel@vger.kernel.org > Subject: [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems > > > I would like to attend this year and continue to talk about the work on enabling the new class of persistent memory devices. Specifically, very interested in talking about both using a block driver under our existing stack and also progress at the file system layer (adding xip/mmap tweaks to existing file systems and looking at new file systems). > > We also have a lot of work left to do on unifying management, it would be good to resync on that. > > Regards, > > Ric > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- The herd instinct among economists makes sheep look like independant thinkers. http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: LSF/MM 2014 Call For Proposals 2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman 2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler @ 2014-01-21 7:00 ` Michel Lespinasse 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler 2014-03-14 9:02 ` Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] James Bottomley 3 siblings, 0 replies; 59+ messages in thread From: Michel Lespinasse @ 2014-01-21 7:00 UTC (permalink / raw) To: Mel Gorman Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, linux-kernel, lsf-pc On Fri, Dec 20, 2013 at 1:30 AM, Mel Gorman <mgorman@suse.de> wrote: > The annual Linux Storage, Filesystem and Memory Management Summit for > 2014 will be held on March 24th and 25th before the Linux Foundation > Collaboration summit at The Meritage Resort, Napa Valley, CA. > > http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit > http://events.linuxfoundation.org/events/collaboration-summit Just a reminder for anyone who wants to participate in LSF/MM: If you haven't already done so, please send us your request and/or topic proposals by January 31st... > Note that we are running LSF/MM a little earlier in 2014 than in previous > years. > > On behalf of the committee I would like to issue a call for agenda proposals > that are suitable for cross-track discussion as well as more technical > subjects for discussion in the breakout sessions. > > 1) Suggestions for agenda topics should be sent before January 31st > 2014 to: > > lsf-pc@lists.linux-foundation.org > > and cc the Linux list or lists that are most interested in it: > > ATA: linux-ide@vger.kernel.org > FS: linux-fsdevel@vger.kernel.org > MM: linux-mm@kvack.org > SCSI: linux-scsi@vger.kernel.org > > People who need more time for visa applications should send proposals before > January 15th. The committee will complete the first round of selections > on that date to accommodate applications. > > Please remember to tag your subject with [LSF/MM TOPIC] to make it > easier to track. Agenda topics and attendees will be selected by the > program committee, but the final agenda will be formed by consensus of > the attendees on the day. > > We will try to cap attendance at around 25-30 per track to facilitate > discussions although the final numbers will depend on the room sizes at > the venue. > > 2) Requests to attend the summit should be sent to: > > lsf-pc@lists.linux-foundation.org > > Please summarize what expertise you will bring to the meeting, and what > you would like to discuss. Please also tag your email with [LSF/MM ATTEND] > so there is less chance of it getting lost in the large mail pile. > > Presentations are allowed to guide discussion, but are strongly > discouraged. There will be no recording or audio bridge. However, we expect > that written minutes will be published as we did in previous years > > 2013: > http://lwn.net/Articles/548089/ > > 2012: > http://lwn.net/Articles/490114/ > http://lwn.net/Articles/490501/ > > 2011: > http://lwn.net/Articles/436871/ > http://lwn.net/Articles/437066/ > > 3) If you have feedback on last year's meeting that we can use to > improve this year's, please also send that to: > > lsf-pc@lists.linux-foundation.org > > Thank you on behalf of the program committee: > > Storage: > James Bottomley > Martin K. Petersen > > Filesystems: > Trond Myklebust > Jeff Layton > Dave Chinner > Jan Kara > Ted Ts'o > > MM: > Rik van Riel > Michel Lespinasse -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman 2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler 2014-01-21 7:00 ` LSF/MM 2014 Call For Proposals Michel Lespinasse @ 2014-01-22 3:04 ` Ric Wheeler 2014-01-22 5:20 ` Joel Becker ` (2 more replies) 2014-03-14 9:02 ` Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] James Bottomley 3 siblings, 3 replies; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 3:04 UTC (permalink / raw) To: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc; +Cc: linux-kernel One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. LSF/MM seems to be pretty much the only event of the year that most of the key people will be present, so should be a great topic for a joint session. Ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler @ 2014-01-22 5:20 ` Joel Becker 2014-01-22 7:14 ` Hannes Reinecke 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman 2014-01-22 15:54 ` James Bottomley 2 siblings, 1 reply; 59+ messages in thread From: Joel Becker @ 2014-01-22 5:20 UTC (permalink / raw) To: Ric Wheeler Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current > 4k limitation for file system block sizes. Some devices in > production today and others coming soon have larger sectors and it > would be interesting to see if it is time to poke at this topic > again. > > LSF/MM seems to be pretty much the only event of the year that most > of the key people will be present, so should be a great topic for a > joint session. Oh yes, I want in on this. We handle 4k/16k/64k pages "seamlessly," and we would want to do the same for larger sectors. In theory, our code should handle it with the appropriate defines updated. Joel -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 5:20 ` Joel Becker @ 2014-01-22 7:14 ` Hannes Reinecke 0 siblings, 0 replies; 59+ messages in thread From: Hannes Reinecke @ 2014-01-22 7:14 UTC (permalink / raw) To: Ric Wheeler, linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On 01/22/2014 06:20 AM, Joel Becker wrote: > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: >> One topic that has been lurking forever at the edges is the current >> 4k limitation for file system block sizes. Some devices in >> production today and others coming soon have larger sectors and it >> would be interesting to see if it is time to poke at this topic >> again. >> >> LSF/MM seems to be pretty much the only event of the year that most >> of the key people will be present, so should be a great topic for a >> joint session. > > Oh yes, I want in on this. We handle 4k/16k/64k pages "seamlessly," and > we would want to do the same for larger sectors. In theory, our code > should handle it with the appropriate defines updated. > +1 The shingled drive folks would really love us for this. Plus it would make live really easy for those type of devices. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nurnberg GF: J. Hawn, J. Guild, F. Imendorffer, HRB 16746 (AG Nurnberg) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler 2014-01-22 5:20 ` Joel Becker @ 2014-01-22 9:34 ` Mel Gorman 2014-01-22 14:10 ` Ric Wheeler ` (2 more replies) 2014-01-22 15:54 ` James Bottomley 2 siblings, 3 replies; 59+ messages in thread From: Mel Gorman @ 2014-01-22 9:34 UTC (permalink / raw) To: Ric Wheeler Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current > 4k limitation for file system block sizes. Some devices in > production today and others coming soon have larger sectors and it > would be interesting to see if it is time to poke at this topic > again. > Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman @ 2014-01-22 14:10 ` Ric Wheeler 2014-01-22 14:34 ` Mel Gorman 2014-01-22 15:14 ` Chris Mason 2014-01-23 20:47 ` Christoph Lameter 2 siblings, 1 reply; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 14:10 UTC (permalink / raw) To: Mel Gorman Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel, Andrew Morton On 01/22/2014 04:34 AM, Mel Gorman wrote: > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: >> One topic that has been lurking forever at the edges is the current >> 4k limitation for file system block sizes. Some devices in >> production today and others coming soon have larger sectors and it >> would be interesting to see if it is time to poke at this topic >> again. >> > Large block support was proposed years ago by Christoph Lameter > (http://lwn.net/Articles/232757/). I think I was just getting started > in the community at the time so I do not recall any of the details. I do > believe it motivated an alternative by Nick Piggin called fsblock though > (http://lwn.net/Articles/321390/). At the very least it would be nice to > know why neither were never merged for those of us that were not around > at the time and who may not have the chance to dive through mailing list > archives between now and March. > > FWIW, I would expect that a show-stopper for any proposal is requiring > high-order allocations to succeed for the system to behave correctly. > I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 14:10 ` Ric Wheeler @ 2014-01-22 14:34 ` Mel Gorman 2014-01-22 14:58 ` Ric Wheeler 2014-01-23 8:21 ` Dave Chinner 0 siblings, 2 replies; 59+ messages in thread From: Mel Gorman @ 2014-01-22 14:34 UTC (permalink / raw) To: Ric Wheeler Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel, Andrew Morton On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > On 01/22/2014 04:34 AM, Mel Gorman wrote: > >On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > >>One topic that has been lurking forever at the edges is the current > >>4k limitation for file system block sizes. Some devices in > >>production today and others coming soon have larger sectors and it > >>would be interesting to see if it is time to poke at this topic > >>again. > >> > >Large block support was proposed years ago by Christoph Lameter > >(http://lwn.net/Articles/232757/). I think I was just getting started > >in the community at the time so I do not recall any of the details. I do > >believe it motivated an alternative by Nick Piggin called fsblock though > >(http://lwn.net/Articles/321390/). At the very least it would be nice to > >know why neither were never merged for those of us that were not around > >at the time and who may not have the chance to dive through mailing list > >archives between now and March. > > > >FWIW, I would expect that a show-stopper for any proposal is requiring > >high-order allocations to succeed for the system to behave correctly. > > > > I have a somewhat hazy memory of Andrew warning us that touching > this code takes us into dark and scary places. > That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 14:34 ` Mel Gorman @ 2014-01-22 14:58 ` Ric Wheeler 2014-01-22 15:19 ` Mel Gorman 2014-01-22 20:47 ` Martin K. Petersen 2014-01-23 8:21 ` Dave Chinner 1 sibling, 2 replies; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 14:58 UTC (permalink / raw) To: Mel Gorman Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel, Andrew Morton On 01/22/2014 09:34 AM, Mel Gorman wrote: > On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: >> On 01/22/2014 04:34 AM, Mel Gorman wrote: >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: >>>> One topic that has been lurking forever at the edges is the current >>>> 4k limitation for file system block sizes. Some devices in >>>> production today and others coming soon have larger sectors and it >>>> would be interesting to see if it is time to poke at this topic >>>> again. >>>> >>> Large block support was proposed years ago by Christoph Lameter >>> (http://lwn.net/Articles/232757/). I think I was just getting started >>> in the community at the time so I do not recall any of the details. I do >>> believe it motivated an alternative by Nick Piggin called fsblock though >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to >>> know why neither were never merged for those of us that were not around >>> at the time and who may not have the chance to dive through mailing list >>> archives between now and March. >>> >>> FWIW, I would expect that a show-stopper for any proposal is requiring >>> high-order allocations to succeed for the system to behave correctly. >>> >> I have a somewhat hazy memory of Andrew warning us that touching >> this code takes us into dark and scary places. >> > That is a light summary. As Andrew tends to reject patches with poor > documentation in case we forget the details in 6 months, I'm going to guess > that he does not remember the details of a discussion from 7ish years ago. > This is where Andrew swoops in with a dazzling display of his eidetic > memory just to prove me wrong. > > Ric, are there any storage vendor that is pushing for this right now? > Is someone working on this right now or planning to? If they are, have they > looked into the history of fsblock (Nick) and large block support (Christoph) > to see if they are candidates for forward porting or reimplementation? > I ask because without that person there is a risk that the discussion > will go as follows > > Topic leader: Does anyone have an objection to supporting larger block > sizes than the page size? > Room: Send patches and we'll talk. > I will have to see if I can get a storage vendor to make a public statement, but there are vendors hoping to see this land in Linux in the next few years. I assume that anyone with a shipping device will have to at least emulate the 4KB sector size for years to come, but that there might be a significant performance win for platforms that can do a larger block. Note that windows seems to suffer from the exact same limitation, so we are not alone here with the vm page size/fs block size entanglement.... ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 14:58 ` Ric Wheeler @ 2014-01-22 15:19 ` Mel Gorman 2014-01-22 17:02 ` Chris Mason 2014-01-23 20:48 ` Christoph Lameter 2014-01-22 20:47 ` Martin K. Petersen 1 sibling, 2 replies; 59+ messages in thread From: Mel Gorman @ 2014-01-22 15:19 UTC (permalink / raw) To: Ric Wheeler Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel, Andrew Morton On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > On 01/22/2014 09:34 AM, Mel Gorman wrote: > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > >>>>One topic that has been lurking forever at the edges is the current > >>>>4k limitation for file system block sizes. Some devices in > >>>>production today and others coming soon have larger sectors and it > >>>>would be interesting to see if it is time to poke at this topic > >>>>again. > >>>> > >>>Large block support was proposed years ago by Christoph Lameter > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > >>>in the community at the time so I do not recall any of the details. I do > >>>believe it motivated an alternative by Nick Piggin called fsblock though > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > >>>know why neither were never merged for those of us that were not around > >>>at the time and who may not have the chance to dive through mailing list > >>>archives between now and March. > >>> > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > >>>high-order allocations to succeed for the system to behave correctly. > >>> > >>I have a somewhat hazy memory of Andrew warning us that touching > >>this code takes us into dark and scary places. > >> > >That is a light summary. As Andrew tends to reject patches with poor > >documentation in case we forget the details in 6 months, I'm going to guess > >that he does not remember the details of a discussion from 7ish years ago. > >This is where Andrew swoops in with a dazzling display of his eidetic > >memory just to prove me wrong. > > > >Ric, are there any storage vendor that is pushing for this right now? > >Is someone working on this right now or planning to? If they are, have they > >looked into the history of fsblock (Nick) and large block support (Christoph) > >to see if they are candidates for forward porting or reimplementation? > >I ask because without that person there is a risk that the discussion > >will go as follows > > > >Topic leader: Does anyone have an objection to supporting larger block > > sizes than the page size? > >Room: Send patches and we'll talk. > > > > I will have to see if I can get a storage vendor to make a public > statement, but there are vendors hoping to see this land in Linux in > the next few years. What about the second and third questions -- is someone working on this right now or planning to? Have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? Don't get me wrong, I'm interested in the topic but I severely doubt I'd have the capacity to research the background of this in advance. It's also unlikely that I'd work on it in the future without throwing out my current TODO list. In an ideal world someone will have done the legwork in advance of LSF/MM to help drive the topic. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 15:19 ` Mel Gorman @ 2014-01-22 17:02 ` Chris Mason 2014-01-22 17:21 ` James Bottomley 2014-01-23 20:48 ` Christoph Lameter 1 sibling, 1 reply; 59+ messages in thread From: Chris Mason @ 2014-01-22 17:02 UTC (permalink / raw) To: mgorman@suse.de Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, rwheeler@redhat.com, linux-fsdevel@vger.kernel.org On Wed, 2014-01-22 at 15:19 +0000, Mel Gorman wrote: > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > > On 01/22/2014 09:34 AM, Mel Gorman wrote: > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > >>>>One topic that has been lurking forever at the edges is the current > > >>>>4k limitation for file system block sizes. Some devices in > > >>>>production today and others coming soon have larger sectors and it > > >>>>would be interesting to see if it is time to poke at this topic > > >>>>again. > > >>>> > > >>>Large block support was proposed years ago by Christoph Lameter > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > > >>>in the community at the time so I do not recall any of the details. I do > > >>>believe it motivated an alternative by Nick Piggin called fsblock though > > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > > >>>know why neither were never merged for those of us that were not around > > >>>at the time and who may not have the chance to dive through mailing list > > >>>archives between now and March. > > >>> > > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > > >>>high-order allocations to succeed for the system to behave correctly. > > >>> > > >>I have a somewhat hazy memory of Andrew warning us that touching > > >>this code takes us into dark and scary places. > > >> > > >That is a light summary. As Andrew tends to reject patches with poor > > >documentation in case we forget the details in 6 months, I'm going to guess > > >that he does not remember the details of a discussion from 7ish years ago. > > >This is where Andrew swoops in with a dazzling display of his eidetic > > >memory just to prove me wrong. > > > > > >Ric, are there any storage vendor that is pushing for this right now? > > >Is someone working on this right now or planning to? If they are, have they > > >looked into the history of fsblock (Nick) and large block support (Christoph) > > >to see if they are candidates for forward porting or reimplementation? > > >I ask because without that person there is a risk that the discussion > > >will go as follows > > > > > >Topic leader: Does anyone have an objection to supporting larger block > > > sizes than the page size? > > >Room: Send patches and we'll talk. > > > > > > > I will have to see if I can get a storage vendor to make a public > > statement, but there are vendors hoping to see this land in Linux in > > the next few years. > > What about the second and third questions -- is someone working on this > right now or planning to? Have they looked into the history of fsblock > (Nick) and large block support (Christoph) to see if they are candidates > for forward porting or reimplementation? I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 17:02 ` Chris Mason @ 2014-01-22 17:21 ` James Bottomley 2014-01-22 18:02 ` Chris Mason 2014-01-23 8:24 ` Dave Chinner 0 siblings, 2 replies; 59+ messages in thread From: James Bottomley @ 2014-01-22 17:21 UTC (permalink / raw) To: Chris Mason Cc: mgorman@suse.de, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, rwheeler@redhat.com, linux-fsdevel@vger.kernel.org On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > On Wed, 2014-01-22 at 15:19 +0000, Mel Gorman wrote: > > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > > > On 01/22/2014 09:34 AM, Mel Gorman wrote: > > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > > >>>>One topic that has been lurking forever at the edges is the current > > > >>>>4k limitation for file system block sizes. Some devices in > > > >>>>production today and others coming soon have larger sectors and it > > > >>>>would be interesting to see if it is time to poke at this topic > > > >>>>again. > > > >>>> > > > >>>Large block support was proposed years ago by Christoph Lameter > > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > > > >>>in the community at the time so I do not recall any of the details. I do > > > >>>believe it motivated an alternative by Nick Piggin called fsblock though > > > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > > > >>>know why neither were never merged for those of us that were not around > > > >>>at the time and who may not have the chance to dive through mailing list > > > >>>archives between now and March. > > > >>> > > > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > > > >>>high-order allocations to succeed for the system to behave correctly. > > > >>> > > > >>I have a somewhat hazy memory of Andrew warning us that touching > > > >>this code takes us into dark and scary places. > > > >> > > > >That is a light summary. As Andrew tends to reject patches with poor > > > >documentation in case we forget the details in 6 months, I'm going to guess > > > >that he does not remember the details of a discussion from 7ish years ago. > > > >This is where Andrew swoops in with a dazzling display of his eidetic > > > >memory just to prove me wrong. > > > > > > > >Ric, are there any storage vendor that is pushing for this right now? > > > >Is someone working on this right now or planning to? If they are, have they > > > >looked into the history of fsblock (Nick) and large block support (Christoph) > > > >to see if they are candidates for forward porting or reimplementation? > > > >I ask because without that person there is a risk that the discussion > > > >will go as follows > > > > > > > >Topic leader: Does anyone have an objection to supporting larger block > > > > sizes than the page size? > > > >Room: Send patches and we'll talk. > > > > > > > > > > I will have to see if I can get a storage vendor to make a public > > > statement, but there are vendors hoping to see this land in Linux in > > > the next few years. > > > > What about the second and third questions -- is someone working on this > > right now or planning to? Have they looked into the history of fsblock > > (Nick) and large block support (Christoph) to see if they are candidates > > for forward porting or reimplementation? > > I really think that if we want to make progress on this one, we need > code and someone that owns it. Nick's work was impressive, but it was > mostly there for getting rid of buffer heads. If we have a device that > needs it and someone working to enable that device, we'll go forward > much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 17:21 ` James Bottomley @ 2014-01-22 18:02 ` Chris Mason 2014-01-22 18:13 ` James Bottomley 2014-01-23 8:24 ` Dave Chinner 1 sibling, 1 reply; 59+ messages in thread From: Chris Mason @ 2014-01-22 18:02 UTC (permalink / raw) To: James.Bottomley@HansenPartnership.com Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: [ I like big sectors and I cannot lie ] > > > I really think that if we want to make progress on this one, we need > > code and someone that owns it. Nick's work was impressive, but it was > > mostly there for getting rid of buffer heads. If we have a device that > > needs it and someone working to enable that device, we'll go forward > > much faster. > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > sector only devices just fine today because the bh mechanisms now > operate on top of the page cache and can do the RMW necessary to update > a bh in the page cache itself which allows us to do only 4k chunked > writes, so we could keep the bh system and just alter the granularity of > the page cache. > We're likely to have people mixing 4K drives and <fill in some other size here> on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. >From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. > The other question is if the drive does RMW between 4k and whatever its > physical sector size, do we need to do anything to take advantage of > it ... as in what would altering the granularity of the page cache buy > us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:02 ` Chris Mason @ 2014-01-22 18:13 ` James Bottomley 2014-01-22 18:17 ` Ric Wheeler ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: James Bottomley @ 2014-01-22 18:13 UTC (permalink / raw) To: Chris Mason Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... > > > I really think that if we want to make progress on this one, we need > > > code and someone that owns it. Nick's work was impressive, but it was > > > mostly there for getting rid of buffer heads. If we have a device that > > > needs it and someone working to enable that device, we'll go forward > > > much faster. > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > sector only devices just fine today because the bh mechanisms now > > operate on top of the page cache and can do the RMW necessary to update > > a bh in the page cache itself which allows us to do only 4k chunked > > writes, so we could keep the bh system and just alter the granularity of > > the page cache. > > > > We're likely to have people mixing 4K drives and <fill in some other > size here> on the same box. We could just go with the biggest size and > use the existing bh code for the sub-pagesized blocks, but I really > hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. > From a pure code point of view, it may be less work to change it once in > the VM. But from an overall system impact point of view, it's a big > change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. > > The other question is if the drive does RMW between 4k and whatever its > > physical sector size, do we need to do anything to take advantage of > > it ... as in what would altering the granularity of the page cache buy > > us? > > The real benefit is when and how the reads get scheduled. We're able to > do a much better job pipelining the reads, controlling our caches and > reducing write latency by having the reads done up in the OS instead of > the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:13 ` James Bottomley @ 2014-01-22 18:17 ` Ric Wheeler 2014-01-22 18:35 ` James Bottomley 2014-01-22 18:37 ` Chris Mason 2014-01-23 8:27 ` Dave Chinner 2 siblings, 1 reply; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 18:17 UTC (permalink / raw) To: James Bottomley, Chris Mason Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On 01/22/2014 01:13 PM, James Bottomley wrote: > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: >> On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: >>> On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: >> [ I like big sectors and I cannot lie ] > I think I might be sceptical, but I don't think that's showing in my > concerns ... > >>>> I really think that if we want to make progress on this one, we need >>>> code and someone that owns it. Nick's work was impressive, but it was >>>> mostly there for getting rid of buffer heads. If we have a device that >>>> needs it and someone working to enable that device, we'll go forward >>>> much faster. >>> Do we even need to do that (eliminate buffer heads)? We cope with 4k >>> sector only devices just fine today because the bh mechanisms now >>> operate on top of the page cache and can do the RMW necessary to update >>> a bh in the page cache itself which allows us to do only 4k chunked >>> writes, so we could keep the bh system and just alter the granularity of >>> the page cache. >>> >> We're likely to have people mixing 4K drives and <fill in some other >> size here> on the same box. We could just go with the biggest size and >> use the existing bh code for the sub-pagesized blocks, but I really >> hesitate to change VM fundamentals for this. > If the page cache had a variable granularity per device, that would cope > with this. It's the variable granularity that's the VM problem. > >> From a pure code point of view, it may be less work to change it once in >> the VM. But from an overall system impact point of view, it's a big >> change in how the system behaves just for filesystem metadata. > Agreed, but only if we don't do RMW in the buffer cache ... which may be > a good reason to keep it. > >>> The other question is if the drive does RMW between 4k and whatever its >>> physical sector size, do we need to do anything to take advantage of >>> it ... as in what would altering the granularity of the page cache buy >>> us? >> The real benefit is when and how the reads get scheduled. We're able to >> do a much better job pipelining the reads, controlling our caches and >> reducing write latency by having the reads done up in the OS instead of >> the drive. > I agree with all of that, but my question is still can we do this by > propagating alignment and chunk size information (i.e. the physical > sector size) like we do today. If the FS knows the optimal I/O patterns > and tries to follow them, the odd cockup won't impact performance > dramatically. The real question is can the FS make use of this layout > information *without* changing the page cache granularity? Only if you > answer me "no" to this do I think we need to worry about changing page > cache granularity. > > Realistically, if you look at what the I/O schedulers output on a > standard (spinning rust) workload, it's mostly large transfers. > Obviously these are misalgned at the ends, but we can fix some of that > in the scheduler. Particularly if the FS helps us with layout. My > instinct tells me that we can fix 99% of this with layout on the FS + io > schedulers ... the remaining 1% goes to the drive as needing to do RMW > in the device, but the net impact to our throughput shouldn't be that > great. > > James > I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:17 ` Ric Wheeler @ 2014-01-22 18:35 ` James Bottomley 2014-01-22 18:39 ` Ric Wheeler 0 siblings, 1 reply; 59+ messages in thread From: James Bottomley @ 2014-01-22 18:35 UTC (permalink / raw) To: Ric Wheeler Cc: Chris Mason, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: > On 01/22/2014 01:13 PM, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > >> On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > >>> On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > >> [ I like big sectors and I cannot lie ] > > I think I might be sceptical, but I don't think that's showing in my > > concerns ... > > > >>>> I really think that if we want to make progress on this one, we need > >>>> code and someone that owns it. Nick's work was impressive, but it was > >>>> mostly there for getting rid of buffer heads. If we have a device that > >>>> needs it and someone working to enable that device, we'll go forward > >>>> much faster. > >>> Do we even need to do that (eliminate buffer heads)? We cope with 4k > >>> sector only devices just fine today because the bh mechanisms now > >>> operate on top of the page cache and can do the RMW necessary to update > >>> a bh in the page cache itself which allows us to do only 4k chunked > >>> writes, so we could keep the bh system and just alter the granularity of > >>> the page cache. > >>> > >> We're likely to have people mixing 4K drives and <fill in some other > >> size here> on the same box. We could just go with the biggest size and > >> use the existing bh code for the sub-pagesized blocks, but I really > >> hesitate to change VM fundamentals for this. > > If the page cache had a variable granularity per device, that would cope > > with this. It's the variable granularity that's the VM problem. > > > >> From a pure code point of view, it may be less work to change it once in > >> the VM. But from an overall system impact point of view, it's a big > >> change in how the system behaves just for filesystem metadata. > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > > a good reason to keep it. > > > >>> The other question is if the drive does RMW between 4k and whatever its > >>> physical sector size, do we need to do anything to take advantage of > >>> it ... as in what would altering the granularity of the page cache buy > >>> us? > >> The real benefit is when and how the reads get scheduled. We're able to > >> do a much better job pipelining the reads, controlling our caches and > >> reducing write latency by having the reads done up in the OS instead of > >> the drive. > > I agree with all of that, but my question is still can we do this by > > propagating alignment and chunk size information (i.e. the physical > > sector size) like we do today. If the FS knows the optimal I/O patterns > > and tries to follow them, the odd cockup won't impact performance > > dramatically. The real question is can the FS make use of this layout > > information *without* changing the page cache granularity? Only if you > > answer me "no" to this do I think we need to worry about changing page > > cache granularity. > > > > Realistically, if you look at what the I/O schedulers output on a > > standard (spinning rust) workload, it's mostly large transfers. > > Obviously these are misalgned at the ends, but we can fix some of that > > in the scheduler. Particularly if the FS helps us with layout. My > > instinct tells me that we can fix 99% of this with layout on the FS + io > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > in the device, but the net impact to our throughput shouldn't be that > > great. > > > > James > > > > I think that the key to having the file system work with larger > sectors is to > create them properly aligned and use the actual, native sector size as > their FS > block size. Which is pretty much back the original challenge. Only if you think laying out stuff requires block size changes. If a 4k block filesystem's allocation algorithm tried to allocate on a 16k boundary for instance, that gets us a lot of the performance without needing a lot of alteration. It's not even obvious that an ignorant 4k layout is going to be so bad ... the RMW occurs only at the ends of the transfers, not in the middle. If we say 16k physical block and average 128k transfers, probabalistically we misalign on 6 out of 31 sectors (or 19% of the time). We can make that better by increasing the transfer size (it comes down to 10% for 256k transfers. > Teaching each and every file system to be aligned at the storage > granularity/minimum IO size when that is larger than the physical > sector size is > harder I think. But you're making assumptions about needing larger block sizes. I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is "do we need to bother any more" territory for me. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:35 ` James Bottomley @ 2014-01-22 18:39 ` Ric Wheeler 2014-01-22 19:30 ` James Bottomley 0 siblings, 1 reply; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 18:39 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On 01/22/2014 01:35 PM, James Bottomley wrote: > On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: >> On 01/22/2014 01:13 PM, James Bottomley wrote: >>> On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: >>>> On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: >>>>> On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: >>>> [ I like big sectors and I cannot lie ] >>> I think I might be sceptical, but I don't think that's showing in my >>> concerns ... >>> >>>>>> I really think that if we want to make progress on this one, we need >>>>>> code and someone that owns it. Nick's work was impressive, but it was >>>>>> mostly there for getting rid of buffer heads. If we have a device that >>>>>> needs it and someone working to enable that device, we'll go forward >>>>>> much faster. >>>>> Do we even need to do that (eliminate buffer heads)? We cope with 4k >>>>> sector only devices just fine today because the bh mechanisms now >>>>> operate on top of the page cache and can do the RMW necessary to update >>>>> a bh in the page cache itself which allows us to do only 4k chunked >>>>> writes, so we could keep the bh system and just alter the granularity of >>>>> the page cache. >>>>> >>>> We're likely to have people mixing 4K drives and <fill in some other >>>> size here> on the same box. We could just go with the biggest size and >>>> use the existing bh code for the sub-pagesized blocks, but I really >>>> hesitate to change VM fundamentals for this. >>> If the page cache had a variable granularity per device, that would cope >>> with this. It's the variable granularity that's the VM problem. >>> >>>> From a pure code point of view, it may be less work to change it once in >>>> the VM. But from an overall system impact point of view, it's a big >>>> change in how the system behaves just for filesystem metadata. >>> Agreed, but only if we don't do RMW in the buffer cache ... which may be >>> a good reason to keep it. >>> >>>>> The other question is if the drive does RMW between 4k and whatever its >>>>> physical sector size, do we need to do anything to take advantage of >>>>> it ... as in what would altering the granularity of the page cache buy >>>>> us? >>>> The real benefit is when and how the reads get scheduled. We're able to >>>> do a much better job pipelining the reads, controlling our caches and >>>> reducing write latency by having the reads done up in the OS instead of >>>> the drive. >>> I agree with all of that, but my question is still can we do this by >>> propagating alignment and chunk size information (i.e. the physical >>> sector size) like we do today. If the FS knows the optimal I/O patterns >>> and tries to follow them, the odd cockup won't impact performance >>> dramatically. The real question is can the FS make use of this layout >>> information *without* changing the page cache granularity? Only if you >>> answer me "no" to this do I think we need to worry about changing page >>> cache granularity. >>> >>> Realistically, if you look at what the I/O schedulers output on a >>> standard (spinning rust) workload, it's mostly large transfers. >>> Obviously these are misalgned at the ends, but we can fix some of that >>> in the scheduler. Particularly if the FS helps us with layout. My >>> instinct tells me that we can fix 99% of this with layout on the FS + io >>> schedulers ... the remaining 1% goes to the drive as needing to do RMW >>> in the device, but the net impact to our throughput shouldn't be that >>> great. >>> >>> James >>> >> I think that the key to having the file system work with larger >> sectors is to >> create them properly aligned and use the actual, native sector size as >> their FS >> block size. Which is pretty much back the original challenge. > Only if you think laying out stuff requires block size changes. If a 4k > block filesystem's allocation algorithm tried to allocate on a 16k > boundary for instance, that gets us a lot of the performance without > needing a lot of alteration. The key here is that we cannot assume that writes happen only during allocation/append mode. Unless the block size enforces it, we will have non-aligned, small block IO done to allocated regions that won't get coalesced. > > It's not even obvious that an ignorant 4k layout is going to be so > bad ... the RMW occurs only at the ends of the transfers, not in the > middle. If we say 16k physical block and average 128k transfers, > probabalistically we misalign on 6 out of 31 sectors (or 19% of the > time). We can make that better by increasing the transfer size (it > comes down to 10% for 256k transfers. This really depends on the nature of the device. Some devices could produce very erratic performance or even (not today, but some day) reject the IO. > >> Teaching each and every file system to be aligned at the storage >> granularity/minimum IO size when that is larger than the physical >> sector size is >> harder I think. > But you're making assumptions about needing larger block sizes. I'm > asking what can we do with what we currently have? Increasing the > transfer size is a way of mitigating the problem with no FS support > whatever. Adding alignment to the FS layout algorithm is another. When > you've done both of those, I think you're already at the 99% aligned > case, which is "do we need to bother any more" territory for me. > I would say no, we will eventually need larger file system block sizes. Tuning and getting 95% (98%?) of the way there with alignment and IO scheduler does help a lot. That is what we do today and it is important when looking for high performance. However, this is more of a short term work around for a lack of a fundamental ability to do the right sized file system block for a specific class of device. As such, not a crisis that must be solved today, but rather something that I think is definitely worth looking at so we can figure this out over the next year or so. Ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:39 ` Ric Wheeler @ 2014-01-22 19:30 ` James Bottomley 2014-01-22 19:50 ` Andrew Morton 2014-01-22 20:57 ` Martin K. Petersen 0 siblings, 2 replies; 59+ messages in thread From: James Bottomley @ 2014-01-22 19:30 UTC (permalink / raw) To: Ric Wheeler Cc: linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org On Wed, 2014-01-22 at 13:39 -0500, Ric Wheeler wrote: > On 01/22/2014 01:35 PM, James Bottomley wrote: > > On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: [...] > >> I think that the key to having the file system work with larger > >> sectors is to > >> create them properly aligned and use the actual, native sector size as > >> their FS > >> block size. Which is pretty much back the original challenge. > > Only if you think laying out stuff requires block size changes. If a 4k > > block filesystem's allocation algorithm tried to allocate on a 16k > > boundary for instance, that gets us a lot of the performance without > > needing a lot of alteration. > > The key here is that we cannot assume that writes happen only during > allocation/append mode. But that doesn't matter at all, does it? If the file is sector aligned, then the write is aligned. If the write is short on a large block fs, well we'd just have to do the RMW in the OS anyway ... is that any better than doing it in the device? > Unless the block size enforces it, we will have non-aligned, small > block IO done > to allocated regions that won't get coalesced. We always get that if it's the use pattern ... the question merely becomes who bears the burden of RMW. > > It's not even obvious that an ignorant 4k layout is going to be so > > bad ... the RMW occurs only at the ends of the transfers, not in the > > middle. If we say 16k physical block and average 128k transfers, > > probabalistically we misalign on 6 out of 31 sectors (or 19% of the > > time). We can make that better by increasing the transfer size (it > > comes down to 10% for 256k transfers. > > This really depends on the nature of the device. Some devices could > produce very > erratic performance Yes, we get that today with misaligned writes to the 4k devices. > or even (not today, but some day) reject the IO. I really doubt this. All 4k drives today do RMW ... I don't see that changing any time soon. > >> Teaching each and every file system to be aligned at the storage > >> granularity/minimum IO size when that is larger than the physical > >> sector size is > >> harder I think. > > But you're making assumptions about needing larger block sizes. I'm > > asking what can we do with what we currently have? Increasing the > > transfer size is a way of mitigating the problem with no FS support > > whatever. Adding alignment to the FS layout algorithm is another. When > > you've done both of those, I think you're already at the 99% aligned > > case, which is "do we need to bother any more" territory for me. > > > > I would say no, we will eventually need larger file system block sizes. > > Tuning and getting 95% (98%?) of the way there with alignment and IO > scheduler > does help a lot. That is what we do today and it is important when > looking for > high performance. > > However, this is more of a short term work around for a lack of a > fundamental > ability to do the right sized file system block for a specific class > of device. > As such, not a crisis that must be solved today, but rather something > that I > think is definitely worth looking at so we can figure this out over > the next > year or so. But this, I think, is the fundamental point for debate. If we can pull alignment and other tricks to solve 99% of the problem is there a need for radical VM surgery? Is there anything coming down the pipe in the future that may move the devices ahead of the tricks? James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 19:30 ` James Bottomley @ 2014-01-22 19:50 ` Andrew Morton 2014-01-22 20:13 ` Chris Mason 2014-01-23 8:35 ` Dave Chinner 2014-01-22 20:57 ` Martin K. Petersen 1 sibling, 2 replies; 59+ messages in thread From: Andrew Morton @ 2014-01-22 19:50 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > But this, I think, is the fundamental point for debate. If we can pull > alignment and other tricks to solve 99% of the problem is there a need > for radical VM surgery? Is there anything coming down the pipe in the > future that may move the devices ahead of the tricks? I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. That way we'll at least have an understanding of what the potential gains will be. If the answer is "1.5%" then poof - go off and do something else. (And the gains on powerpc would be an upper bound - unlike powerpc, x86 still has to fiddle around with 16x as many pages and perhaps order-4 allocations(?)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 19:50 ` Andrew Morton @ 2014-01-22 20:13 ` Chris Mason 2014-01-23 2:46 ` David Lang 2014-01-23 8:35 ` Dave Chinner 1 sibling, 1 reply; 59+ messages in thread From: Chris Mason @ 2014-01-22 20:13 UTC (permalink / raw) To: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, James.Bottomley@hansenpartnership.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote: > On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > > But this, I think, is the fundamental point for debate. If we can pull > > alignment and other tricks to solve 99% of the problem is there a need > > for radical VM surgery? Is there anything coming down the pipe in the > > future that may move the devices ahead of the tricks? > > I expect it would be relatively simple to get large blocksizes working > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > amounts of work, perhaps someone can do a proof-of-concept on powerpc > (or ia64) with 64k blocksize. Maybe 5 drives in raid5 on MD, with 4K coming from each drive. Well aligned 16K IO will work, everything else will about the same as a rmw from a single drive. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 20:13 ` Chris Mason @ 2014-01-23 2:46 ` David Lang 2014-01-23 5:21 ` Theodore Ts'o 0 siblings, 1 reply; 59+ messages in thread From: David Lang @ 2014-01-23 2:46 UTC (permalink / raw) To: Chris Mason Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, James.Bottomley@hansenpartnership.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 22 Jan 2014, Chris Mason wrote: > On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote: >> On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@hansenpartnership.com> wrote: >> >>> But this, I think, is the fundamental point for debate. If we can pull >>> alignment and other tricks to solve 99% of the problem is there a need >>> for radical VM surgery? Is there anything coming down the pipe in the >>> future that may move the devices ahead of the tricks? >> >> I expect it would be relatively simple to get large blocksizes working >> on powerpc with 64k PAGE_SIZE. So before diving in and doing huge >> amounts of work, perhaps someone can do a proof-of-concept on powerpc >> (or ia64) with 64k blocksize. > > > Maybe 5 drives in raid5 on MD, with 4K coming from each drive. Well > aligned 16K IO will work, everything else will about the same as a rmw > from a single drive. I think this is the key point to think about here. How will these new hard drive large block sizes differ from RAID stripes and SSD eraseblocks? In all of these cases there are very clear advantages to doing the writes in properly sized and aligned chunks that correspond with the underlying structure to avoid the RMW overhead. It's extremely unlikely that drive manufacturers will produce drives that won't work with any existing OS, so they are going to support smaller writes in firmware. If they don't, they won't be able to sell their drives to anyone running existing software. Given the Enterprise software upgrade cycle compared to the expanding storage needs, whatever they ship will have to work on OS and firmware releases that happened several years ago. I think what is needed is some way to be able to get a report on how man RMW cycles have to happen. Then people can work on ways to reduce this number and measure the results. I don't know if md and dm are currently smart enough to realize that the entire stripe is being overwritten and avoid the RMW cycle. If they can't, I would expect that once we start measuring it, they will gain such support. David Lang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 2:46 ` David Lang @ 2014-01-23 5:21 ` Theodore Ts'o 0 siblings, 0 replies; 59+ messages in thread From: Theodore Ts'o @ 2014-01-23 5:21 UTC (permalink / raw) To: David Lang Cc: Chris Mason, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, James.Bottomley@hansenpartnership.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, Jan 22, 2014 at 06:46:11PM -0800, David Lang wrote: > It's extremely unlikely that drive manufacturers will produce drives > that won't work with any existing OS, so they are going to support > smaller writes in firmware. If they don't, they won't be able to > sell their drives to anyone running existing software. Given the > Enterprise software upgrade cycle compared to the expanding storage > needs, whatever they ship will have to work on OS and firmware > releases that happened several years ago. I've been talking to a number of HDD vendors, and while most of the discussions has been about SMR, the topic of 64k sectors did come up recently. In the opinion of at least one drive vendor, the pressure or 64k sectors will start increasing (roughly paraphrasing that vendor's engineer, "it's a matter of physics"), and it might not be surprising that in 2 or 3 years, we might start seing drives with 64k sectors. Like with 4k sector drives, it's likely that at least initial said drives will have an emulation mode where sub-64k writes will require a read-modify-write cycle. What I told that vendor was that if this were the case, he should seriously consider submitting a topic proposal to the LSF/MM, since if he wants those drives to be well supported, we need to start thinking about what changes might be necessary at the VM and FS layers now. So hopefully we'll see a topic proposal from that HDD vendor in the next couple of days. The bottom line is that I'm pretty well convinced that like SMR drives, 64k sector drives will be coming, and it's not something we can duck. It might not come as quickly as the HDD vendor community might like --- I remember attending an IDEMA conference in 2008 where they confidently predicted that 4k sector drives would be the default in 2 years, and it took a wee bit longer than that. But nevertheless, looking at the most likely roadmap and trajectory of hard drive technology, these are two things that will very likely be coming down the pike, and it would be best if we start thinking about how to engage with these changes constructively sooner rather than putting it off and then getting caught behind the eight-ball later. Cheers, - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 19:50 ` Andrew Morton 2014-01-22 20:13 ` Chris Mason @ 2014-01-23 8:35 ` Dave Chinner 2014-01-23 12:55 ` Theodore Ts'o 1 sibling, 1 reply; 59+ messages in thread From: Dave Chinner @ 2014-01-23 8:35 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Ric Wheeler On Wed, Jan 22, 2014 at 11:50:02AM -0800, Andrew Morton wrote: > On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > > But this, I think, is the fundamental point for debate. If we can pull > > alignment and other tricks to solve 99% of the problem is there a need > > for radical VM surgery? Is there anything coming down the pipe in the > > future that may move the devices ahead of the tricks? > > I expect it would be relatively simple to get large blocksizes working > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > amounts of work, perhaps someone can do a proof-of-concept on powerpc > (or ia64) with 64k blocksize. Reality check: 64k block sizes on 64k page Linux machines has been used in production on XFS for at least 10 years. It's exactly the same case as 4k block size on 4k page size - one page, one buffer head, one filesystem block. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 8:35 ` Dave Chinner @ 2014-01-23 12:55 ` Theodore Ts'o 2014-01-23 19:49 ` Dave Chinner 2014-01-23 21:21 ` Joel Becker 0 siblings, 2 replies; 59+ messages in thread From: Theodore Ts'o @ 2014-01-23 12:55 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, linux-scsi@vger.kernel.org, linux-mm@kvack.org, Chris Mason, linux-kernel@vger.kernel.org, James Bottomley, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Ric Wheeler On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: > > > > I expect it would be relatively simple to get large blocksizes working > > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > > amounts of work, perhaps someone can do a proof-of-concept on powerpc > > (or ia64) with 64k blocksize. > > Reality check: 64k block sizes on 64k page Linux machines has been > used in production on XFS for at least 10 years. It's exactly the > same case as 4k block size on 4k page size - one page, one buffer > head, one filesystem block. This is true for ext4 as well. Block size == page size support is pretty easy; the hard part is when block size > page size, due to assumptions in the VM layer that requires that FS system needs to do a lot of extra work to fudge around. So the real problem comes with trying to support 64k block sizes on a 4k page architecture, and can we do it in a way where every single file system doesn't have to do their own specific hacks to work around assumptions made in the VM layer. Some of the problems include handling the case where you get someone dirties a single block in a sparse page, and the FS needs to manually fault in the other 56k pages around that single page. Or the VM not understanding that page eviction needs to be done in chunks of 64k so we don't have part of the block evicted but not all of it, etc. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 12:55 ` Theodore Ts'o @ 2014-01-23 19:49 ` Dave Chinner 2014-01-23 21:21 ` Joel Becker 1 sibling, 0 replies; 59+ messages in thread From: Dave Chinner @ 2014-01-23 19:49 UTC (permalink / raw) To: Theodore Ts'o, Andrew Morton, linux-scsi@vger.kernel.org, linux-mm@kvack.org, Chris Mason, linux-kernel@vger.kernel.org, James Bottomley, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Ric Wheeler On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote: > On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: > > > > > > I expect it would be relatively simple to get large blocksizes working > > > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > > > amounts of work, perhaps someone can do a proof-of-concept on powerpc > > > (or ia64) with 64k blocksize. > > > > Reality check: 64k block sizes on 64k page Linux machines has been > > used in production on XFS for at least 10 years. It's exactly the > > same case as 4k block size on 4k page size - one page, one buffer > > head, one filesystem block. > > This is true for ext4 as well. Block size == page size support is > pretty easy; the hard part is when block size > page size, due to > assumptions in the VM layer that requires that FS system needs to do a > lot of extra work to fudge around. So the real problem comes with > trying to support 64k block sizes on a 4k page architecture, and can > we do it in a way where every single file system doesn't have to do > their own specific hacks to work around assumptions made in the VM > layer. > > Some of the problems include handling the case where you get someone > dirties a single block in a sparse page, and the FS needs to manually > fault in the other 56k pages around that single page. Or the VM not > understanding that page eviction needs to be done in chunks of 64k so > we don't have part of the block evicted but not all of it, etc. Right, this is part of the problem that fsblock tried to handle, and some of the nastiness it had was that a page fault only resulted in the individual page being read from the underlying block. This means that it was entirely possible that the filesystem would need to do RMW cycles in the writeback path itself to handle things like block checksums, copy-on-write, unwritten extent conversion, etc. i.e. all the stuff that the page cache currently handles by doing RMW cycles at the page level. The method of using compound pages in the page cache so that the page cache could do 64k RMW cycles so that a filesystem never had to deal with new issues like the above was one of the reasons that approach is so appealing to us filesystem people. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 12:55 ` Theodore Ts'o 2014-01-23 19:49 ` Dave Chinner @ 2014-01-23 21:21 ` Joel Becker 1 sibling, 0 replies; 59+ messages in thread From: Joel Becker @ 2014-01-23 21:21 UTC (permalink / raw) To: Theodore Ts'o, Dave Chinner, Andrew Morton, linux-scsi@vger.kernel.org, linux-mm@kvack.org, Chris Mason, linux-kernel@vger.kernel.org, James Bottomley, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Ric Wheeler On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote: > On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote: > > > > > > I expect it would be relatively simple to get large blocksizes working > > > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > > > amounts of work, perhaps someone can do a proof-of-concept on powerpc > > > (or ia64) with 64k blocksize. > > > > Reality check: 64k block sizes on 64k page Linux machines has been > > used in production on XFS for at least 10 years. It's exactly the > > same case as 4k block size on 4k page size - one page, one buffer > > head, one filesystem block. > > This is true for ext4 as well. Block size == page size support is > pretty easy; the hard part is when block size > page size, due to > assumptions in the VM layer that requires that FS system needs to do a > lot of extra work to fudge around. So the real problem comes with > trying to support 64k block sizes on a 4k page architecture, and can > we do it in a way where every single file system doesn't have to do > their own specific hacks to work around assumptions made in the VM > layer. Yup, ditto for ocfs2. Joel -- "One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important." - Bertrand Russell http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 19:30 ` James Bottomley 2014-01-22 19:50 ` Andrew Morton @ 2014-01-22 20:57 ` Martin K. Petersen 1 sibling, 0 replies; 59+ messages in thread From: Martin K. Petersen @ 2014-01-22 20:57 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes: >> or even (not today, but some day) reject the IO. James> I really doubt this. All 4k drives today do RMW ... I don't see James> that changing any time soon. All consumer grade 4K phys drives do RMW. It's a different story for enterprise drives. The vendors appear to be divided between 4Kn and 512e with RMW mitigation. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:13 ` James Bottomley 2014-01-22 18:17 ` Ric Wheeler @ 2014-01-22 18:37 ` Chris Mason 2014-01-22 18:40 ` Ric Wheeler 2014-01-22 18:47 ` James Bottomley 2014-01-23 8:27 ` Dave Chinner 2 siblings, 2 replies; 59+ messages in thread From: Chris Mason @ 2014-01-22 18:37 UTC (permalink / raw) To: James.Bottomley@HansenPartnership.com Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, rwheeler@redhat.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > > We're likely to have people mixing 4K drives and <fill in some other > > size here> on the same box. We could just go with the biggest size and > > use the existing bh code for the sub-pagesized blocks, but I really > > hesitate to change VM fundamentals for this. > > If the page cache had a variable granularity per device, that would cope > with this. It's the variable granularity that's the VM problem. Agreed. But once we go variable granularity we're basically talking the large order allocation problem. > > > From a pure code point of view, it may be less work to change it once in > > the VM. But from an overall system impact point of view, it's a big > > change in how the system behaves just for filesystem metadata. > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > a good reason to keep it. > > > > The other question is if the drive does RMW between 4k and whatever its > > > physical sector size, do we need to do anything to take advantage of > > > it ... as in what would altering the granularity of the page cache buy > > > us? > > > > The real benefit is when and how the reads get scheduled. We're able to > > do a much better job pipelining the reads, controlling our caches and > > reducing write latency by having the reads done up in the OS instead of > > the drive. > > I agree with all of that, but my question is still can we do this by > propagating alignment and chunk size information (i.e. the physical > sector size) like we do today. If the FS knows the optimal I/O patterns > and tries to follow them, the odd cockup won't impact performance > dramatically. The real question is can the FS make use of this layout > information *without* changing the page cache granularity? Only if you > answer me "no" to this do I think we need to worry about changing page > cache granularity. Can it mostly work? I think the answer is yes. If not we'd have a lot of miserable people on top of raid5/6 right now. We can always make a generic r/m/w engine in DM that supports larger sectors transparently. > > Realistically, if you look at what the I/O schedulers output on a > standard (spinning rust) workload, it's mostly large transfers. > Obviously these are misalgned at the ends, but we can fix some of that > in the scheduler. Particularly if the FS helps us with layout. My > instinct tells me that we can fix 99% of this with layout on the FS + io > schedulers ... the remaining 1% goes to the drive as needing to do RMW > in the device, but the net impact to our throughput shouldn't be that > great. There are a few workloads where the VM and the FS would team up to make this fairly miserable Small files. Delayed allocation fixes a lot of this, but the VM doesn't realize that fileA, fileB, fileC, and fileD all need to be written at the same time to avoid RMW. Btrfs and MD have setup plugging callbacks to accumulate full stripes as much as possible, but it still hurts. Metadata. These writes are very latency sensitive and we'll gain a lot if the FS is explicitly trying to build full sector IOs. I do agree that its very likely these drives are going to silently rmw in the background for us. Circling back to what we might talk about at the conference, Ric do you have any ideas on when these drives might hit the wild? -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:37 ` Chris Mason @ 2014-01-22 18:40 ` Ric Wheeler 2014-01-22 18:47 ` James Bottomley 1 sibling, 0 replies; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 18:40 UTC (permalink / raw) To: Chris Mason, James.Bottomley@HansenPartnership.com Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, mgorman@suse.de On 01/22/2014 01:37 PM, Chris Mason wrote: > Circling back to what we might talk about at the conference, Ric do you > have any ideas on when these drives might hit the wild? > > -chris I will poke at vendors to see if we can get someone to make a public statement, but I cannot do that for them. Ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:37 ` Chris Mason 2014-01-22 18:40 ` Ric Wheeler @ 2014-01-22 18:47 ` James Bottomley 2014-01-23 21:27 ` Joel Becker 1 sibling, 1 reply; 59+ messages in thread From: James Bottomley @ 2014-01-22 18:47 UTC (permalink / raw) To: Chris Mason Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, rwheeler@redhat.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote: > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: [agreement cut because it's boring for the reader] > > Realistically, if you look at what the I/O schedulers output on a > > standard (spinning rust) workload, it's mostly large transfers. > > Obviously these are misalgned at the ends, but we can fix some of that > > in the scheduler. Particularly if the FS helps us with layout. My > > instinct tells me that we can fix 99% of this with layout on the FS + io > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > in the device, but the net impact to our throughput shouldn't be that > > great. > > There are a few workloads where the VM and the FS would team up to make > this fairly miserable > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't > realize that fileA, fileB, fileC, and fileD all need to be written at > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks > to accumulate full stripes as much as possible, but it still hurts. > > Metadata. These writes are very latency sensitive and we'll gain a lot > if the FS is explicitly trying to build full sector IOs. OK, so these two cases I buy ... the question is can we do something about them today without increasing the block size? The metadata problem, in particular, might be block independent: we still have a lot of small chunks to write out at fractured locations. With a large block size, the FS knows it's been bad and can expect the rolled up newspaper, but it's not clear what it could do about it. The small files issue looks like something we should be tackling today since writing out adjacent files would actually help us get bigger transfers. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:47 ` James Bottomley @ 2014-01-23 21:27 ` Joel Becker 2014-01-23 21:34 ` Chris Mason 0 siblings, 1 reply; 59+ messages in thread From: Joel Becker @ 2014-01-23 21:27 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, akpm@linux-foundation.org, rwheeler@redhat.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote: > > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > [agreement cut because it's boring for the reader] > > > Realistically, if you look at what the I/O schedulers output on a > > > standard (spinning rust) workload, it's mostly large transfers. > > > Obviously these are misalgned at the ends, but we can fix some of that > > > in the scheduler. Particularly if the FS helps us with layout. My > > > instinct tells me that we can fix 99% of this with layout on the FS + io > > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > > in the device, but the net impact to our throughput shouldn't be that > > > great. > > > > There are a few workloads where the VM and the FS would team up to make > > this fairly miserable > > > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't > > realize that fileA, fileB, fileC, and fileD all need to be written at > > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks > > to accumulate full stripes as much as possible, but it still hurts. > > > > Metadata. These writes are very latency sensitive and we'll gain a lot > > if the FS is explicitly trying to build full sector IOs. > > OK, so these two cases I buy ... the question is can we do something > about them today without increasing the block size? > > The metadata problem, in particular, might be block independent: we > still have a lot of small chunks to write out at fractured locations. > With a large block size, the FS knows it's been bad and can expect the > rolled up newspaper, but it's not clear what it could do about it. > > The small files issue looks like something we should be tackling today > since writing out adjacent files would actually help us get bigger > transfers. ocfs2 can actually take significant advantage here, because we store small file data in-inode. This would grow our in-inode size from ~3K to ~15K or ~63K. We'd actually have to do more work to start putting more than one inode in a block (thought that would be a promising avenue too once the coordination is solved generically. Joel -- "One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important." - Bertrand Russell http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 21:27 ` Joel Becker @ 2014-01-23 21:34 ` Chris Mason 0 siblings, 0 replies; 59+ messages in thread From: Chris Mason @ 2014-01-23 21:34 UTC (permalink / raw) To: jlbec@evilplan.org Cc: linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-scsi@vger.kernel.org, rwheeler@redhat.com, akpm@linux-foundation.org, James.Bottomley@HansenPartnership.com, linux-fsdevel@vger.kernel.org, mgorman@suse.de On Thu, 2014-01-23 at 13:27 -0800, Joel Becker wrote: > On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote: > > > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > [agreement cut because it's boring for the reader] > > > > Realistically, if you look at what the I/O schedulers output on a > > > > standard (spinning rust) workload, it's mostly large transfers. > > > > Obviously these are misalgned at the ends, but we can fix some of that > > > > in the scheduler. Particularly if the FS helps us with layout. My > > > > instinct tells me that we can fix 99% of this with layout on the FS + io > > > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > > > in the device, but the net impact to our throughput shouldn't be that > > > > great. > > > > > > There are a few workloads where the VM and the FS would team up to make > > > this fairly miserable > > > > > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't > > > realize that fileA, fileB, fileC, and fileD all need to be written at > > > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks > > > to accumulate full stripes as much as possible, but it still hurts. > > > > > > Metadata. These writes are very latency sensitive and we'll gain a lot > > > if the FS is explicitly trying to build full sector IOs. > > > > OK, so these two cases I buy ... the question is can we do something > > about them today without increasing the block size? > > > > The metadata problem, in particular, might be block independent: we > > still have a lot of small chunks to write out at fractured locations. > > With a large block size, the FS knows it's been bad and can expect the > > rolled up newspaper, but it's not clear what it could do about it. > > > > The small files issue looks like something we should be tackling today > > since writing out adjacent files would actually help us get bigger > > transfers. > > ocfs2 can actually take significant advantage here, because we store > small file data in-inode. This would grow our in-inode size from ~3K to > ~15K or ~63K. We'd actually have to do more work to start putting more > than one inode in a block (thought that would be a promising avenue too > once the coordination is solved generically. Btrfs already defaults to 16K metadata and can go as high as 64k. The part we don't do is multi-page sectors for data blocks. I'd tend to leverage the read/modify/write engine from the raid code for that. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 18:13 ` James Bottomley 2014-01-22 18:17 ` Ric Wheeler 2014-01-22 18:37 ` Chris Mason @ 2014-01-23 8:27 ` Dave Chinner 2014-01-23 15:47 ` James Bottomley 2 siblings, 1 reply; 59+ messages in thread From: Dave Chinner @ 2014-01-23 8:27 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, linux-scsi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > > > [ I like big sectors and I cannot lie ] > > I think I might be sceptical, but I don't think that's showing in my > concerns ... > > > > > I really think that if we want to make progress on this one, we need > > > > code and someone that owns it. Nick's work was impressive, but it was > > > > mostly there for getting rid of buffer heads. If we have a device that > > > > needs it and someone working to enable that device, we'll go forward > > > > much faster. > > > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > > sector only devices just fine today because the bh mechanisms now > > > operate on top of the page cache and can do the RMW necessary to update > > > a bh in the page cache itself which allows us to do only 4k chunked > > > writes, so we could keep the bh system and just alter the granularity of > > > the page cache. > > > > > > > We're likely to have people mixing 4K drives and <fill in some other > > size here> on the same box. We could just go with the biggest size and > > use the existing bh code for the sub-pagesized blocks, but I really > > hesitate to change VM fundamentals for this. > > If the page cache had a variable granularity per device, that would cope > with this. It's the variable granularity that's the VM problem. > > > From a pure code point of view, it may be less work to change it once in > > the VM. But from an overall system impact point of view, it's a big > > change in how the system behaves just for filesystem metadata. > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > a good reason to keep it. > > > > The other question is if the drive does RMW between 4k and whatever its > > > physical sector size, do we need to do anything to take advantage of > > > it ... as in what would altering the granularity of the page cache buy > > > us? > > > > The real benefit is when and how the reads get scheduled. We're able to > > do a much better job pipelining the reads, controlling our caches and > > reducing write latency by having the reads done up in the OS instead of > > the drive. > > I agree with all of that, but my question is still can we do this by > propagating alignment and chunk size information (i.e. the physical > sector size) like we do today. If the FS knows the optimal I/O patterns > and tries to follow them, the odd cockup won't impact performance > dramatically. The real question is can the FS make use of this layout > information *without* changing the page cache granularity? Only if you > answer me "no" to this do I think we need to worry about changing page > cache granularity. We already do this today. The problem is that we are limited by the page cache assumption that the block device/filesystem never need to manage multiple pages as an atomic unit of change. Hence we can't use the generic infrastructure as it stands to handle block/sector sizes larger than a page size... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 8:27 ` Dave Chinner @ 2014-01-23 15:47 ` James Bottomley 2014-01-23 16:44 ` Mel Gorman 2014-01-23 20:54 ` Christoph Lameter 0 siblings, 2 replies; 59+ messages in thread From: James Bottomley @ 2014-01-23 15:47 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, linux-scsi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > > > > > [ I like big sectors and I cannot lie ] > > > > I think I might be sceptical, but I don't think that's showing in my > > concerns ... > > > > > > > I really think that if we want to make progress on this one, we need > > > > > code and someone that owns it. Nick's work was impressive, but it was > > > > > mostly there for getting rid of buffer heads. If we have a device that > > > > > needs it and someone working to enable that device, we'll go forward > > > > > much faster. > > > > > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > > > sector only devices just fine today because the bh mechanisms now > > > > operate on top of the page cache and can do the RMW necessary to update > > > > a bh in the page cache itself which allows us to do only 4k chunked > > > > writes, so we could keep the bh system and just alter the granularity of > > > > the page cache. > > > > > > > > > > We're likely to have people mixing 4K drives and <fill in some other > > > size here> on the same box. We could just go with the biggest size and > > > use the existing bh code for the sub-pagesized blocks, but I really > > > hesitate to change VM fundamentals for this. > > > > If the page cache had a variable granularity per device, that would cope > > with this. It's the variable granularity that's the VM problem. > > > > > From a pure code point of view, it may be less work to change it once in > > > the VM. But from an overall system impact point of view, it's a big > > > change in how the system behaves just for filesystem metadata. > > > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > > a good reason to keep it. > > > > > > The other question is if the drive does RMW between 4k and whatever its > > > > physical sector size, do we need to do anything to take advantage of > > > > it ... as in what would altering the granularity of the page cache buy > > > > us? > > > > > > The real benefit is when and how the reads get scheduled. We're able to > > > do a much better job pipelining the reads, controlling our caches and > > > reducing write latency by having the reads done up in the OS instead of > > > the drive. > > > > I agree with all of that, but my question is still can we do this by > > propagating alignment and chunk size information (i.e. the physical > > sector size) like we do today. If the FS knows the optimal I/O patterns > > and tries to follow them, the odd cockup won't impact performance > > dramatically. The real question is can the FS make use of this layout > > information *without* changing the page cache granularity? Only if you > > answer me "no" to this do I think we need to worry about changing page > > cache granularity. > > We already do this today. > > The problem is that we are limited by the page cache assumption that > the block device/filesystem never need to manage multiple pages as > an atomic unit of change. Hence we can't use the generic > infrastructure as it stands to handle block/sector sizes larger than > a page size... If the compound page infrastructure exists today and is usable for this, what else do we need to do? ... because if it's a couple of trivial changes and a few minor patches to filesystems to take advantage of it, we might as well do it anyway. I was only objecting on the grounds that the last time we looked at it, it was major VM surgery. Can someone give a summary of how far we are away from being able to do this with the VM system today and what extra work is needed (and how big is this piece of work)? James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 15:47 ` James Bottomley @ 2014-01-23 16:44 ` Mel Gorman 2014-01-23 19:55 ` James Bottomley 2014-01-23 20:34 ` Dave Chinner 2014-01-23 20:54 ` Christoph Lameter 1 sibling, 2 replies; 59+ messages in thread From: Mel Gorman @ 2014-01-23 16:44 UTC (permalink / raw) To: James Bottomley Cc: Dave Chinner, Chris Mason, linux-scsi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: > On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: > > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > > > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > > > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > > > > > > > [ I like big sectors and I cannot lie ] > > > > > > I think I might be sceptical, but I don't think that's showing in my > > > concerns ... > > > > > > > > > I really think that if we want to make progress on this one, we need > > > > > > code and someone that owns it. Nick's work was impressive, but it was > > > > > > mostly there for getting rid of buffer heads. If we have a device that > > > > > > needs it and someone working to enable that device, we'll go forward > > > > > > much faster. > > > > > > > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > > > > sector only devices just fine today because the bh mechanisms now > > > > > operate on top of the page cache and can do the RMW necessary to update > > > > > a bh in the page cache itself which allows us to do only 4k chunked > > > > > writes, so we could keep the bh system and just alter the granularity of > > > > > the page cache. > > > > > > > > > > > > > We're likely to have people mixing 4K drives and <fill in some other > > > > size here> on the same box. We could just go with the biggest size and > > > > use the existing bh code for the sub-pagesized blocks, but I really > > > > hesitate to change VM fundamentals for this. > > > > > > If the page cache had a variable granularity per device, that would cope > > > with this. It's the variable granularity that's the VM problem. > > > > > > > From a pure code point of view, it may be less work to change it once in > > > > the VM. But from an overall system impact point of view, it's a big > > > > change in how the system behaves just for filesystem metadata. > > > > > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > > > a good reason to keep it. > > > > > > > > The other question is if the drive does RMW between 4k and whatever its > > > > > physical sector size, do we need to do anything to take advantage of > > > > > it ... as in what would altering the granularity of the page cache buy > > > > > us? > > > > > > > > The real benefit is when and how the reads get scheduled. We're able to > > > > do a much better job pipelining the reads, controlling our caches and > > > > reducing write latency by having the reads done up in the OS instead of > > > > the drive. > > > > > > I agree with all of that, but my question is still can we do this by > > > propagating alignment and chunk size information (i.e. the physical > > > sector size) like we do today. If the FS knows the optimal I/O patterns > > > and tries to follow them, the odd cockup won't impact performance > > > dramatically. The real question is can the FS make use of this layout > > > information *without* changing the page cache granularity? Only if you > > > answer me "no" to this do I think we need to worry about changing page > > > cache granularity. > > > > We already do this today. > > > > The problem is that we are limited by the page cache assumption that > > the block device/filesystem never need to manage multiple pages as > > an atomic unit of change. Hence we can't use the generic > > infrastructure as it stands to handle block/sector sizes larger than > > a page size... > > If the compound page infrastructure exists today and is usable for this, > what else do we need to do? ... because if it's a couple of trivial > changes and a few minor patches to filesystems to take advantage of it, > we might as well do it anyway. Do not do this as there is no guarantee that a compound allocation will succeed. If the allocation fails then it is potentially unrecoverable because we can no longer write to storage then you're hosed. If you are now thinking mempool then the problem becomes that the system will be in a state of degraded performance for an unknowable length of time and may never recover fully. 64K MMU page size systems get away with this because the blocksize is still <= PAGE_SIZE and no core VM changes are necessary. Critically, pages like the page table pages are the same size as the basic unit of allocation used by the kernel so external fragmentation simply is not a severe problem. > I was only objecting on the grounds that > the last time we looked at it, it was major VM surgery. Can someone > give a summary of how far we are away from being able to do this with > the VM system today and what extra work is needed (and how big is this > piece of work)? > Offhand no idea. For fsblock, probably a similar amount of work than had to be done in 2007 and I'd expect it would still require filesystem awareness problems that Dave Chinner pointer out earlier. For large block, it'd hit into the same wall that allocations must always succeed. If we want to break the connection between the basic unit of memory managed by the kernel and the MMU page size then I don't know but it would be a fairly large amount of surgery and need a lot of design work. Minimally, anything dealing with an MMU-sized amount of memory would now need to deal with sub-pages and there would need to be some restrictions on how sub-pages were used to mitigate the risk of external fragmentation -- do not mix page table page allocations with pages mapped into the address space, do not allow sub pages to be used by different processes etc. At the very least there would be a performance impact because PAGE_SIZE is no longer a compile-time constant. However, it would potentially allow the block size to be at least the same size as this new basic allocation unit. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 16:44 ` Mel Gorman @ 2014-01-23 19:55 ` James Bottomley 2014-01-24 10:57 ` Mel Gorman 2014-01-23 20:34 ` Dave Chinner 1 sibling, 1 reply; 59+ messages in thread From: James Bottomley @ 2014-01-23 19:55 UTC (permalink / raw) To: Mel Gorman Cc: linux-scsi@vger.kernel.org, Chris Mason, Dave Chinner, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, 2014-01-23 at 16:44 +0000, Mel Gorman wrote: > On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: > > On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: > > > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: > > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > > > > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > > > > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > > > > > > > > > [ I like big sectors and I cannot lie ] > > > > > > > > I think I might be sceptical, but I don't think that's showing in my > > > > concerns ... > > > > > > > > > > > I really think that if we want to make progress on this one, we need > > > > > > > code and someone that owns it. Nick's work was impressive, but it was > > > > > > > mostly there for getting rid of buffer heads. If we have a device that > > > > > > > needs it and someone working to enable that device, we'll go forward > > > > > > > much faster. > > > > > > > > > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > > > > > sector only devices just fine today because the bh mechanisms now > > > > > > operate on top of the page cache and can do the RMW necessary to update > > > > > > a bh in the page cache itself which allows us to do only 4k chunked > > > > > > writes, so we could keep the bh system and just alter the granularity of > > > > > > the page cache. > > > > > > > > > > > > > > > > We're likely to have people mixing 4K drives and <fill in some other > > > > > size here> on the same box. We could just go with the biggest size and > > > > > use the existing bh code for the sub-pagesized blocks, but I really > > > > > hesitate to change VM fundamentals for this. > > > > > > > > If the page cache had a variable granularity per device, that would cope > > > > with this. It's the variable granularity that's the VM problem. > > > > > > > > > From a pure code point of view, it may be less work to change it once in > > > > > the VM. But from an overall system impact point of view, it's a big > > > > > change in how the system behaves just for filesystem metadata. > > > > > > > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > > > > a good reason to keep it. > > > > > > > > > > The other question is if the drive does RMW between 4k and whatever its > > > > > > physical sector size, do we need to do anything to take advantage of > > > > > > it ... as in what would altering the granularity of the page cache buy > > > > > > us? > > > > > > > > > > The real benefit is when and how the reads get scheduled. We're able to > > > > > do a much better job pipelining the reads, controlling our caches and > > > > > reducing write latency by having the reads done up in the OS instead of > > > > > the drive. > > > > > > > > I agree with all of that, but my question is still can we do this by > > > > propagating alignment and chunk size information (i.e. the physical > > > > sector size) like we do today. If the FS knows the optimal I/O patterns > > > > and tries to follow them, the odd cockup won't impact performance > > > > dramatically. The real question is can the FS make use of this layout > > > > information *without* changing the page cache granularity? Only if you > > > > answer me "no" to this do I think we need to worry about changing page > > > > cache granularity. > > > > > > We already do this today. > > > > > > The problem is that we are limited by the page cache assumption that > > > the block device/filesystem never need to manage multiple pages as > > > an atomic unit of change. Hence we can't use the generic > > > infrastructure as it stands to handle block/sector sizes larger than > > > a page size... > > > > If the compound page infrastructure exists today and is usable for this, > > what else do we need to do? ... because if it's a couple of trivial > > changes and a few minor patches to filesystems to take advantage of it, > > we might as well do it anyway. > > Do not do this as there is no guarantee that a compound allocation will > succeed. I presume this is because in the current implementation compound pages have to be physically contiguous. For increasing granularity in the page cache, we don't necessarily need this ... however, getting write out to work properly without physically contiguous pages would be a bit more challenging (but not impossible) to solve. > If the allocation fails then it is potentially unrecoverable > because we can no longer write to storage then you're hosed. If you are > now thinking mempool then the problem becomes that the system will be > in a state of degraded performance for an unknowable length of time and > may never recover fully. 64K MMU page size systems get away with this > because the blocksize is still <= PAGE_SIZE and no core VM changes are > necessary. Critically, pages like the page table pages are the same size as > the basic unit of allocation used by the kernel so external fragmentation > simply is not a severe problem. Right, I understand this ... but we still need to wonder about what it would take. Even the simple fail a compound page allocation gets treated in the kernel the same way as failing a single page allocation in the page cache. > > I was only objecting on the grounds that > > the last time we looked at it, it was major VM surgery. Can someone > > give a summary of how far we are away from being able to do this with > > the VM system today and what extra work is needed (and how big is this > > piece of work)? > > > > Offhand no idea. For fsblock, probably a similar amount of work than > had to be done in 2007 and I'd expect it would still require filesystem > awareness problems that Dave Chinner pointer out earlier. For large block, > it'd hit into the same wall that allocations must always succeed. I don't understand this. Why must they succeed? 4k page allocations don't have to succeed today in the page cache, so why would compound page allocations have to succeed? > If we > want to break the connection between the basic unit of memory managed > by the kernel and the MMU page size then I don't know but it would be a > fairly large amount of surgery and need a lot of design work. Minimally, > anything dealing with an MMU-sized amount of memory would now need to > deal with sub-pages and there would need to be some restrictions on how > sub-pages were used to mitigate the risk of external fragmentation -- do not > mix page table page allocations with pages mapped into the address space, > do not allow sub pages to be used by different processes etc. At the very > least there would be a performance impact because PAGE_SIZE is no longer a > compile-time constant. However, it would potentially allow the block size > to be at least the same size as this new basic allocation unit. Hm, OK, so less appealing then. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 19:55 ` James Bottomley @ 2014-01-24 10:57 ` Mel Gorman 2014-01-30 4:52 ` Matthew Wilcox 0 siblings, 1 reply; 59+ messages in thread From: Mel Gorman @ 2014-01-24 10:57 UTC (permalink / raw) To: James Bottomley Cc: linux-scsi@vger.kernel.org, Chris Mason, Dave Chinner, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, Jan 23, 2014 at 11:55:35AM -0800, James Bottomley wrote: > > > > > > <SNIP> > > > > > > The real benefit is when and how the reads get scheduled. We're able to > > > > > > do a much better job pipelining the reads, controlling our caches and > > > > > > reducing write latency by having the reads done up in the OS instead of > > > > > > the drive. > > > > > > > > > > I agree with all of that, but my question is still can we do this by > > > > > propagating alignment and chunk size information (i.e. the physical > > > > > sector size) like we do today. If the FS knows the optimal I/O patterns > > > > > and tries to follow them, the odd cockup won't impact performance > > > > > dramatically. The real question is can the FS make use of this layout > > > > > information *without* changing the page cache granularity? Only if you > > > > > answer me "no" to this do I think we need to worry about changing page > > > > > cache granularity. > > > > > > > > We already do this today. > > > > > > > > The problem is that we are limited by the page cache assumption that > > > > the block device/filesystem never need to manage multiple pages as > > > > an atomic unit of change. Hence we can't use the generic > > > > infrastructure as it stands to handle block/sector sizes larger than > > > > a page size... > > > > > > If the compound page infrastructure exists today and is usable for this, > > > what else do we need to do? ... because if it's a couple of trivial > > > changes and a few minor patches to filesystems to take advantage of it, > > > we might as well do it anyway. > > > > Do not do this as there is no guarantee that a compound allocation will > > succeed. > > I presume this is because in the current implementation compound pages > have to be physically contiguous. Well.... yes. In VM terms, a compound page is a high-order physically contiguous page that has additional metadata and a destructor. A potentially discontiguous buffer would need a different structure and always be accessed with base-page-sized iterators. > For increasing granularity in the > page cache, we don't necessarily need this ... however, getting write > out to work properly without physically contiguous pages would be a bit > more challenging (but not impossible) to solve. > Every filesystem would have to be aware of this potentially discontiguous buffer. I do not know what the mechanics of fsblock were but I bet it had to handle some sort of multiple page read/write when block size was bigger than PAGE_SIZE. > > If the allocation fails then it is potentially unrecoverable > > because we can no longer write to storage then you're hosed. If you are > > now thinking mempool then the problem becomes that the system will be > > in a state of degraded performance for an unknowable length of time and > > may never recover fully. 64K MMU page size systems get away with this > > because the blocksize is still <= PAGE_SIZE and no core VM changes are > > necessary. Critically, pages like the page table pages are the same size as > > the basic unit of allocation used by the kernel so external fragmentation > > simply is not a severe problem. > > Right, I understand this ... but we still need to wonder about what it > would take. So far on the table is 1. major filesystem overhawl 2. major vm overhawl 3. use compound pages as they are today and hope it does not go completely to hell, reboot when it does > Even the simple fail a compound page allocation gets > treated in the kernel the same way as failing a single page allocation > in the page cache. > The percentages of failures are the problem here. If an order-0 allocation fails then any number of actions the kernel takes will result in a free page that can be used to satisfy the allocation. At worst, OOM killing a process is guaranteed to free up order-0 pages but the same is not true for compaction. Anti-fragmentation and compaction make this very difficult and they go a long way here but it is not a 100% guarantee a compound allocation will succeed in the future or be a cheap allocation. > > > I was only objecting on the grounds that > > > the last time we looked at it, it was major VM surgery. Can someone > > > give a summary of how far we are away from being able to do this with > > > the VM system today and what extra work is needed (and how big is this > > > piece of work)? > > > > > > > Offhand no idea. For fsblock, probably a similar amount of work than > > had to be done in 2007 and I'd expect it would still require filesystem > > awareness problems that Dave Chinner pointer out earlier. For large block, > > it'd hit into the same wall that allocations must always succeed. > > I don't understand this. Why must they succeed? 4k page allocations > don't have to succeed today in the page cache, so why would compound > page allocations have to succeed? > 4K page allocations can temporarily fail but almost any reclaim action with the exception of slab reclaim will result in 4K allocation requests succeeding again. The same is not true of compound pages. An adverse workload could potentially use page table pages (unreclaimable other than OOM kill) to prevent compound allocations ever succeeding. That's why I suggested that it may be necessary to change the basic unit of allocation the kernel uses to be larger than the MMU page size and restrict how the sub pages are used. The requirement is to preserve the property that "with the exception of slab reclaim that any reclaim action will result in K-sized allocation succeeding" where K is the largest blocksize used by any underlying storage device. From an FS perspective then certain things would look similar to what they do today. Block data would be on physically contiguous pages, buffer_heads would still manage the case where block_size <= PAGEALLOC_PAGE_SIZE (as opposed to MMU_PAGE_SIZE), particularly for dirty tracking and so on. The VM perspective is different because now it has to handle MMU_PAGE_SIZE in a very different way, page reclaim of a page becomes multiple unmap events and so on. There would also be anomalies such as mlock of a range smaller than PAGEALLOC_PAGE_SIZE becomes difficult if not impossible to sensibly manage because mlock of a 4K page effectively pins the rest and it's not obvious how we would deal with the VMAs in that case. It would get more than just the storage gains though. Some of the scalability problems that deal with massive amount of struct pages may magically go away if the base unit of allocation and management changes. > > If we > > want to break the connection between the basic unit of memory managed > > by the kernel and the MMU page size then I don't know but it would be a > > fairly large amount of surgery and need a lot of design work. Minimally, > > anything dealing with an MMU-sized amount of memory would now need to > > deal with sub-pages and there would need to be some restrictions on how > > sub-pages were used to mitigate the risk of external fragmentation -- do not > > mix page table page allocations with pages mapped into the address space, > > do not allow sub pages to be used by different processes etc. At the very > > least there would be a performance impact because PAGE_SIZE is no longer a > > compile-time constant. However, it would potentially allow the block size > > to be at least the same size as this new basic allocation unit. > > Hm, OK, so less appealing then. > Yes. On the plus side, you get the type of compound pages you want. On the negative side this would be a massive overhawl of a large chunk of the VM with lots of nasty details. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-24 10:57 ` Mel Gorman @ 2014-01-30 4:52 ` Matthew Wilcox 2014-01-30 6:01 ` Dave Chinner 2014-01-30 10:50 ` Mel Gorman 0 siblings, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2014-01-30 4:52 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, linux-scsi@vger.kernel.org, Chris Mason, Dave Chinner, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Fri, Jan 24, 2014 at 10:57:48AM +0000, Mel Gorman wrote: > So far on the table is > > 1. major filesystem overhawl > 2. major vm overhawl > 3. use compound pages as they are today and hope it does not go > completely to hell, reboot when it does Is the below paragraph an exposition of option 2, or is it an option 4, change the VM unit of allocation? Other than the names you're using, this is basically what I said to Kirill in an earlier thread; either scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start making use of it. The fact that EVERYBODY in this thread has been using PAGE_SIZE when they should have been using PAGE_CACHE_SIZE makes me wonder if part of the problem is that the split in naming went the wrong way. ie use PTE_SIZE for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for 'the amount of memory described by a struct page'. (we need to remove the current users of PTE_SIZE; sparc32 and powerpc32, but that's just a detail) And we need to fix all the places that are currently getting the distinction wrong. SMOP ... ;-) What would help is correct typing of variables, possibly with sparse support to help us out. Big Job. > That's why I suggested that it may be necessary to change the basic unit of > allocation the kernel uses to be larger than the MMU page size and restrict > how the sub pages are used. The requirement is to preserve the property that > "with the exception of slab reclaim that any reclaim action will result > in K-sized allocation succeeding" where K is the largest blocksize used by > any underlying storage device. From an FS perspective then certain things > would look similar to what they do today. Block data would be on physically > contiguous pages, buffer_heads would still manage the case where block_size > <= PAGEALLOC_PAGE_SIZE (as opposed to MMU_PAGE_SIZE), particularly for > dirty tracking and so on. The VM perspective is different because now it > has to handle MMU_PAGE_SIZE in a very different way, page reclaim of a page > becomes multiple unmap events and so on. There would also be anomalies such > as mlock of a range smaller than PAGEALLOC_PAGE_SIZE becomes difficult if > not impossible to sensibly manage because mlock of a 4K page effectively > pins the rest and it's not obvious how we would deal with the VMAs in that > case. It would get more than just the storage gains though. Some of the > scalability problems that deal with massive amount of struct pages may > magically go away if the base unit of allocation and management changes. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-30 4:52 ` Matthew Wilcox @ 2014-01-30 6:01 ` Dave Chinner 2014-01-30 10:50 ` Mel Gorman 1 sibling, 0 replies; 59+ messages in thread From: Dave Chinner @ 2014-01-30 6:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Mel Gorman, James Bottomley, linux-scsi@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Wed, Jan 29, 2014 at 09:52:46PM -0700, Matthew Wilcox wrote: > On Fri, Jan 24, 2014 at 10:57:48AM +0000, Mel Gorman wrote: > > So far on the table is > > > > 1. major filesystem overhawl > > 2. major vm overhawl > > 3. use compound pages as they are today and hope it does not go > > completely to hell, reboot when it does > > Is the below paragraph an exposition of option 2, or is it an option 4, > change the VM unit of allocation? Other than the names you're using, > this is basically what I said to Kirill in an earlier thread; either > scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start > making use of it. Christoph Lamater's compound page patch set scrapped PAGE_CACHE_SIZE and made it a variable that was set on the struct address_space when it was instantiated by the filesystem. In effect, it allowed filesystems to specify the unit of page cache allocation on a per-inode basis. > The fact that EVERYBODY in this thread has been using PAGE_SIZE when they > should have been using PAGE_CACHE_SIZE makes me wonder if part of the > problem is that the split in naming went the wrong way. ie use PTE_SIZE > for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for > 'the amount of memory described by a struct page'. PAGE_CACHE_SIZE was never distributed sufficiently to be used, and if you #define it to something other than PAGE_SIZE stuff will simply break. > (we need to remove the current users of PTE_SIZE; sparc32 and powerpc32, > but that's just a detail) > > And we need to fix all the places that are currently getting the > distinction wrong. SMOP ... ;-) What would help is correct typing of > variables, possibly with sparse support to help us out. Big Job. Yes, that's what the Christoph's patchset did. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-30 4:52 ` Matthew Wilcox 2014-01-30 6:01 ` Dave Chinner @ 2014-01-30 10:50 ` Mel Gorman 1 sibling, 0 replies; 59+ messages in thread From: Mel Gorman @ 2014-01-30 10:50 UTC (permalink / raw) To: Matthew Wilcox Cc: James Bottomley, linux-scsi@vger.kernel.org, Chris Mason, Dave Chinner, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Wed, Jan 29, 2014 at 09:52:46PM -0700, Matthew Wilcox wrote: > On Fri, Jan 24, 2014 at 10:57:48AM +0000, Mel Gorman wrote: > > So far on the table is > > > > 1. major filesystem overhawl > > 2. major vm overhawl > > 3. use compound pages as they are today and hope it does not go > > completely to hell, reboot when it does > > Is the below paragraph an exposition of option 2, or is it an option 4, > change the VM unit of allocation? Changing the VM unit of allocation is a major VM overhawl > Other than the names you're using, > this is basically what I said to Kirill in an earlier thread; either > scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start > making use of it. > No. The PAGE_CACHE_SIZE would depend on the underlying address space and vary. The large block patchset would have to have done this but I did not go back and review the patches due to lack of time. With that it starts hitting into fragmentation problems that have to be addressed somehow and cannot just be waved away. > The fact that EVERYBODY in this thread has been using PAGE_SIZE when they > should have been using PAGE_CACHE_SIZE makes me wonder if part of the > problem is that the split in naming went the wrong way. ie use PTE_SIZE > for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for > 'the amount of memory described by a struct page'. > > (we need to remove the current users of PTE_SIZE; sparc32 and powerpc32, > but that's just a detail) > > And we need to fix all the places that are currently getting the > distinction wrong. SMOP ... ;-) What would help is correct typing of > variables, possibly with sparse support to help us out. Big Job. > That's taking the approach of the large block patchset (as I understand it, not reviewed, not working on this etc) without dealing with potential fragmentation problems. Of course they could be remapped virtually if necessary but that will be very constrained on 32-bit, the final transfer to hardware will require scatter/gather and there is a setup/teardown cost with virtual mappings such as faulting (setup) and IPIs to flush TLBs (teardown) that would add overhead. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 16:44 ` Mel Gorman 2014-01-23 19:55 ` James Bottomley @ 2014-01-23 20:34 ` Dave Chinner 1 sibling, 0 replies; 59+ messages in thread From: Dave Chinner @ 2014-01-23 20:34 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, Chris Mason, linux-scsi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, Jan 23, 2014 at 04:44:38PM +0000, Mel Gorman wrote: > On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote: > > On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote: > > > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote: > > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: > > > > > > The other question is if the drive does RMW between 4k and whatever its > > > > > > physical sector size, do we need to do anything to take advantage of > > > > > > it ... as in what would altering the granularity of the page cache buy > > > > > > us? > > > > > > > > > > The real benefit is when and how the reads get scheduled. We're able to > > > > > do a much better job pipelining the reads, controlling our caches and > > > > > reducing write latency by having the reads done up in the OS instead of > > > > > the drive. > > > > > > > > I agree with all of that, but my question is still can we do this by > > > > propagating alignment and chunk size information (i.e. the physical > > > > sector size) like we do today. If the FS knows the optimal I/O patterns > > > > and tries to follow them, the odd cockup won't impact performance > > > > dramatically. The real question is can the FS make use of this layout > > > > information *without* changing the page cache granularity? Only if you > > > > answer me "no" to this do I think we need to worry about changing page > > > > cache granularity. > > > > > > We already do this today. > > > > > > The problem is that we are limited by the page cache assumption that > > > the block device/filesystem never need to manage multiple pages as > > > an atomic unit of change. Hence we can't use the generic > > > infrastructure as it stands to handle block/sector sizes larger than > > > a page size... > > > > If the compound page infrastructure exists today and is usable for this, > > what else do we need to do? ... because if it's a couple of trivial > > changes and a few minor patches to filesystems to take advantage of it, > > we might as well do it anyway. > > Do not do this as there is no guarantee that a compound allocation will > succeed. If the allocation fails then it is potentially unrecoverable > because we can no longer write to storage then you're hosed. If you are > now thinking mempool then the problem becomes that the system will be > in a state of degraded performance for an unknowable length of time and > may never recover fully. We are talking about page cache allocation here, not something deep down inside the IO path that requires mempools to guarantee IO completion. IOWs, we have an *existing error path* to return ENOMEM to userspace when page cache allocation fails. > 64K MMU page size systems get away with this > because the blocksize is still <= PAGE_SIZE and no core VM changes are > necessary. Critically, pages like the page table pages are the same size as > the basic unit of allocation used by the kernel so external fragmentation > simply is not a severe problem. Christoph's old patches didn't need 64k MMU page sizes to work. IIRC, the compound page was mapped via into the page cache as individual 4k pages. Any change of state on the child pages followed the back pointer to the head of the compound page and changed the state of that page. On page faults, the individual 4k pages were mapped to userspace rather than the compound page, so there was no userspace visible change, either. The question I had at the time that was never answered was this: if pages are faulted and mapped individually through their own ptes, why did the compound pages need to be contiguous? copy-in/out through read/write was still done a PAGE_SIZE granularity, mmap mappings were still on PAGE_SIZE granularity, so why can't we build a compound page for the page cache out of discontiguous pages? FWIW, XFS has long used discontiguous pages for large block support in metadata. Some of that is vmapped to make metadata processing simple. The point of this is that we don't need *contiguous* compound pages in the page cache if we can map them into userspace as individual PAGE_SIZE pages. Only the page cache management needs to handle the groups of pages that make up a filesystem block as a compound page.... > > I was only objecting on the grounds that > > the last time we looked at it, it was major VM surgery. Can someone > > give a summary of how far we are away from being able to do this with > > the VM system today and what extra work is needed (and how big is this > > piece of work)? > > > > Offhand no idea. For fsblock, probably a similar amount of work than > had to be done in 2007 and I'd expect it would still require filesystem > awareness problems that Dave Chinner pointer out earlier. For large block, > it'd hit into the same wall that allocations must always succeed. If we > want to break the connection between the basic unit of memory managed > by the kernel and the MMU page size then I don't know but it would be a > fairly large amount of surgery and need a lot of design work. Here's the patch that Christoph wrote backin 2007 to add PAGE_SIZE based mmap support: http://thread.gmane.org/gmane.linux.file-systems/18004 I don't claim to understand all of it, but it seems to me that most of the design and implementation problems were solved.... ..... > At the very > least there would be a performance impact because PAGE_SIZE is no longer a > compile-time constant. Christoph's patchset did this, and no discernable performance difference could be measured as a result of making PAGE_SIZE a variable rather than a compile time constant. I doubt that this has changed much since then... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 15:47 ` James Bottomley 2014-01-23 16:44 ` Mel Gorman @ 2014-01-23 20:54 ` Christoph Lameter 1 sibling, 0 replies; 59+ messages in thread From: Christoph Lameter @ 2014-01-23 20:54 UTC (permalink / raw) To: James Bottomley Cc: Dave Chinner, Chris Mason, linux-scsi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Thu, 23 Jan 2014, James Bottomley wrote: > If the compound page infrastructure exists today and is usable for this, > what else do we need to do? ... because if it's a couple of trivial > changes and a few minor patches to filesystems to take advantage of it, > we might as well do it anyway. I was only objecting on the grounds that > the last time we looked at it, it was major VM surgery. Can someone > give a summary of how far we are away from being able to do this with > the VM system today and what extra work is needed (and how big is this > piece of work)? The main problem for me was the page cache. The VM would not be such a problem. Changing the page cache function required updates to many filesystems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 17:21 ` James Bottomley 2014-01-22 18:02 ` Chris Mason @ 2014-01-23 8:24 ` Dave Chinner 1 sibling, 0 replies; 59+ messages in thread From: Dave Chinner @ 2014-01-23 8:24 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Wed, Jan 22, 2014 at 09:21:40AM -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote: > > On Wed, 2014-01-22 at 15:19 +0000, Mel Gorman wrote: > > > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > > > > On 01/22/2014 09:34 AM, Mel Gorman wrote: > > > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > > > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > > > >>>>One topic that has been lurking forever at the edges is the current > > > > >>>>4k limitation for file system block sizes. Some devices in > > > > >>>>production today and others coming soon have larger sectors and it > > > > >>>>would be interesting to see if it is time to poke at this topic > > > > >>>>again. > > > > >>>> > > > > >>>Large block support was proposed years ago by Christoph Lameter > > > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > > > > >>>in the community at the time so I do not recall any of the details. I do > > > > >>>believe it motivated an alternative by Nick Piggin called fsblock though > > > > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > > > > >>>know why neither were never merged for those of us that were not around > > > > >>>at the time and who may not have the chance to dive through mailing list > > > > >>>archives between now and March. > > > > >>> > > > > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > > > > >>>high-order allocations to succeed for the system to behave correctly. > > > > >>> > > > > >>I have a somewhat hazy memory of Andrew warning us that touching > > > > >>this code takes us into dark and scary places. > > > > >> > > > > >That is a light summary. As Andrew tends to reject patches with poor > > > > >documentation in case we forget the details in 6 months, I'm going to guess > > > > >that he does not remember the details of a discussion from 7ish years ago. > > > > >This is where Andrew swoops in with a dazzling display of his eidetic > > > > >memory just to prove me wrong. > > > > > > > > > >Ric, are there any storage vendor that is pushing for this right now? > > > > >Is someone working on this right now or planning to? If they are, have they > > > > >looked into the history of fsblock (Nick) and large block support (Christoph) > > > > >to see if they are candidates for forward porting or reimplementation? > > > > >I ask because without that person there is a risk that the discussion > > > > >will go as follows > > > > > > > > > >Topic leader: Does anyone have an objection to supporting larger block > > > > > sizes than the page size? > > > > >Room: Send patches and we'll talk. > > > > > > > > > > > > > I will have to see if I can get a storage vendor to make a public > > > > statement, but there are vendors hoping to see this land in Linux in > > > > the next few years. > > > > > > What about the second and third questions -- is someone working on this > > > right now or planning to? Have they looked into the history of fsblock > > > (Nick) and large block support (Christoph) to see if they are candidates > > > for forward porting or reimplementation? > > > > I really think that if we want to make progress on this one, we need > > code and someone that owns it. Nick's work was impressive, but it was > > mostly there for getting rid of buffer heads. If we have a device that > > needs it and someone working to enable that device, we'll go forward > > much faster. > > Do we even need to do that (eliminate buffer heads)? No, the reason bufferheads were replaced was that a bufferhead can only reference a single page. i.e. the structure is that a page can reference multipl bufferheads (block size >= page size) but a bufferhead can't refernce multiple pages which is what is needed for block size > page size. fsblock was designed to handle both cases. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 15:19 ` Mel Gorman 2014-01-22 17:02 ` Chris Mason @ 2014-01-23 20:48 ` Christoph Lameter 1 sibling, 0 replies; 59+ messages in thread From: Christoph Lameter @ 2014-01-23 20:48 UTC (permalink / raw) To: Mel Gorman Cc: Ric Wheeler, linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel, Andrew Morton On Wed, 22 Jan 2014, Mel Gorman wrote: > Don't get me wrong, I'm interested in the topic but I severely doubt I'd > have the capacity to research the background of this in advance. It's also > unlikely that I'd work on it in the future without throwing out my current > TODO list. In an ideal world someone will have done the legwork in advance > of LSF/MM to help drive the topic. I can give an overview of the history and the challenges of the approaches if needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 14:58 ` Ric Wheeler 2014-01-22 15:19 ` Mel Gorman @ 2014-01-22 20:47 ` Martin K. Petersen 1 sibling, 0 replies; 59+ messages in thread From: Martin K. Petersen @ 2014-01-22 20:47 UTC (permalink / raw) To: Ric Wheeler Cc: Mel Gorman, linux-scsi, linux-mm, linux-kernel, linux-ide, linux-fsdevel, Andrew Morton, lsf-pc >>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes: Ric> I will have to see if I can get a storage vendor to make a public Ric> statement, but there are vendors hoping to see this land in Linux Ric> in the next few years. I assume that anyone with a shipping device Ric> will have to at least emulate the 4KB sector size for years to Ric> come, but that there might be a significant performance win for Ric> platforms that can do a larger block. I am aware of two companies that already created devices with 8KB logical blocks and expected Linux to work. I had to do some explaining. I agree with Ric that this is something we'll need to address sooner rather than later. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 14:34 ` Mel Gorman 2014-01-22 14:58 ` Ric Wheeler @ 2014-01-23 8:21 ` Dave Chinner 1 sibling, 0 replies; 59+ messages in thread From: Dave Chinner @ 2014-01-23 8:21 UTC (permalink / raw) To: Mel Gorman Cc: Ric Wheeler, linux-scsi, linux-mm, linux-kernel, linux-ide, linux-fsdevel, Andrew Morton, lsf-pc On Wed, Jan 22, 2014 at 02:34:52PM +0000, Mel Gorman wrote: > On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > On 01/22/2014 04:34 AM, Mel Gorman wrote: > > >On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > >>One topic that has been lurking forever at the edges is the current > > >>4k limitation for file system block sizes. Some devices in > > >>production today and others coming soon have larger sectors and it > > >>would be interesting to see if it is time to poke at this topic > > >>again. > > >> > > >Large block support was proposed years ago by Christoph Lameter > > >(http://lwn.net/Articles/232757/). I think I was just getting started > > >in the community at the time so I do not recall any of the details. I do > > >believe it motivated an alternative by Nick Piggin called fsblock though > > >(http://lwn.net/Articles/321390/). At the very least it would be nice to > > >know why neither were never merged for those of us that were not around > > >at the time and who may not have the chance to dive through mailing list > > >archives between now and March. > > > > > >FWIW, I would expect that a show-stopper for any proposal is requiring > > >high-order allocations to succeed for the system to behave correctly. > > > > > > > I have a somewhat hazy memory of Andrew warning us that touching > > this code takes us into dark and scary places. > > > > That is a light summary. As Andrew tends to reject patches with poor > documentation in case we forget the details in 6 months, I'm going to guess > that he does not remember the details of a discussion from 7ish years ago. > This is where Andrew swoops in with a dazzling display of his eidetic > memory just to prove me wrong. > > Ric, are there any storage vendor that is pushing for this right now? > Is someone working on this right now or planning to? If they are, have they > looked into the history of fsblock (Nick) and large block support (Christoph) > to see if they are candidates for forward porting or reimplementation? > I ask because without that person there is a risk that the discussion > will go as follows > > Topic leader: Does anyone have an objection to supporting larger block > sizes than the page size? > Room: Send patches and we'll talk. So, from someone who was done in the trenches of the large filesystem block size code wars, the main objection to Christoph lameter's patchset was that it used high order compound pages in the page cache so that nothing at filesystem level needed to be changed to support large block sizes. The patch to enable XFS to use 64k block sizes with Christoph's patches was simply removing 5 lines of code that limited the block size to PAGE_SIZE. And everything just worked. Given that compound pages are used all over the place now and we also have page migration, compaction and other MM support that greatly improves high order memory allocation, perhaps we should revisit this approach. As to Nick's fsblock rewrite, he basically rewrote all the bufferhead head code to handle filesystem blocks larger than a page whilst leaving the page cache untouched. i.e. the complete opposite approach. The problem with this approach is that every filesystem needs to be re-written to use fsblocks rather than bufferheads. For some filesystems that isn't hard (e.g. ext2) but for filesystems that use bufferheads in the core of their journalling subsystems that's a completely different story. And for filesystems like XFS, it doesn't solve any of the problem with using bufferheads that we have now, so it simply introduces a huge amount of IO path rework and validation without providing any advantage from a feature or performance point of view. i.e. extent based filesystems mostly negate the impact of filesystem block size on IO performance... Realistically, if I'm going to do something in XFS to add block size > page size support, I'm going to do it wiht somethign XFS can track through it's own journal so I can add data=journal functionality with the same filesystem block/extent header structures used to track the pages in blocks larger than PAGE_SIZE. And given that we already have such infrastructure in XFS to support directory blocks larger than filesystem block size.... FWIW, as to the original "large sector size" support question, XFS already supports sector sizes up to 32k in size. The limitation is actually a limitation of the journal format, so going larger than that would take some work... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman 2014-01-22 14:10 ` Ric Wheeler @ 2014-01-22 15:14 ` Chris Mason 2014-01-22 16:03 ` James Bottomley 2014-01-23 20:47 ` Christoph Lameter 2 siblings, 1 reply; 59+ messages in thread From: Chris Mason @ 2014-01-22 15:14 UTC (permalink / raw) To: mgorman@suse.de Cc: linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, rwheeler@redhat.com, lsf-pc@lists.linux-foundation.org, linux-ide@vger.kernel.org, linux-fsdevel@vger.kernel.org On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote: > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > One topic that has been lurking forever at the edges is the current > > 4k limitation for file system block sizes. Some devices in > > production today and others coming soon have larger sectors and it > > would be interesting to see if it is time to poke at this topic > > again. > > > > Large block support was proposed years ago by Christoph Lameter > (http://lwn.net/Articles/232757/). I think I was just getting started > in the community at the time so I do not recall any of the details. I do > believe it motivated an alternative by Nick Piggin called fsblock though > (http://lwn.net/Articles/321390/). At the very least it would be nice to > know why neither were never merged for those of us that were not around > at the time and who may not have the chance to dive through mailing list > archives between now and March. > > FWIW, I would expect that a show-stopper for any proposal is requiring > high-order allocations to succeed for the system to behave correctly. > My memory is that Nick's work just didn't have the momentum to get pushed in. It all seemed very reasonable though, I think our hatred of buffered heads just wasn't yet bigger than the fear of moving away. But, the bigger question is how big are the blocks going to be? At some point (64K?) we might as well just make a log structured dm target and have a single setup for both shingled and large sector drives. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 15:14 ` Chris Mason @ 2014-01-22 16:03 ` James Bottomley 2014-01-22 16:45 ` Ric Wheeler 0 siblings, 1 reply; 59+ messages in thread From: James Bottomley @ 2014-01-22 16:03 UTC (permalink / raw) To: Chris Mason Cc: mgorman@suse.de, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, rwheeler@redhat.com On Wed, 2014-01-22 at 15:14 +0000, Chris Mason wrote: > On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote: > > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > > One topic that has been lurking forever at the edges is the current > > > 4k limitation for file system block sizes. Some devices in > > > production today and others coming soon have larger sectors and it > > > would be interesting to see if it is time to poke at this topic > > > again. > > > > > > > Large block support was proposed years ago by Christoph Lameter > > (http://lwn.net/Articles/232757/). I think I was just getting started > > in the community at the time so I do not recall any of the details. I do > > believe it motivated an alternative by Nick Piggin called fsblock though > > (http://lwn.net/Articles/321390/). At the very least it would be nice to > > know why neither were never merged for those of us that were not around > > at the time and who may not have the chance to dive through mailing list > > archives between now and March. > > > > FWIW, I would expect that a show-stopper for any proposal is requiring > > high-order allocations to succeed for the system to behave correctly. > > > > My memory is that Nick's work just didn't have the momentum to get > pushed in. It all seemed very reasonable though, I think our hatred of > buffered heads just wasn't yet bigger than the fear of moving away. > > But, the bigger question is how big are the blocks going to be? At some > point (64K?) we might as well just make a log structured dm target and > have a single setup for both shingled and large sector drives. There is no real point. Even with 4k drives today using 4k sectors in the filesystem, we still get 512 byte writes because of journalling and the buffer cache. The question is what would we need to do to support these devices and the answer is "try to send IO in x byte multiples x byte aligned" this really becomes an ioscheduler problem, not a supporting large page problem. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 16:03 ` James Bottomley @ 2014-01-22 16:45 ` Ric Wheeler 2014-01-22 17:00 ` James Bottomley 0 siblings, 1 reply; 59+ messages in thread From: Ric Wheeler @ 2014-01-22 16:45 UTC (permalink / raw) To: James Bottomley, Chris Mason Cc: mgorman@suse.de, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org On 01/22/2014 11:03 AM, James Bottomley wrote: > On Wed, 2014-01-22 at 15:14 +0000, Chris Mason wrote: >> On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote: >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: >>>> One topic that has been lurking forever at the edges is the current >>>> 4k limitation for file system block sizes. Some devices in >>>> production today and others coming soon have larger sectors and it >>>> would be interesting to see if it is time to poke at this topic >>>> again. >>>> >>> Large block support was proposed years ago by Christoph Lameter >>> (http://lwn.net/Articles/232757/). I think I was just getting started >>> in the community at the time so I do not recall any of the details. I do >>> believe it motivated an alternative by Nick Piggin called fsblock though >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to >>> know why neither were never merged for those of us that were not around >>> at the time and who may not have the chance to dive through mailing list >>> archives between now and March. >>> >>> FWIW, I would expect that a show-stopper for any proposal is requiring >>> high-order allocations to succeed for the system to behave correctly. >>> >> My memory is that Nick's work just didn't have the momentum to get >> pushed in. It all seemed very reasonable though, I think our hatred of >> buffered heads just wasn't yet bigger than the fear of moving away. >> >> But, the bigger question is how big are the blocks going to be? At some >> point (64K?) we might as well just make a log structured dm target and >> have a single setup for both shingled and large sector drives. > There is no real point. Even with 4k drives today using 4k sectors in > the filesystem, we still get 512 byte writes because of journalling and > the buffer cache. I think that you are wrong here James. Even with 512 byte drives, the IO's we send down tend to be 4k or larger. Do you have traces that show this and details? > > The question is what would we need to do to support these devices and > the answer is "try to send IO in x byte multiples x byte aligned" this > really becomes an ioscheduler problem, not a supporting large page > problem. > > James > Not that simple. The requirement of some of these devices are that you *never* send down a partial write or an unaligned write. Also keep in mind that larger block sizes allow us to track larger files with smaller amounts of metadata which is a second win. Ric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 16:45 ` Ric Wheeler @ 2014-01-22 17:00 ` James Bottomley 2014-01-22 21:05 ` Jan Kara 0 siblings, 1 reply; 59+ messages in thread From: James Bottomley @ 2014-01-22 17:00 UTC (permalink / raw) To: Ric Wheeler Cc: Chris Mason, mgorman@suse.de, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org On Wed, 2014-01-22 at 11:45 -0500, Ric Wheeler wrote: > On 01/22/2014 11:03 AM, James Bottomley wrote: > > On Wed, 2014-01-22 at 15:14 +0000, Chris Mason wrote: > >> On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote: > >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > >>>> One topic that has been lurking forever at the edges is the current > >>>> 4k limitation for file system block sizes. Some devices in > >>>> production today and others coming soon have larger sectors and it > >>>> would be interesting to see if it is time to poke at this topic > >>>> again. > >>>> > >>> Large block support was proposed years ago by Christoph Lameter > >>> (http://lwn.net/Articles/232757/). I think I was just getting started > >>> in the community at the time so I do not recall any of the details. I do > >>> believe it motivated an alternative by Nick Piggin called fsblock though > >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to > >>> know why neither were never merged for those of us that were not around > >>> at the time and who may not have the chance to dive through mailing list > >>> archives between now and March. > >>> > >>> FWIW, I would expect that a show-stopper for any proposal is requiring > >>> high-order allocations to succeed for the system to behave correctly. > >>> > >> My memory is that Nick's work just didn't have the momentum to get > >> pushed in. It all seemed very reasonable though, I think our hatred of > >> buffered heads just wasn't yet bigger than the fear of moving away. > >> > >> But, the bigger question is how big are the blocks going to be? At some > >> point (64K?) we might as well just make a log structured dm target and > >> have a single setup for both shingled and large sector drives. > > There is no real point. Even with 4k drives today using 4k sectors in > > the filesystem, we still get 512 byte writes because of journalling and > > the buffer cache. > > I think that you are wrong here James. Even with 512 byte drives, the IO's we > send down tend to be 4k or larger. Do you have traces that show this and details? It's mostly an ext3 journalling issue ... and it's only metadata and mostly the ioschedulers can elevate it into 4k chunks, so yes, most of our writes are 4k+, so this is a red herring, yes. > > > > The question is what would we need to do to support these devices and > > the answer is "try to send IO in x byte multiples x byte aligned" this > > really becomes an ioscheduler problem, not a supporting large page > > problem. > > > > James > > > > Not that simple. > > The requirement of some of these devices are that you *never* send down a > partial write or an unaligned write. But this is the million dollar question. That was originally going to be the requirement of the 4k sector devices but look what happened in the market. > Also keep in mind that larger block sizes allow us to track larger > files with > smaller amounts of metadata which is a second win. Larger file block sizes are completely independent from larger device block sizes (we can have 16k file block sizes on 4k or even 512b devices). The questions on larger block size devices are twofold: 1. If manufacturers tell us that they'll only support I/O on the physical sector size, do we believe them, given that they said this before on 4k and then backed down. All the logical vs physical sector stuff is now in T10 standards, why would they try to go all physical again, especially as they've now all written firmware that does the necessary RMW? 2. If we agree they'll do RMW in Firmware again, what do we have to do to take advantage of larger sector sizes beyond what we currently do in alignment and chunking? There may still be issues in FS journal and data layouts. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 17:00 ` James Bottomley @ 2014-01-22 21:05 ` Jan Kara 0 siblings, 0 replies; 59+ messages in thread From: Jan Kara @ 2014-01-22 21:05 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, Chris Mason, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mgorman@suse.de, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org On Wed 22-01-14 09:00:33, James Bottomley wrote: > On Wed, 2014-01-22 at 11:45 -0500, Ric Wheeler wrote: > > On 01/22/2014 11:03 AM, James Bottomley wrote: > > > On Wed, 2014-01-22 at 15:14 +0000, Chris Mason wrote: > > >> On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote: > > >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > >>>> One topic that has been lurking forever at the edges is the current > > >>>> 4k limitation for file system block sizes. Some devices in > > >>>> production today and others coming soon have larger sectors and it > > >>>> would be interesting to see if it is time to poke at this topic > > >>>> again. > > >>>> > > >>> Large block support was proposed years ago by Christoph Lameter > > >>> (http://lwn.net/Articles/232757/). I think I was just getting started > > >>> in the community at the time so I do not recall any of the details. I do > > >>> believe it motivated an alternative by Nick Piggin called fsblock though > > >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to > > >>> know why neither were never merged for those of us that were not around > > >>> at the time and who may not have the chance to dive through mailing list > > >>> archives between now and March. > > >>> > > >>> FWIW, I would expect that a show-stopper for any proposal is requiring > > >>> high-order allocations to succeed for the system to behave correctly. > > >>> > > >> My memory is that Nick's work just didn't have the momentum to get > > >> pushed in. It all seemed very reasonable though, I think our hatred of > > >> buffered heads just wasn't yet bigger than the fear of moving away. > > >> > > >> But, the bigger question is how big are the blocks going to be? At some > > >> point (64K?) we might as well just make a log structured dm target and > > >> have a single setup for both shingled and large sector drives. > > > There is no real point. Even with 4k drives today using 4k sectors in > > > the filesystem, we still get 512 byte writes because of journalling and > > > the buffer cache. > > > > I think that you are wrong here James. Even with 512 byte drives, the IO's we > > send down tend to be 4k or larger. Do you have traces that show this and details? > > It's mostly an ext3 journalling issue ... and it's only metadata and > mostly the ioschedulers can elevate it into 4k chunks, so yes, most of > our writes are 4k+, so this is a red herring, yes. ext3 (similarly as ext4) does block level journalling meaning that it journals *only* full blocks. So an ext3/4 filesystem with 4 KB blocksize will never journal anything else than full 4 KB blocks. So I'm not sure where this 512-byte writes idea came from.. > > Also keep in mind that larger block sizes allow us to track larger > > files with > > smaller amounts of metadata which is a second win. > > Larger file block sizes are completely independent from larger device > block sizes (we can have 16k file block sizes on 4k or even 512b > devices). The questions on larger block size devices are twofold: > > 1. If manufacturers tell us that they'll only support I/O on the > physical sector size, do we believe them, given that they said > this before on 4k and then backed down. All the logical vs > physical sector stuff is now in T10 standards, why would they > try to go all physical again, especially as they've now all > written firmware that does the necessary RMW? > 2. If we agree they'll do RMW in Firmware again, what do we have to > do to take advantage of larger sector sizes beyond what we > currently do in alignment and chunking? There may still be > issues in FS journal and data layouts. I also believe drives will support smaller-than-blocksize writes. But supporting larger fs blocksize can sometimes be beneficial for other reasons (think performance with specialized workloads because amount of metadata is smaller, fragmentation is smaller, ...). Currently ocfs2, ext4, and possibly others go through the hoops to support allocating file data in chunks larger than fs blocksize - at the first sight that should be straightforward but if you look at the code you find out there are nasty corner cases which make it pretty ugly. And each fs doing these large data allocations currently invents its own way to deal with the problems. So providing some common infrastructure for dealing with blocks larger than page size would definitely relieve some pain. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman 2014-01-22 14:10 ` Ric Wheeler 2014-01-22 15:14 ` Chris Mason @ 2014-01-23 20:47 ` Christoph Lameter 2014-01-24 11:09 ` Mel Gorman 2 siblings, 1 reply; 59+ messages in thread From: Christoph Lameter @ 2014-01-23 20:47 UTC (permalink / raw) To: Mel Gorman Cc: Ric Wheeler, linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Wed, 22 Jan 2014, Mel Gorman wrote: > Large block support was proposed years ago by Christoph Lameter > (http://lwn.net/Articles/232757/). I think I was just getting started > in the community at the time so I do not recall any of the details. I do > believe it motivated an alternative by Nick Piggin called fsblock though > (http://lwn.net/Articles/321390/). At the very least it would be nice to > know why neither were never merged for those of us that were not around > at the time and who may not have the chance to dive through mailing list > archives between now and March. It was rejected first because of the necessity of higher order page allocations. Nick and I then added ways to virtually map higher order pages if the page allocator could no longe provide those. All of this required changes to the basic page cache operations. I added a way for the mapping to indicate an order for an address range and then modified the page cache operations to be able to operate on any order pages. The patchset that introduced the ability to specify different orders for the pagecache address ranges was not accepted by Andrew because he thought there was no chance for the rest of the modifications to become acceptable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-23 20:47 ` Christoph Lameter @ 2014-01-24 11:09 ` Mel Gorman 2014-01-24 15:44 ` Christoph Lameter 0 siblings, 1 reply; 59+ messages in thread From: Mel Gorman @ 2014-01-24 11:09 UTC (permalink / raw) To: Christoph Lameter Cc: Ric Wheeler, linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Thu, Jan 23, 2014 at 02:47:10PM -0600, Christoph Lameter wrote: > On Wed, 22 Jan 2014, Mel Gorman wrote: > > > Large block support was proposed years ago by Christoph Lameter > > (http://lwn.net/Articles/232757/). I think I was just getting started > > in the community at the time so I do not recall any of the details. I do > > believe it motivated an alternative by Nick Piggin called fsblock though > > (http://lwn.net/Articles/321390/). At the very least it would be nice to > > know why neither were never merged for those of us that were not around > > at the time and who may not have the chance to dive through mailing list > > archives between now and March. > > It was rejected first because of the necessity of higher order page > allocations. Nick and I then added ways to virtually map higher order > pages if the page allocator could no longe provide those. > That'd be okish for 64-bit at least although it would show up as degraded performance in some cases when virtually contiguous buffers were used. Aside from the higher setup, access costs and teardown costs of a virtual contiguous buffer, the underlying storage would no longer gets a single buffer as part of the IO request. Would that not offset many of the advantages? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-24 11:09 ` Mel Gorman @ 2014-01-24 15:44 ` Christoph Lameter 0 siblings, 0 replies; 59+ messages in thread From: Christoph Lameter @ 2014-01-24 15:44 UTC (permalink / raw) To: Mel Gorman Cc: Ric Wheeler, linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Fri, 24 Jan 2014, Mel Gorman wrote: > That'd be okish for 64-bit at least although it would show up as > degraded performance in some cases when virtually contiguous buffers were > used. Aside from the higher setup, access costs and teardown costs of a > virtual contiguous buffer, the underlying storage would no longer gets > a single buffer as part of the IO request. Would that not offset many of > the advantages? It would offset some of that. But the major benefit of large order page cache was the reduction of the number of operations that the kernel has to perform. A 64k page contains 16 4k pages. So there is only one kernel operation required instead of 16. If the page is virtually allocated then the higher level kernel functions still only operate on one page struct. The lower levels (bio) then will have to deal with the virtuall mappings and create a scatter gather list. This is some more overhead but not much. Doing something like this will put more stress on the defragmentation logic in the kernel. In general I think we need more contiguous physical memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler 2014-01-22 5:20 ` Joel Becker 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman @ 2014-01-22 15:54 ` James Bottomley 2 siblings, 0 replies; 59+ messages in thread From: James Bottomley @ 2014-01-22 15:54 UTC (permalink / raw) To: Ric Wheeler Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, lsf-pc, linux-kernel On Tue, 2014-01-21 at 22:04 -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current 4k > limitation for file system block sizes. Some devices in production today and > others coming soon have larger sectors and it would be interesting to see if it > is time to poke at this topic again. > > LSF/MM seems to be pretty much the only event of the year that most of the key > people will be present, so should be a great topic for a joint session. But the question is what will the impact be. A huge amount of fuss was made about 512->4k. Linux was totally ready because we had variable block sizes and our page size is 4k. I even have one pure 4k sector drive that works in one of my test systems. However, the result was the market chose to go the physical/logical route because of other Operating System considerations, all 4k drives expose 512 byte sectors and do RMW internally. For us it becomes about layout and alignment, which we already do. I can't see how going to 8k or 16k would be any different from what we've already done. In other words, this is an already solved problem. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] 2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman ` (2 preceding siblings ...) 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler @ 2014-03-14 9:02 ` James Bottomley 3 siblings, 0 replies; 59+ messages in thread From: James Bottomley @ 2014-03-14 9:02 UTC (permalink / raw) To: Mel Gorman Cc: linux-scsi, linux-ide, linux-mm, linux-fsdevel, linux-kernel, lsf-pc Hi everyone We're about three weeks out from LSF/MM, so the PC is putting together the agenda here: https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdHU2Zk1KbFhmeVZFVmFMQ19nakJYaFE&gid=0 The current list of attendees is: https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdHU2Zk1KbFhmeVZFVmFMQ19nakJYaFE&gid=1 As usual, we schedule stuff in a just in time, so if you feel there are any topics we're missing, please let us know (send an email to lsf@lists.linux-foundation.org and cc the relevant linux list). There's always time to add stuff we forgot and, if the topic is very relevant, we might even invite you to present it. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2014-03-14 9:02 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman 2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler 2014-01-06 22:32 ` faibish, sorin 2014-01-07 19:44 ` Joel Becker 2014-01-21 7:00 ` LSF/MM 2014 Call For Proposals Michel Lespinasse 2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler 2014-01-22 5:20 ` Joel Becker 2014-01-22 7:14 ` Hannes Reinecke 2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman 2014-01-22 14:10 ` Ric Wheeler 2014-01-22 14:34 ` Mel Gorman 2014-01-22 14:58 ` Ric Wheeler 2014-01-22 15:19 ` Mel Gorman 2014-01-22 17:02 ` Chris Mason 2014-01-22 17:21 ` James Bottomley 2014-01-22 18:02 ` Chris Mason 2014-01-22 18:13 ` James Bottomley 2014-01-22 18:17 ` Ric Wheeler 2014-01-22 18:35 ` James Bottomley 2014-01-22 18:39 ` Ric Wheeler 2014-01-22 19:30 ` James Bottomley 2014-01-22 19:50 ` Andrew Morton 2014-01-22 20:13 ` Chris Mason 2014-01-23 2:46 ` David Lang 2014-01-23 5:21 ` Theodore Ts'o 2014-01-23 8:35 ` Dave Chinner 2014-01-23 12:55 ` Theodore Ts'o 2014-01-23 19:49 ` Dave Chinner 2014-01-23 21:21 ` Joel Becker 2014-01-22 20:57 ` Martin K. Petersen 2014-01-22 18:37 ` Chris Mason 2014-01-22 18:40 ` Ric Wheeler 2014-01-22 18:47 ` James Bottomley 2014-01-23 21:27 ` Joel Becker 2014-01-23 21:34 ` Chris Mason 2014-01-23 8:27 ` Dave Chinner 2014-01-23 15:47 ` James Bottomley 2014-01-23 16:44 ` Mel Gorman 2014-01-23 19:55 ` James Bottomley 2014-01-24 10:57 ` Mel Gorman 2014-01-30 4:52 ` Matthew Wilcox 2014-01-30 6:01 ` Dave Chinner 2014-01-30 10:50 ` Mel Gorman 2014-01-23 20:34 ` Dave Chinner 2014-01-23 20:54 ` Christoph Lameter 2014-01-23 8:24 ` Dave Chinner 2014-01-23 20:48 ` Christoph Lameter 2014-01-22 20:47 ` Martin K. Petersen 2014-01-23 8:21 ` Dave Chinner 2014-01-22 15:14 ` Chris Mason 2014-01-22 16:03 ` James Bottomley 2014-01-22 16:45 ` Ric Wheeler 2014-01-22 17:00 ` James Bottomley 2014-01-22 21:05 ` Jan Kara 2014-01-23 20:47 ` Christoph Lameter 2014-01-24 11:09 ` Mel Gorman 2014-01-24 15:44 ` Christoph Lameter 2014-01-22 15:54 ` James Bottomley 2014-03-14 9:02 ` Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).