linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ric Wheeler <rwheeler@redhat.com>,
	"linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Chris Mason <clm@fb.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"mgorman@suse.de" <mgorman@suse.de>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
Date: Wed, 22 Jan 2014 22:05:24 +0100	[thread overview]
Message-ID: <20140122210524.GA27916@quack.suse.cz> (raw)
In-Reply-To: <1390410033.2372.28.camel@dabdike.int.hansenpartnership.com>

On Wed 22-01-14 09:00:33, James Bottomley wrote:
> On Wed, 2014-01-22 at 11:45 -0500, Ric Wheeler wrote:
> > On 01/22/2014 11:03 AM, James Bottomley wrote:
> > > On Wed, 2014-01-22 at 15:14 +0000, Chris Mason wrote:
> > >> On Wed, 2014-01-22 at 09:34 +0000, Mel Gorman wrote:
> > >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
> > >>>> One topic that has been lurking forever at the edges is the current
> > >>>> 4k limitation for file system block sizes. Some devices in
> > >>>> production today and others coming soon have larger sectors and it
> > >>>> would be interesting to see if it is time to poke at this topic
> > >>>> again.
> > >>>>
> > >>> Large block support was proposed years ago by Christoph Lameter
> > >>> (http://lwn.net/Articles/232757/). I think I was just getting started
> > >>> in the community at the time so I do not recall any of the details. I do
> > >>> believe it motivated an alternative by Nick Piggin called fsblock though
> > >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to
> > >>> know why neither were never merged for those of us that were not around
> > >>> at the time and who may not have the chance to dive through mailing list
> > >>> archives between now and March.
> > >>>
> > >>> FWIW, I would expect that a show-stopper for any proposal is requiring
> > >>> high-order allocations to succeed for the system to behave correctly.
> > >>>
> > >> My memory is that Nick's work just didn't have the momentum to get
> > >> pushed in.  It all seemed very reasonable though, I think our hatred of
> > >> buffered heads just wasn't yet bigger than the fear of moving away.
> > >>
> > >> But, the bigger question is how big are the blocks going to be?  At some
> > >> point (64K?) we might as well just make a log structured dm target and
> > >> have a single setup for both shingled and large sector drives.
> > > There is no real point.  Even with 4k drives today using 4k sectors in
> > > the filesystem, we still get 512 byte writes because of journalling and
> > > the buffer cache.
> > 
> > I think that you are wrong here James. Even with 512 byte drives, the IO's we 
> > send down tend to be 4k or larger. Do you have traces that show this and details?
> 
> It's mostly an ext3 journalling issue ... and it's only metadata and
> mostly the ioschedulers can elevate it into 4k chunks, so yes, most of
> our writes are 4k+, so this is a red herring, yes.
  ext3 (similarly as ext4) does block level journalling meaning that it
journals *only* full blocks. So an ext3/4 filesystem with 4 KB blocksize
will never journal anything else than full 4 KB blocks. So I'm not sure
where this 512-byte writes idea came from..

> > Also keep in mind that larger block sizes allow us to track larger
> > files with 
> > smaller amounts of metadata which is a second win.
> 
> Larger file block sizes are completely independent from larger device
> block sizes (we can have 16k file block sizes on 4k or even 512b
> devices).  The questions on larger block size devices are twofold:
> 
>      1. If manufacturers tell us that they'll only support I/O on the
>         physical sector size, do we believe them, given that they said
>         this before on 4k and then backed down.  All the logical vs
>         physical sector stuff is now in T10 standards, why would they
>         try to go all physical again, especially as they've now all
>         written firmware that does the necessary RMW?
>      2. If we agree they'll do RMW in Firmware again, what do we have to
>         do to take advantage of larger sector sizes beyond what we
>         currently do in alignment and chunking?  There may still be
>         issues in FS journal and data layouts.
  I also believe drives will support smaller-than-blocksize writes. But
supporting larger fs blocksize can sometimes be beneficial for other
reasons (think performance with specialized workloads because amount of
metadata is smaller, fragmentation is smaller, ...). Currently ocfs2, ext4,
and possibly others go through the hoops to support allocating file data in
chunks larger than fs blocksize - at the first sight that should be
straightforward but if you look at the code you find out there are nasty
corner cases which make it pretty ugly. And each fs doing these large data
allocations currently invents its own way to deal with the problems. So
providing some common infrastructure for dealing with blocks larger than
page size would definitely relieve some pain.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-01-22 21:05 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-20  9:30 LSF/MM 2014 Call For Proposals Mel Gorman
2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler
2014-01-06 22:32   ` faibish, sorin
2014-01-07 19:44     ` Joel Becker
2014-01-21  7:00 ` LSF/MM 2014 Call For Proposals Michel Lespinasse
2014-01-22  3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler
2014-01-22  5:20   ` Joel Becker
2014-01-22  7:14     ` Hannes Reinecke
2014-01-22  9:34   ` [Lsf-pc] " Mel Gorman
2014-01-22 14:10     ` Ric Wheeler
2014-01-22 14:34       ` Mel Gorman
2014-01-22 14:58         ` Ric Wheeler
2014-01-22 15:19           ` Mel Gorman
2014-01-22 17:02             ` Chris Mason
2014-01-22 17:21               ` James Bottomley
2014-01-22 18:02                 ` Chris Mason
2014-01-22 18:13                   ` James Bottomley
2014-01-22 18:17                     ` Ric Wheeler
2014-01-22 18:35                       ` James Bottomley
2014-01-22 18:39                         ` Ric Wheeler
2014-01-22 19:30                           ` James Bottomley
2014-01-22 19:50                             ` Andrew Morton
2014-01-22 20:13                               ` Chris Mason
2014-01-23  2:46                                 ` David Lang
2014-01-23  5:21                                   ` Theodore Ts'o
2014-01-23  8:35                               ` Dave Chinner
2014-01-23 12:55                                 ` Theodore Ts'o
2014-01-23 19:49                                   ` Dave Chinner
2014-01-23 21:21                                   ` Joel Becker
2014-01-22 20:57                             ` Martin K. Petersen
2014-01-22 18:37                     ` Chris Mason
2014-01-22 18:40                       ` Ric Wheeler
2014-01-22 18:47                       ` James Bottomley
2014-01-23 21:27                         ` Joel Becker
2014-01-23 21:34                           ` Chris Mason
2014-01-23  8:27                     ` Dave Chinner
2014-01-23 15:47                       ` James Bottomley
2014-01-23 16:44                         ` Mel Gorman
2014-01-23 19:55                           ` James Bottomley
2014-01-24 10:57                             ` Mel Gorman
2014-01-30  4:52                               ` Matthew Wilcox
2014-01-30  6:01                                 ` Dave Chinner
2014-01-30 10:50                                 ` Mel Gorman
2014-01-23 20:34                           ` Dave Chinner
2014-01-23 20:54                         ` Christoph Lameter
2014-01-23  8:24                 ` Dave Chinner
2014-01-23 20:48             ` Christoph Lameter
2014-01-22 20:47           ` Martin K. Petersen
2014-01-23  8:21         ` Dave Chinner
2014-01-22 15:14     ` Chris Mason
2014-01-22 16:03       ` James Bottomley
2014-01-22 16:45         ` Ric Wheeler
2014-01-22 17:00           ` James Bottomley
2014-01-22 21:05             ` Jan Kara [this message]
2014-01-23 20:47     ` Christoph Lameter
2014-01-24 11:09       ` Mel Gorman
2014-01-24 15:44         ` Christoph Lameter
2014-01-22 15:54   ` James Bottomley
2014-03-14  9:02 ` Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] James Bottomley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140122210524.GA27916@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=clm@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mgorman@suse.de \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).