From: Chris Mason <clm@fb.com>
To: "jlbec@evilplan.org" <jlbec@evilplan.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
"rwheeler@redhat.com" <rwheeler@redhat.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"James.Bottomley@HansenPartnership.com"
<James.Bottomley@HansenPartnership.com>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"mgorman@suse.de" <mgorman@suse.de>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
Date: Thu, 23 Jan 2014 21:34:08 +0000 [thread overview]
Message-ID: <1390512936.1198.76.camel@ret.masoncoding.com> (raw)
In-Reply-To: <20140123212714.GB25376@localhost>
On Thu, 2014-01-23 at 13:27 -0800, Joel Becker wrote:
> On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote:
> > On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote:
> > > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
> > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote:
> > [agreement cut because it's boring for the reader]
> > > > Realistically, if you look at what the I/O schedulers output on a
> > > > standard (spinning rust) workload, it's mostly large transfers.
> > > > Obviously these are misalgned at the ends, but we can fix some of that
> > > > in the scheduler. Particularly if the FS helps us with layout. My
> > > > instinct tells me that we can fix 99% of this with layout on the FS + io
> > > > schedulers ... the remaining 1% goes to the drive as needing to do RMW
> > > > in the device, but the net impact to our throughput shouldn't be that
> > > > great.
> > >
> > > There are a few workloads where the VM and the FS would team up to make
> > > this fairly miserable
> > >
> > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't
> > > realize that fileA, fileB, fileC, and fileD all need to be written at
> > > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks
> > > to accumulate full stripes as much as possible, but it still hurts.
> > >
> > > Metadata. These writes are very latency sensitive and we'll gain a lot
> > > if the FS is explicitly trying to build full sector IOs.
> >
> > OK, so these two cases I buy ... the question is can we do something
> > about them today without increasing the block size?
> >
> > The metadata problem, in particular, might be block independent: we
> > still have a lot of small chunks to write out at fractured locations.
> > With a large block size, the FS knows it's been bad and can expect the
> > rolled up newspaper, but it's not clear what it could do about it.
> >
> > The small files issue looks like something we should be tackling today
> > since writing out adjacent files would actually help us get bigger
> > transfers.
>
> ocfs2 can actually take significant advantage here, because we store
> small file data in-inode. This would grow our in-inode size from ~3K to
> ~15K or ~63K. We'd actually have to do more work to start putting more
> than one inode in a block (thought that would be a promising avenue too
> once the coordination is solved generically.
Btrfs already defaults to 16K metadata and can go as high as 64k. The
part we don't do is multi-page sectors for data blocks.
I'd tend to leverage the read/modify/write engine from the raid code for
that.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Chris Mason <clm@fb.com>
To: "jlbec@evilplan.org" <jlbec@evilplan.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
"rwheeler@redhat.com" <rwheeler@redhat.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"James.Bottomley@HansenPartnership.com"
<James.Bottomley@HansenPartnership.com>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"mgorman@suse.de" <mgorman@suse.de>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
Date: Thu, 23 Jan 2014 21:34:08 +0000 [thread overview]
Message-ID: <1390512936.1198.76.camel@ret.masoncoding.com> (raw)
In-Reply-To: <20140123212714.GB25376@localhost>
On Thu, 2014-01-23 at 13:27 -0800, Joel Becker wrote:
> On Wed, Jan 22, 2014 at 10:47:01AM -0800, James Bottomley wrote:
> > On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote:
> > > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
> > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote:
> > [agreement cut because it's boring for the reader]
> > > > Realistically, if you look at what the I/O schedulers output on a
> > > > standard (spinning rust) workload, it's mostly large transfers.
> > > > Obviously these are misalgned at the ends, but we can fix some of that
> > > > in the scheduler. Particularly if the FS helps us with layout. My
> > > > instinct tells me that we can fix 99% of this with layout on the FS + io
> > > > schedulers ... the remaining 1% goes to the drive as needing to do RMW
> > > > in the device, but the net impact to our throughput shouldn't be that
> > > > great.
> > >
> > > There are a few workloads where the VM and the FS would team up to make
> > > this fairly miserable
> > >
> > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't
> > > realize that fileA, fileB, fileC, and fileD all need to be written at
> > > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks
> > > to accumulate full stripes as much as possible, but it still hurts.
> > >
> > > Metadata. These writes are very latency sensitive and we'll gain a lot
> > > if the FS is explicitly trying to build full sector IOs.
> >
> > OK, so these two cases I buy ... the question is can we do something
> > about them today without increasing the block size?
> >
> > The metadata problem, in particular, might be block independent: we
> > still have a lot of small chunks to write out at fractured locations.
> > With a large block size, the FS knows it's been bad and can expect the
> > rolled up newspaper, but it's not clear what it could do about it.
> >
> > The small files issue looks like something we should be tackling today
> > since writing out adjacent files would actually help us get bigger
> > transfers.
>
> ocfs2 can actually take significant advantage here, because we store
> small file data in-inode. This would grow our in-inode size from ~3K to
> ~15K or ~63K. We'd actually have to do more work to start putting more
> than one inode in a block (thought that would be a promising avenue too
> once the coordination is solved generically.
Btrfs already defaults to 16K metadata and can go as high as 64k. The
part we don't do is multi-page sectors for data blocks.
I'd tend to leverage the read/modify/write engine from the raid code for
that.
-chris
next prev parent reply other threads:[~2014-01-23 21:34 UTC|newest]
Thread overview: 121+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-20 9:30 LSF/MM 2014 Call For Proposals Mel Gorman
2013-12-20 9:30 ` Mel Gorman
2013-12-20 9:30 ` [Lsf-pc] " Mel Gorman
2014-01-06 22:20 ` [LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems Ric Wheeler
2014-01-06 22:20 ` Ric Wheeler
2014-01-06 22:32 ` faibish, sorin
2014-01-06 22:32 ` faibish, sorin
2014-01-07 19:44 ` Joel Becker
2014-01-07 19:44 ` Joel Becker
2014-01-21 7:00 ` LSF/MM 2014 Call For Proposals Michel Lespinasse
2014-01-21 7:00 ` Michel Lespinasse
2014-01-22 3:04 ` [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes Ric Wheeler
2014-01-22 3:04 ` Ric Wheeler
2014-01-22 5:20 ` Joel Becker
2014-01-22 5:20 ` Joel Becker
2014-01-22 7:14 ` Hannes Reinecke
2014-01-22 7:14 ` Hannes Reinecke
2014-01-22 7:14 ` Hannes Reinecke
2014-01-22 9:34 ` [Lsf-pc] " Mel Gorman
2014-01-22 9:34 ` Mel Gorman
2014-01-22 14:10 ` Ric Wheeler
2014-01-22 14:10 ` Ric Wheeler
2014-01-22 14:34 ` Mel Gorman
2014-01-22 14:34 ` Mel Gorman
2014-01-22 14:58 ` Ric Wheeler
2014-01-22 14:58 ` Ric Wheeler
2014-01-22 15:19 ` Mel Gorman
2014-01-22 15:19 ` Mel Gorman
2014-01-22 17:02 ` Chris Mason
2014-01-22 17:02 ` Chris Mason
2014-01-22 17:21 ` James Bottomley
2014-01-22 17:21 ` James Bottomley
2014-01-22 18:02 ` Chris Mason
2014-01-22 18:02 ` Chris Mason
2014-01-22 18:13 ` James Bottomley
2014-01-22 18:13 ` James Bottomley
2014-01-22 18:17 ` Ric Wheeler
2014-01-22 18:17 ` Ric Wheeler
2014-01-22 18:35 ` James Bottomley
2014-01-22 18:35 ` James Bottomley
2014-01-22 18:39 ` Ric Wheeler
2014-01-22 18:39 ` Ric Wheeler
2014-01-22 19:30 ` James Bottomley
2014-01-22 19:30 ` James Bottomley
2014-01-22 19:50 ` Andrew Morton
2014-01-22 19:50 ` Andrew Morton
2014-01-22 20:13 ` Chris Mason
2014-01-22 20:13 ` Chris Mason
2014-01-23 2:46 ` David Lang
2014-01-23 2:46 ` David Lang
2014-01-23 5:21 ` Theodore Ts'o
2014-01-23 5:21 ` Theodore Ts'o
2014-01-23 8:35 ` Dave Chinner
2014-01-23 8:35 ` Dave Chinner
2014-01-23 12:55 ` Theodore Ts'o
2014-01-23 12:55 ` Theodore Ts'o
2014-01-23 19:49 ` Dave Chinner
2014-01-23 19:49 ` Dave Chinner
2014-01-23 21:21 ` Joel Becker
2014-01-23 21:21 ` Joel Becker
2014-01-22 20:57 ` Martin K. Petersen
2014-01-22 20:57 ` Martin K. Petersen
2014-01-22 20:57 ` Martin K. Petersen
2014-01-22 18:37 ` Chris Mason
2014-01-22 18:37 ` Chris Mason
2014-01-22 18:40 ` Ric Wheeler
2014-01-22 18:40 ` Ric Wheeler
2014-01-22 18:47 ` James Bottomley
2014-01-22 18:47 ` James Bottomley
2014-01-23 21:27 ` Joel Becker
2014-01-23 21:27 ` Joel Becker
2014-01-23 21:34 ` Chris Mason [this message]
2014-01-23 21:34 ` Chris Mason
2014-01-23 8:27 ` Dave Chinner
2014-01-23 8:27 ` Dave Chinner
2014-01-23 15:47 ` James Bottomley
2014-01-23 15:47 ` James Bottomley
2014-01-23 16:44 ` Mel Gorman
2014-01-23 16:44 ` Mel Gorman
2014-01-23 19:55 ` James Bottomley
2014-01-23 19:55 ` James Bottomley
2014-01-24 10:57 ` Mel Gorman
2014-01-24 10:57 ` Mel Gorman
2014-01-30 4:52 ` Matthew Wilcox
2014-01-30 4:52 ` Matthew Wilcox
2014-01-30 6:01 ` Dave Chinner
2014-01-30 6:01 ` Dave Chinner
2014-01-30 10:50 ` Mel Gorman
2014-01-30 10:50 ` Mel Gorman
2014-01-23 20:34 ` Dave Chinner
2014-01-23 20:34 ` Dave Chinner
2014-01-23 20:54 ` Christoph Lameter
2014-01-23 20:54 ` Christoph Lameter
2014-01-23 8:24 ` Dave Chinner
2014-01-23 8:24 ` Dave Chinner
2014-01-23 20:48 ` Christoph Lameter
2014-01-23 20:48 ` Christoph Lameter
2014-01-22 20:47 ` Martin K. Petersen
2014-01-22 20:47 ` Martin K. Petersen
2014-01-23 8:21 ` Dave Chinner
2014-01-23 8:21 ` Dave Chinner
2014-01-22 15:14 ` Chris Mason
2014-01-22 15:14 ` Chris Mason
2014-01-22 16:03 ` James Bottomley
2014-01-22 16:03 ` James Bottomley
2014-01-22 16:45 ` Ric Wheeler
2014-01-22 16:45 ` Ric Wheeler
2014-01-22 17:00 ` James Bottomley
2014-01-22 17:00 ` James Bottomley
2014-01-22 21:05 ` Jan Kara
2014-01-22 21:05 ` Jan Kara
2014-01-23 20:47 ` Christoph Lameter
2014-01-23 20:47 ` Christoph Lameter
2014-01-24 11:09 ` Mel Gorman
2014-01-24 11:09 ` Mel Gorman
2014-01-24 15:44 ` Christoph Lameter
2014-01-24 15:44 ` Christoph Lameter
2014-01-22 15:54 ` James Bottomley
2014-01-22 15:54 ` James Bottomley
2014-03-14 9:02 ` Update on LSF/MM [was Re: LSF/MM 2014 Call For Proposals] James Bottomley
2014-03-14 9:02 ` James Bottomley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1390512936.1198.76.camel@ret.masoncoding.com \
--to=clm@fb.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=akpm@linux-foundation.org \
--cc=jlbec@evilplan.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-ide@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-scsi@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mgorman@suse.de \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.