public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Re: file journal fadvise
       [not found]       ` <alpine.DEB.2.00.1412011122020.3471@cobra.newdream.net>
@ 2014-12-01 22:31         ` Mark Nelson
  2014-12-01 22:51           ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Nelson @ 2014-12-01 22:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, xfs, 马建朋



On 12/01/2014 01:23 PM, Sage Weil wrote:
> On Mon, 1 Dec 2014, Mark Nelson wrote:
>> On 11/30/2014 09:26 PM, Sage Weil wrote:
>>> On Mon, 1 Dec 2014, ??? wrote:
>>>> Hi sage:
>>>>    For fadvise_random it only change the file readahead. I think it make
>>>> no sense for xfs
>>>> Becasue xfs don't like btrfs, the journal write always on old place(at
>>>> first allocated). We only can make those place contiguous.
>>>
>>> I'm thinking of the OSD journal, which can be a regular file.  I guess it
>>> would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
>>> an ioctl, which makes the delayed allocation especially unconcerned with
>>> keeping blocks contiguous.  It would need to be combined with the discard
>>> ioctl so that any journal write can be allocated wherever it is most
>>> convenient (hopefully contiguous to some other write).
>>>
>>> sage
>>
>> Hi Sage,
>>
>> Could you quick write down the steps you are thinking we'd take to implement
>> this?  I'm concerned about the amount of overhead this could cause but I want
>> to make sure I'm thinking about it correctly. Especially when trim happens and
>> what you think/expect to happens at the FS and device levels.
>
> 1- set journal_discard = true
> 2- add journal_preallocate = true config option, set it to false, and make
> the fallocate(2) call on journal create conditional on that.
> 3- test with defaults (discard = false, preallocate = true) and
> compare it to discard = true + preallocate = false (with file journal).
> 4- possibly add a call to set extsize to something small on the journal
> file.  Or give xfs some other appropriate hint, if one exists.
>
> sage

CCing XFS devel so we can get some feedback from those guys too.

Question:  Looking through our discard code in common/blkdev.cc, it 
looks like the new discard implementation is using blkdiscard.  For 
co-located journals should we be using fstrim_range?

FWIW there were some performance tests done quite a while ago:

http://people.redhat.com/lczerner/discard/files/Performance_evaluation_of_Linux_DIscard_support_Dev_Con2011_Brno.pdf

>
>>
>> Mark
>>
>>>
>>>
>>>>
>>>> Thanks!
>>>> Jianpeng
>>>>
>>>> 2014-12-01 2:46 GMT+08:00 Sage Weil <sweil@redhat.com>:
>>>>> Currently, when an OSD journal is stored as a file, we preallocate it as
>>>>> a
>>>>> large contiguous extent.  That means that for every journal write we're
>>>>> seeking back to wherever the journal is.  That possibly not ideal for
>>>>> writes.  For reads it's great, but that's the last thing we care about
>>>>> optimizing (we only read the journal after a failure, which is very
>>>>> rare).
>>>>>
>>>>> I wonder if we would do better if we:
>>>>>
>>>>>    1- trim/discard the old journal contents,
>>>>>    2- posix_fadvise RANDOM
>>>>>
>>>>> I'm not sure what the XFS behavior is in this case, but ideally it seems
>>>>> what we want it to do is write the journal wherever on disk it is most
>>>>> convenient... ideally contiguous with some other write that it is
>>>>> already
>>>>> doing.  If fadvise random doesn't do that, perhaps there is another
>>>>> allocator hint we can give it that will get us that behavior...
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: file journal fadvise
  2014-12-01 22:31         ` file journal fadvise Mark Nelson
@ 2014-12-01 22:51           ` Dave Chinner
  2014-12-02  0:12             ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-12-01 22:51 UTC (permalink / raw)
  To: mnelson; +Cc: Sage Weil, ceph-devel, 马建朋, xfs

On Mon, Dec 01, 2014 at 04:31:18PM -0600, Mark Nelson wrote:
> 
> 
> On 12/01/2014 01:23 PM, Sage Weil wrote:
> >On Mon, 1 Dec 2014, Mark Nelson wrote:
> >>On 11/30/2014 09:26 PM, Sage Weil wrote:
> >>>On Mon, 1 Dec 2014, ??? wrote:
> >>>>Hi sage:
> >>>>   For fadvise_random it only change the file readahead. I think it make
> >>>>no sense for xfs
> >>>>Becasue xfs don't like btrfs, the journal write always on old place(at
> >>>>first allocated). We only can make those place contiguous.
> >>>
> >>>I'm thinking of the OSD journal, which can be a regular file.  I guess it
> >>>would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
> >>>an ioctl, which makes the delayed allocation especially unconcerned with
> >>>keeping blocks contiguous.  It would need to be combined with the discard
> >>>ioctl so that any journal write can be allocated wherever it is most
> >>>convenient (hopefully contiguous to some other write).
> >>>
> >>>sage
> >>
> >>Hi Sage,
> >>
> >>Could you quick write down the steps you are thinking we'd take to implement
> >>this?  I'm concerned about the amount of overhead this could cause but I want
> >>to make sure I'm thinking about it correctly. Especially when trim happens and
> >>what you think/expect to happens at the FS and device levels.
> >
> >1- set journal_discard = true
> >2- add journal_preallocate = true config option, set it to false, and make
> >the fallocate(2) call on journal create conditional on that.
> >3- test with defaults (discard = false, preallocate = true) and
> >compare it to discard = true + preallocate = false (with file journal).
> >4- possibly add a call to set extsize to something small on the journal
> >file.  Or give xfs some other appropriate hint, if one exists.

What behaviour are you wanting for a journal file? it sounds like
you want it to behave like a wandering log: automatically allocating
it's next block where-ever the previous write of any kind occurred?

We can't actually do that in XFS - we have no idea where the last
write IO occurred because that's several layers down the IO stack.
We could store where the last allocation was, but that doesn't
guarantee we can allocate another block contiguously to that. Even
if we do, that then fragments whatever file the journal block now
sits adjacent to.

The other issue is that block allocation is divided up into
allocation groups, and allocation is mostly siloed to avoid randomly
allocating a file into different AGs. Just randomly allocating
blocks to a file is the polar opposite of everything the XFS
allocation strategies do, hence a bit more clarity on what the
overall goal is would be helpful. ;)

> >
> >sage
> 
> CCing XFS devel so we can get some feedback from those guys too.
> 
> Question:  Looking through our discard code in common/blkdev.cc, it
> looks like the new discard implementation is using blkdiscard.  For
> co-located journals should we be using fstrim_range?

If you are talking about journals hosted in files on a filesystem,
then discard is the wrong operation to be performing. Discard/trim
operates solely on free filesystem space, and you have to free the
space from the file before you can discard it. To free the space
from the file you need to punch a hole in it. i.e. you need to use
fallocate(FALLOC_FL_PUNCH_HOLE).

> FWIW there were some performance tests done quite a while ago:
> 
> http://people.redhat.com/lczerner/discard/files/Performance_evaluation_of_Linux_DIscard_support_Dev_Con2011_Brno.pdf

Quite frankly, you do not want to use realtime discard - it has too
many performance issues associated with it, not to mention there are
randomly broken firmwares out there that don't handle high volumes
or frequent discard operations at all well (i.e. the devices hang
and/or trash the wrong data).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: file journal fadvise
  2014-12-01 22:51           ` Dave Chinner
@ 2014-12-02  0:12             ` Sage Weil
  2014-12-02  0:32               ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-12-02  0:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ceph-devel, 马建朋, mnelson, xfs

On Tue, 2 Dec 2014, Dave Chinner wrote:
> On Mon, Dec 01, 2014 at 04:31:18PM -0600, Mark Nelson wrote:
> > 
> > 
> > On 12/01/2014 01:23 PM, Sage Weil wrote:
> > >On Mon, 1 Dec 2014, Mark Nelson wrote:
> > >>On 11/30/2014 09:26 PM, Sage Weil wrote:
> > >>>On Mon, 1 Dec 2014, ??? wrote:
> > >>>>Hi sage:
> > >>>>   For fadvise_random it only change the file readahead. I think it make
> > >>>>no sense for xfs
> > >>>>Becasue xfs don't like btrfs, the journal write always on old place(at
> > >>>>first allocated). We only can make those place contiguous.
> > >>>
> > >>>I'm thinking of the OSD journal, which can be a regular file.  I guess it
> > >>>would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
> > >>>an ioctl, which makes the delayed allocation especially unconcerned with
> > >>>keeping blocks contiguous.  It would need to be combined with the discard
> > >>>ioctl so that any journal write can be allocated wherever it is most
> > >>>convenient (hopefully contiguous to some other write).
> > >>>
> > >>>sage
> > >>
> > >>Hi Sage,
> > >>
> > >>Could you quick write down the steps you are thinking we'd take to implement
> > >>this?  I'm concerned about the amount of overhead this could cause but I want
> > >>to make sure I'm thinking about it correctly. Especially when trim happens and
> > >>what you think/expect to happens at the FS and device levels.
> > >
> > >1- set journal_discard = true
> > >2- add journal_preallocate = true config option, set it to false, and make
> > >the fallocate(2) call on journal create conditional on that.
> > >3- test with defaults (discard = false, preallocate = true) and
> > >compare it to discard = true + preallocate = false (with file journal).
> > >4- possibly add a call to set extsize to something small on the journal
> > >file.  Or give xfs some other appropriate hint, if one exists.
> 
> What behaviour are you wanting for a journal file? it sounds like
> you want it to behave like a wandering log: automatically allocating
> it's next block where-ever the previous write of any kind occurred?

Precisely.  Well, as long as it is adjacent to *some* other scheduled 
write, it would save us a seek.  The real question, I guess, is whether 
there is an XFS allocation mode that makes no attempt to avoid 
fragmentation for the file and that chooses something adjacent to other 
small, newly-written data during delayed allocation.

> We can't actually do that in XFS - we have no idea where the last
> write IO occurred because that's several layers down the IO stack.
> We could store where the last allocation was, but that doesn't
> guarantee we can allocate another block contiguously to that. Even
> if we do, that then fragments whatever file the journal block now
> sits adjacent to.
> 
> The other issue is that block allocation is divided up into
> allocation groups, and allocation is mostly siloed to avoid randomly
> allocating a file into different AGs. Just randomly allocating
> blocks to a file is the polar opposite of everything the XFS
> allocation strategies do, hence a bit more clarity on what the
> overall goal is would be helpful. ;)

It's a circular file, usually a few GB in site, written sequentially with 
a range of small to large (block-aligned) write sizes, and (for all 
intents and purposes) is never read.  We periodically overwrite the first 
block with recent start and end pointers and other metadata.

> > CCing XFS devel so we can get some feedback from those guys too.
> > 
> > Question:  Looking through our discard code in common/blkdev.cc, it
> > looks like the new discard implementation is using blkdiscard.  For
> > co-located journals should we be using fstrim_range?
> 
> If you are talking about journals hosted in files on a filesystem,
> then discard is the wrong operation to be performing. Discard/trim
> operates solely on free filesystem space, and you have to free the
> space from the file before you can discard it. To free the space
> from the file you need to punch a hole in it. i.e. you need to use
> fallocate(FALLOC_FL_PUNCH_HOLE).

Yeah.  Right now it uses the BLKDISCARD ioctl if the fd references a 
block device and the option is enabled; it needs to use fallocate in the 
file case.

This may still have some minor value in the btrfs case because we are 
doing the deallocation work at trim time instead of overwrite time.  
We'll get the wandering log behavior more or less for free just by 
disabling the initial fallocate call since that's how allocation works in 
general.

Thanks, Dave!
sage

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: file journal fadvise
  2014-12-02  0:12             ` Sage Weil
@ 2014-12-02  0:32               ` Dave Chinner
  2014-12-02  1:24                 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-12-02  0:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 马建朋, mnelson, xfs

On Mon, Dec 01, 2014 at 04:12:03PM -0800, Sage Weil wrote:
> On Tue, 2 Dec 2014, Dave Chinner wrote:
> > On Mon, Dec 01, 2014 at 04:31:18PM -0600, Mark Nelson wrote:
> > > 
> > > 
> > > On 12/01/2014 01:23 PM, Sage Weil wrote:
> > > >On Mon, 1 Dec 2014, Mark Nelson wrote:
> > > >>On 11/30/2014 09:26 PM, Sage Weil wrote:
> > > >>>On Mon, 1 Dec 2014, ??? wrote:
> > > >>>>Hi sage:
> > > >>>>   For fadvise_random it only change the file readahead. I think it make
> > > >>>>no sense for xfs
> > > >>>>Becasue xfs don't like btrfs, the journal write always on old place(at
> > > >>>>first allocated). We only can make those place contiguous.
> > > >>>
> > > >>>I'm thinking of the OSD journal, which can be a regular file.  I guess it
> > > >>>would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
> > > >>>an ioctl, which makes the delayed allocation especially unconcerned with
> > > >>>keeping blocks contiguous.  It would need to be combined with the discard
> > > >>>ioctl so that any journal write can be allocated wherever it is most
> > > >>>convenient (hopefully contiguous to some other write).
> > > >>>
> > > >>>sage
> > > >>
> > > >>Hi Sage,
> > > >>
> > > >>Could you quick write down the steps you are thinking we'd take to implement
> > > >>this?  I'm concerned about the amount of overhead this could cause but I want
> > > >>to make sure I'm thinking about it correctly. Especially when trim happens and
> > > >>what you think/expect to happens at the FS and device levels.
> > > >
> > > >1- set journal_discard = true
> > > >2- add journal_preallocate = true config option, set it to false, and make
> > > >the fallocate(2) call on journal create conditional on that.
> > > >3- test with defaults (discard = false, preallocate = true) and
> > > >compare it to discard = true + preallocate = false (with file journal).
> > > >4- possibly add a call to set extsize to something small on the journal
> > > >file.  Or give xfs some other appropriate hint, if one exists.
> > 
> > What behaviour are you wanting for a journal file? it sounds like
> > you want it to behave like a wandering log: automatically allocating
> > it's next block where-ever the previous write of any kind occurred?
> 
> Precisely.  Well, as long as it is adjacent to *some* other scheduled 
> write, it would save us a seek.  The real question, I guess, is whether 
> there is an XFS allocation mode that makes no attempt to avoid 
> fragmentation for the file and that chooses something adjacent to other 
> small, newly-written data during delayed allocation.

Ok, so what is the most common underlying storage you need to
optimise for? Is it raid5/6 where a small write will trigger a
larger RMW cycle and so proximity rather than exact adjacency
matters, or is it raid 0/1/jbod where exact adjacency is the only
way to avoid a seek?

I suspect that we can play certain tricks to trigger unaligned,
discontiguous allocation (i.e. no target allocation block), but the
question is whether we can get determine sufficient
allocation/writeback context to enable delayed allocation to make
sensible "next written block" decisions.

> > We can't actually do that in XFS - we have no idea where the last
> > write IO occurred because that's several layers down the IO stack.
> > We could store where the last allocation was, but that doesn't
> > guarantee we can allocate another block contiguously to that. Even
> > if we do, that then fragments whatever file the journal block now
> > sits adjacent to.
> > 
> > The other issue is that block allocation is divided up into
> > allocation groups, and allocation is mostly siloed to avoid randomly
> > allocating a file into different AGs. Just randomly allocating
> > blocks to a file is the polar opposite of everything the XFS
> > allocation strategies do, hence a bit more clarity on what the
> > overall goal is would be helpful. ;)
> 
> It's a circular file, usually a few GB in site, written sequentially with 
> a range of small to large (block-aligned) write sizes, and (for all 
> intents and purposes) is never read.  We periodically overwrite the first 
> block with recent start and end pointers and other metadata.

Ok, so it's just another typical WAL file. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: file journal fadvise
  2014-12-02  0:32               ` Dave Chinner
@ 2014-12-02  1:24                 ` Sage Weil
  2014-12-02  2:01                   ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-12-02  1:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ceph-devel, 马建朋, mnelson, xfs

On Tue, 2 Dec 2014, Dave Chinner wrote:
> On Mon, Dec 01, 2014 at 04:12:03PM -0800, Sage Weil wrote:
> > On Tue, 2 Dec 2014, Dave Chinner wrote:
> > > What behaviour are you wanting for a journal file? it sounds like
> > > you want it to behave like a wandering log: automatically allocating
> > > it's next block where-ever the previous write of any kind occurred?
> > 
> > Precisely.  Well, as long as it is adjacent to *some* other scheduled 
> > write, it would save us a seek.  The real question, I guess, is whether 
> > there is an XFS allocation mode that makes no attempt to avoid 
> > fragmentation for the file and that chooses something adjacent to other 
> > small, newly-written data during delayed allocation.
> 
> Ok, so what is the most common underlying storage you need to
> optimise for? Is it raid5/6 where a small write will trigger a
> larger RMW cycle and so proximity rather than exact adjacency
> matters, or is it raid 0/1/jbod where exact adjacency is the only
> way to avoid a seek?

The common case is a single raw disk.

> I suspect that we can play certain tricks to trigger unaligned,
> discontiguous allocation (i.e. no target allocation block), but the
> question is whether we can get determine sufficient
> allocation/writeback context to enable delayed allocation to make
> sensible "next written block" decisions.

Yeah.

> > It's a circular file, usually a few GB in site, written sequentially with 
> > a range of small to large (block-aligned) write sizes, and (for all 
> > intents and purposes) is never read.  We periodically overwrite the first 
> > block with recent start and end pointers and other metadata.
> 
> Ok, so it's just another typical WAL file. ;)

Nothing to lose sleep over if this mode doesn't already exist, but I 
expect a fair number of applications could make use of this.

FWIW, while I am already distracting you from useful things, I suspect 
(batched) aio_fsync would be a bigger win for us and probably a smaller 
investment of effort.  :)

sage

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: file journal fadvise
  2014-12-02  1:24                 ` Sage Weil
@ 2014-12-02  2:01                   ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2014-12-02  2:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 马建朋, mnelson, xfs

On Mon, Dec 01, 2014 at 05:24:46PM -0800, Sage Weil wrote:
> On Tue, 2 Dec 2014, Dave Chinner wrote:
> > On Mon, Dec 01, 2014 at 04:12:03PM -0800, Sage Weil wrote:
> > > On Tue, 2 Dec 2014, Dave Chinner wrote:
> > > > What behaviour are you wanting for a journal file? it sounds like
> > > > you want it to behave like a wandering log: automatically allocating
> > > > it's next block where-ever the previous write of any kind occurred?
> > > 
> > > Precisely.  Well, as long as it is adjacent to *some* other scheduled 
> > > write, it would save us a seek.  The real question, I guess, is whether 
> > > there is an XFS allocation mode that makes no attempt to avoid 
> > > fragmentation for the file and that chooses something adjacent to other 
> > > small, newly-written data during delayed allocation.
> > 
> > Ok, so what is the most common underlying storage you need to
> > optimise for? Is it raid5/6 where a small write will trigger a
> > larger RMW cycle and so proximity rather than exact adjacency
> > matters, or is it raid 0/1/jbod where exact adjacency is the only
> > way to avoid a seek?
> 
> The common case is a single raw disk.

Ok, so it's an exact match that is really required. I'll have a
think about it.

> > > It's a circular file, usually a few GB in site, written sequentially with 
> > > a range of small to large (block-aligned) write sizes, and (for all 
> > > intents and purposes) is never read.  We periodically overwrite the first 
> > > block with recent start and end pointers and other metadata.
> > 
> > Ok, so it's just another typical WAL file. ;)
> 
> Nothing to lose sleep over if this mode doesn't already exist, but I 
> expect a fair number of applications could make use of this.
> 
> FWIW, while I am already distracting you from useful things, I suspect 
> (batched) aio_fsync would be a bigger win for us and probably a smaller 
> investment of effort.  :)

If you want to test a patch that implements a basic, simple
implementation of aio_fsync:

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-12-02  2:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.DEB.2.00.1411301013490.352@cobra.newdream.net>
     [not found] ` <CALurOm2tEV=RqN21eFJvfU1zTtJkbz2gHDCk_Ntsy4oz9iwHoA@mail.gmail.com>
     [not found]   ` <alpine.DEB.2.00.1411301922220.352@cobra.newdream.net>
     [not found]     ` <547CBEFA.3000204@redhat.com>
     [not found]       ` <alpine.DEB.2.00.1412011122020.3471@cobra.newdream.net>
2014-12-01 22:31         ` file journal fadvise Mark Nelson
2014-12-01 22:51           ` Dave Chinner
2014-12-02  0:12             ` Sage Weil
2014-12-02  0:32               ` Dave Chinner
2014-12-02  1:24                 ` Sage Weil
2014-12-02  2:01                   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox