question on new feature complexity/possibility/sensibility? (^ Alternate)

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* question on new feature complexity/possibility/sensibility? (^ Alternate)
@ 2011-06-26  0:29 Linda A. Walsh
  2011-06-27  0:32 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Linda A. Walsh @ 2011-06-26  0:29 UTC (permalink / raw)
  To: xfs-oss

I noticed in the 'cp' (coretuils 8.9-4.1) command on suse, there is a 
a "--reflink" that controls "clone/CoW" copies -- which says
it performs a 'lightweight' copy where the data blocks are copied
only when modified.  Now it is vague the 'when modified', (i.e.
does it mean ones that are different between the two copies) (src
and dst), or does it mean only to copy blocks that were modified
since 'some point' -- Doesn't say, but would guess it's src/dst diffs
(I wonder if it is restricted to the same physical filesystem).

Anyway, turns out, it's only for BTRFS (which I haven't yet used, 
and therefore know only that it supports operations like the above).  

Would it be practice to implement, some similar, feature in XFS?

It wouldn't be practice or useful to do it on an 'extent' basis due
to their large size...So to do something similar on XFS, I was
thinking,  with "some amount of effort", some number of 
"updated extents" could be kept, in addition to the original data.

Any future modifications to the file would also have the extents 
modified, but any extents that overlap previous mods will be 
merged, and only the newest data would be kept (meaning that
new sections that are written, that skip over parts of the 
file, wouldn't overwrite a pending change to that section -- only
the bytes (granularity?)  that were changed.

I.e. file is 1Mb.
User1 updates bytes 1k-200k.
User2, later updates bytes 100k-300k, 
New modification 'extent' is created with 1k-300k, with bytes
1k-(100k-1) from user1 be saved, and 100k-300k from user2.

Changes to the 'base' copy would be made upon some ioctl 'sync'
command (file-by-file)...  

It would require up to double the amount of file space.

----
Another possibility would simply be to create a record of byte
ranges that have been updated in the extent and the extent's last
modification time.  Then one could compare the mod times and 
apply the changes.   The problem there would be having to keep a
possibly 'large' log of changes (what if it's not sync/purged...
couldn't be circular as that would allow events to be lost -- though
the file system could be forced 'offline' if the event log became full
...a major pain...)..., but if it was created with a few G of space, 
might take a while...and if synced in time, no prob.

Still, may be no great desire or benefit, but DAMN if I haven't
wanted copy-on-write files for a LONG time.

I.e. being able to hardlink files, but have an option to mark it as
copy on write -- allowing space to be save when copying directory trees,
but then dynamically making new copies when someone updates one of the
linked copies.

Maybe that's a different feature that could be more easily done?

Anyway -- the new copy option just got me thinking....

Any ideas or work in thinking about things in this area for XFS to
keep it updated and 'current as a "Filesystem of Choice"....?

Thanks,
Linda

so if a user wants to modify it -- it 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question on new feature complexity/possibility/sensibility? (^ Alternate)
  2011-06-26  0:29 question on new feature complexity/possibility/sensibility? (^ Alternate) Linda A. Walsh
@ 2011-06-27  0:32 ` Dave Chinner
  2011-06-27  4:08   ` Linda Walsh
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2011-06-27  0:32 UTC (permalink / raw)
  To: Linda A. Walsh; +Cc: xfs-oss

On Sat, Jun 25, 2011 at 05:29:53PM -0700, Linda A. Walsh wrote:
> I noticed in the 'cp' (coretuils 8.9-4.1) command on suse, there is
> a a "--reflink" that controls "clone/CoW" copies -- which says
> it performs a 'lightweight' copy where the data blocks are copied
> only when modified.  Now it is vague the 'when modified', (i.e.
> does it mean ones that are different between the two copies) (src
> and dst), or does it mean only to copy blocks that were modified
> since 'some point' -- Doesn't say, but would guess it's src/dst diffs
> (I wonder if it is restricted to the same physical filesystem).
> 
> Anyway, turns out, it's only for BTRFS (which I haven't yet used,
> and therefore know only that it supports operations like the above).

Yup, it requires a refcounted, shared, copy-on-write extent index to
do efficiently.

> Would it be practice to implement, some similar, feature in XFS?

It could be done, but it's a fairly large chunk of work....

> It wouldn't be practice or useful to do it on an 'extent' basis due
> to their large size...So to do something similar on XFS, I was
> thinking,  with "some amount of effort", some number of "updated
> extents" could be kept, in addition to the original data.

Kept where, exactly? And how do you share the original extent tree
between multiple inodes? And if all the inodes that share the
original tree get truncated, how do you know that you can break the
ref-link state and remove the original, now unreferenced tree?

Once you have solved those problems, you have effectively designed a
refcounted, shared, copy-on-write extent index....

> Any future modifications to the file would also have the extents
> modified, but any extents that overlap previous mods will be merged,
> and only the newest data would be kept (meaning that
> new sections that are written, that skip over parts of the file,
> wouldn't overwrite a pending change to that section -- only
> the bytes (granularity?)  that were changed.
> 
> I.e. file is 1Mb.
> User1 updates bytes 1k-200k.
> User2, later updates bytes 100k-300k, New modification 'extent' is
> created with 1k-300k, with bytes
> 1k-(100k-1) from user1 be saved, and 100k-300k from user2.
> 
> Changes to the 'base' copy would be made upon some ioctl 'sync'
> command (file-by-file)...
> 
> It would require up to double the amount of file space.

For a single reflink copy, yes. But there's nothing stopping you
from having multiple ref-link copies of the one file. And so the
problem is far more complex than you are considering.

I've looked at what it would require to implemnt reflinks
transparently in XFS, and it's not pretty.  Major surgery to the
bmap code, a new btree type that includes back pointers to all the
owner inodes, a new shadow inode type that holds the original tree,
a new reflink inode type that contains the overwrite extent tree
instead of a normal extent tree, a bunch of new transactions, new
extent lookup/seek code, etc.

I'd estimate it to be a 6 month project for someone who knew what
they were doing. It's not just kernel code, but all the userspace
tools need to be updated to understand reflinks and the COW based
format (repair, check, db, bmap, etc)

FWIW, I haven't even looked at how extended attributes are
supposed to be handled on reflinked files, so that could increase
the complexity significantly.

> ----
> Another possibility would simply be to create a record of byte
> ranges that have been updated in the extent and the extent's last
> modification time.  Then one could compare the mod times and apply
> the changes.   The problem there would be having to keep a
> possibly 'large' log of changes (what if it's not sync/purged...
> couldn't be circular as that would allow events to be lost -- though
> the file system could be forced 'offline' if the event log became full
> ...a major pain...)..., but if it was created with a few G of space,
> might take a while...and if synced in time, no prob.
> 
> Still, may be no great desire or benefit, but DAMN if I haven't
> wanted copy-on-write files for a LONG time.

So use a filesystem that supports them natively ;)

> I.e. being able to hardlink files, but have an option to mark it as
> copy on write -- allowing space to be save when copying directory trees,
> but then dynamically making new copies when someone updates one of the
> linked copies.

The problem is that a reflink sort of looks like a hard link, but in
many cases behaves like a soft link (e.g. different owners,
permissions, etc are possible) and hence - combined with the
copy-on-write behaviour - they need to be treated more like a
soft-link in terms of implementation. Soft links have their own
inode so can hold state separate to the inode they are pointing to,
and for reflinked files it is simply not practical to retroactively
modify the directory structure to point at a different inode when
the first COW operation occurs.

Like I said, it can be done, but it's not a small project. If you
want to sink a significant amount of development time to the
project, we will help you in any way we can. However, I don't think
anyone has the time to do something like this for you....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question on new feature complexity/possibility/sensibility? (^ Alternate)
  2011-06-27  0:32 ` Dave Chinner
@ 2011-06-27  4:08   ` Linda Walsh
  2011-06-27  6:02     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Linda Walsh @ 2011-06-27  4:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss



Dave Chinner wrote:
> So use a filesystem that supports them natively ;)
---
	Got any in in mind that also support acls, extended attrs
and has the reliability and performance of xfs?  ;-)
> 
>> I.e. being able to hardlink files, but have an option to mark it as
>> copy on write -- allowing space to be save when copying directory trees,
>> but then dynamically making new copies when someone updates one of the
>> linked copies.
> 
> The problem is that a reflink sort of looks like a hard link, but in
> many cases behaves like a soft link (e.g. different owners,
> permissions, etc are possible) and hence - combined with the
> copy-on-write behaviour - they need to be treated more like a
> soft-link in terms of implementation.
----
	I can see this -- now this doesn't mean we are talking the same
type of reflink with the shared data above is it?  Cuz in this lower
case, I was talking about hmmm....interesting....

	I was thinking of just a copy-of-the whole file on write,
but that'd be a potential pain on some systems depending on the file
size....   

> Soft links have their own
> inode so can hold state separate to the inode they are pointing to,
> and for reflinked files it is simply not practical to retroactively
> modify the directory structure to point at a different inode when
> the first COW operation occurs.
----
	I see....   
> 
> Like I said, it can be done, but it's not a small project. If you
> want to sink a significant amount of development time to the
> project, we will help you in any way we can. However, I don't think
> anyone has the time to do something like this for you....
---
	If I had the time and mental resources...I'd love to.
But am a bit overwhelmed for time now, and even if I wasn't, I'd
not be anywhere near certain I'd be able to maintain continuous focus
for the length of time necessary to do that length of project....  
if you know what I mean...

	Maybe certain if it was something that I or others could break
down into useful 'subchunks' that could go in at separate times.  Nothing
useless by itself (if nothing less than specific 'upgrading' of 
data infrastructure (data fields on disk, routines to parse such...etc),
but the whole thing at once seems pretty large.

(I know...I'm being a wimp)...

to support such things in the future


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question on new feature complexity/possibility/sensibility? (^ Alternate)
  2011-06-27  4:08   ` Linda Walsh
@ 2011-06-27  6:02     ` Dave Chinner
  2011-06-27  8:22       ` Stan Hoeppner
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2011-06-27  6:02 UTC (permalink / raw)
  To: Linda Walsh; +Cc: xfs-oss

On Sun, Jun 26, 2011 at 09:08:41PM -0700, Linda Walsh wrote:
> 
> 
> Dave Chinner wrote:
> >So use a filesystem that supports them natively ;)
> ---
> 	Got any in in mind that also support acls, extended attrs
> and has the reliability and performance of xfs?  ;-)

For most "normal" workloads, ZFS would probably be your only
production ready option. But that's not something you could use on
Linux, is it? :/

Until btrfs is a completely baked cake, you won't be able to tick
all those boxes, and even then there will be questions about
performance...

Mmmm, cake.... :)

> >>I.e. being able to hardlink files, but have an option to mark it as
> >>copy on write -- allowing space to be save when copying directory trees,
> >>but then dynamically making new copies when someone updates one of the
> >>linked copies.
> >
> >The problem is that a reflink sort of looks like a hard link, but in
> >many cases behaves like a soft link (e.g. different owners,
> >permissions, etc are possible) and hence - combined with the
> >copy-on-write behaviour - they need to be treated more like a
> >soft-link in terms of implementation.
> ----
> 	I can see this -- now this doesn't mean we are talking the same
> type of reflink with the shared data above is it?  Cuz in this lower
> case, I was talking about hmmm....interesting....
> 
> 	I was thinking of just a copy-of-the whole file on write,
> but that'd be a potential pain on some systems depending on the file
> size....

If you are going to take the pain of copying the entire file anyway,
just copy it up front. XFS is designed for fast, large storage
subsystems, so make use of it's capabilities ;)

> >Soft links have their own
> >inode so can hold state separate to the inode they are pointing to,
> >and for reflinked files it is simply not practical to retroactively
> >modify the directory structure to point at a different inode when
> >the first COW operation occurs.
> ----
> 	I see....
> >
> >Like I said, it can be done, but it's not a small project. If you
> >want to sink a significant amount of development time to the
> >project, we will help you in any way we can. However, I don't think
> >anyone has the time to do something like this for you....
> ---
> 	If I had the time and mental resources...I'd love to.
> But am a bit overwhelmed for time now, and even if I wasn't, I'd
> not be anywhere near certain I'd be able to maintain continuous focus
> for the length of time necessary to do that length of project....
> if you know what I mean...
> 
> 	Maybe certain if it was something that I or others could break
> down into useful 'subchunks' that could go in at separate times.  Nothing
> useless by itself (if nothing less than specific 'upgrading' of data
> infrastructure (data fields on disk, routines to parse such...etc),
> but the whole thing at once seems pretty large.

Actually, none of it needs to be merged until complete support is
available. If there's a need for parts of the work in mainline
before the bigger set of work is complete, then we'd merge the bits
needed and rebase the working tree on top the new mainline tree. git
makes this sort of operation easy, so developement of such a feature
could be done quite simply in a parallel branch.

IOWs, even if I was doing this, I'd still be doing it a small chunk
at a time before committing it to a public branch somewhere. It's
easy to point interested parties at the code that way to collaborate
on a separate branch until everything is done. I just wouldn't be
asking for review/mainline inclusion until I had everything done and
tested....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question on new feature complexity/possibility/sensibility? (^ Alternate)
  2011-06-27  6:02     ` Dave Chinner
@ 2011-06-27  8:22       ` Stan Hoeppner
  0 siblings, 0 replies; 5+ messages in thread
From: Stan Hoeppner @ 2011-06-27  8:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss

On 6/27/2011 1:02 AM, Dave Chinner wrote:
> On Sun, Jun 26, 2011 at 09:08:41PM -0700, Linda Walsh wrote:
>>
>>
>> Dave Chinner wrote:
>>> So use a filesystem that supports them natively ;)
>> ---
>> 	Got any in in mind that also support acls, extended attrs
>> and has the reliability and performance of xfs?  ;-)
> 
> For most "normal" workloads, ZFS would probably be your only
> production ready option. But that's not something you could use on
> Linux, is it? :/

It is apparently usable on Linux for those willing to build and maintain
it themselves.  But that level of administrative burden in itself is the
opposite of production ready.

> Until btrfs is a completely baked cake, you won't be able to tick
> all those boxes, and even then there will be questions about
> performance...

Or until Oracle releases ZFS under GPL, or a compatible license, which
probably isn't going to happen.  Apparently Oracle believes their SPARC
and x86 hardware won't sell if ZFS is free on Linux.  Now that Oracle
owns Solaris and ZFS I'd bet they wish they could put the BTRFS genie
back in the bottle, as well as OCFS, etc.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-06-27  8:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-26  0:29 question on new feature complexity/possibility/sensibility? (^ Alternate) Linda A. Walsh
2011-06-27  0:32 ` Dave Chinner
2011-06-27  4:08   ` Linda Walsh
2011-06-27  6:02     ` Dave Chinner
2011-06-27  8:22       ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox