dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK

All of lore.kernel.org
 help / color / mirror / Atom feed

* dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
@ 2012-05-01 12:53 Spelic
  2012-05-01 13:08 ` Christoph Hellwig
  2012-05-01 14:10 ` Joe Thornber
  0 siblings, 2 replies; 7+ messages in thread
From: Spelic @ 2012-05-01 12:53 UTC (permalink / raw)
  To: device-mapper development

Dear dm-thin developers,
I thought that it would be immensely useful to have a SEEK_DATA / 
SEEK_HOLE implementation for dm-thin and/or even for the older non-thin 
snapshotting mechanism.
This would allow to implement a mechanism like the acclaimed "zfs send" 
with dm snapshots, i.e. cheaply replicate a thin snapshot remotely once 
the parent snapshot has been replicated already. Extremely useful imho.
Is there any plan to do that?
The "HOLE" would mean "data comes from parent snapshot/device", while 
DATA is "data that has changed since the parent snapshot". Discarded 
regions that were not discarded in the parent snapshot should preferably 
appear as zeroed DATA and not HOLE, or a new type SEEK_DISCARD because 
if you make it HOLE, you lose information (you lose: "such data region 
was meaningful in the parent snapshot but is not meaningful in the child 
snapshot", and this kind of information cannot be recovered later in any 
way) and you lose the property that reading those regions return zeroed 
data, which is a major problem for backups, see next paragraph.
Instead, if a discarded region returns zeroed DATA, not much information 
is lost because any long string of zeroes is interchangeable with a 
discard, i.e. you can detect zeroes and perform the discard afterwards. 
A new type SEEK_DISCARD could still be better.

Another question / feature request: I would like to know if reading an 
area of a thin device after a discard is guaranteed to return zeroes 
(and/or can be identified as empty from userspace via a seek_data / 
seek_hole or equivalent mechanism). This would be very important for 
backups, so to not get scarcely compressible garbage out of an old and 
now unused region.
If yes: how big should such discarded area be for that area to be seen 
from userspace as hole/zeroes: 512b, 4k, or 64m? E.g. a 512b discarded 
area surrounded by nondiscarded data will return zeroes on read?

Thank you
S.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-01 12:53 dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD Spelic
@ 2012-05-01 13:08 ` Christoph Hellwig
  2012-05-01 14:10 ` Joe Thornber
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2012-05-01 13:08 UTC (permalink / raw)
  To: device-mapper development

On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:
> Dear dm-thin developers,
> I thought that it would be immensely useful to have a SEEK_DATA /
> SEEK_HOLE implementation for dm-thin and/or even for the older
> non-thin snapshotting mechanism.

You can't implement it direct as device mapper doesn't implement the
file operations.  But I think adding block device operations that back
it would be a good idea, they could also implemented for arrays
implementing thin provisioning using the GET LBA STATUS scsi command.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-01 12:53 dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD Spelic
  2012-05-01 13:08 ` Christoph Hellwig
@ 2012-05-01 14:10 ` Joe Thornber
  2012-05-01 15:52   ` Spelic
  1 sibling, 1 reply; 7+ messages in thread
From: Joe Thornber @ 2012-05-01 14:10 UTC (permalink / raw)
  To: device-mapper development

On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:
> Dear dm-thin developers,
> I thought that it would be immensely useful to have a SEEK_DATA /
> SEEK_HOLE implementation for dm-thin and/or even for the older
> non-thin snapshotting mechanism.
> This would allow to implement a mechanism like the acclaimed "zfs
> send" with dm snapshots, i.e. cheaply replicate a thin snapshot
> remotely once the parent snapshot has been replicated already.
> Extremely useful imho.
> Is there any plan to do that?

I'm planning to do replication via userland.  There's a new message
that allows userland to access a read-only copy of the metadata.  From
this, and using some intermediate snapshots we can work out what data
is changing and replicate it (asynchronously).

> The "HOLE" would mean "data comes from parent snapshot/device",
> while DATA is "data that has changed since the parent snapshot".

This sounds like the external snapshots feature that I just added.
See documentation in latest kernel.

> Another question / feature request: I would like to know if reading
> an area of a thin device after a discard is guaranteed to return
> zeroes (and/or can be identified as empty from userspace via a
> seek_data / seek_hole or equivalent mechanism).

A great question.  If the discard exactly covers some dm-thin blocks,
then the mappings will be removed.  Any future io to that block will
trigger the block to be reprovisioned.  Depending whether you've set
the block zeroing flag in the pool, you are guaranteed to have zeroes
come out.

Any partial block discards will get passed down to the underlying data
device (assuming you've selected that option).  Any zeroing side
effects depend on the underlying device.

As for identifying empty blocks from userland: there is an inherant
race here.  What would you do with the info?

- Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-01 14:10 ` Joe Thornber
@ 2012-05-01 15:52   ` Spelic
  2012-05-03  9:14     ` Joe Thornber
  0 siblings, 1 reply; 7+ messages in thread
From: Spelic @ 2012-05-01 15:52 UTC (permalink / raw)
  To: device-mapper development; +Cc: Joe Thornber

On 05/01/12 16:10, Joe Thornber wrote:
> On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:
>> Dear dm-thin developers,
>> I thought that it would be immensely useful to have a SEEK_DATA /
>> SEEK_HOLE implementation for dm-thin and/or even for the older
>> non-thin snapshotting mechanism.
>> This would allow to implement a mechanism like the acclaimed "zfs
>> send" with dm snapshots, i.e. cheaply replicate a thin snapshot
>> remotely once the parent snapshot has been replicated already.
>> Extremely useful imho.
>> Is there any plan to do that?
> I'm planning to do replication via userland.  There's a new message
> that allows userland to access a read-only copy of the metadata.  From
> this, and using some intermediate snapshots we can work out what data
> is changing and replicate it (asynchronously).
>
>> The "HOLE" would mean "data comes from parent snapshot/device",
>> while DATA is "data that has changed since the parent snapshot".
> This sounds like the external snapshots feature that I just added.
> See documentation in latest kernel.

I'm looking at it right now
Well, I was thinking at a parent snapshot and child snapshot (or anyway 
an older and a more recent snapshot of the same device) so I'm not sure 
that's the feature I needed... probably I'm missing something and need 
to study more

>> Another question / feature request: I would like to know if reading
>> an area of a thin device after a discard is guaranteed to return
>> zeroes (and/or can be identified as empty from userspace via a
>> seek_data / seek_hole or equivalent mechanism).
> A great question.  If the discard exactly covers some dm-thin blocks,

I'm not sure I have understood the full nomenclature of dm-thin yet :-) 
... "dm-thin blocks" would be the same thing as so called "pool 
blocksize" as talked in the thread " Re: [PATCH 2/2] dm thin: support 
for non power of 2 pool blocksize" right? so that's customizable now and 
not necessarily in power of 2...

But those are anyway quite big, default is what, 64 megabytes? (which is 
in fact a good thing for preventing excessive fragmentation...)

Now an obvious question:
If userspace sends multiple smaller discards eventually covering the 
whole block, the block will still be unmapped correctly, right?
If yes: so you do preserve the information of what part of the block is 
has already been discarded, and what part is not... so it would be 
possible to return zeroes if the unmapped sub-part of the block is being 
read... right?

> then the mappings will be removed.  Any future io to that block will
> trigger the block to be reprovisioned.

(note: here we are talking of a full block now unmapped, different 
situation from above)
Ok, supposing I do *not* write, so it does not get reprovisioned, what 
does reading from there return; does it return zeroes, or it returns 
nonzero data coming from the parent snapshot at the same offset?

> ...
> As for identifying empty blocks from userland: there is an inherant
> race here.  What would you do with the info?

You are right , I would definitely need to take a snapshot prior to 
reading that... so consider my question related to reading a snapshot of 
a device which has been partially discarded...

Thank you
S.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-01 15:52   ` Spelic
@ 2012-05-03  9:14     ` Joe Thornber
  2012-05-04 17:16       ` Spelic
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Thornber @ 2012-05-03  9:14 UTC (permalink / raw)
  To: Spelic; +Cc: device-mapper development

On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:
> I'm looking at it right now
> Well, I was thinking at a parent snapshot and child snapshot (or
> anyway an older and a more recent snapshot of the same device) so
> I'm not sure that's the feature I needed... probably I'm missing
> something and need to study more

I'm not really following you here.  You can have arbitrary depth of
snapshots (snaps of snaps) if that helps.

> I'm not sure I have understood the full nomenclature of dm-thin yet
> :-) ... "dm-thin blocks" would be the same thing as so called "pool
> blocksize" as talked in the thread " Re: [PATCH 2/2] dm thin:
> support for non power of 2 pool blocksize" right? so that's
> customizable now and not necessarily in power of 2...
> 
> But those are anyway quite big, default is what, 64 megabytes?
> (which is in fact a good thing for preventing excessive
> fragmentation...)

Yes, this is the pool block size, it's the atomic unit used for
provisioning and copy-on-write.  I think the LVM2 tools default this
to be 512 _k_.  You'd only set it to 64M if you had little interest in
snapshot performance.

> Now an obvious question:
> If userspace sends multiple smaller discards eventually covering the
> whole block, the block will still be unmapped correctly, right?

No, I don't track anything smaller than a block.  (Note, blocks are
typically much smaller than you've been envisioning.)

> If yes: so you do preserve the information of what part of the block
> is has already been discarded, and what part is not... so it would
> be possible to return zeroes if the unmapped sub-part of the block
> is being read... right?

No, but the underlying device may do ...

> >then the mappings will be removed.  Any future io to that block will
> >trigger the block to be reprovisioned.
> 
> (note: here we are talking of a full block now unmapped, different
> situation from above)
> Ok, supposing I do *not* write, so it does not get reprovisioned,
> what does reading from there return; does it return zeroes, or it
> returns nonzero data coming from the parent snapshot at the same
> offset?

zeroes.

> >...
> >As for identifying empty blocks from userland: there is an inherant
> >race here.  What would you do with the info?
> 
> You are right , I would definitely need to take a snapshot prior to
> reading that... so consider my question related to reading a
> snapshot of a device which has been partially discarded...

Y, I'll provide tools to let you do this.  If you wish to help with
writing a replicator please email me.  It's a project I'm keen to get
going.

- Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-03  9:14     ` Joe Thornber
@ 2012-05-04 17:16       ` Spelic
  2012-05-09  7:55         ` Joe Thornber
  0 siblings, 1 reply; 7+ messages in thread
From: Spelic @ 2012-05-04 17:16 UTC (permalink / raw)
  To: dm-devel

On 05/03/12 11:14, Joe Thornber wrote:
> On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:
>> I'm looking at it right now
>> Well, I was thinking at a parent snapshot and child snapshot (or
>> anyway an older and a more recent snapshot of the same device) so
>> I'm not sure that's the feature I needed... probably I'm missing
>> something and need to study more
> I'm not really following you here.  You can have arbitrary depth of
> snapshots (snaps of snaps) if that helps.

I'm not following you either (you pointed me to the external snapshot 
feature but this would not be an "external origin" methinks...?), but 
this is probably irrelevant after having seen the rest of the replies 
because I now finally understand what metadata is available inside 
dm-thin. Thanks for such clear replies.

With your implementation there's the problem of fragmentation and RAID 
alignment vs discards implementation. With concurrent access to many 
thin provisioned devices, if blocksize is small, fragmentation is likely 
to come out bad, HDDs streaming reads can suffer a lot on fragmented 
areas (up to a factor 1000), and on parity raid, write performance would 
also suffer; while if blocksize is set to be large (such as one RAID 
stripe), block unmapping on discards is not likely to work because one 
discard per file would be received but most files would be smaller than 
a thinpool block (smaller than a RAID stripe: in fact it is recommended 
that the raid chunk is made equal to the prospected average file size so 
average file size and average discard size would be 1/N of the thinpool 
block size) so nothing would be unprovisioned.

There would be another way to do it (pls excuse my obvious arrogance and 
I know I should write code instead of write emails) two layers: 
blocksize for provisioning is e.g. 64M (this one should be customizable 
like you have now), while blocksize for tracking writes and discards is 
e.g. 4K. You make the btree only for the 64M blocks, and inside that you 
keep 2 bitmaps for tracking its 16384  4K-blocks. One bit is "4K block 
has been written", and if this is zero, reads go against the parent 
snapshot (this avoids CoW costs when provisioning a new 64M block). The 
other bit is "4K block has been discarded" and if this is set, reads 
return zero, and if all 16384 bits are set, the 64M block gets 
un-provisioned. This would play well with RAID alignment, with HDD 
fragmentation, with CoW (normally no cow performed if writes are 4K or 
bigger... "read optimizations" could do that afterwards if needed), with 
multiple small discards, with tracking differences between parent 
snapshot and current snapshot for remote replication, and with 
compressed backups which would see zeroes on all discarded areas.
It should be possible to add this into your implementation because added 
metadata is just 2 bitmaps more for each block than what you have now.
I would really like to try to write code for this but unfortunately I 
foresee I won't have time to write code for a good while.
With this I don't want that to appear like I don't appreciate your 
current implementation which is great work, was very much needed, and in 
fact I will definitely use it for our production systems after 3.4 is 
stable (I was waiting for discards)

> Y, I'll provide tools to let you do this.  If you wish to help with
> writing a replicator please email me.  It's a project I'm keen to get
> going.

Thanks for the opportunity but for now it seems I can only be a leech, 
at most I have time for writing a few emails :-(

Thank you
S.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
  2012-05-04 17:16       ` Spelic
@ 2012-05-09  7:55         ` Joe Thornber
  0 siblings, 0 replies; 7+ messages in thread
From: Joe Thornber @ 2012-05-09  7:55 UTC (permalink / raw)
  To: device-mapper development

On Fri, May 04, 2012 at 07:16:52PM +0200, Spelic wrote:
> On 05/03/12 11:14, Joe Thornber wrote:
> >On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:
> >>I'm looking at it right now
> >>Well, I was thinking at a parent snapshot and child snapshot (or
> >>anyway an older and a more recent snapshot of the same device) so
> >>I'm not sure that's the feature I needed... probably I'm missing
> >>something and need to study more
> >I'm not really following you here.  You can have arbitrary depth of
> >snapshots (snaps of snaps) if that helps.
> 
> I'm not following you either (you pointed me to the external
> snapshot feature but this would not be an "external origin"
> methinks...?),

Yes, it's a snapshot of an external origin.

> With your implementation there's the problem of fragmentation and
> RAID alignment vs discards implementation.

This is always going to be an issue with thin provisioning.

> (such as one RAID stripe), block unmapping on discards is not likely
> to work because one discard per file would be received but most
> files would be smaller than a thinpool block (smaller than a RAID
> stripe: in fact it is recommended that the raid chunk is made equal
> to the prospected average file size so average file size and average
> discard size would be 1/N of the thinpool block size) so nothing
> would be unprovisioned.

You're right.  In general discard is an expensive operation (on all
devices, not just thin), so you want to use it infrequently and on
large chunks.  I suspect that most people, rather than turning on
discard withing the file system, will just periodically run a cleanup
program that inspects the fs and discards unused blocks.

> There would be another way to do it (pls excuse my obvious arrogance
> and I know I should write code instead of write emails) two layers:
> blocksize for provisioning is e.g. 64M (this one should be
> customizable like you have now), while blocksize for tracking writes
> and discards is e.g. 4K. You make the btree only for the 64M blocks,
> and inside that you keep 2 bitmaps for tracking its 16384
> 4K-blocks.

Yes, we could track discards and aggregate them into bigger blocks.
Doing so would require more metadata, and more commits (which are
synchronous operations).  The 2 blocks size approach has a lot going
for it, but it does add a lot of complexity - I deliberately kept thin
simple.  One concern I have is that it demotes the snapshots to second
class citizens since they're composed of the smaller blocks and will
not have the adjacency properties of the thin that is provisioned
solely with big blocks.  I'd rather just do the CoW on the whole
block, and boost performance by putting an SSD (via a caching target)
in front of the data device.  That way the CoW would complete
v. quickly, and could be written back to the device slowly in the
background iff it's infrequently used.

- Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-09  7:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-01 12:53 dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD Spelic
2012-05-01 13:08 ` Christoph Hellwig
2012-05-01 14:10 ` Joe Thornber
2012-05-01 15:52   ` Spelic
2012-05-03  9:14     ` Joe Thornber
2012-05-04 17:16       ` Spelic
2012-05-09  7:55         ` Joe Thornber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.