All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs for enterprise raid arrays
@ 2009-04-03  4:34 Erwin van Londen
  2009-04-03  7:32 ` Sander
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Erwin van Londen @ 2009-04-03  4:34 UTC (permalink / raw)
  To: linux-btrfs

Dear all,

While going through the archived mailing list and crawling along the wi=
ki I didn't find any clues if there would be any optimizations in Btrfs=
 to make efficient use of functions and features that today exist on en=
terprise class storage arrays.

One exception to that was the ssd option which I think can make a impro=
vement on read and write IO's however when attached to a storage array,=
 from an OS perspective, it doesn't really matter since it can't look b=
ehind the array front-end interface anyhow(whether it FC/iSCSI or any o=
ther).

There are however more options that we could think of. Almost all stora=
ge arrays these days have the capabilities to replicate volume (or part=
 of it in COW cases) either in the system or remotely. It would be hand=
y that if a Btrfs formatted volume could make use of those features sin=
ce this might offload a lot of the processing time involved in maintain=
ing these. The arrays already have optimized code to make these snapsho=
ts. I'm not saying we should step away from the host based snapshots bu=
t integration would be very nice.

=46urthermore some enterprise array have a feature that allows for full=
 or partial staging data in cache. By this I mean when a volume contain=
s a certain amount of blocks you can define to have the first X number =
of blocks pre-staged in cache which enables you to have extremely high =
IO rates on these first ones. An option related to the -ssd parameter c=
ould be to have a mount command say "mount -t btrfs -ssd 0-10000" so Bt=
rfs knows what to expect from the partial area and maybe can optimize t=
he locality of frequently used blocks to optimize performance.

Another thing is that some arrays have the capability to "thin-provisio=
n" volumes. In the back-end on the physical layer the array configures,=
 let say, a 1 TB volume and virtually provisions 5TB to the host. On wr=
ites it dynamically allocates more pages in the pool up to the 5TB poin=
t. Now if for some reason large holes occur on the volume, maybe a coup=
le of ISO images that have been deleted, what normally happens is just =
some pointers in the inodes get deleted so from an array perspective th=
ere is still data on those locations and will never release those alloc=
ated blocks. New firmware/microcode versions are able to reclaim that s=
pace if it sees a certain number of consecutive zero's and will reclaim=
 that space to the volume pool. Are there any thoughts on writing a low=
-priority tread that zeros out those "non-used" blocks?

Given the scalability targets of Btrfs it will most likely be heavily u=
sed in the enterprise environment once it reaches a stable code level. =
If we would be able to interface with these array based features that w=
ould be very beneficial.=20

=46urthermore one question also pops to mind and that's when looking at=
 the scalability of Btrfs and its targeted capacity levels I think we w=
ill run into problems with the capabilities of the server hardware itse=
lf. From what I can see now it will not be designed as a distributed fi=
le-system with integrated distributed lock manager to scale out over mu=
ltiple nodes. (I know Oracle is working on a similar thing but this mig=
ht get things more complicated than it already is.) This might impose s=
ome serious issues with recovery scenarios like backup/restore since it=
 will take quite some time to backup/restore a multi PB system when it =
resides on just 1 physical host even when we're talking high end P-seri=
es, I25K's or Superdome class.

I'm not a coder but am heavily involved in the storage industry for the=
 past 15 years so this is just some of the things I come across in real=
 life enterprise customer environments so these are just some of my min=
d spinnings.

There are some more however these would be best covered in another topi=
c.

Let me know your thoughts.

Kind regards,

Erwin van Londen
Systems Engineer
HITACHI DATA SYSTEMS
Level 4, 441 St. Kilda Rd.
Melbourne, Victoria, 3004
Australia
=A0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03  4:34 btrfs for enterprise raid arrays Erwin van Londen
@ 2009-04-03  7:32 ` Sander
  2009-04-03 11:51   ` Ric Wheeler
  2009-04-03 11:43 ` Ric Wheeler
  2009-04-03 13:22 ` Chris Mason
  2 siblings, 1 reply; 11+ messages in thread
From: Sander @ 2009-04-03  7:32 UTC (permalink / raw)
  To: Erwin van Londen; +Cc: linux-btrfs

Dear Erwin,

Erwin van Londen wrote (ao):
> Another thing is that some arrays have the capability to
> "thin-provision" volumes. In the back-end on the physical layer the
> array configures, let say, a 1 TB volume and virtually provisions 5TB
> to the host. On writes it dynamically allocates more pages in the pool
> up to the 5TB point. Now if for some reason large holes occur on the
> volume, maybe a couple of ISO images that have been deleted, what
> normally happens is just some pointers in the inodes get deleted so
> from an array perspective there is still data on those locations and
> will never release those allocated blocks. New firmware/microcode
> versions are able to reclaim that space if it sees a certain number of
> consecutive zero's and will reclaim that space to the volume pool. Are
> there any thoughts on writing a low-priority tread that zeros out
> those "non-used" blocks?

SSD would also benefit from such a feature as it doesn't need to copy
deleted data when erasing blocks.

The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for
that?

I have one question on thin provisioning: if Windows XP performs defrag
on a 20GB 'virtual' size LUN with 2GB in actuall use, whil the volume
grow to 20GB on the storage and never shrink afterwards anymore, while
the client still has only 2GB in use?

This would make thin provisioning on virtual desktops less useful.

Do you have any numbers on the performance impact of thin provisioning?
I can imagine that thin provisioning causes on-storage defragmentation
of disk images, which would kill any OS optimisations like grouping often
read files.

	With kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03  4:34 btrfs for enterprise raid arrays Erwin van Londen
  2009-04-03  7:32 ` Sander
@ 2009-04-03 11:43 ` Ric Wheeler
  2009-04-03 11:58   ` David Woodhouse
  2009-04-03 13:51   ` James Bottomley
  2009-04-03 13:22 ` Chris Mason
  2 siblings, 2 replies; 11+ messages in thread
From: Ric Wheeler @ 2009-04-03 11:43 UTC (permalink / raw)
  To: Erwin van Londen
  Cc: linux-btrfs, Alasdair G Kergon, James Bottomley, willy,
	David.Woodhouse, Tom Coughlan

Erwin van Londen wrote:
> Dear all,
>
> While going through the archived mailing list and crawling along the wiki I didn't find any clues if there would be any optimizations in Btrfs to make efficient use of functions and features that today exist on enterprise class storage arrays.
>
> One exception to that was the ssd option which I think can make a improvement on read and write IO's however when attached to a storage array, from an OS perspective, it doesn't really matter since it can't look behind the array front-end interface anyhow(whether it FC/iSCSI or any other).
>
> There are however more options that we could think of. Almost all storage arrays these days have the capabilities to replicate volume (or part of it in COW cases) either in the system or remotely. It would be handy that if a Btrfs formatted volume could make use of those features since this might offload a lot of the processing time involved in maintaining these. The arrays already have optimized code to make these snapshots. I'm not saying we should step away from the host based snapshots but integration would be very nice.
>   
I agree - it would be great to have a standard way to invoke built in 
snap shots in enterprise arrays. Does HDS export something that we could 
invoke from kernel context to perform a snap shot?

> Furthermore some enterprise array have a feature that allows for full or partial staging data in cache. By this I mean when a volume contains a certain amount of blocks you can define to have the first X number of blocks pre-staged in cache which enables you to have extremely high IO rates on these first ones. An option related to the -ssd parameter could be to have a mount command say "mount -t btrfs -ssd 0-10000" so Btrfs knows what to expect from the partial area and maybe can optimize the locality of frequently used blocks to optimize performance.
>   
Effectively, you prefetch and pin these blocks in cache?  Is this 
something we can preconfigure via some interface per LUN?
> Another thing is that some arrays have the capability to "thin-provision" volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread t
 hat zeros out those "non-used" blocks?
>   

Patches have been floating around to support this - see the recent 
patches around "DISCARD" on linux-ide and lkml.  It would be great to 
get access to a box that implemented the T10 proposed UNMAP commands 
that we could test against. 
> Given the scalability targets of Btrfs it will most likely be heavily used in the enterprise environment once it reaches a stable code level. If we would be able to interface with these array based features that would be very beneficial. 
>
> Furthermore one question also pops to mind and that's when looking at the scalability of Btrfs and its targeted capacity levels I think we will run into problems with the capabilities of the server hardware itself. From what I can see now it will not be designed as a distributed file-system with integrated distributed lock manager to scale out over multiple nodes. (I know Oracle is working on a similar thing but this might get things more complicated than it already is.) This might impose some serious issues with recovery scenarios like backup/restore since it will take quite some time to backup/restore a multi PB system when it resides on just 1 physical host even when we're talking high end P-series, I25K's or Superdome class.
>
> I'm not a coder but am heavily involved in the storage industry for the past 15 years so this is just some of the things I come across in real life enterprise customer environments so these are just some of my mind spinnings.
>
> There are some more however these would be best covered in another topic.
>
> Let me know your thoughts.
>
> Kind regards,
>
> Erwin van Londen
> Systems Engineer
> HITACHI DATA SYSTEMS
> Level 4, 441 St. Kilda Rd.
> Melbourne, Victoria, 3004
> Australia
>  
>   
Erwin, the real key here is to figure out what standard interfaces we 
can use to invoke these functions (from kernel context). Having a 
background in storage myself, the challenge in taking advantage of these 
advanced array features has been that each vendor has their own back 
door API's to control this. Linux likes to take advantage of well 
supported by multiple vendor features.

If you have HDS engineers interested in exploring this or spilling 
details on how to trigger these, it would be a great conversation (but 
not just for btrfs, you would need to talk to the broader SCSI, LVM, etc 
lists as well :-))

Ric


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03  7:32 ` Sander
@ 2009-04-03 11:51   ` Ric Wheeler
  0 siblings, 0 replies; 11+ messages in thread
From: Ric Wheeler @ 2009-04-03 11:51 UTC (permalink / raw)
  To: sander; +Cc: Erwin van Londen, linux-btrfs

Sander wrote:
> Dear Erwin,
>
> Erwin van Londen wrote (ao):
>   
>> Another thing is that some arrays have the capability to
>> "thin-provision" volumes. In the back-end on the physical layer the
>> array configures, let say, a 1 TB volume and virtually provisions 5TB
>> to the host. On writes it dynamically allocates more pages in the pool
>> up to the 5TB point. Now if for some reason large holes occur on the
>> volume, maybe a couple of ISO images that have been deleted, what
>> normally happens is just some pointers in the inodes get deleted so
>> from an array perspective there is still data on those locations and
>> will never release those allocated blocks. New firmware/microcode
>> versions are able to reclaim that space if it sees a certain number of
>> consecutive zero's and will reclaim that space to the volume pool. Are
>> there any thoughts on writing a low-priority tread that zeros out
>> those "non-used" blocks?
>>     
>
> SSD would also benefit from such a feature as it doesn't need to copy
> deleted data when erasing blocks.
>   

The joy of the proposed T10 UNMAP commands is that you send one block 
down with special bits set which lets the target know you don't need the 
whole block range. No real data movement to or from the storage device 
(other than that one special sector).

> The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for
> that?
>   
That is the current plan & is being implemented in a way that should let 
us transparently (from a btrfs level) take advantage of arrays that do 
unmap or SSD trim enabled devices with no fs level changes.  Currently, 
as far as I know, we have no thin enabled devices to play with.
> I have one question on thin provisioning: if Windows XP performs defrag
> on a 20GB 'virtual' size LUN with 2GB in actuall use, whil the volume
> grow to 20GB on the storage and never shrink afterwards anymore, while
> the client still has only 2GB in use?
>   

That is the inverse of what would happen - the windows people will 
probably hack defrag to emit its equivalent of block discard commands 
for the old blocks after defragging a file (just guessing here). With a 
per fs defrag pass, you could use this kind of hook to resync the actual 
allocated block state between your fs image and the storage.
> This would make thin provisioning on virtual desktops less useful.
>
> Do you have any numbers on the performance impact of thin provisioning?
> I can imagine that thin provisioning causes on-storage defragmentation
> of disk images, which would kill any OS optimisations like grouping often
> read files.
>
> 	With kind regards, Sander
>
>   
Big arrays have virtual block ranges anyway, this is just a different 
layer of abstraction that is invisible to us.

Thin enabled devices might have a performance hit for small discard 
commands (that would impact truncate/unlink performance).  I suspect the 
performance impact will vary a lot depending on how the feature is coded 
internally in the storage, but we will only know when we get one to play 
with :-)

ric



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03 11:43 ` Ric Wheeler
@ 2009-04-03 11:58   ` David Woodhouse
  2009-04-03 12:02     ` Ric Wheeler
  2009-04-03 13:27     ` Matthew Wilcox
  2009-04-03 13:51   ` James Bottomley
  1 sibling, 2 replies; 11+ messages in thread
From: David Woodhouse @ 2009-04-03 11:58 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Erwin van Londen, linux-btrfs@vger.kernel.org, Alasdair G Kergon,
	James Bottomley, willy@linux.intel.com, Tom Coughlan

On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
> > New firmware/microcode versions are able to reclaim that space if it
> > sees a certain number of consecutive zero's and will reclaim that 
> > space to the volume pool. Are there any thoughts on writing a 
> > low-priority tread that zeros out those "non-used" blocks?
> 
> Patches have been floating around to support this - see the recent 
> patches around "DISCARD" on linux-ide and lkml.  It would be great to 
> get access to a box that implemented the T10 proposed UNMAP commands 
> that we could test against. 

We've already made btrfs support TRIM, and Matthew has patches which
hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
once the dust settles on the spec.

I don't think I've seen anybody talking about deliberately writing
zeroes instead of just issuing a discard command though. That doesn't
seem like a massively cunning plan.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03 11:58   ` David Woodhouse
@ 2009-04-03 12:02     ` Ric Wheeler
  2009-04-03 13:27     ` Matthew Wilcox
  1 sibling, 0 replies; 11+ messages in thread
From: Ric Wheeler @ 2009-04-03 12:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Ric Wheeler, Erwin van Londen, linux-btrfs@vger.kernel.org,
	Alasdair G Kergon, James Bottomley, willy@linux.intel.com,
	Tom Coughlan

David Woodhouse wrote:
> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
>   
>>> New firmware/microcode versions are able to reclaim that space if it
>>> sees a certain number of consecutive zero's and will reclaim that 
>>> space to the volume pool. Are there any thoughts on writing a 
>>> low-priority tread that zeros out those "non-used" blocks?
>>>       
>> Patches have been floating around to support this - see the recent 
>> patches around "DISCARD" on linux-ide and lkml.  It would be great to 
>> get access to a box that implemented the T10 proposed UNMAP commands 
>> that we could test against. 
>>     
>
> We've already made btrfs support TRIM, and Matthew has patches which
> hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
> once the dust settles on the spec.
>
> I don't think I've seen anybody talking about deliberately writing
> zeroes instead of just issuing a discard command though. That doesn't
> seem like a massively cunning plan.
>
>   
What the SCSI spec says is that you can use "WRITE SAME" with a discard 
bit set.  What the array would do with that is array dependent - it 
could in fact write that same block out to each of the blocks if it 
chooses to do so. The intention would be, of course, to manipulate 
internal array tracking so that you do no IO.

We should avoid doing that command to arrays that don't really implement 
the unmap part, it could take a long time to complete a single largish 
discard request :-)

The nice part of the write same with unmap flavour of the T10 command is 
that it is very clear about the semantics of what you should get back,

Ric


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03  4:34 btrfs for enterprise raid arrays Erwin van Londen
  2009-04-03  7:32 ` Sander
  2009-04-03 11:43 ` Ric Wheeler
@ 2009-04-03 13:22 ` Chris Mason
  2009-04-06  0:29   ` Erwin van Londen
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Mason @ 2009-04-03 13:22 UTC (permalink / raw)
  To: Erwin van Londen; +Cc: linux-btrfs

On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote:
> Dear all,
> 
> While going through the archived mailing list and crawling along the
> wiki I didn't find any clues if there would be any optimizations in
> Btrfs to make efficient use of functions and features that today exist
> on enterprise class storage arrays.
> 
> One exception to that was the ssd option which I think can make a
> improvement on read and write IO's however when attached to a storage
> array, from an OS perspective, it doesn't really matter since it can't
> look behind the array front-end interface anyhow(whether it FC/iSCSI
> or any other).
> 
> There are however more options that we could think of. Almost all
> storage arrays these days have the capabilities to replicate volume
> (or part of it in COW cases) either in the system or remotely. It
> would be handy that if a Btrfs formatted volume could make use of
> those features since this might offload a lot of the processing time
> involved in maintaining these. The arrays already have optimized code
> to make these snapshots. I'm not saying we should step away from the
> host based snapshots but integration would be very nice.

Storage based snapshotting would definitely be useful for replication in
btrfs, and in that case we could wire it up from userland.  Basically
there is a point during commit where a storage snapshot could be taken
and fully consistent.

Outside of replication though, I'm not sure exactly where storage based
snapshotting would come in.  It wouldn't really be compatible with the
snapshots btrfs is already doing (but I'm always open to more ideas).

> Furthermore some enterprise array have a feature that allows for full
> or partial staging data in cache. By this I mean when a volume
> contains a certain amount of blocks you can define to have the first X
> number of blocks pre-staged in cache which enables you to have
> extremely high IO rates on these first ones. An option related to the
> -ssd parameter could be to have a mount command say "mount -t btrfs
> -ssd 0-10000" so Btrfs knows what to expect from the partial area and
> maybe can optimize the locality of frequently used blocks to optimize
> performance.

This would be very useful, although I would tend to export it to btrfs
as a second lun.  My long term goal is to have code in btrfs that
supports a super fast staging lun, which might be an ssd or cache carved
out of a high end array.
> 
> Another thing is that some arrays have the capability to
> "thin-provision" volumes. In the back-end on the physical layer the
> array configures, let say, a 1 TB volume and virtually provisions 5TB
> to the host. On writes it dynamically allocates more pages in the pool
> up to the 5TB point. Now if for some reason large holes occur on the
> volume, maybe a couple of ISO images that have been deleted, what
> normally happens is just some pointers in the inodes get deleted so
> from an array perspective there is still data on those locations and
> will never release those allocated blocks. New firmware/microcode
> versions are able to reclaim that space if it sees a certain number of
> consecutive zero's and will reclaim that space to the volume pool. Are
> there any thoughts on writing a low-priority tread that zeros out
> those "non-used" blocks?

Other people have replied about the trim commands, which btrfs can issue
on every block it frees.  But, another way to look at this is that btrfs
already is thinly provisioned.  When you add storage to btrfs, it
allocates from that storage in 1GB chunks, and then hands those over to
the FS allocation code for more fine grained use.

It may make sense to talk about how that can fit in with your own thin
provisioning.

> Given the scalability targets of Btrfs it will most likely be heavily
> used in the enterprise environment once it reaches a stable code
> level. If we would be able to interface with these array based
> features that would be very beneficial. 
> 
> Furthermore one question also pops to mind and that's when looking at
> the scalability of Btrfs and its targeted capacity levels I think we
> will run into problems with the capabilities of the server hardware
> itself. From what I can see now it will not be designed as a
> distributed file-system with integrated distributed lock manager to
> scale out over multiple nodes. (I know Oracle is working on a similar
> thing but this might get things more complicated than it already is.)
> This might impose some serious issues with recovery scenarios like
> backup/restore since it will take quite some time to backup/restore a
> multi PB system when it resides on just 1 physical host even when
> we're talking high end P-series, I25K's or Superdome class.

This is true.  Things like replication and failover are the best plans
for it today.

Thanks for your interest, we're always looking for ways to better
utilize high end storage features.

-chris



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03 11:58   ` David Woodhouse
  2009-04-03 12:02     ` Ric Wheeler
@ 2009-04-03 13:27     ` Matthew Wilcox
  2009-04-03 13:48       ` James Bottomley
  1 sibling, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2009-04-03 13:27 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Ric Wheeler, Erwin van Londen, linux-btrfs@vger.kernel.org,
	Alasdair G Kergon, James Bottomley, Tom Coughlan

On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
> On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
> > > New firmware/microcode versions are able to reclaim that space if it
> > > sees a certain number of consecutive zero's and will reclaim that 
> > > space to the volume pool. Are there any thoughts on writing a 
> > > low-priority tread that zeros out those "non-used" blocks?
> > 
> > Patches have been floating around to support this - see the recent 
> > patches around "DISCARD" on linux-ide and lkml.  It would be great to 
> > get access to a box that implemented the T10 proposed UNMAP commands 
> > that we could test against. 
> 
> We've already made btrfs support TRIM, and Matthew has patches which
> hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
> once the dust settles on the spec.

It seems like the dust has settled ... I just need to check that
my code still conforms to the spec.  Understandably, I've been focused
on TRIM ;-)

> I don't think I've seen anybody talking about deliberately writing
> zeroes instead of just issuing a discard command though. That doesn't
> seem like a massively cunning plan.

Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
be sure.  I'm not exactly sure how we're supposed to be deciding whether
to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
properly it'll tell me.

I just had a quick chat with someone from another storage vendor who
don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
their device will notice that and unmap the LBAs in question.

Something for the plane on Sunday anyway.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03 13:27     ` Matthew Wilcox
@ 2009-04-03 13:48       ` James Bottomley
  0 siblings, 0 replies; 11+ messages in thread
From: James Bottomley @ 2009-04-03 13:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Woodhouse, Ric Wheeler, Erwin van Londen,
	linux-btrfs@vger.kernel.org, Alasdair G Kergon, Tom Coughlan

On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote:
> On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
> > On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
> > > > New firmware/microcode versions are able to reclaim that space if it
> > > > sees a certain number of consecutive zero's and will reclaim that 
> > > > space to the volume pool. Are there any thoughts on writing a 
> > > > low-priority tread that zeros out those "non-used" blocks?
> > > 
> > > Patches have been floating around to support this - see the recent 
> > > patches around "DISCARD" on linux-ide and lkml.  It would be great to 
> > > get access to a box that implemented the T10 proposed UNMAP commands 
> > > that we could test against. 
> > 
> > We've already made btrfs support TRIM, and Matthew has patches which
> > hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
> > once the dust settles on the spec.
> 
> It seems like the dust has settled ... I just need to check that
> my code still conforms to the spec.  Understandably, I've been focused
> on TRIM ;-)
> 
> > I don't think I've seen anybody talking about deliberately writing
> > zeroes instead of just issuing a discard command though. That doesn't
> > seem like a massively cunning plan.
> 
> Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
> be sure.  I'm not exactly sure how we're supposed to be deciding whether
> to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
> properly it'll tell me.

Actually, the point about WRITE SAME is that it's a far smaller patch to
the standards (just a couple of bits).  Plus it gets around the problem
of what does the array return when an unmapped block is requested (which
occupies pages in the UNMAP proposal), so from that point of view it
seems very logical.

> I just had a quick chat with someone from another storage vendor who
> don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
> their device will notice that and unmap the LBAs in question.

I actually already looked at using WRITE SAME in sd.c ... it turns out
to be surprisingly little work ... the thing you'll like about it is
that there are no extents to worry about and if you plan on writing all
zeros, you can keep a static zeroed data buffer around for the
purpose ...

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs for enterprise raid arrays
  2009-04-03 11:43 ` Ric Wheeler
  2009-04-03 11:58   ` David Woodhouse
@ 2009-04-03 13:51   ` James Bottomley
  1 sibling, 0 replies; 11+ messages in thread
From: James Bottomley @ 2009-04-03 13:51 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Erwin van Londen, linux-btrfs, Alasdair G Kergon, willy,
	David.Woodhouse, Tom Coughlan

On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote:
> Erwin van Londen wrote:
> > Another thing is that some arrays have the capability to "thin-provision" volumes. In the back-end on the physical layer the array configures, let say, a 1 TB volume and virtually provisions 5TB to the host. On writes it dynamically allocates more pages in the pool up to the 5TB point. Now if for some reason large holes occur on the volume, maybe a couple of ISO images that have been deleted, what normally happens is just some pointers in the inodes get deleted so from an array perspective there is still data on those locations and will never release those allocated blocks. New firmware/microcode versions are able to reclaim that space if it sees a certain number of consecutive zero's and will reclaim that space to the volume pool. Are there any thoughts on writing a low-priority tread
  that zeros out those "non-used" blocks?
> >   
> 
> Patches have been floating around to support this - see the recent 
> patches around "DISCARD" on linux-ide and lkml.  It would be great to 
> get access to a box that implemented the T10 proposed UNMAP commands 
> that we could test against. 

So we went several times around the block in the upcoming Linux
Filesystem and Storage workshop to see if anyone from the array vendors
might be interested in discussing thin provisioning.  The general result
was no since travel is tight.  The upshot will be that most of our
discard infrastructure will be focussed on SSD TRIM, but that we'll try
to preserve the TP option for arrays ... there are still private
conversations going on with various people who know the UNMAP/WRITE SAME
requirements of the various arrays at the various vendors.

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: btrfs for enterprise raid arrays
  2009-04-03 13:22 ` Chris Mason
@ 2009-04-06  0:29   ` Erwin van Londen
  0 siblings, 0 replies; 11+ messages in thread
From: Erwin van Londen @ 2009-04-06  0:29 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris,
 

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Chris Mason
> Sent: Saturday, 4 April 2009 12:23 AM
> To: Erwin van Londen
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: btrfs for enterprise raid arrays
> 
> On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote:
> > Dear all,
> >
> > While going through the archived mailing list and crawling along the
> > wiki I didn't find any clues if there would be any optimizations in
> > Btrfs to make efficient use of functions and features that today
exist
> > on enterprise class storage arrays.
> >
> > One exception to that was the ssd option which I think can make a
> > improvement on read and write IO's however when attached to a
storage
> > array, from an OS perspective, it doesn't really matter since it
can't
> > look behind the array front-end interface anyhow(whether it FC/iSCSI
> > or any other).
> >
> > There are however more options that we could think of. Almost all
> > storage arrays these days have the capabilities to replicate volume
> > (or part of it in COW cases) either in the system or remotely. It
> > would be handy that if a Btrfs formatted volume could make use of
> > those features since this might offload a lot of the processing time
> > involved in maintaining these. The arrays already have optimized
code
> > to make these snapshots. I'm not saying we should step away from the
> > host based snapshots but integration would be very nice.
> 
> Storage based snapshotting would definitely be useful for replication
in
> btrfs, and in that case we could wire it up from userland.  Basically
> there is a point during commit where a storage snapshot could be taken
> and fully consistent.
> 
> Outside of replication though, I'm not sure exactly where storage
based
> snapshotting would come in.  It wouldn't really be compatible with the
> snapshots btrfs is already doing (but I'm always open to more ideas).

There is a Linux interface however I don't think it Open Source
(unfortunately) which runs from userland but directly "talks" to the
array to a so-called "command device". From a usability perspective you
could mount this snapshot/shadow image to a second server and process
data from there. (backups etc.)

> 
> > Furthermore some enterprise array have a feature that allows for
full
> > or partial staging data in cache. By this I mean when a volume
> > contains a certain amount of blocks you can define to have the first
X
> > number of blocks pre-staged in cache which enables you to have
> > extremely high IO rates on these first ones. An option related to
the
> > -ssd parameter could be to have a mount command say "mount -t btrfs
> > -ssd 0-10000" so Btrfs knows what to expect from the partial area
and
> > maybe can optimize the locality of frequently used blocks to
optimize
> > performance.
> 
> This would be very useful, although I would tend to export it to btrfs
> as a second lun.  My long term goal is to have code in btrfs that
> supports a super fast staging lun, which might be an ssd or cache
carved
> out of a high end array.

The problem with that is addressability especially if you have a
significant amount of volumes attached to a host and are using FC
multi-pathing tools underneath. From an administrative point of view
this will complicate things a lot. The option that I mentioned is
transparent to the admin and he only needs to get the number of blocks
added to the mount command or fstab.

Bear in mind that this method (staging in cache) is still a lot faster
that having flash drives since there is no back-end traffic going on.
All IO's only touch cache and front-end ports. As I said there is also
the option to put a full volume in cache however from a financial point
of view this will become expensive. That's one of the reasons why we
came up with the partial bit.

> >
> > Another thing is that some arrays have the capability to
> > "thin-provision" volumes. In the back-end on the physical layer the
> > array configures, let say, a 1 TB volume and virtually provisions
5TB
> > to the host. On writes it dynamically allocates more pages in the
pool
> > up to the 5TB point. Now if for some reason large holes occur on the
> > volume, maybe a couple of ISO images that have been deleted, what
> > normally happens is just some pointers in the inodes get deleted so
> > from an array perspective there is still data on those locations and
> > will never release those allocated blocks. New firmware/microcode
> > versions are able to reclaim that space if it sees a certain number
of
> > consecutive zero's and will reclaim that space to the volume pool.
Are
> > there any thoughts on writing a low-priority tread that zeros out
> > those "non-used" blocks?
> 
> Other people have replied about the trim commands, which btrfs can
issue
> on every block it frees.  But, another way to look at this is that
btrfs
> already is thinly provisioned.  When you add storage to btrfs, it
> allocates from that storage in 1GB chunks, and then hands those over
to
> the FS allocation code for more fine grained use.
> 
> It may make sense to talk about how that can fit in with your own thin
> provisioning.

The problem is that the arrays have to be pre-configured for volume
level allocation. Whether this is a normal volume or thin-provisioned
volume it doesn't matter. You're right if you say that if the array had
interface(s) so it would dynamically allocate blocks from those pools as
soon as btrfs addresses those that would be fantastic. Unfortunately
today that's not the case and it's one-way traffic. Not only from our
arrays but from other vendors as well. So a thin-provisioned volume
still presents a fixed number of block to the host although in the back
it will derive those blocks as soon they get "touched". After this those
blocks will be reserved for that volume from the pool unless something
tells the array to free it up. Currently it's only achievable, from an
array perspective, to release those pages if it sees a consecutive
number of zero's. I'm currently not aware that the array software
fulfills the trim commands but I can have a look around.  

> 
> > Given the scalability targets of Btrfs it will most likely be
heavily
> > used in the enterprise environment once it reaches a stable code
> > level. If we would be able to interface with these array based
> > features that would be very beneficial.
> >
> > Furthermore one question also pops to mind and that's when looking
at
> > the scalability of Btrfs and its targeted capacity levels I think we
> > will run into problems with the capabilities of the server hardware
> > itself. From what I can see now it will not be designed as a
> > distributed file-system with integrated distributed lock manager to
> > scale out over multiple nodes. (I know Oracle is working on a
similar
> > thing but this might get things more complicated than it already
is.)
> > This might impose some serious issues with recovery scenarios like
> > backup/restore since it will take quite some time to backup/restore
a
> > multi PB system when it resides on just 1 physical host even when
> > we're talking high end P-series, I25K's or Superdome class.
> 
> This is true.  Things like replication and failover are the best plans
> for it today.
> 
> Thanks for your interest, we're always looking for ways to better
> utilize high end storage features.

No problem. My interest is in the adoption level as well and given the
fact that large companies will be mostly utilizing these array's they
are the ones who will most benefit from a filesystem that gives them the
flexibility and robustness  that we're targeting for with btrfs.

> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-06  0:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-03  4:34 btrfs for enterprise raid arrays Erwin van Londen
2009-04-03  7:32 ` Sander
2009-04-03 11:51   ` Ric Wheeler
2009-04-03 11:43 ` Ric Wheeler
2009-04-03 11:58   ` David Woodhouse
2009-04-03 12:02     ` Ric Wheeler
2009-04-03 13:27     ` Matthew Wilcox
2009-04-03 13:48       ` James Bottomley
2009-04-03 13:51   ` James Bottomley
2009-04-03 13:22 ` Chris Mason
2009-04-06  0:29   ` Erwin van Londen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.