[help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
@ 2008-05-29  9:12 Xiaoming Li
  2008-05-29 10:31 ` Alan Cox
  0 siblings, 1 reply; 7+ messages in thread
From: Xiaoming Li @ 2008-05-29  9:12 UTC (permalink / raw)
  To: linux-kernel

Hi, guys,

We are currently developing a "Thin Provisioning" (see
http://en.wikipedia.org/wiki/Thin_Provisioning) logical volume manager
called ASD at the Linux block device level, which is somewhat like the
LVM (see http://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29).
But there is a big difference between ASD and LVM: LVM pre-allocates
all storage spaces at the very beginning (Fat Provisioning), while ASD
allocates storage spaces _ONLY WHEN NEEDED_.
As a result, with LVM, you can never export a "logical volume" with
claimed storage space larger than of the underlying physical storage
devices (e.g. SATA disks, Hardware RAID array etc.); but with ASD, you
can export "logical volumes" which have much larger logical storage
space.

However, we have some troubles when developing ASD. The
thin-provisioning feature of ASD may inevitably result in "overcommit
problem" in the worst case, that is, the underlying physical storage
resource run out while supported applications are still running. Since
the logical storage space seen by the upper-level applications is very
large, it never knows the actually physical storage spaces left.
What's worse, in case of VFS delay write mode, the write request is
immediately satisfied by VFS cache. Although the application thinks
the request has complete successfully, but actually, it may be
silently discarded by ASD driver later because there are no any
physical resource left.

In attempt to resolve this problem, we find the prepare_write method
in inode address space operations will be invoked by VFS for any new
written data, that is, notifying the file system module to allocate
required space for the new written data. If our ASD driver can also be
informed by this hook function, then we can solve this problem.
Fortunately, we can get and modify this hook function (prepare_write)
in case of raw block device I/O (VFS handle the ASD exported device as
a whole).

However, there are also 2 limitations in this solution.

1. We cannot block write requests to a mount-point where an ASD device
is mounted.
That is to say, if we perform a mkfs.ext2 operation to an ASD, and
then mount it to /mnt/asd1, then dd to the mount point /mnt/asd1, we
will not be able to block the new-writes which may run out of the
physical device spaces.

2. We cannot block the writes, if we put some other virtual devices
over ASD, unless we modify _ALL_ of the virtual devices in the I/O
stack.
Because of the mechanism of VFS, upper-level applications will write
to the page cache of the new-added virtual devices over ASD. If using
our current solution, we will have to replace the prepare_write()
function of the newly-added virtual device. What's worse, if we have
more virtual devices in the I/O stack, we will have to replace _ALL_
of their prepare_write() functions to achieve our goal, which will be
unacceptable.

Does anyone have some ideas for a better solution?
Thanks a lot!

--
Regards,

alx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-05-29  9:12 [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out? Xiaoming Li
@ 2008-05-29 10:31 ` Alan Cox
  2008-05-29 16:08   ` Rik van Riel
  2008-06-02 17:22   ` Xiaoming Li
  0 siblings, 2 replies; 7+ messages in thread
From: Alan Cox @ 2008-05-29 10:31 UTC (permalink / raw)
  To: Xiaoming Li; +Cc: linux-kernel

> As a result, with LVM, you can never export a "logical volume" with
> claimed storage space larger than of the underlying physical storage
> devices (e.g. SATA disks, Hardware RAID array etc.); but with ASD, you
> can export "logical volumes" which have much larger logical storage
> space.

Why do that ?

> Does anyone have some ideas for a better solution?

Take one file system such as ext3, or even a cluster file system like
GFS2 or OCFS. Create top level subdirectories in it for each machine.
Either export the subdirectory via NFS. Alternatively mount the clustered
file system somewhere on each node and then remount the subdirectory into
the right place in the file tree

(And for a clustered/shared pool root you can use pivot_root() to start
from initrd or local disk and then switch to running entirely within the
clusterfs)

Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-05-29 10:31 ` Alan Cox
@ 2008-05-29 16:08   ` Rik van Riel
  2008-05-29 23:04     ` Dave Chinner
  2008-06-02 17:22   ` Xiaoming Li
  1 sibling, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2008-05-29 16:08 UTC (permalink / raw)
  To: Alan Cox; +Cc: Xiaoming Li, linux-kernel

On Thu, 29 May 2008 11:31:59 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > Does anyone have some ideas for a better solution?
> 
> Take one file system such as ext3, or even a cluster file system like
> GFS2 or OCFS. Create top level subdirectories in it for each machine.
> Either export the subdirectory via NFS.

Xiaoming, let me point out another advantage of Alan's approach.

In a block based thin provisioning system, like you proposed, there
is no way to free up space.  Once a user's filesystem has written a
block, it is allocated - when the user deletes a file inside the
filesystem, the space will not be freed again...

This means that once the disk is full it will stay full, no matter
how many files are deleted in the guest filesystems. Overcommitting
an unfreeable resource is a bad idea.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-05-29 16:08   ` Rik van Riel
@ 2008-05-29 23:04     ` Dave Chinner
  2008-05-30  7:31       ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2008-05-29 23:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, Xiaoming Li, linux-kernel

On Thu, May 29, 2008 at 12:08:15PM -0400, Rik van Riel wrote:
> On Thu, 29 May 2008 11:31:59 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > > Does anyone have some ideas for a better solution?
> > 
> > Take one file system such as ext3, or even a cluster file system like
> > GFS2 or OCFS. Create top level subdirectories in it for each machine.
> > Either export the subdirectory via NFS.
> 
> Xiaoming, let me point out another advantage of Alan's approach.
> 
> In a block based thin provisioning system, like you proposed, there
> is no way to free up space.  Once a user's filesystem has written a
> block, it is allocated - when the user deletes a file inside the
> filesystem, the space will not be freed again...

That's where we've been discussing bio hints to help communicate
space being allocated and freed by the filesystem to the lower layers.

See here:

http://marc.info/?l=linux-fsdevel&m=119370585902974&w=2

Alternatively, forget about block based thin provisioning and
just use XFS directory quotas....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-05-29 23:04     ` Dave Chinner
@ 2008-05-30  7:31       ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2008-05-30  7:31 UTC (permalink / raw)
  To: Rik van Riel, Alan Cox, Xiaoming Li, linux-kernel

On Fri, May 30, 2008 at 09:04:41AM +1000, Dave Chinner wrote:
> > In a block based thin provisioning system, like you proposed, there
> > is no way to free up space.  Once a user's filesystem has written a
> > block, it is allocated - when the user deletes a file inside the
> > filesystem, the space will not be freed again...
> 
> That's where we've been discussing bio hints to help communicate
> space being allocated and freed by the filesystem to the lower layers.

Or use the XFS multiple subvolume support based on that:

http://verein.lst.de/~hch/xfs/xfs-multiple-containers.txt

and yeah, I really need to finish the lose bits up and actually post it.
Hopefully next month.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-05-29 10:31 ` Alan Cox
  2008-05-29 16:08   ` Rik van Riel
@ 2008-06-02 17:22   ` Xiaoming Li
  2008-06-02 19:41     ` Alan Cox
  1 sibling, 1 reply; 7+ messages in thread
From: Xiaoming Li @ 2008-06-02 17:22 UTC (permalink / raw)
  To: Alan Cox; +Cc: Rik van Riel, Dave Chinner, Christoph Hellwig, linux-kernel

Dear Alan,

On 5/29/08, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > As a result, with LVM, you can never export a "logical volume" with
> > claimed storage space larger than of the underlying physical storage
> > devices (e.g. SATA disks, Hardware RAID array etc.); but with ASD, you
> > can export "logical volumes" which have much larger logical storage
> > space.
>
> Why do that ?
>
> > Does anyone have some ideas for a better solution?
>
> Take one file system such as ext3, or even a cluster file system like
> GFS2 or OCFS. Create top level subdirectories in it for each machine.
> Either export the subdirectory via NFS. Alternatively mount the clustered
> file system somewhere on each node and then remount the subdirectory into
> the right place in the file tree
>
> (And for a clustered/shared pool root you can use pivot_root() to start
> from initrd or local disk and then switch to running entirely within the
> clusterfs)

Thank you for your quick reply.

Yes, in our ASD system, we have implemented "Thin Provioning" on block
level, and we do have the problem of "no way to free space". Thank Rik
van Riel for pointing this out.

However, I want to ask, is there _any need_ to implement "Thin
Provioning" on block level rather than FS level?
In my opinon, there are some reasons why we implemented "Thin
Provioning" on block level:
1. We can use _all_ types of FS on our ASD device.
2. In our current system, we use some other virtual device drivers to
provide other features, like snapshot, cahce management, exporting as
an iSCSI target in a SAN, etc. Please note, all of these virtual
device drivers have been developed already.
3. Some storage vendors (e.g. EMC) have their own "Block-based thin
provisioning" product; they must have enough reasons to do so.
Please check the following URL for reference:
http://www.wikibon.org/Maximizing_storage_returns_from_your_EMC_relationship#Strategic_fit_for_EMC_in_key_storage_modernization_projects
Then search the string "Block-based thin provisioning".

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out?
  2008-06-02 17:22   ` Xiaoming Li
@ 2008-06-02 19:41     ` Alan Cox
  0 siblings, 0 replies; 7+ messages in thread
From: Alan Cox @ 2008-06-02 19:41 UTC (permalink / raw)
  To: Xiaoming Li; +Cc: Rik van Riel, Dave Chinner, Christoph Hellwig, linux-kernel

> However, I want to ask, is there _any need_ to implement "Thin
> Provioning" on block level rather than FS level?

A good question

> In my opinon, there are some reasons why we implemented "Thin
> Provioning" on block level:
> 1. We can use _all_ types of FS on our ASD device.

Except that you can run out of space and die.

> 2. In our current system, we use some other virtual device drivers to
> provide other features, like snapshot, cahce management, exporting as
> an iSCSI target in a SAN, etc. Please note, all of these virtual
> device drivers have been developed already.

A cluster file system can do cache management and in theory snapshots.
The iscsi larget is a block property - the equivalence in fs layer would
I guess be NFS. Most of those have been developed too ;)

> 3. Some storage vendors (e.g. EMC) have their own "Block-based thin
> provisioning" product; they must have enough reasons to do so.

Some storage vendors do the most marvellously bizzare things. That
doesn't mean they are right answers. EMC don't/didn't have a cluster file
system so that rather limited their choice.

I think you missed one however and maybe one EMC considered - its a much
easier way to do cross platform non shared filestore as a device than add
clustering file systems to do that.

However if you overcommit you have a problem. Its interesting as a front
end technology with an array of slow large disks behind it (so you don't
overcommit but push old storage to the slow disks). I don't think its
interesting in the general case except where you can carefully avoid
overcommit by management policies.

Its also not helped by the fact your storage layer needs to understand
the fs's it supports in order to deduce what blocks are free so that it
can recover them.

Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-06-02 19:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-29  9:12 [help]How to block new write in a "Thin Provisioning" logical volume manager as a virtual device driver when physical spaces run out? Xiaoming Li
2008-05-29 10:31 ` Alan Cox
2008-05-29 16:08   ` Rik van Riel
2008-05-29 23:04     ` Dave Chinner
2008-05-30  7:31       ` Christoph Hellwig
2008-06-02 17:22   ` Xiaoming Li
2008-06-02 19:41     ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox