From: Dave Chinner <david@fromorbit.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Black_David@emc.com, martin.petersen@oracle.com,
chris.mason@oracle.com, jens.axboe@oracle.com,
James.Bottomley@hansenpartnership.com, rwheeler@redhat.com,
linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
coughlan@redhat.com, matthew@wil.cx
Subject: Re: Thin provisioning & arrays
Date: Tue, 11 Nov 2008 09:18:07 +1100 [thread overview]
Message-ID: <20081110221807.GI2373@disturbed> (raw)
In-Reply-To: <1226311189.4367.30.camel@macbook.infradead.org>
On Mon, Nov 10, 2008 at 10:59:49AM +0100, David Woodhouse wrote:
> On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
> > On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@emc.com wrote:
> > > There will be a chunk size value available in a VPD page that can be
> > > used to determine minimum size/alignment. For openers, I see
> > > essentially
> > > no point in a 512-byte UNMAP, even though it's allowed by the standard -
> > > I suspect most arrays (and many SSDs) will ignore it, and ignoring
> > > it is definitely within the spirit of the proposed T10 standard (hint:
> > > I'm one of the people directly working on that proposal).
> >
> > I think this is the crux of the issue. IMO, it's not much of a standard
> > when the spirit of the standard is to allow everyone to implement
> > different, non-deterministic behaviour....
>
> I disagree. The discard request is a _hint_ from the upper layers, and
> the storage device can act on that hint as it sees fit. There's nothing
> wrong with that; it doesn't make it "not much of a standard".
If it's not reliable, then it is effectively useless from a
design persepctive. The fact that it is being treated as a hint
means that everyone is going to require "defrag" tools to clean
up the mess when the array runs out of space.
Treating it as a reliable command (i.e. it succeeds or returns
an error) means that we can implement filesystems that can do
unmapping in such a way that when the array reports that it is out
of space we *know* that there is no free space that can be unmapped.
i.e. no need for a "defrag" tool.
The defrag tool approach is a cop-out. It simply does not scale to
environments where you have hundreds of luns spread over hundreds of
machines, and each of them needs to be "defragged" individually to
find all the unmappable space in the array. It gets worse in the
virutalised space where you might have tens of virtual machines
using each lun.
This is why unmap as a hint is a fundamentally broken model from an
overall storage stack persepctive, no matter how appealing it is to
array vendors....
> Storage devices are complex enough that they _already_ exhibit behaviour
> which is fairly much non-deterministic in a number of ways. Especially
> if we're talking about SSDs or large arrays, rather than just disks.
> A standard needs to be clear about what _is_ guaranteed, and what is
> _not_ guaranteed. If it is explicit that the storage device is permitted
> to ignore the discard hint, and some storage devices do so under some
> circumstances, then that is just fine.
Right, it's non-deterministic even within a single device. That
makes it impossible to implement something reliable because the
higher layers are not provided with any guarantee they can rely
on. A hint is useless from a design perspective - guarantees are
required for reliable operation and if we are not designing new
storage features with reliability as a primary concern then we
are wasting our time...
> > Unmapping can and should be made reliable so that we don't have to
> > waste effort trying to fix up mismatches that shouldn't have occurred
> > in the first place...
>
> Perhaps so. But remember, this can only really be considered a
> correctness issue on thin-provisioned arrays -- because they may run out
> of space sooner than they should. But that kind of failure mode is
> something that is explicitly accepted by those designing and using such
> thin-provisioned arrays. It's not as if we're introducing any _new_ kind
> of problem.
Very true. But this is not a justification for not providing a
reliable unmapping service. If anything it's justification for being
reliable; that when you finally run out of space, there really is no
more space available....
Defrag is not the answer here.
> So I think it's perfectly acceptable for the operating system to treat
> discard requests as a hint, with best-effort semantics. And any device
> which _really_ cares will need to make sure for _itself_ that it handles
> those hints reliably.
So how do you propose that a storage architect who is trying to
design a reliable thin provisioning storage stack finds out which
devices actually do reliable unmapping? Vendors are simply going
to say they support the unmap command, which currently means
anything from "ignore completely" to "always do the right thing".
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2008-11-10 22:18 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
2008-11-06 16:40 ` Martin K. Petersen
2008-11-06 17:04 ` Ric Wheeler
2008-11-06 17:15 ` Matthew Wilcox
2008-11-07 12:05 ` Jens Axboe
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
2008-11-07 12:19 ` Jens Axboe
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34 ` Matthew Wilcox
2008-11-07 14:45 ` Jörn Engel
2008-11-07 14:43 ` Theodore Tso
2008-11-07 14:54 ` Ric Wheeler
2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:35 ` jim owens
2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:51 ` Martin K. Petersen
2008-11-07 16:06 ` Ric Wheeler
2008-11-07 15:56 ` James Bottomley
2008-11-07 15:36 ` James Bottomley
2008-11-07 15:48 ` David Woodhouse
2008-11-07 15:36 ` Theodore Tso
2008-11-07 15:45 ` Matthew Wilcox
2008-11-07 16:07 ` jim owens
2008-11-07 16:12 ` James Bottomley
2008-11-07 16:23 ` jim owens
2008-11-07 16:02 ` Ric Wheeler
2008-11-07 14:55 ` Matthew Wilcox
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
2008-11-09 23:08 ` Dave Chinner
2008-11-09 23:37 ` James Bottomley
2008-11-10 0:33 ` Dave Chinner
2008-11-10 14:31 ` James Bottomley
2008-11-07 15:49 ` Chris Mason
2008-11-07 16:00 ` Martin K. Petersen
2008-11-07 16:06 ` James Bottomley
2008-11-07 16:11 ` Chris Mason
2008-11-07 16:18 ` James Bottomley
2008-11-07 16:22 ` Ric Wheeler
2008-11-07 16:27 ` James Bottomley
2008-11-07 16:28 ` David Woodhouse
2008-11-07 17:22 ` Chris Mason
2008-11-07 18:09 ` Ric Wheeler
2008-11-07 18:36 ` Theodore Tso
2008-11-07 18:41 ` Ric Wheeler
[not found] ` <49148BDF.9050707@redhat.com>
2008-11-07 19:35 ` Theodore Tso
2008-11-07 19:55 ` Martin K. Petersen
2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:21 ` Matthew Wilcox
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 20:26 ` Ric Wheeler
2008-11-07 20:48 ` Chris Mason
2008-11-07 21:04 ` Ric Wheeler
2008-11-07 21:13 ` Theodore Tso
2008-11-07 20:42 ` Theodore Tso
2008-11-07 21:06 ` Martin K. Petersen
2008-11-07 20:37 ` Ric Wheeler
2008-11-10 2:44 ` Black_David
2008-11-10 2:36 ` Black_David
2008-11-07 19:44 ` jim owens
2008-11-07 19:48 ` Matthew Wilcox
2008-11-07 19:50 ` Ric Wheeler
2008-11-09 23:36 ` Dave Chinner
2008-11-10 3:40 ` Thin provisioning & arrays Black_David
2008-11-10 8:31 ` Dave Chinner
2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
2008-11-10 13:36 ` Jens Axboe
2008-11-10 17:05 ` UNMAP is a hint Black_David
2008-11-10 17:30 ` Matthew Wilcox
2008-11-10 17:56 ` Ric Wheeler
2008-11-10 22:18 ` Dave Chinner [this message]
2008-11-11 1:23 ` Thin provisioning & arrays Black_David
2008-11-11 2:09 ` Keith Owens
2008-11-11 13:59 ` Ric Wheeler
2008-11-11 14:55 ` jim owens
2008-11-11 15:38 ` Ric Wheeler
2008-11-11 15:59 ` jim owens
2008-11-11 16:25 ` Ric Wheeler
2008-11-11 16:53 ` jim owens
2008-11-11 23:08 ` Dave Chinner
2008-11-11 23:52 ` jim owens
2008-11-11 22:49 ` Dave Chinner
2008-11-06 15:27 ` thin provisioned LUN support jim owens
2008-11-06 15:57 ` jim owens
2008-11-06 16:21 ` James Bottomley
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
2008-11-06 15:42 ` Ric Wheeler
2008-11-06 15:57 ` David Woodhouse
2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55 ` Ric Wheeler
[not found] ` <491375E9.7020707@redhat.com>
2008-11-06 23:06 ` James Bottomley
2008-11-06 23:10 ` Ric Wheeler
2008-11-06 23:26 ` James Bottomley
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44 ` Chris Mason
2008-11-11 0:12 ` Brad Boyer
2008-11-11 15:25 ` jim owens
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
2008-11-11 17:07 ` jim owens
2008-11-11 17:33 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081110221807.GI2373@disturbed \
--to=david@fromorbit.com \
--cc=Black_David@emc.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=chris.mason@oracle.com \
--cc=coughlan@redhat.com \
--cc=dwmw2@infradead.org \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=matthew@wil.cx \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).