From: Dave Chinner <david@fromorbit.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ric Wheeler <rwheeler@redhat.com>,
Jens Axboe <jens.axboe@oracle.com>,
David Woodhouse <dwmw2@infradead.org>,
linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Black_David@emc.com,
"Martin K. Petersen" <martin.petersen@oracle.com>,
Tom Coughlan <coughlan@redhat.com>,
Matthew Wilcox <matthew@wil.cx>
Subject: Re: thin provisioned LUN support
Date: Mon, 10 Nov 2008 11:33:29 +1100 [thread overview]
Message-ID: <20081110003329.GU4985@disturbed> (raw)
In-Reply-To: <1226273859.19841.31.camel@localhost.localdomain>
On Sun, Nov 09, 2008 at 05:37:39PM -0600, James Bottomley wrote:
> On Mon, 2008-11-10 at 10:08 +1100, Dave Chinner wrote:
> > On Fri, Nov 07, 2008 at 09:20:30AM -0600, James Bottomley wrote:
> > > On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> > > > Jens Axboe wrote:
> > > > I think that discard merging would be helpful (especially for devices
> > > > with more reasonable sized unmap chunks).
> > >
> > > One of the ways the unmap command is set up is with a disjoint
> > > scatterlist, so we can send a large number of unmaps together. Whether
> > > they're merged or not really doesn't matter.
> > >
> > > The probable way a discard system would work if we wanted to endure the
> > > complexity would be to have the discard system in the underlying device
> > > driver (or possibly just above it in block, but different devices like
> > > SCSI or ATA have different discard characteristics). It would just
> > > accumulate block discard requests as ranges (and it would have to poke
> > > holes in the ranges as it sees read/write requests) which it flushes
> > > periodically.
> >
> > It appears to me that discard requests are only being considered
> > here at a block and device level, and nobody is thinking about
> > the system level effects of such aggregation of discard requests.
> >
> > What happens on a system crash? We lose all the pending discard
> > requests, never to be sent again?
>
> Yes ... since this is for thin provisioning. Discard is best guess ...
> it doesn't affect integrity if we lose one and from the point of view of
> the array, 99% transmitted is far better than we do today. All that
> happens for a lost discard is that the array keeps a block that the
> filesystem isn't currently using. However, the chances are that it will
> get reused, so it shares a good probability of getting discarded again.
Ok. Given that a single extent free in XFS could span up to 2^37 bytes,
is it considered acceptible to lose the discard request that this
issued from this transaction? I don't think it is....
> > If so, how do we tell the device
> > that certain ranges have actually been discarded after the crash?
> > Are you expecting them to get replayed by a filesystem during
> > recovery? What if it was a userspace discard from something like
> > mkfs that was lost? How does this interact with sync or other
> > such user level filesystems synchronisation primitives? Does
> > sync_blockdev() flush out pending discard requests? Should fsync?
>
> No .. the syncs are all integrity based. Discard is simple opportunity
> based.
Given that discard requests modify the stable storage associated
with the filesystem, then shouldn't an integrity synchronisation
issue and complete all pending requests to the underlying storage
device?
If not, how do we guarantee them to all be flushed on remount-ro
or unmount-before-hot-unplug type of events?
> > And if the filesystem has to wait for discard requests to complete
> > to guarantee that they are done or can be recovered and replayed
> > after a crash, most filesystems are going to need modification. e.g.
> > XFS would need to prevent the tail of the log moving forward until
> > the discard request associated with a given extent free transaction
> > has been completed. That means we need to be able to specifically
> > flush queued discard requests and we'd need I/O completions to
> > run when they are done to do the filesytem level cleanup work....
>
> OK, I really don't follow the logic here. Discards have no effect on
> data integrity ... unless you're confusing them with secure deletion?
Not at all. I'm considering what is needed to allow the filesystem's
discard requests to be replayed during recovery. i.e. what is needed
to allow a filesystem to handle discard requests for thin
provisioning robustly.
If discard requests are not guaranteed to be issued to the storage
on a crash, then it is up to the filesystem to ensure that it
happens during recovery. That requires discard requests to behave
just like all other types of I/O and definitely requires a mechanism
to flush and wait for all discard requests to complete....
> A
> discard merely tells the array that it doesn't need to back this block
> with an actual storage location anymore (until the next write for that
> region comes down).
Right. But really, it's the filesystem that is saying this, not the
block layer, so if the filesytem wants to be robust, then block
layer can't queue these forever - they have to be issued in a timely
fashion so the filesystem can keep track of which discards have
completed or not....
> The ordering worry can be coped with in the same way we do barriers ...
> it's even safer for discards because if we know the block is going to be
> rewritten, we simply discard the discard.
Ordering is determined by the filesystem - barriers are just a
mechanism the filesystem uses to guarantee I/O ordering. If the
filesystem is tracking discard completion status, then it won't
be issuing I/O over the top of that region as the free transaction
won't be complete until the discard is done....
> > Let's keep the OS level interactions simple - if the array vendors
> > want to keep long queues of requests around before acting on them
> > to aggregate them, then that is an optimisation for them to
> > implement. They already do this with small data writes to NVRAM, so I
> > don't see how this should be treated any differently...
>
> Well, that's Chris' argument, and it has merit. I'm coming from the
> point of view that discards are actually a fundamentally different
> entity from anything else we process.
>From a filesystem perspective, they are no different to any other
metadata I/O. They need to be tracked to allow robust crash recovery
semantics to be implemented in the filesystem.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2008-11-10 0:33 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
2008-11-06 16:40 ` Martin K. Petersen
2008-11-06 17:04 ` Ric Wheeler
2008-11-06 17:15 ` Matthew Wilcox
2008-11-07 12:05 ` Jens Axboe
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
2008-11-07 12:19 ` Jens Axboe
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34 ` Matthew Wilcox
2008-11-07 14:45 ` Jörn Engel
2008-11-07 14:43 ` Theodore Tso
2008-11-07 14:54 ` Ric Wheeler
2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:35 ` jim owens
2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:51 ` Martin K. Petersen
2008-11-07 16:06 ` Ric Wheeler
2008-11-07 15:56 ` James Bottomley
2008-11-07 15:36 ` James Bottomley
2008-11-07 15:48 ` David Woodhouse
2008-11-07 15:36 ` Theodore Tso
2008-11-07 15:45 ` Matthew Wilcox
2008-11-07 16:07 ` jim owens
2008-11-07 16:12 ` James Bottomley
2008-11-07 16:23 ` jim owens
2008-11-07 16:02 ` Ric Wheeler
2008-11-07 14:55 ` Matthew Wilcox
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
2008-11-09 23:08 ` Dave Chinner
2008-11-09 23:37 ` James Bottomley
2008-11-10 0:33 ` Dave Chinner [this message]
2008-11-10 14:31 ` James Bottomley
2008-11-07 15:49 ` Chris Mason
2008-11-07 16:00 ` Martin K. Petersen
2008-11-07 16:06 ` James Bottomley
2008-11-07 16:11 ` Chris Mason
2008-11-07 16:18 ` James Bottomley
2008-11-07 16:22 ` Ric Wheeler
2008-11-07 16:27 ` James Bottomley
2008-11-07 16:28 ` David Woodhouse
2008-11-07 17:22 ` Chris Mason
2008-11-07 18:09 ` Ric Wheeler
2008-11-07 18:36 ` Theodore Tso
2008-11-07 18:41 ` Ric Wheeler
[not found] ` <49148BDF.9050707@redhat.com>
2008-11-07 19:35 ` Theodore Tso
2008-11-07 19:55 ` Martin K. Petersen
2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:21 ` Matthew Wilcox
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 20:26 ` Ric Wheeler
2008-11-07 20:48 ` Chris Mason
2008-11-07 21:04 ` Ric Wheeler
2008-11-07 21:13 ` Theodore Tso
2008-11-07 20:42 ` Theodore Tso
2008-11-07 21:06 ` Martin K. Petersen
2008-11-07 20:37 ` Ric Wheeler
2008-11-10 2:44 ` Black_David
2008-11-10 2:36 ` Black_David
2008-11-07 19:44 ` jim owens
2008-11-07 19:48 ` Matthew Wilcox
2008-11-07 19:50 ` Ric Wheeler
2008-11-09 23:36 ` Dave Chinner
2008-11-10 3:40 ` Thin provisioning & arrays Black_David
2008-11-10 8:31 ` Dave Chinner
2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
2008-11-10 13:36 ` Jens Axboe
2008-11-10 17:05 ` UNMAP is a hint Black_David
2008-11-10 17:30 ` Matthew Wilcox
2008-11-10 17:56 ` Ric Wheeler
2008-11-10 22:18 ` Thin provisioning & arrays Dave Chinner
2008-11-11 1:23 ` Black_David
2008-11-11 2:09 ` Keith Owens
2008-11-11 13:59 ` Ric Wheeler
2008-11-11 14:55 ` jim owens
2008-11-11 15:38 ` Ric Wheeler
2008-11-11 15:59 ` jim owens
2008-11-11 16:25 ` Ric Wheeler
2008-11-11 16:53 ` jim owens
2008-11-11 23:08 ` Dave Chinner
2008-11-11 23:52 ` jim owens
2008-11-11 22:49 ` Dave Chinner
2008-11-06 15:27 ` thin provisioned LUN support jim owens
2008-11-06 15:57 ` jim owens
2008-11-06 16:21 ` James Bottomley
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
2008-11-06 15:42 ` Ric Wheeler
2008-11-06 15:57 ` David Woodhouse
2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55 ` Ric Wheeler
[not found] ` <491375E9.7020707@redhat.com>
2008-11-06 23:06 ` James Bottomley
2008-11-06 23:10 ` Ric Wheeler
2008-11-06 23:26 ` James Bottomley
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44 ` Chris Mason
2008-11-11 0:12 ` Brad Boyer
2008-11-11 15:25 ` jim owens
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
2008-11-11 17:07 ` jim owens
2008-11-11 17:33 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081110003329.GU4985@disturbed \
--to=david@fromorbit.com \
--cc=Black_David@emc.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=coughlan@redhat.com \
--cc=dwmw2@infradead.org \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=matthew@wil.cx \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox