* Proposal to improve filesystem/block snapshot interaction [not found] <20070927063113.GD2989@sgi.com> @ 2007-10-30 1:04 ` Greg Banks 2007-10-30 1:11 ` Greg Banks ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Greg Banks @ 2007-10-30 1:04 UTC (permalink / raw) To: Linux Filesystem Mailing List Cc: David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann G'day, A number of people have already seen this; I'm posting for wider comment and to move some interesting discussion to a public list. I'll apologise in advance for the talk about SGI technologies (including proprietary ones), but all the problems mentioned apply to in-tree technologies too. This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. a) The first problem is the server's behaviour when a filesystem which is subject to snapshot is written to, and the snapshot repository runs out of room. The failure mode can be quite severe. XFS issues a metadata write to the block device, triggering a Copy-On-Write operation in the XVM snapshot element, which because of the full repository fails with EIO. When XFS sees the failure it shuts down the filesystem. All subsequent attempts to perform IO to the filesystem block indefinitely. In particular any NFS server thread will block and never reply to the NFS client. The NFS client will retry, causing another NFS server thread to block, and repeat until every NFS server thread is blocked. At this point all NFS service for all filesystems ceases. See PV 958220 and PV 958140 for a description of this problem and some of the approaches which have been discussed for resolving it. b) The second problem is that certain common combinations of filesystem operations can cause large wastes of space in the XVM snapshot repository. Examples include writing the same file twice with dd, or writing a new file and deleting it. The cause is the inability of the XVM snapshot code to be able to free regions in the snapshot repository that are no longer in use by the filesystem; this information is simply not available within the block layer. Note that problem b) also contributes to problem a) by increasing repository usage and thus making it easier to encounter an out-of-space condition on the repository. c) The third problem is an unfortunate interaction between an XFS internal log and block snapshots. The log is a fixed region of the block device which is written as a side effect of a great many different filesystem operations. The information written there has no value and is not even read until and unless log recovery needs to be performed after the server has crashed. This means the log does not need to be preserved by the block feature snapshot (because at the point in time when the snapshot is taken, log recovery must have already happened). In fact the correct procedure when mounting a read-only snapshot is to use the "norecovery" option to prevent any attempt to read the log (although the NAS server software actually doesn't do this). However, because the block device layer doesn't have enough information to know any better, the first pass of writes to the log are subjected to Copy-On-Write. This has two undesirable effects. Firstly, it increases the amount of snapshot repository space used by each snapshot, thus contributing to problem a). Secondly, it puts a significant performance penalty on filesystem metadata operations for some time after each snapshot is taken; given that the NAS server can be configured to take regular frequent snapshots this may mean all of the time. An obvious solution is to use an external XFS log, but this quite inconvenient for the NAS server software to arrange. For one thing, we would need to construct a separate external log device for the main filesystem and one for each mounted snapshot. Note that these problems are not specific to XVM but will be encountered by any Linux block-COWing snapshot implementation. For example the DM snapshot implementation is documented to suffer from problem a). From the linux/Documentation/device-mapper/snapshot.txt: > <COW device> will often be smaller than the origin and if it > fills up the snapshot will become useless and be disabled, > returning errors. So it is important to monitor the amount of > free space and expand the <COW device> before it fills up. During discussions, it became clear that we could solve all three of these problems by improving the block device interface to allow a filesystem to provide the block device with dynamic block usage hints. For example, when unlinking a file the filesystem could tell the block device a hint of the form "I'm about to stop using these blocks". Most block devices would silently ignore these hints, but a snapshot COW implementation (the "copy-on-write" XVM element or the "snapshot-origin" dm target) could use them to help avoid these problems. For example, the response to the "I'm about to stop using these blocks" hint could be to free the space used in the snapshot repository for unnecessary copies of those blocks. Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ We'll also need a field to tell us which kind of hint the bio represents. Perhaps a new field could be added, or perhaps the top 16 bits of bi_rw (currently used to encode the bio's priority, which has no meaning for hint bios) could be reused. The latter approach may allow hints to be used without modifying the bio structure or any code that uses it other than the filesystem and the snapshot implementation. Such a property would have obvious advantages for our NAS server software, where XFS and XVM modules are provided but the other users of struct bio are stock SLES code. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. BIO_HINT_DONTCOW (the Bart Simpson BIO). The bio's block extent is not needed in mounted snapshots and does not need to be subjected to COW. Here's how these proposed hints help solve the abovementioned problems. Problem a) The filesystem gives the BIO_HINT_ALLOCATE hint to the block device when preparing to write to blocks and when allocating blocks. The snapshot implementation checks whether COW is necessary, and if so performs it immediately. If the COW fails due to a lack of space in the snapshot repository, the bio fails. This can be caught in the filesystem and reported to userspace (or the NFS server) as ENOSPC via the existing mechanisms. Filesystem shutdown is no longer necessary. Problem b) is solved by the filesystem giving the BIO_HINT_RELEASE hint to the block device every time it unmaps blocks in xfs_bunmapi. The snapshot implementation can then free unnecessary copies of those blocks. Problem c) is solved by the filesystem giving to the block device a BIO_HINT_DONTCOW hint describing the block extent of the internal log, at filesystem mount time. The snapshot implementation marks that extent, and subsequent writes to those blocks do not cause COWs but proceed directly to the origin filesystem. Comments? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks @ 2007-10-30 1:11 ` Greg Banks 2007-10-30 4:16 ` Neil Brown 2007-10-30 9:35 ` Dongjun Shin 2 siblings, 0 replies; 20+ messages in thread From: Greg Banks @ 2007-10-30 1:11 UTC (permalink / raw) To: Linux Filesystem Mailing List Cc: David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote: > On Monday 29 October 2007, Christoph Hellwig wrote: > > ----- Forwarded message from Greg Banks <gnb@sgi.com> ----- > > > > Date: Thu, 27 Sep 2007 16:31:13 +1000 > > From: Greg Banks <gnb@sgi.com> > > Subject: Proposal to improve filesystem/block snapshot interaction > > To: David Chinner <dgc@melbourne.sgi.com>, Donald Douwsma <donaldd@sgi.com>, > > Christoph Hellwig <hch@infradead.org>, Roger Strassburg <rls@sgi.com> > > Cc: Mark Goodwin <markgw@sgi.com>, > > Brett Jon Grandbois <brettg@melbourne.sgi.com> > > > > > > > > This proposal seeks to solve three problems in our NAS server product > > due to the interaction of the filesystem (XFS) and the block-based > > snapshot feature (XVM snapshot). It's based on discussions held with > > various people over the last few weeks, including Roger Strassburg, > > Christoph Hellwig, David Chinner, and Donald Douwsma. > > Hi Greg, > > Christoph forwarded me your mail, because I mentioned to him that > I'm trying to come up with a similar change, and it might make sense > to combine our efforts. Excellent, thanks Christoph ;-) > > > For example, when unlinking a file the filesystem could tell the > > block device a hint of the form "I'm about to stop using these > > blocks". Most block devices would silently ignore these hints, but > > a snapshot COW implementation (the "copy-on-write" XVM element or > > the "snapshot-origin" dm target) could use them to help avoid these > > problems. For example, the response to the "I'm about to stop using > > these blocks" hint could be to free the space used in the snapshot > > repository for unnecessary copies of those blocks. > > The case I'm interested in is the more specific case of 'erase', > which is more of a performance optimization than a space optimization. > When you have a flash medium, it's useful to erase a block as soon > as it's becoming unused, so that a subsequent write will be faster. > Moreover, on an MTD medium, you may not even be able to write to > a block unless it has been erased before. Spending the device's time to erase early, when the CPU isn't waiting for it, instead of later, when it adds to effective write latency. Makes sense. > > Of course snapshot cow elements may be part of more generic element > > trees. In general there may be more than one consumer of block usage > > hints in a given filesystem's element tree, and their locations in that > > tree are not predictable. This means the block extents mentioned in > > the usage hints need to be subject to the block mapping algorithms > > provided by the element tree. As those algorithms are currently > > implemented using bio mapping and splitting, the easiest and simplest > > way to reuse those algorithms is to add new bio flags. > > > > First we need a mechanism to indicate that a bio is a hint rather > > than a real IO. Perhaps the easiest way is to add a new flag to > > the bi_rw field: > > > > #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ > > My first thought was to do this on the request layer, not already > on bio, but they can easily be combined, I guess. My first thoughts were along similar lines, but I wasn't expecting these hint bios to survive deep enough in the stack to need queuing and thus visibility in struct request; I was expecting their lifetime to be some passage and splitting through a volume manager and then conversion to synchronous metadata operations. Plus, hijacking bios means not having to modify every single DM target to duplicate it's block mapping algorithm. Basically, I was thinking of loopback-like block mapping and not considering flash. I suppose for flash where there's a real erase operation, you'd want to be queuing and that means a new request type. > > > We'll also need a field to tell us which kind of hint the bio > > represents. Perhaps a new field could be added, or perhaps the top > > 16 bits of bi_rw (currently used to encode the bio's priority, which > > has no meaning for hint bios) could be reused. The latter approach > > may allow hints to be used without modifying the bio structure or > > any code that uses it other than the filesystem and the snapshot > > implementation. Such a property would have obvious advantages for > > our NAS server software, where XFS and XVM modules are provided but > > the other users of struct bio are stock SLES code. > > > > > > Next we'll need three bio hints types with the following semantics. > > > > BIO_HINT_ALLOCATE > > The bio's block extent will soon be written by the filesystem > > and any COW that may be necessary to achieve that should begin > > now. If the COW is going to fail, the bio should fail. Note > > that this provides a way for the filesystem to manage when and > > how failures to COW are reported. > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > > > BIO_HINT_DONTCOW > > (the Bart Simpson BIO). The bio's block extent is not needed > > in mounted snapshots and does not need to be subjected to COW. > > > > My code currently needs four flags, which don't match yours too much: > > /* > * A number of different actions could be triggered by an erase request, > * depending on the underlying device. Each device specifies its > * capabilities with these flags, while a request specifies the options > * that are acceptable. If the logical AND from these two does not > * have any bits set, the request will result in > * an error. > */ > enum { > /* > * Device may choose to ignore the request, subsequent writes > * may return the original data. This is meant to work on Is this supposed to be "reads" ? > * any block device. When combined with other flags, the driver > * should only perform an actual erase if it makes sense > * from a performance perspective, e.g. speeding up subsequent > * writes. > */ > LB_ERASE_IGNORE = 0x01, > /* > * A subsequent read may return zero data for the erase, > * like on some high-level abstractions for flash memory, > * or a virtual device. > */ > LB_ERASE_ALL_ZERO = 0x02, > /* > * A subsequent read may return a block filled with 0xff, > * which is the typical behavior on raw NAND flash. > */ > LB_ERASE_ALL_ONE = 0x04, > /* > * The device may reject a read request for an erased block > * until the block has been written again. This is typical > * for NAND flash with builtin ECC checks, or for optical > * drives. > */ > LB_ERASE_NUKE = 0x08, > /* > * Used by file systems that know that data is no longer > * in use and want to optimize the next write operations. > */ > LB_ERASE_DISCARD = LB_ERASE_IGNORE | LB_ERASE_ALL_ZERO | > LB_ERASE_ALL_ONE | LB_ERASE_NUKE, > /* > * Used when we want the data to be invalidated and make sure > * it is no longer accessible. > */ > LB_ERASE_DESTROY = LB_ERASE_ALL_ZERO | LB_ERASE_ALL_ONE | > LB_ERASE_NUKE, > }; > > I guess BIO_HINT_RELEASE would match LB_ERASE_DISCARD best, Yep. Actually, I'm curious why you'd want to expose, outside the block driver, the semantics of reading a block which has been earlier explicitly discarded. Surely it's an error for a filesystem to do that? How does it help a filesystem to know in advance which error case that will trigger. > and perhaps > there should be some bio flag with LB_ERASE_DESTROY semantics, although > that doesn't really qualify as a hint any more. Yes, that's more of a command ;-) > My release command would be REQ_TYPE_LINUX_BLOCK/REQ_LB_OP_ERASE. Were > you thinking of adding REQ_LB_* operations as well, or just encoding > the hint in a REQ_TYPE_FS request? I wasn't expecting a request to be created for the hint bio at all. > Shall we move the discussion to a public mailing list? Feel free to > forward my mail anywhere you like. Done! Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks 2007-10-30 1:11 ` Greg Banks @ 2007-10-30 4:16 ` Neil Brown 2007-10-30 5:12 ` Greg Banks 2007-10-30 23:56 ` David Chinner 2007-10-30 9:35 ` Dongjun Shin 2 siblings, 2 replies; 20+ messages in thread From: Neil Brown @ 2007-10-30 4:16 UTC (permalink / raw) To: Greg Banks Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tuesday October 30, gnb@sgi.com wrote: > > Of course snapshot cow elements may be part of more generic element > trees. In general there may be more than one consumer of block usage > hints in a given filesystem's element tree, and their locations in that > tree are not predictable. This means the block extents mentioned in > the usage hints need to be subject to the block mapping algorithms > provided by the element tree. As those algorithms are currently > implemented using bio mapping and splitting, the easiest and simplest > way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say "We've got this really cool thing to make snapshots much faster - wanna share?" and thus be presenting from a position of more strength (the old 'code talks' mantra). > > First we need a mechanism to indicate that a bio is a hint rather > than a real IO. Perhaps the easiest way is to add a new flag to > the bi_rw field: > > #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. > > Next we'll need three bio hints types with the following semantics. > > BIO_HINT_ALLOCATE > The bio's block extent will soon be written by the filesystem > and any COW that may be necessary to achieve that should begin > now. If the COW is going to fail, the bio should fail. Note > that this provides a way for the filesystem to manage when and > how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. "Please allocate 128 blocks, maybe 'here'" "OK, 128 blocks allocated, but they are actually over 'there'". If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant "Allocate 128 blocks, I don't care where". Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? Or must it return without doing any IO at all? > > BIO_HINT_RELEASE > The bio's block extent is no longer in use by the filesystem > and will not be read in the future. Any storage used to back > the extent may be released without any threat to filesystem > or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Would this be a burden on the filesystems? Is my imagined disparity between block sizes valid? Would it be just as easy for the storage device to track small allocation/deallocations? > > BIO_HINT_DONTCOW > (the Bart Simpson BIO). The bio's block extent is not needed > in mounted snapshots and does not need to be subjected to COW. This seems like a much more domain-specific function that the other two which themselves could be more generally useful (I'm imagining using hints from them to e.g. accelerate RAID reconstruction). Surely the "correct" thing to do with the log is to put it on a separate device which itself isn't snapshotted. If you have a storage manager that is smart enough to handle these sorts of things, maybe the functionality you want is "Give me a subordinate device which is not snapshotted, size X", then journal to that virtual device. I guess that is equally domain specific, but the difference is that if you try to read from the DONTCOW part of the snapshot, you get bad old data, where as if you try to access the subordinate device of a snapshot, you get an IO error - which is probably safer. > > Comments? On the whole it seems reasonably sane .... providing you are from the school which believes that volume managers and filesystems should be kept separate :-) NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 4:16 ` Neil Brown @ 2007-10-30 5:12 ` Greg Banks 2007-10-30 7:43 ` Arnd Bergmann 2007-11-20 23:43 ` Roger Strassburg 2007-10-30 23:56 ` David Chinner 1 sibling, 2 replies; 20+ messages in thread From: Greg Banks @ 2007-10-30 5:12 UTC (permalink / raw) To: Neil Brown Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > On Tuesday October 30, gnb@sgi.com wrote: > > > > Of course snapshot cow elements may be part of more generic element > > trees. In general there may be more than one consumer of block usage > > hints in a given filesystem's element tree, and their locations in that > > tree are not predictable. This means the block extents mentioned in > > the usage hints need to be subject to the block mapping algorithms > > provided by the element tree. As those algorithms are currently > > implemented using bio mapping and splitting, the easiest and simplest > > way to reuse those algorithms is to add new bio flags. > > So are you imagining that you might have a distinct snapshotable > elements, and that some of these might be combined by e.g. RAID0 into > a larger device, then a filesystem is created on that? I was thinking more a concatenation than a stripe, but yes you could do such a thing, e.g. to parallelise the COW procedure. We don't do any such thing in our product; the COW element is always inserted at the top of the logical element tree. > I ask because my first thought was that the sort of communication you > want seems like it would be just between a filesystem and the block > device that it talks directly to, and as you are particularly > interested in XFS and XVM, should could come up with whatever protocol > you want for those two to talk to either other, prototype it, iron out > all the issues, then say "We've got this really cool thing to make > snapshots much faster - wanna share?" and thus be presenting from a > position of more strength (the old 'code talks' mantra). Indeed, code talks ;-) I was hoping someone else would do that talking for me, though. > > First we need a mechanism to indicate that a bio is a hint rather > > than a real IO. Perhaps the easiest way is to add a new flag to > > the bi_rw field: > > > > #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ > > Reminds me of the new approach to issue_flush_fn which is just to have > a zero-length barrier bio (is that implemented yet? I lost track). > But different as a zero length barrier has zero length, and your hints > have a very meaningful length. Yes. > > > > Next we'll need three bio hints types with the following semantics. > > > > BIO_HINT_ALLOCATE > > The bio's block extent will soon be written by the filesystem > > and any COW that may be necessary to achieve that should begin > > now. If the COW is going to fail, the bio should fail. Note > > that this provides a way for the filesystem to manage when and > > how failures to COW are reported. > > Would it make sense to allow the bi_sector to be changed by the device > and to have that change honoured. > i.e. "Please allocate 128 blocks, maybe 'here'" > "OK, 128 blocks allocated, but they are actually over 'there'". That wasn't the expectation at all. Perhaps "allocate" is a poor name. "I have just allocated, deal with it" might be more appropriate. Perhaps BIO_HINT_WILLUSE or something. > If the device is tracking what space is and isn't used, it might make > life easier for it to do the allocation. Maybe even have a variant > "Allocate 128 blocks, I don't care where". That kind of thing might perhaps be useful for flash, but I think current filesystems would have conniptions. > Is this bio supposed to block until the copy has happened? Or only > until the space of the copy has been allocated and possibly committed? The latter. The writes following will block until the COW has completed, or might be performed sufficiently later that the COW has meanwhile completed (I think this implies an extra state in the snapshot metadata to avoid double-COWing). The point of the hint is to allow the snapshot code to test for running out of repo space and report that failure at a time when the filesystem is able to handle it gracefully. > Or must it return without doing any IO at all? I would expect it would be a useful optimisation to start the IO but not wait for it's completion, but that the first implementation would just do a space check. > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > If the allocation unit of the storage device (e.g. a few MB) does not > match the allocation unit of the filesystem (e.g. a few KB) then for > this to be useful either the storage device must start recording tiny > allocations, or the filesystem should re-release areas as they grow. > i.e. when releasing a range of a device, look in the filesystem's usage > records for the largest surrounding free space, and release all of that. Good point. I was planning on ignoring this problem :-/ Given that current snapshot implementations waste *all* the blocks in deleted files, it would be an improvement to scavenge the blocks in large extents. This is especially true for XFS which goes to some effort to achieve large linear extents. > Would this be a burden on the filesystems? I think so. I would hope the hints could be done in a way which minimises the impact on filesystems, so that it would be easier to roll out. That implies pushing the responsibility for being smart about combining partial deallocations down to the block device/snapshot code. Any comments, Roger? > Is my imagined disparity between block sizes valid? Yep, at least for XFS and XVM. If the space was used in lots of little files, this rounding would probably eat a lot of the savings. > Would it be just as easy for the storage device to track small > allocation/deallocations? > > > > > BIO_HINT_DONTCOW > > (the Bart Simpson BIO). The bio's block extent is not needed > > in mounted snapshots and does not need to be subjected to COW. > > This seems like a much more domain-specific function that the other > two which themselves could be more generally useful Agreed, I can't offhand think of a use other than internal logs. > (I'm imagining > using hints from them to e.g. accelerate RAID reconstruction). Ah, interesting idea: delete a file to speed up RAID recovery ;-) > Surely the "correct" thing to do with the log is to put it on a separate > device which itself isn't snapshotted. Indeed. > If you have a storage manager that is smart enough to handle these > sorts of things, maybe the functionality you want is "Give me a > subordinate device which is not snapshotted, size X", then journal to > that virtual device. This is usually better, but is not always convenient for a number of reasons. For example, you might not have enough disks to build all of a base, a snapshot repo, and a log device. Also, the log really needs to be safe, so you want it mirrored or RAID5, and you want it fast, and you want it on separate spindles, so it needs several disks; but now you're using terabytes of disk space for 128 MiB of log. > I guess that is equally domain specific, but the difference is that if > you try to read from the DONTCOW part of the snapshot, you get bad > old data, where as if you try to access the subordinate device of a > snapshot, you get an IO error - which is probably safer. I believe (Dave or Roger will correct me here) that XFS needs a log when you mount, and you get to either provide an external one or use the internal one. So when you mount a snapshot of an XFS filesystem which was built with an external log, you need to provide a new external log device. So the storage manager needs to allocate an external log device for each snapshot it allows. > > > > > Comments? > > On the whole it seems reasonably sane .... providing you are from the > school which believes that volume managers and filesystems should be > kept separate :-) Yeah, I'm so old-school :-) Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 5:12 ` Greg Banks @ 2007-10-30 7:43 ` Arnd Bergmann 2007-11-20 23:43 ` Roger Strassburg 1 sibling, 0 replies; 20+ messages in thread From: Arnd Bergmann @ 2007-10-30 7:43 UTC (permalink / raw) To: Greg Banks Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On Tuesday 30 October 2007, Greg Banks wrote: > > > If the allocation unit of the storage device (e.g. a few MB) does not > > match the allocation unit of the filesystem (e.g. a few KB) then for > > this to be useful either the storage device must start recording tiny > > allocations, or the filesystem should re-release areas as they grow. > > i.e. when releasing a range of a device, look in the filesystem's usage > > records for the largest surrounding free space, and release all of that. > > Good point. I was planning on ignoring this problem :-/ Given that > current snapshot implementations waste *all* the blocks in deleted > files, it would be an improvement to scavenge the blocks in large > extents. This is especially true for XFS which goes to some effort > to achieve large linear extents. > Ah, this is an important difference to my idea about an erase operation on the block device. For erase to be meaningful, you need to know the erase block size at the file system or user space, so it would be encoded in the struct block_device, and the user has to issue erase requests at erase block grenularity. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 5:12 ` Greg Banks 2007-10-30 7:43 ` Arnd Bergmann @ 2007-11-20 23:43 ` Roger Strassburg 1 sibling, 0 replies; 20+ messages in thread From: Roger Strassburg @ 2007-11-20 23:43 UTC (permalink / raw) To: Greg Banks Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann Greg, Sorry I didn't respond sooner - other things have gotten in the way of reading this thread. See comments below. Roger Greg Banks wrote: > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: >> On Tuesday October 30, gnb@sgi.com wrote: >>> Of course snapshot cow elements may be part of more generic element >>> trees. In general there may be more than one consumer of block usage >>> hints in a given filesystem's element tree, and their locations in that >>> tree are not predictable. This means the block extents mentioned in >>> the usage hints need to be subject to the block mapping algorithms >>> provided by the element tree. As those algorithms are currently >>> implemented using bio mapping and splitting, the easiest and simplest >>> way to reuse those algorithms is to add new bio flags. >> So are you imagining that you might have a distinct snapshotable >> elements, and that some of these might be combined by e.g. RAID0 into >> a larger device, then a filesystem is created on that? > > I was thinking more a concatenation than a stripe, but yes you could > do such a thing, e.g. to parallelise the COW procedure. We don't do > any such thing in our product; the COW element is always inserted at > the top of the logical element tree. > >> I ask because my first thought was that the sort of communication you >> want seems like it would be just between a filesystem and the block >> device that it talks directly to, and as you are particularly >> interested in XFS and XVM, should could come up with whatever protocol >> you want for those two to talk to either other, prototype it, iron out >> all the issues, then say "We've got this really cool thing to make >> snapshots much faster - wanna share?" and thus be presenting from a >> position of more strength (the old 'code talks' mantra). > > Indeed, code talks ;-) I was hoping someone else would do that > talking for me, though. > >>> First we need a mechanism to indicate that a bio is a hint rather >>> than a real IO. Perhaps the easiest way is to add a new flag to >>> the bi_rw field: >>> >>> #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ >> Reminds me of the new approach to issue_flush_fn which is just to have >> a zero-length barrier bio (is that implemented yet? I lost track). >> But different as a zero length barrier has zero length, and your hints >> have a very meaningful length. > > Yes. > >>> Next we'll need three bio hints types with the following semantics. >>> >>> BIO_HINT_ALLOCATE >>> The bio's block extent will soon be written by the filesystem >>> and any COW that may be necessary to achieve that should begin >>> now. If the COW is going to fail, the bio should fail. Note >>> that this provides a way for the filesystem to manage when and >>> how failures to COW are reported. >> Would it make sense to allow the bi_sector to be changed by the device >> and to have that change honoured. >> i.e. "Please allocate 128 blocks, maybe 'here'" >> "OK, 128 blocks allocated, but they are actually over 'there'". > > That wasn't the expectation at all. Perhaps "allocate" is a poor > name. "I have just allocated, deal with it" might be more appropriate. > Perhaps BIO_HINT_WILLUSE or something. > >> If the device is tracking what space is and isn't used, it might make >> life easier for it to do the allocation. Maybe even have a variant >> "Allocate 128 blocks, I don't care where". > > That kind of thing might perhaps be useful for flash, but I think > current filesystems would have conniptions. > >> Is this bio supposed to block until the copy has happened? Or only >> until the space of the copy has been allocated and possibly committed? > > The latter. The writes following will block until the COW has > completed, or might be performed sufficiently later that the COW > has meanwhile completed (I think this implies an extra state in the > snapshot metadata to avoid double-COWing). The point of the hint is > to allow the snapshot code to test for running out of repo space and > report that failure at a time when the filesystem is able to handle > it gracefully. > >> Or must it return without doing any IO at all? > > I would expect it would be a useful optimisation to start the IO but > not wait for it's completion, but that the first implementation would > just do a space check. > >>> BIO_HINT_RELEASE >>> The bio's block extent is no longer in use by the filesystem >>> and will not be read in the future. Any storage used to back >>> the extent may be released without any threat to filesystem >>> or data integrity. >> If the allocation unit of the storage device (e.g. a few MB) does not >> match the allocation unit of the filesystem (e.g. a few KB) then for >> this to be useful either the storage device must start recording tiny >> allocations, or the filesystem should re-release areas as they grow. >> i.e. when releasing a range of a device, look in the filesystem's usage >> records for the largest surrounding free space, and release all of that. > > Good point. I was planning on ignoring this problem :-/ Given that > current snapshot implementations waste *all* the blocks in deleted > files, it would be an improvement to scavenge the blocks in large > extents. This is especially true for XFS which goes to some effort > to achieve large linear extents. > >> Would this be a burden on the filesystems? > > I think so. I would hope the hints could be done in a way which > minimises the impact on filesystems, so that it would be easier to roll > out. That implies pushing the responsibility for being smart about > combining partial deallocations down to the block device/snapshot code. > Any comments, Roger? I'm not sure how snapshot can really use a dealloc hint. Whatever you're deallocating is in the base, but you want it to stay in the snapshot, since the purpose of a snapshot is to keep track of what was there before. What makes more sense is to somehow pass a hint saying that the data being written is to space that wasn't allocated at the time the snapshot was created, but that would require the filesystem to have knowledge of the snapshot. This would prevent copying data that doesn't contain meaningful data in the first place. >> Is my imagined disparity between block sizes valid? > > Yep, at least for XFS and XVM. If the space was used in lots of > little files, this rounding would probably eat a lot of the savings. > >> Would it be just as easy for the storage device to track small >> allocation/deallocations? >> >>> BIO_HINT_DONTCOW >>> (the Bart Simpson BIO). The bio's block extent is not needed >>> in mounted snapshots and does not need to be subjected to COW. >> This seems like a much more domain-specific function that the other >> two which themselves could be more generally useful > > Agreed, I can't offhand think of a use other than internal logs. > >> (I'm imagining >> using hints from them to e.g. accelerate RAID reconstruction). > > Ah, interesting idea: delete a file to speed up RAID recovery ;-) > >> Surely the "correct" thing to do with the log is to put it on a separate >> device which itself isn't snapshotted. > > Indeed. > >> If you have a storage manager that is smart enough to handle these >> sorts of things, maybe the functionality you want is "Give me a >> subordinate device which is not snapshotted, size X", then journal to >> that virtual device. > > This is usually better, but is not always convenient for a number of > reasons. For example, you might not have enough disks to build all > of a base, a snapshot repo, and a log device. Also, the log really > needs to be safe, so you want it mirrored or RAID5, and you want it > fast, and you want it on separate spindles, so it needs several disks; > but now you're using terabytes of disk space for 128 MiB of log. The log doesn't need to be on a separate disk, just a separate logical volume. Also, you don't have to mirror the whole disk in order to mirror the log volume. Snapshots are done per logical volume, not per physical disk. >> I guess that is equally domain specific, but the difference is that if >> you try to read from the DONTCOW part of the snapshot, you get bad >> old data, where as if you try to access the subordinate device of a >> snapshot, you get an IO error - which is probably safer. > > I believe (Dave or Roger will correct me here) that XFS needs a log > when you mount, and you get to either provide an external one or use > the internal one. So when you mount a snapshot of an XFS filesystem > which was built with an external log, you need to provide a new > external log device. So the storage manager needs to allocate an > external log device for each snapshot it allows. That's correct. >>> Comments? >> On the whole it seems reasonably sane .... providing you are from the >> school which believes that volume managers and filesystems should be >> kept separate :-) > > Yeah, I'm so old-school :-) > > Greg. -- Roger Strassburg SGI Storage Systems Software +49-89-46108-142 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 4:16 ` Neil Brown 2007-10-30 5:12 ` Greg Banks @ 2007-10-30 23:56 ` David Chinner 2007-10-31 4:01 ` Greg Banks 1 sibling, 1 reply; 20+ messages in thread From: David Chinner @ 2007-10-30 23:56 UTC (permalink / raw) To: Neil Brown Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > On Tuesday October 30, gnb@sgi.com wrote: > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > If the allocation unit of the storage device (e.g. a few MB) does not > match the allocation unit of the filesystem (e.g. a few KB) then for > this to be useful either the storage device must start recording tiny > allocations, or the filesystem should re-release areas as they grow. > i.e. when releasing a range of a device, look in the filesystem's usage > records for the largest surrounding free space, and release all of that. I figured that the easiest way around this is reporting free space extents, not the amoutn actually freed. e.g. 4k in file A @ block 10 4k in file B @ block 11 4k free space @ block 12 4k in file C @ block 13 1008k in free space at block 14. If we free file A, we report that we've released an extent of 4k @ block 10. if we then free file B, we report we've released an extent of 12k @ block 10. If we then free file C, we report a release of 1024k @ block 10. Then the underlying device knows what the aggregated free space regions are and can easily release large regions without needing to track tiny allocations and frees done by the filesystem. > I guess that is equally domain specific, but the difference is that if > you try to read from the DONTCOW part of the snapshot, you get bad > old data, where as if you try to access the subordinate device of a > snapshot, you get an IO error - which is probably safer. If you read from a DONTCOW region you should get zeros back - it's a hole in the snapshot. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 23:56 ` David Chinner @ 2007-10-31 4:01 ` Greg Banks 2007-10-31 7:04 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Greg Banks @ 2007-10-31 4:01 UTC (permalink / raw) To: David Chinner Cc: Neil Brown, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > > On Tuesday October 30, gnb@sgi.com wrote: > > > BIO_HINT_RELEASE > > > The bio's block extent is no longer in use by the filesystem > > > and will not be read in the future. Any storage used to back > > > the extent may be released without any threat to filesystem > > > or data integrity. > > > > If the allocation unit of the storage device (e.g. a few MB) does not > > match the allocation unit of the filesystem (e.g. a few KB) then for > > this to be useful either the storage device must start recording tiny > > allocations, or the filesystem should re-release areas as they grow. > > i.e. when releasing a range of a device, look in the filesystem's usage > > records for the largest surrounding free space, and release all of that. > > I figured that the easiest way around this is reporting free space > extents, not the amoutn actually freed. e.g. > > 4k in file A @ block 10 > 4k in file B @ block 11 > 4k free space @ block 12 > 4k in file C @ block 13 > 1008k in free space at block 14. > > If we free file A, we report that we've released an extent of 4k @ block 10. > if we then free file B, we report we've released an extent of 12k @ block 10. > If we then free file C, we report a release of 1024k @ block 10. > > Then the underlying device knows what the aggregated free space regions > are and can easily release large regions without needing to track tiny > allocations and frees done by the filesystem. If you could do that in the filesystem, it certainly solve the problem. In which case I'll explicitly allow for the hint's extent to overlap extents previous extents thus hinted, and define the semantics for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, I think that will make the semantics a little clearer. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-31 4:01 ` Greg Banks @ 2007-10-31 7:04 ` David Chinner 0 siblings, 0 replies; 20+ messages in thread From: David Chinner @ 2007-10-31 7:04 UTC (permalink / raw) To: Greg Banks Cc: David Chinner, Neil Brown, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote: > On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: > > On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: > > > On Tuesday October 30, gnb@sgi.com wrote: > > > > BIO_HINT_RELEASE > > > > The bio's block extent is no longer in use by the filesystem > > > > and will not be read in the future. Any storage used to back > > > > the extent may be released without any threat to filesystem > > > > or data integrity. > > > > > > If the allocation unit of the storage device (e.g. a few MB) does not > > > match the allocation unit of the filesystem (e.g. a few KB) then for > > > this to be useful either the storage device must start recording tiny > > > allocations, or the filesystem should re-release areas as they grow. > > > i.e. when releasing a range of a device, look in the filesystem's usage > > > records for the largest surrounding free space, and release all of that. > > > > I figured that the easiest way around this is reporting free space > > extents, not the amoutn actually freed. e.g. > > > > 4k in file A @ block 10 > > 4k in file B @ block 11 > > 4k free space @ block 12 > > 4k in file C @ block 13 > > 1008k in free space at block 14. > > > > If we free file A, we report that we've released an extent of 4k @ block 10. > > if we then free file B, we report we've released an extent of 12k @ block 10. > > If we then free file C, we report a release of 1024k @ block 10. > > > > Then the underlying device knows what the aggregated free space regions > > are and can easily release large regions without needing to track tiny > > allocations and frees done by the filesystem. > > If you could do that in the filesystem, it certainly solve the problem. > In which case I'll explicitly allow for the hint's extent to overlap > extents previous extents thus hinted, and define the semantics > for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, > I think that will make the semantics a little clearer. I think that can be done - i wouldn't have mentioned it if I didn't think it was possible to implement ;). It will require a further btree lookup once the free transaction hits the disk, but I think that's pretty easy to do. I'd probably hook xfs_alloc_clear_busy() to do this. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks 2007-10-30 1:11 ` Greg Banks 2007-10-30 4:16 ` Neil Brown @ 2007-10-30 9:35 ` Dongjun Shin 2007-10-30 10:15 ` Arnd Bergmann ` (2 more replies) 2 siblings, 3 replies; 20+ messages in thread From: Dongjun Shin @ 2007-10-30 9:35 UTC (permalink / raw) To: Greg Banks Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On 10/30/07, Greg Banks <gnb@sgi.com> wrote: > > BIO_HINT_RELEASE > The bio's block extent is no longer in use by the filesystem > and will not be read in the future. Any storage used to back > the extent may be released without any threat to filesystem > or data integrity. > I'd like to second the proposal, but it would be more useful to bring the hint down to the physical devices. There is an ongoing discussion about adding 'Trim' ATA command for notifying the drive about the deleted blocks. http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf This is especially useful for the storage device like Solid State Drive (SSD). Dongjun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 9:35 ` Dongjun Shin @ 2007-10-30 10:15 ` Arnd Bergmann 2007-10-30 10:49 ` Dongjun Shin 2007-10-30 23:42 ` Kyungmin Park 2007-10-30 14:06 ` Jörn Engel 2007-10-31 3:44 ` Greg Banks 2 siblings, 2 replies; 20+ messages in thread From: Arnd Bergmann @ 2007-10-30 10:15 UTC (permalink / raw) To: Dongjun Shin Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On Tuesday 30 October 2007, Dongjun Shin wrote: > There is an ongoing discussion about adding 'Trim' ATA command for notifying > the drive about the deleted blocks. > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf > > This is especially useful for the storage device like Solid State Drive (SSD). > This make me curious, why would t13 want to invent a new command when there is already the erase command from CFA? It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE should probably be mapped to CFA_ERASE (0xc0) on drives that support it: http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf Arnd <>< ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 10:15 ` Arnd Bergmann @ 2007-10-30 10:49 ` Dongjun Shin 2007-10-30 12:38 ` Arnd Bergmann 2007-10-30 23:42 ` Kyungmin Park 1 sibling, 1 reply; 20+ messages in thread From: Dongjun Shin @ 2007-10-30 10:49 UTC (permalink / raw) To: Arnd Bergmann Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > This make me curious, why would t13 want to invent a new command when > there is already the erase command from CFA? > > It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE > should probably be mapped to CFA_ERASE (0xc0) on drives that support it: > http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf > I'm not sure about the background. However, it's definitely a sign that passing the deleted block info to the flash-based storage is useful. Anyway, BIO_HINT_RELEASE could destroy the content of the blocks after being passed to the device. I think that other bio should not be reordered accross that hint (just like barrier). Dongjun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 10:49 ` Dongjun Shin @ 2007-10-30 12:38 ` Arnd Bergmann 2007-10-30 14:19 ` Dongjun Shin 0 siblings, 1 reply; 20+ messages in thread From: Arnd Bergmann @ 2007-10-30 12:38 UTC (permalink / raw) To: Dongjun Shin Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On Tuesday 30 October 2007, Dongjun Shin wrote: > Anyway, BIO_HINT_RELEASE could destroy the content of the blocks > after being passed to the device. I think that other bio should not be > reordered accross that hint (just like barrier). Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? Arnd <>< ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 12:38 ` Arnd Bergmann @ 2007-10-30 14:19 ` Dongjun Shin 2007-10-30 15:37 ` Jörn Engel 0 siblings, 1 reply; 20+ messages in thread From: Dongjun Shin @ 2007-10-30 14:19 UTC (permalink / raw) To: Arnd Bergmann Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > > Not sure. Why shouldn't you be able to reorder the hints provided that > they don't overlap with read/write bios for the same block? > You're right. The bios can be reordered if they don't overlap with hint. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 14:19 ` Dongjun Shin @ 2007-10-30 15:37 ` Jörn Engel 2007-10-30 16:37 ` Arnd Bergmann 0 siblings, 1 reply; 20+ messages in thread From: Jörn Engel @ 2007-10-30 15:37 UTC (permalink / raw) To: Dongjun Shin Cc: Arnd Bergmann, Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > they don't overlap with read/write bios for the same block? > > You're right. The bios can be reordered if they don't overlap with hint. I would keep things simpler. Bios can be reordered, full stop. If an erase and a write overlap, the caller (filesystem?) has to add a barrier. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 15:37 ` Jörn Engel @ 2007-10-30 16:37 ` Arnd Bergmann 2007-10-30 23:19 ` Kyungmin Park 0 siblings, 1 reply; 20+ messages in thread From: Arnd Bergmann @ 2007-10-30 16:37 UTC (permalink / raw) To: Jörn Engel Cc: Dongjun Shin, Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On Tuesday 30 October 2007, Jörn Engel wrote: > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > > On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > > > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > > they don't overlap with read/write bios for the same block? > > > > You're right. The bios can be reordered if they don't overlap with hint. > > I would keep things simpler. Bios can be reordered, full stop. If an > erase and a write overlap, the caller (filesystem?) has to add a > barrier. I thought bios were already ordered if they affect the same blocks. Either way, I agree that an erase should not be treated special on the bio layer, its ordering should be handled the same way we do it for writes. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 16:37 ` Arnd Bergmann @ 2007-10-30 23:19 ` Kyungmin Park 0 siblings, 0 replies; 20+ messages in thread From: Kyungmin Park @ 2007-10-30 23:19 UTC (permalink / raw) To: Arnd Bergmann Cc: Jörn Engel, Dongjun Shin, Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On 10/31/07, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 30 October 2007, Jörn Engel wrote: > > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: > > > On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > > > > > > > > Not sure. Why shouldn't you be able to reorder the hints provided that > > > > they don't overlap with read/write bios for the same block? > > > > > > You're right. The bios can be reordered if they don't overlap with hint. > > > > I would keep things simpler. Bios can be reordered, full stop. If an > > erase and a write overlap, the caller (filesystem?) has to add a > > barrier. > > I thought bios were already ordered if they affect the same blocks. > Either way, I agree that an erase should not be treated special on > the bio layer, its ordering should be handled the same way we do it > for writes. > To support the new ATA command (trim, or dataset), the suggested hint is not enough. We have to send the bio with data (at least one sector or more) since the new ATA command requests the dataset information. And also we have to strictly follow the order using barrier or other methods at filesystem level For example, the delete operation in ext3. 1. delete some file 2. ext3_delete_inode() called 3. ... -> ext3_free_blocks_sb() releases the free blocks 4. If it sends the hints here, it breaks the ext3 power off recovery scheme since it trims the data from given information after reboot 5. after transaction, all dirty pages are flushed. after this work, we can trim the free blocks safely. Another approach is modifying the block framework. At I/O scheduler, it don't merge the hint bio (in my terminology, bio control info) with general bio. In this case we also consider the reordering problem. I'm not sure it is possible at this time. Thank you, Kyungmin Park - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 10:15 ` Arnd Bergmann 2007-10-30 10:49 ` Dongjun Shin @ 2007-10-30 23:42 ` Kyungmin Park 1 sibling, 0 replies; 20+ messages in thread From: Kyungmin Park @ 2007-10-30 23:42 UTC (permalink / raw) To: Arnd Bergmann Cc: Dongjun Shin, Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois On 10/30/07, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 30 October 2007, Dongjun Shin wrote: > > There is an ongoing discussion about adding 'Trim' ATA command for notifying > > the drive about the deleted blocks. > > > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf > > > > This is especially useful for the storage device like Solid State Drive (SSD). > > > This make me curious, why would t13 want to invent a new command when > there is already the erase command from CFA? > > It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE > should probably be mapped to CFA_ERASE (0xc0) on drives that support it: > http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf > IHMO, the main difference is that it requires the physical operation or not. The CFA_ERAES erases the free blocks, it requires the physical erase operation. But in trim case, it just unmapped the free blocks at FTL level. it doesn't require the physical operation. It's time saving and we can do a lot of works at FTL level internally. Thank you, Kyungmin Park I ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 9:35 ` Dongjun Shin 2007-10-30 10:15 ` Arnd Bergmann @ 2007-10-30 14:06 ` Jörn Engel 2007-10-31 3:44 ` Greg Banks 2 siblings, 0 replies; 20+ messages in thread From: Jörn Engel @ 2007-10-30 14:06 UTC (permalink / raw) To: Dongjun Shin Cc: Greg Banks, Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote: > On 10/30/07, Greg Banks <gnb@sgi.com> wrote: > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > I'd like to second the proposal, but it would be more useful to bring the hint > down to the physical devices. Absolutely. Logfs would love to have an erase operation for block devices as well. However the above doesn't quite match my needs, because the blocks _will_ be read in the future. There are two reasons for reading things back later. The good one is to determine whether the segment was erased or not. Reads should return either valid data or one of (all-0xff, all-0x00, -ESOMETHING). Having a dedicated error code would be best. And getting the device erasesize would be useful as well, for obvious reasons. Jörn -- When you close your hand, you own nothing. When you open it up, you own the whole world. -- Li Mu Bai in Tiger & Dragon - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Proposal to improve filesystem/block snapshot interaction 2007-10-30 9:35 ` Dongjun Shin 2007-10-30 10:15 ` Arnd Bergmann 2007-10-30 14:06 ` Jörn Engel @ 2007-10-31 3:44 ` Greg Banks 2 siblings, 0 replies; 20+ messages in thread From: Greg Banks @ 2007-10-31 3:44 UTC (permalink / raw) To: Dongjun Shin Cc: Linux Filesystem Mailing List, David Chinner, Donald Douwsma, Christoph Hellwig, Roger Strassburg, Mark Goodwin, Brett Jon Grandbois, Arnd Bergmann On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote: > On 10/30/07, Greg Banks <gnb@sgi.com> wrote: > > > > BIO_HINT_RELEASE > > The bio's block extent is no longer in use by the filesystem > > and will not be read in the future. Any storage used to back > > the extent may be released without any threat to filesystem > > or data integrity. > > > > I'd like to second the proposal, but it would be more useful to bring the hint > down to the physical devices. > > There is an ongoing discussion about adding 'Trim' ATA command for notifying > the drive about the deleted blocks. > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf What an interesting document. Am I reading the change markup correctly, did it get *simpler* in the last revision? Wow. I agree that BIO_HINT_RELEASE would be a good match for the proposed Trim command. But I don't think we'll ever be issuing Trims with more than a single LBA Range Entry, that feature seems unhelpful. The Trim proposal doesn't specify what happens when a sector which is already deallocated is deallocated again, presumably this is supposed to be harmless? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2007-11-20 23:41 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070927063113.GD2989@sgi.com>
2007-10-30 1:04 ` Proposal to improve filesystem/block snapshot interaction Greg Banks
2007-10-30 1:11 ` Greg Banks
2007-10-30 4:16 ` Neil Brown
2007-10-30 5:12 ` Greg Banks
2007-10-30 7:43 ` Arnd Bergmann
2007-11-20 23:43 ` Roger Strassburg
2007-10-30 23:56 ` David Chinner
2007-10-31 4:01 ` Greg Banks
2007-10-31 7:04 ` David Chinner
2007-10-30 9:35 ` Dongjun Shin
2007-10-30 10:15 ` Arnd Bergmann
2007-10-30 10:49 ` Dongjun Shin
2007-10-30 12:38 ` Arnd Bergmann
2007-10-30 14:19 ` Dongjun Shin
2007-10-30 15:37 ` Jörn Engel
2007-10-30 16:37 ` Arnd Bergmann
2007-10-30 23:19 ` Kyungmin Park
2007-10-30 23:42 ` Kyungmin Park
2007-10-30 14:06 ` Jörn Engel
2007-10-31 3:44 ` Greg Banks
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).