* New TRIM/UNMAP tree published (2009-05-02) [not found] ` <20090503061150.GF10704@linux.intel.com> @ 2009-05-03 7:16 ` Matthew Wilcox 2009-05-03 13:07 ` Hugh Dickins 0 siblings, 1 reply; 22+ messages in thread From: Matthew Wilcox @ 2009-05-03 7:16 UTC (permalink / raw) To: Matthew Wilcox Cc: Boaz Harrosh, linux-ide, linux-kernel, jgarzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote: > > blkdev_issue_discard() and blk_ioctl_discard() has half a page > > of common (and changing) code, could be done to use a common > > helper that sets policy about bio allocation sizes and such. > > > > Just my $0.017 > > Yes, that works nicely. Thanks for the suggestion. I've pushed out a new git tree: http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502 Changes since the last version: - Based on 2.6.30-rc4. - Dropped the three patches which were already merged. - Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads' to the front of the queue since it's not controversial and could be merged earlier. - Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two. - Updated 'ide: Add support for TRIM' to compile with new IDE code. Still to do: - Understand all the changes Bart made to the IDE code; I didn't make all the changes that he did. I just made it compile for now. - Dave had some objections to the description of 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement text. - Figure out what to do about memory allocation. - Handle error returns from discard_prep. - Test the new code still works. - Delve into the latest SCSI spec and see if anything changed in UNMAP Given the extensive length of the todo list, I shan't send out the mailbomb. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 7:16 ` New TRIM/UNMAP tree published (2009-05-02) Matthew Wilcox @ 2009-05-03 13:07 ` Hugh Dickins 2009-05-03 14:48 ` Matthew Wilcox 0 siblings, 1 reply; 22+ messages in thread From: Hugh Dickins @ 2009-05-03 13:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, 3 May 2009, Matthew Wilcox wrote: > On Sat, May 02, 2009 at 11:11:50PM -0700, Matthew Wilcox wrote: > > > blkdev_issue_discard() and blk_ioctl_discard() has half a page > > > of common (and changing) code, could be done to use a common > > > helper that sets policy about bio allocation sizes and such. > > > > > > Just my $0.017 > > > > Yes, that works nicely. Thanks for the suggestion. > > I've pushed out a new git tree: > > http://git.kernel.org/?p=linux/kernel/git/willy/ssd.git;a=shortlog;h=trim-20090502 I'm glad to see this coming alive again, thank you. I notice OCZ hope to have updated Vertex firmware mid-May, such that their TRIM support matches up with (y)ours. One comment below... > > Changes since the last version: > > - Based on 2.6.30-rc4. > - Dropped the three patches which were already merged. > - Shuffled 'Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of > reads' to the front of the queue since it's not controversial and could > be merged earlier. > - Inserted 'Unify blk_ioctl_discard and blkdev_issue_discard' as patch two. > - Updated 'ide: Add support for TRIM' to compile with new IDE code. > > Still to do: > - Understand all the changes Bart made to the IDE code; I didn't make all > the changes that he did. I just made it compile for now. > - Dave had some objections to the description of 'Make DISCARD_BARRIER and > DISCARD_NOBARRIER writes instead of reads', but didn't propose replacement > text. > - Figure out what to do about memory allocation. You have GFP_KERNEL allocations in your discard_fn()s. I believe that allocations down at that level will need to be GFP_NOIO - the bio mempool might be empty by this point, you don't want memory allocation to get into submitting even more I/O to satisfy your discard_fn(). Does it need its own mempool of reserve pages? My guess is that you can get away without that, since a failed discard, though regrettable, loses no data. Hugh > - Handle error returns from discard_prep. > - Test the new code still works. > - Delve into the latest SCSI spec and see if anything changed in UNMAP > > Given the extensive length of the todo list, I shan't send out the mailbomb. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 13:07 ` Hugh Dickins @ 2009-05-03 14:48 ` Matthew Wilcox 2009-05-03 15:02 ` Boaz Harrosh 2009-05-03 15:05 ` Hugh Dickins 0 siblings, 2 replies; 22+ messages in thread From: Matthew Wilcox @ 2009-05-03 14:48 UTC (permalink / raw) To: Hugh Dickins Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote: > I notice OCZ hope to have updated Vertex firmware mid-May, > such that their TRIM support matches up with (y)ours. Excellent. > You have GFP_KERNEL allocations in your discard_fn()s. I believe that > allocations down at that level will need to be GFP_NOIO - the bio mempool > might be empty by this point, you don't want memory allocation to get into > submitting even more I/O to satisfy your discard_fn(). Does it need its > own mempool of reserve pages? My guess is that you can get away without > that, since a failed discard, though regrettable, loses no data. You make a good point. There's definitely still work to be done around error handling. OTOH, we could always do what IDE does ... static void idedisk_prepare_flush(struct request_queue *q, struct request *rq) { ide_drive_t *drive = q->queuedata; struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC); /* FIXME: map struct ide_taskfile on rq->cmd[] */ BUG_ON(cmd == NULL); -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 14:48 ` Matthew Wilcox @ 2009-05-03 15:02 ` Boaz Harrosh 2009-05-03 15:42 ` Matthew Wilcox 2009-05-03 15:05 ` Hugh Dickins 1 sibling, 1 reply; 22+ messages in thread From: Boaz Harrosh @ 2009-05-03 15:02 UTC (permalink / raw) To: Matthew Wilcox Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On 05/03/2009 05:48 PM, Matthew Wilcox wrote: > On Sun, May 03, 2009 at 02:07:32PM +0100, Hugh Dickins wrote: >> I notice OCZ hope to have updated Vertex firmware mid-May, >> such that their TRIM support matches up with (y)ours. > > Excellent. > >> You have GFP_KERNEL allocations in your discard_fn()s. I believe that >> allocations down at that level will need to be GFP_NOIO - the bio mempool >> might be empty by this point, you don't want memory allocation to get into >> submitting even more I/O to satisfy your discard_fn(). Does it need its >> own mempool of reserve pages? My guess is that you can get away without >> that, since a failed discard, though regrettable, loses no data. > > You make a good point. There's definitely still work to be done around > error handling. OTOH, we could always do what IDE does ... > > static void idedisk_prepare_flush(struct request_queue *q, struct request *rq) > { > ide_drive_t *drive = q->queuedata; > struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC); > > /* FIXME: map struct ide_taskfile on rq->cmd[] */ > BUG_ON(cmd == NULL); > > > Out from the frying pan and into the fire. ;-) I agree with Hugh. The allocation is done at, too-low in the food chain. (And that free of buffer at upper layer allocated by lower layer). I think you need to separate the: "does lld need buffer, what size" from the "here is buffer prepare", so upper layer that can sleep does sleep. In all other buffer needing operations the allocation is done before submission of request, No? Just my $0.017 Thanks Boaz BTW: I've not have time to review second round will "TODO" ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 15:02 ` Boaz Harrosh @ 2009-05-03 15:42 ` Matthew Wilcox 2009-05-03 16:34 ` Boaz Harrosh 2009-05-03 18:48 ` Bartlomiej Zolnierkiewicz 0 siblings, 2 replies; 22+ messages in thread From: Matthew Wilcox @ 2009-05-03 15:42 UTC (permalink / raw) To: Boaz Harrosh Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote: > I agree with Hugh. The allocation is done at, too-low in the food chain. > (And that free of buffer at upper layer allocated by lower layer). > > I think you need to separate the: "does lld need buffer, what size" > from the "here is buffer prepare", so upper layer that can sleep does > sleep. So you want two function pointers in the request queue relating to discard? > In all other buffer needing operations the allocation is done before > submission of request, No? It's not true for the flush request (the example I quoted). Obviously, the solution adopted here by IDE is Bad and Wrong ... -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 15:42 ` Matthew Wilcox @ 2009-05-03 16:34 ` Boaz Harrosh 2009-05-03 18:34 ` Jeff Garzik 2009-05-03 18:48 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 22+ messages in thread From: Boaz Harrosh @ 2009-05-03 16:34 UTC (permalink / raw) To: Matthew Wilcox Cc: Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On 05/03/2009 06:42 PM, Matthew Wilcox wrote: > On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote: >> I agree with Hugh. The allocation is done at, too-low in the food chain. >> (And that free of buffer at upper layer allocated by lower layer). >> >> I think you need to separate the: "does lld need buffer, what size" >> from the "here is buffer prepare", so upper layer that can sleep does >> sleep. > > So you want two function pointers in the request queue relating to discard? > OK I don't know what I want, I guess. ;-) I'm not a block-device export but from the small osdblk device I maintain it looks like osdblk_prepare_flush which is set into: blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush); does some internal structure setup, but the actual flush command is only executed later in the global osdblk_rq_fn which is set into: blk_init_queue(osdblk_rq_fn, &osdev->lock); But I'm not even sure that prepare_flush is called in a better context then queue_fn, and what does it means to let block devices take care of another new command type at queue_fn. I guess it comes back to Jeff Garzik's comment about not having a central place to ask the request what we need to do. But I do hate that allocation is done by driver and free by mid-layer, so yes two vectors, request_queue is allocated once per device it's not that bad. And later when Jeff's comment is addressed it can be removed. >> In all other buffer needing operations the allocation is done before >> submission of request, No? > > It's not true for the flush request (the example I quoted). Obviously, > the solution adopted here by IDE is Bad and Wrong ... > So now I don't understand, is flush executed by the queue_fn or by the prepare_flush, or am I seeing two type of flushes? Thanks Boaz ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 16:34 ` Boaz Harrosh @ 2009-05-03 18:34 ` Jeff Garzik 2009-05-03 18:40 ` Jeff Garzik 2009-05-03 21:48 ` Matthew Wilcox 0 siblings, 2 replies; 22+ messages in thread From: Jeff Garzik @ 2009-05-03 18:34 UTC (permalink / raw) To: Boaz Harrosh Cc: Matthew Wilcox, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord Boaz Harrosh wrote: > On 05/03/2009 06:42 PM, Matthew Wilcox wrote: >> On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote: >>> I agree with Hugh. The allocation is done at, too-low in the food chain. >>> (And that free of buffer at upper layer allocated by lower layer). >>> >>> I think you need to separate the: "does lld need buffer, what size" >>> from the "here is buffer prepare", so upper layer that can sleep does >>> sleep. >> So you want two function pointers in the request queue relating to discard? >> > > OK I don't know what I want, I guess. ;-) > > I'm not a block-device export but from the small osdblk device I maintain > it looks like osdblk_prepare_flush which is set into: > blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush); > > does some internal structure setup, but the actual flush command is only executed > later in the global osdblk_rq_fn which is set into: > blk_init_queue(osdblk_rq_fn, &osdev->lock); > > But I'm not even sure that prepare_flush is called in a better context then > queue_fn, and what does it means to let block devices take care of another > new command type at queue_fn. > > I guess it comes back to Jeff Garzik's comment about not having a central > place to ask the request what we need to do. > > But I do hate that allocation is done by driver and free by mid-layer, > so yes two vectors, request_queue is allocated once per device it's not > that bad. And later when Jeff's comment is addressed it can be removed. May I presume you are referring to the following osdblk.c comment? /* deduce our operation (read, write, flush) */ /* I wish the block layer simplified * cmd_type/cmd_flags/cmd[] * into a clearly defined set of RPC commands: * read, write, flush, scsi command, power mgmt req, * driver-specific, etc. */ Yes, the task of figuring out -what to do- in the queue's request function is quite complex, and discard makes it even more so. The API makes life difficult -- you have to pass temporary info to yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the overall sum is a bewildering collection of opcodes, flags, and internal driver notes to itself. Add to this yet another prep function, ->prep_rq_fn() It definitely sucks, especially with regards to failed atomic allocations... but I think fixing this quite a big more than Matthew probably willing to tackle ;-) My ideal block layer interface would be a lot more opcode-based, e.g. (1) create REQ_TYPE_DISCARD (2) determine at init if queue (a) supports explicit DISCARD and/or (b) supports DISCARD flag passed with READ or WRITE (3) when creating a discard request, use block helpers w/ queue-specific knowledge to create either (a) one request, REQ_TYPE_FS, with discard flag or (b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD (4) blkdev_issue_discard() would function like an empty barrier, and unconditionally create REQ_TYPE_DISCARD. This type of setup would require NO prepare_discard command, as all knowledge would be passed directly to ->prep_rq_fn() and ->request_fn() And to tangent a bit... I feel barriers should be handled in exactly the same way. Create REQ_TYPE_FLUSH, which would be issued for above examples #2a and #4, if the queue is setup that way. All this MINIMIZES the amount of information a driver must "pass to itself", by utilizing existing ->prep_fn_rq() and ->request_fn() pathways. Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 18:34 ` Jeff Garzik @ 2009-05-03 18:40 ` Jeff Garzik 2009-05-03 19:04 ` James Bottomley 2009-05-03 21:48 ` Matthew Wilcox 1 sibling, 1 reply; 22+ messages in thread From: Jeff Garzik @ 2009-05-03 18:40 UTC (permalink / raw) To: Matthew Wilcox, Jens Axboe Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord Jeff Garzik wrote: > (2) determine at init if queue (a) supports explicit DISCARD and/or (b) > supports DISCARD flag passed with READ or WRITE As an aside -- does any existing command set support case #b, above? AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware command to discard/deallocate unused sectors. Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new hook ->prepare_discard(). This provides a 1:1 correspondence between hardware and struct request, most closely matching the setup of known hardware. Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 18:40 ` Jeff Garzik @ 2009-05-03 19:04 ` James Bottomley 2009-05-03 19:20 ` Jeff Garzik 0 siblings, 1 reply; 22+ messages in thread From: James Bottomley @ 2009-05-03 19:04 UTC (permalink / raw) To: Jeff Garzik Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote: > Jeff Garzik wrote: > > (2) determine at init if queue (a) supports explicit DISCARD and/or (b) > > supports DISCARD flag passed with READ or WRITE > > > As an aside -- does any existing command set support case #b, above? Not to my knowledge. I think discard was modelled on barrier (which can be associated with data or stand on its own). > AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware > command to discard/deallocate unused sectors. > > Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new > hook ->prepare_discard(). Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE SAME, we still need a data buffer to store either the extents or the actual same data. > This provides a 1:1 correspondence between hardware and struct request, > most closely matching the setup of known hardware. Agreed ... but we still have to allocate the adjunct buffer somewhere .... James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 19:04 ` James Bottomley @ 2009-05-03 19:20 ` Jeff Garzik 2009-05-03 19:37 ` James Bottomley 2009-05-03 19:47 ` James Bottomley 0 siblings, 2 replies; 22+ messages in thread From: Jeff Garzik @ 2009-05-03 19:20 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord James Bottomley wrote: > On Sun, 2009-05-03 at 14:40 -0400, Jeff Garzik wrote: >> Jeff Garzik wrote: >>> (2) determine at init if queue (a) supports explicit DISCARD and/or (b) >>> supports DISCARD flag passed with READ or WRITE >> >> As an aside -- does any existing command set support case #b, above? > > Not to my knowledge. > > I think discard was modelled on barrier (which can be associated with > data or stand on its own). Yeah -- I am thinking we can reduce complexity if no real hardware supports "associated with data." >> AFAICT, ATA, SCSI and NVMHCI all have a single, explicit hardware >> command to discard/deallocate unused sectors. >> >> Therefore, creating REQ_TYPE_DISCARD seems to eliminate any need for new >> hook ->prepare_discard(). > > Well, yes and no ... in the SCSI implementation of either UNMAP or WRITE > SAME, we still need a data buffer to store either the extents or the > actual same data. Sure, UNMAP's extents would quite naturally travel with the REQ_TYPE_DISCARD struct request, and can be set up at struct request allocation time. In all of ATA, SCSI and NVMHCI, they appear to want the extents. Is WRITE SAME associated with this current DISCARD work, or is that just a similar example? I'm unfamiliar with its issues... >> This provides a 1:1 correspondence between hardware and struct request, >> most closely matching the setup of known hardware. > > Agreed ... but we still have to allocate the adjunct buffer > somewhere .... Agreed. I merely argue ->prep_rq_fn() would be a better place -- it can at least return a useful return value -- than ->prepare_discard(), which would be the natural disposition for a REQ_TYPE_DISCARD -- or ->request_fn() itself, if a block driver so chooses. The extents would be an argument to REQ_TYPE_DISCARD, and should be associated with struct request somehow, by struct request's creator. The extent info needs to be in some generic form, because all of ATA|SCSI|NVMHCI seem to want it. [tangent...] Does make you wonder if a ->init_rq_fn() would be helpful, one that could perform gfp_t allocations rather than GFP_ATOMIC? The idea being to call ->init_rq_fn() almost immediately after creation of struct request, by the struct request creator. I obviously have not thought in depth about this, but it does seem that init_rq_fn(), called earlier in struct request lifetime, could eliminate the need for ->prepare_flush, ->prepare_discard, and perhaps could be a better place for some of the ->prep_rq_fn logic. The creator of struct request generally has more freedom to sleep, and it seems logical to give low-level drivers a "fill in LLD-specific info" hook BEFORE the request is ever added to a request_queue. Who knows... Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 19:20 ` Jeff Garzik @ 2009-05-03 19:37 ` James Bottomley 2009-05-04 14:03 ` Douglas Gilbert 2009-05-03 19:47 ` James Bottomley 1 sibling, 1 reply; 22+ messages in thread From: James Bottomley @ 2009-05-03 19:37 UTC (permalink / raw) To: Jeff Garzik Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: > Is WRITE SAME associated with this current DISCARD work, or is that just > a similar example? I'm unfamiliar with its issues... It's an adjunct body of work. T10 apparently ratified both UNMAP and the WRITE SAME extensions. What WRITE SAME does is write the same data block to multiple contiguous locations as specified in the CDB. What the thin provisioning update did for it is allow you to specify a flag saying I want these sectors unmapped. The perceived benefit of WRITE SAME is that you specify (with the write same data ... presumably all zeros) what an unmapped sector will return if it's ever read from again, which was a big argument in the UNMAP case. James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 19:37 ` James Bottomley @ 2009-05-04 14:03 ` Douglas Gilbert 2009-05-04 14:40 ` James Bottomley 0 siblings, 1 reply; 22+ messages in thread From: Douglas Gilbert @ 2009-05-04 14:03 UTC (permalink / raw) To: James Bottomley Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord James Bottomley wrote: > On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: >> Is WRITE SAME associated with this current DISCARD work, or is that just >> a similar example? I'm unfamiliar with its issues... > > It's an adjunct body of work. T10 apparently ratified both UNMAP and > the WRITE SAME extensions. What WRITE SAME does is write the same data > block to multiple contiguous locations as specified in the CDB. What > the thin provisioning update did for it is allow you to specify a flag > saying I want these sectors unmapped. The perceived benefit of WRITE > SAME is that you specify (with the write same data ... presumably all > zeros) what an unmapped sector will return if it's ever read from again, > which was a big argument in the UNMAP case. James, Your presumption is correct. For the UNMAP bit to be honoured in the SCSI WRITE SAME command, the user data part of the data-out buffer needs to be all zeros, and, if present, the protection data part of the data-out buffer needs to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the UNMAP bit in WRITE SAME command is ignored and it does a "normal" WRITE SAME. My $0.02's worth was a suggestion to report an error if the UNMAP bit was given to WRITE SAME and the data-out buffer did not comply with the above pattern. Alternatively the data-out buffer could just be ignored. The author of the WRITE SAME "unmap" facility duly noted my observations and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME is contorted so there will be changes. And t10 is still having teleconferences about thin provisioning so there may be non-trivial changes in the near future. Doug Gilbert ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-04 14:03 ` Douglas Gilbert @ 2009-05-04 14:40 ` James Bottomley 2009-05-04 15:11 ` Douglas Gilbert 0 siblings, 1 reply; 22+ messages in thread From: James Bottomley @ 2009-05-04 14:40 UTC (permalink / raw) To: dgilbert Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On Mon, 2009-05-04 at 16:03 +0200, Douglas Gilbert wrote: > James Bottomley wrote: > > On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: > >> Is WRITE SAME associated with this current DISCARD work, or is that just > >> a similar example? I'm unfamiliar with its issues... > > > > It's an adjunct body of work. T10 apparently ratified both UNMAP and > > the WRITE SAME extensions. What WRITE SAME does is write the same data > > block to multiple contiguous locations as specified in the CDB. What > > the thin provisioning update did for it is allow you to specify a flag > > saying I want these sectors unmapped. The perceived benefit of WRITE > > SAME is that you specify (with the write same data ... presumably all > > zeros) what an unmapped sector will return if it's ever read from again, > > which was a big argument in the UNMAP case. > > James, > Your presumption is correct. For the UNMAP bit to be honoured > in the SCSI WRITE SAME command, the user data part of the > data-out buffer needs to be all zeros, and, if present, > the protection data part of the data-out buffer needs > to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the > UNMAP bit in WRITE SAME command is ignored and it does a > "normal" WRITE SAME. > > My $0.02's worth was a suggestion to report an error if the > UNMAP bit was given to WRITE SAME and the data-out > buffer did not comply with the above pattern. Alternatively > the data-out buffer could just be ignored. The author > of the WRITE SAME "unmap" facility duly noted my observations > and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME > is contorted so there will be changes. And t10 is still > having teleconferences about thin provisioning so there may be > non-trivial changes in the near future. Actually, I'd just like something far more basic: forcing a thin provisioned array to support all of the three possible mechanisms. It's going to be a real mess trying to work out for any given array do you support UNMAP or WRITE SAME(16) or WRITE SAME(32)? We can only do this currently by trying the commands ... then we have to have support for all three built into sd just in case ... and we get precisely the same functionality in each case ... James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-04 14:40 ` James Bottomley @ 2009-05-04 15:11 ` Douglas Gilbert 2009-05-04 15:23 ` James Bottomley 0 siblings, 1 reply; 22+ messages in thread From: Douglas Gilbert @ 2009-05-04 15:11 UTC (permalink / raw) To: James Bottomley Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord James Bottomley wrote: > On Mon, 2009-05-04 at 16:03 +0200, Douglas Gilbert wrote: >> James Bottomley wrote: >>> On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: >>>> Is WRITE SAME associated with this current DISCARD work, or is that just >>>> a similar example? I'm unfamiliar with its issues... >>> It's an adjunct body of work. T10 apparently ratified both UNMAP and >>> the WRITE SAME extensions. What WRITE SAME does is write the same data >>> block to multiple contiguous locations as specified in the CDB. What >>> the thin provisioning update did for it is allow you to specify a flag >>> saying I want these sectors unmapped. The perceived benefit of WRITE >>> SAME is that you specify (with the write same data ... presumably all >>> zeros) what an unmapped sector will return if it's ever read from again, >>> which was a big argument in the UNMAP case. >> James, >> Your presumption is correct. For the UNMAP bit to be honoured >> in the SCSI WRITE SAME command, the user data part of the >> data-out buffer needs to be all zeros, and, if present, >> the protection data part of the data-out buffer needs >> to be all 0xff_s (i.e. 8 bytes of 0xff). Otherwise the >> UNMAP bit in WRITE SAME command is ignored and it does a >> "normal" WRITE SAME. >> >> My $0.02's worth was a suggestion to report an error if the >> UNMAP bit was given to WRITE SAME and the data-out >> buffer did not comply with the above pattern. Alternatively >> the data-out buffer could just be ignored. The author >> of the WRITE SAME "unmap" facility duly noted my observations >> and rejected them :-) The wording in sbc3r18.pdf for WRITE SAME >> is contorted so there will be changes. And t10 is still >> having teleconferences about thin provisioning so there may be >> non-trivial changes in the near future. > > Actually, I'd just like something far more basic: forcing a thin > provisioned array to support all of the three possible mechanisms. It's > going to be a real mess trying to work out for any given array do you > support UNMAP or WRITE SAME(16) or WRITE SAME(32)? We can only do this > currently by trying the commands ... then we have to have support for > all three built into sd just in case ... and we get precisely the same > functionality in each case ... James, Another aspect, especially if a large amount of storage is to be trimmed, is how long will it take? This relates to the timeout value we should associate with such an invocation. The FORMAT UNIT and START STOP UNIT commands have an IMMED bit, but not WRITE SAME. Speaking of FORMAT UNIT, some words were added into sbc3r18 that suggest a FORMAT UNIT command could be interpreted as unmap/trim the whole disk. Doug Gilbert ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-04 15:11 ` Douglas Gilbert @ 2009-05-04 15:23 ` James Bottomley 0 siblings, 0 replies; 22+ messages in thread From: James Bottomley @ 2009-05-04 15:23 UTC (permalink / raw) To: dgilbert Cc: Jeff Garzik, Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On Mon, 2009-05-04 at 17:11 +0200, Douglas Gilbert wrote: > Another aspect, especially if a large amount of storage > is to be trimmed, is how long will it take? This relates > to the timeout value we should associate with such an > invocation. The FORMAT UNIT and START STOP UNIT commands > have an IMMED bit, but not WRITE SAME. I was assuming it would be more or less instantaneous ... if it's not then the plan of trying to keep thin provisioning happy just by sending down a maximal trim from the filesystem is going to run into trouble. > Speaking of FORMAT UNIT, some words were added into sbc3r18 > that suggest a FORMAT UNIT command could be interpreted as > unmap/trim the whole disk. Seems reasonable ... if you format the device it could be argued you're not relying on the data contents any more ... James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 19:20 ` Jeff Garzik 2009-05-03 19:37 ` James Bottomley @ 2009-05-03 19:47 ` James Bottomley 2009-05-03 22:47 ` Jeff Garzik 1 sibling, 1 reply; 22+ messages in thread From: James Bottomley @ 2009-05-03 19:47 UTC (permalink / raw) To: Jeff Garzik Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: > [tangent...] > > Does make you wonder if a ->init_rq_fn() would be helpful, one that > could perform gfp_t allocations rather than GFP_ATOMIC? The idea being > to call ->init_rq_fn() almost immediately after creation of struct > request, by the struct request creator. Isn't that what the current prep_fn actually is? > I obviously have not thought in depth about this, but it does seem that > init_rq_fn(), called earlier in struct request lifetime, could eliminate > the need for ->prepare_flush, ->prepare_discard, and perhaps could be a > better place for some of the ->prep_rq_fn logic. It's hard to see how ... prep_rq_fn is already called pretty early ... almost as soon as the elevator has decided to spit out the request > The creator of struct request generally has more freedom to sleep, and > it seems logical to give low-level drivers a "fill in LLD-specific info" > hook BEFORE the request is ever added to a request_queue. Unfortunately it's not really possible to find a sleeping context in there: The elevators have to operate from the current elv_next_request() context, which, in most drivers can either be user or interrupt. The way the block layer is designed is to pull allocations up the stack much closer to the process (usually at the bio creation point) because that allows the elevators to operate even in memory starved conditions. If we pushed the allocation down into the request level, we'd need some type of threading (bad for performance) and the request processing would stall when some GFP_KERNEL allocation went out to lunch finding memory. The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation tied to a bio when it's issued at the top. That way everyone has enough memory when it comes down the stack (both extents and WRITE SAME sector will fit into a page ... although only just for WRITE SAME on 4k sectors). James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 19:47 ` James Bottomley @ 2009-05-03 22:47 ` Jeff Garzik 2009-05-04 15:28 ` Boaz Harrosh 0 siblings, 1 reply; 22+ messages in thread From: Jeff Garzik @ 2009-05-03 22:47 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, Jens Axboe, Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord James Bottomley wrote: > On Sun, 2009-05-03 at 15:20 -0400, Jeff Garzik wrote: >> [tangent...] >> >> Does make you wonder if a ->init_rq_fn() would be helpful, one that >> could perform gfp_t allocations rather than GFP_ATOMIC? The idea being >> to call ->init_rq_fn() almost immediately after creation of struct >> request, by the struct request creator. > > Isn't that what the current prep_fn actually is? > It's hard to see how ... prep_rq_fn is already called pretty early ... > almost as soon as the elevator has decided to spit out the request prep_rq_fn is called very, very late -- from elv_next_request(), which is called from ->request_fn. As quoted above, I'm talking about something closer to -creation- time, not -execution- time. >> The creator of struct request generally has more freedom to sleep, and >> it seems logical to give low-level drivers a "fill in LLD-specific info" >> hook BEFORE the request is ever added to a request_queue. > > Unfortunately it's not really possible to find a sleeping context in > there: The elevators have to operate from the current > elv_next_request() context, which, in most drivers can either be user or > interrupt. Sure, because that's further down the pipeline at the request execution stage. Think more like make_request time... > The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation > tied to a bio when it's issued at the top. That way everyone has enough > memory when it comes down the stack (both extents and WRITE SAME sector > will fit into a page ... although only just for WRITE SAME on 4k > sectors). Makes sense... Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 22:47 ` Jeff Garzik @ 2009-05-04 15:28 ` Boaz Harrosh 0 siblings, 0 replies; 22+ messages in thread From: Boaz Harrosh @ 2009-05-04 15:28 UTC (permalink / raw) To: Jeff Garzik Cc: James Bottomley, Matthew Wilcox, Jens Axboe, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, linux-scsi, Bartlomiej Zolnierkiewicz, Mark Lord On 05/04/2009 01:47 AM, Jeff Garzik wrote: > James Bottomley wrote: >> The ideal for REQ_TYPE_DISCARD seems to be to force a page allocation >> tied to a bio when it's issued at the top. That way everyone has enough >> memory when it comes down the stack (both extents and WRITE SAME sector >> will fit into a page ... although only just for WRITE SAME on 4k >> sectors). > > Makes sense... > > Jeff > I second that, let the allocator of the request supply a buffer and free it on return/done. Perhaps with a general helper that does all that. Boaz ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 18:34 ` Jeff Garzik 2009-05-03 18:40 ` Jeff Garzik @ 2009-05-03 21:48 ` Matthew Wilcox 2009-05-03 22:54 ` Jeff Garzik 1 sibling, 1 reply; 22+ messages in thread From: Matthew Wilcox @ 2009-05-03 21:48 UTC (permalink / raw) To: Jeff Garzik Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord, David Woodhouse On Sun, May 03, 2009 at 02:34:35PM -0400, Jeff Garzik wrote: > Yes, the task of figuring out -what to do- in the queue's request > function is quite complex, and discard makes it even more so. > > The API makes life difficult -- you have to pass temporary info to > yourself in ->prepare_flush_fn() and ->prepare_discard_fn(), and the > overall sum is a bewildering collection of opcodes, flags, and internal > driver notes to itself. > > Add to this yet another prep function, ->prep_rq_fn() > > It definitely sucks, especially with regards to failed atomic > allocations... but I think fixing this quite a big more than Matthew > probably willing to tackle ;-) I'm completely confused by the block layer API, to be honest. Trying to deduce how to add a new feature at this stage is hard (compare it to adding the reflink operation to the VFS ...). I'm definitely willing to tackle changing the block device API, but it may take a while. > My ideal block layer interface would be a lot more opcode-based, e.g. > > (1) create REQ_TYPE_DISCARD > > (2) determine at init if queue (a) supports explicit DISCARD and/or (b) > supports DISCARD flag passed with READ or WRITE > > (3) when creating a discard request, use block helpers w/ queue-specific > knowledge to create either > (a) one request, REQ_TYPE_FS, with discard flag or > (b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD I'm not sure we need option 3b. > (4) blkdev_issue_discard() would function like an empty barrier, and > unconditionally create REQ_TYPE_DISCARD. I can certainly prototype a replacement for discard_prep_fn along these lines. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 21:48 ` Matthew Wilcox @ 2009-05-03 22:54 ` Jeff Garzik 0 siblings, 0 replies; 22+ messages in thread From: Jeff Garzik @ 2009-05-03 22:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord, David Woodhouse Matthew Wilcox wrote: >> (3) when creating a discard request, use block helpers w/ queue-specific >> knowledge to create either >> (a) one request, REQ_TYPE_FS, with discard flag or >> (b) two requests, REQ_TYPE_FS followed by REQ_TYPE_DISCARD > > I'm not sure we need option 3b. Well -- it is a hard requirement to map 1:1 struct request with the underlying hardware device's command set. If a device command set lacks a READ-and-DISCARD operation, then you _must_ create two struct request. Otherwise you break block layer tagging and other 1:1-based assumptions that exist today. Thus, given that all of ATA|SCSI|NVMHCI have a DISCARD operation distinct from other commands... option 3b is the overwhelming common case. Jeff ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 15:42 ` Matthew Wilcox 2009-05-03 16:34 ` Boaz Harrosh @ 2009-05-03 18:48 ` Bartlomiej Zolnierkiewicz 1 sibling, 0 replies; 22+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2009-05-03 18:48 UTC (permalink / raw) To: Matthew Wilcox Cc: Boaz Harrosh, Hugh Dickins, Matthew Wilcox, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Mark Lord On Sunday 03 May 2009 17:42:16 Matthew Wilcox wrote: > On Sun, May 03, 2009 at 06:02:51PM +0300, Boaz Harrosh wrote: > > I agree with Hugh. The allocation is done at, too-low in the food chain. > > (And that free of buffer at upper layer allocated by lower layer). > > > > I think you need to separate the: "does lld need buffer, what size" > > from the "here is buffer prepare", so upper layer that can sleep does > > sleep. > > So you want two function pointers in the request queue relating to discard? > > > In all other buffer needing operations the allocation is done before > > submission of request, No? > > It's not true for the flush request (the example I quoted). Obviously, > the solution adopted here by IDE is Bad and Wrong ... When speaking about "the solution" do you mean that quick & ugly hack that I needed to fix the unexpected regression during late -rc cycle? ;) The proper solution would involve adding 'struct ide_cmd flush_cmd' to ide_drive_t but I never got to verify/inquiry that is OK with the block layer (if there can be only one flush rq outstanding or if we can have both q->pre_flush_rq and q->post_flush_rq queued at the same time)... Thanks, Bart ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: New TRIM/UNMAP tree published (2009-05-02) 2009-05-03 14:48 ` Matthew Wilcox 2009-05-03 15:02 ` Boaz Harrosh @ 2009-05-03 15:05 ` Hugh Dickins 1 sibling, 0 replies; 22+ messages in thread From: Hugh Dickins @ 2009-05-03 15:05 UTC (permalink / raw) To: Matthew Wilcox Cc: Matthew Wilcox, Boaz Harrosh, linux-ide, linux-kernel, Jeff Garzik, linux-scsi, Jens Axboe, Bartlomiej Zolnierkiewicz, Mark Lord On Sun, 3 May 2009, Matthew Wilcox wrote: > > You make a good point. There's definitely still work to be done around > error handling. OTOH, we could always do what IDE does ... > > static void idedisk_prepare_flush(struct request_queue *q, struct request *rq) > { > ide_drive_t *drive = q->queuedata; > struct ide_cmd *cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC); > > /* FIXME: map struct ide_taskfile on rq->cmd[] */ > BUG_ON(cmd == NULL); Oooh, you're winding me up. Certainly not. Off with the Top Of your Head! Hugh ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-05-04 15:28 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1238683047-13588-1-git-send-email-willy@linux.intel.com>
[not found] ` <49D8A3D7.5070507@panasas.com>
[not found] ` <20090503061150.GF10704@linux.intel.com>
2009-05-03 7:16 ` New TRIM/UNMAP tree published (2009-05-02) Matthew Wilcox
2009-05-03 13:07 ` Hugh Dickins
2009-05-03 14:48 ` Matthew Wilcox
2009-05-03 15:02 ` Boaz Harrosh
2009-05-03 15:42 ` Matthew Wilcox
2009-05-03 16:34 ` Boaz Harrosh
2009-05-03 18:34 ` Jeff Garzik
2009-05-03 18:40 ` Jeff Garzik
2009-05-03 19:04 ` James Bottomley
2009-05-03 19:20 ` Jeff Garzik
2009-05-03 19:37 ` James Bottomley
2009-05-04 14:03 ` Douglas Gilbert
2009-05-04 14:40 ` James Bottomley
2009-05-04 15:11 ` Douglas Gilbert
2009-05-04 15:23 ` James Bottomley
2009-05-03 19:47 ` James Bottomley
2009-05-03 22:47 ` Jeff Garzik
2009-05-04 15:28 ` Boaz Harrosh
2009-05-03 21:48 ` Matthew Wilcox
2009-05-03 22:54 ` Jeff Garzik
2009-05-03 18:48 ` Bartlomiej Zolnierkiewicz
2009-05-03 15:05 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox