* [PATCH, RFC] xfs: batched discard support @ 2009-08-16 0:47 Christoph Hellwig 2009-08-16 1:35 ` Mark Lord 2009-08-19 20:39 ` Ingo Molnar 0 siblings, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2009-08-16 0:47 UTC (permalink / raw) To: xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe [-- Attachment #1: Type: text/plain, Size: 9156 bytes --] Given that everyone is so big in the discard discussion I'd like to present what I had started to prepare for XFS. I didn't plan to send it out until I get my hands onto a TRIM capable device (or at least get time to add support to qemu), and so far it's only been tested in dry-run mode. The basic idea is to add an ioctl which walks the free space btrees in each allocation group and simply discard everythin that is free. Given that XFS doesn't gragment freespace very much that's a very efficient way to do it. In addition we also already support setting a threshold under which we don't bother to discard an extent, it's currently hardcoded in the helper tool. In the future we could also add things like a sequence number in the AG headers if anything has changed at all, but let's leave those optimizations until we need them. XFS locks the allocation btree using the btree buffers, so we do not block allocations from any extent which we're not currenly discarding. Now the caveat for that is that we really do want to do the discard synchronously, that is wait for the request to finish. That's what I've implemented in this patch, but it's the part I haven't been able to test so far. (and yes, this should be separate patch, but it's really just an RFC for now) Mark, any chance to try it? Just create an XFS filesystem, age it a bit and then call the attached little trim.c program on the mountmoint (or any file inside the filesystem for that matter) Index: linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c =================================================================== --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl.c 2009-08-15 20:15:14.379163976 -0300 +++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c 2009-08-15 21:19:19.342664224 -0300 @@ -1275,6 +1275,31 @@ xfs_ioc_getbmapx( return 0; } +STATIC int +xfs_ioc_trim( + struct xfs_mount *mp, + __uint32_t *argp) +{ + xfs_agnumber_t agno; + int error = 0; + __uint32_t minlen; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (get_user(minlen, argp)) + return -EFAULT; + + down_read(&mp->m_peraglock); + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { + error = -xfs_trim_extents(mp, agno, minlen); + if (error) + break; + } + up_read(&mp->m_peraglock); + + return error; +} + /* * Note: some of the ioctl's return positive numbers as a * byte count indicating success, such as readlink_by_handle. @@ -1524,6 +1549,9 @@ xfs_file_ioctl( error = xfs_errortag_clearall(mp, 1); return -error; + case XFS_IOC_TRIM: + return xfs_ioc_trim(mp, arg); + default: return -ENOTTY; } Index: linux-2.6/fs/xfs/xfs_alloc.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_alloc.c 2009-08-15 20:11:00.791163409 -0300 +++ linux-2.6/fs/xfs/xfs_alloc.c 2009-08-15 21:40:16.226666638 -0300 @@ -2470,6 +2470,96 @@ error0: return error; } +STATIC int +xfs_trim_extent( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_agblock_t fbno, + xfs_extlen_t flen) +{ + xfs_daddr_t blkno = XFS_AGB_TO_DADDR(mp, agno, fbno); + sector_t nblks = XFS_FSB_TO_BB(mp, flen); + int error; + + xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]", + blkno, nblks); + + error = -__blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, + blkno, nblks, GFP_NOFS, 1); + if (error && error != EOPNOTSUPP) + xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error); + return error; +} + +/* + * Notify the underlying block device about our free extent map. + * + * This walks all free extents above a minimum threshold and notifies the + * underlying device that these blocks are unused. That information is + * useful for SSDs or thinly provisioned storage in high end arrays or + * virtualization scenarios. + */ +int +xfs_trim_extents( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_extlen_t minlen) /* minimum extent size to bother */ +{ + struct xfs_btree_cur *cur; /* cursor for the by-block btree */ + struct xfs_buf *agbp; /* AGF buffer pointer */ + xfs_agblock_t bno; /* block the for next search */ + xfs_agblock_t fbno; /* start block of found extent */ + xfs_extlen_t flen; /* length of found extent */ + int error; + int i; + + error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp); + if (error) + return error; + + bno = 0; + for (;;) { + cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno, + XFS_BTNUM_BNO); + + error = xfs_alloc_lookup_ge(cur, bno, minlen, &i); + if (error) + goto error0; + if (!i) { + /* + * No more free extents found: done. + */ + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + break; + } + + error = xfs_alloc_get_rec(cur, &fbno, &flen, &i); + if (error) + goto error0; + XFS_WANT_CORRUPTED_GOTO(i == 1, error0); + + /* + * Pass if the freespace extent isn't long enough to bother. + */ + if (flen >= minlen) { + error = xfs_trim_extent(mp, agno, fbno, flen); + if (error) { + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + break; + } + } + + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + bno = fbno + flen; + } + +out: + xfs_buf_relse(agbp); + return error; +error0: + xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); + goto out; +} /* * AG Busy list management Index: linux-2.6/fs/xfs/xfs_alloc.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_alloc.h 2009-08-15 20:12:51.762661386 -0300 +++ linux-2.6/fs/xfs/xfs_alloc.h 2009-08-15 20:15:07.334667592 -0300 @@ -217,4 +217,7 @@ xfs_free_extent( xfs_fsblock_t bno, /* starting block number of extent */ xfs_extlen_t len); /* length of extent */ +int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_extlen_t minlen); + #endif /* __XFS_ALLOC_H__ */ Index: linux-2.6/fs/xfs/xfs_fs.h =================================================================== --- linux-2.6.orig/fs/xfs/xfs_fs.h 2009-08-15 20:22:03.735200427 -0300 +++ linux-2.6/fs/xfs/xfs_fs.h 2009-08-15 20:24:18.677996430 -0300 @@ -475,6 +475,7 @@ typedef struct xfs_handle { #define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq) #define XFS_IOC_FSGEOMETRY _IOR ('X', 124, struct xfs_fsop_geom) #define XFS_IOC_GOINGDOWN _IOR ('X', 125, __uint32_t) +#define XFS_IOC_TRIM _IOR ('X', 126, __uint32_t) /* XFS_IOC_GETFSUUID ---------- deprecated 140 */ Index: linux-2.6/block/blk-barrier.c =================================================================== --- linux-2.6.orig/block/blk-barrier.c 2009-08-15 21:36:30.426696824 -0300 +++ linux-2.6/block/blk-barrier.c 2009-08-15 21:41:11.490664659 -0300 @@ -348,22 +348,27 @@ static void blkdev_discard_end_io(struct clear_bit(BIO_UPTODATE, &bio->bi_flags); } + if (bio->bi_private) + complete(bio->bi_private); bio_put(bio); } /** - * blkdev_issue_discard - queue a discard + * __blkdev_issue_discard - queue a discard * @bdev: blockdev to issue discard for * @sector: start sector * @nr_sects: number of sectors to discard * @gfp_mask: memory allocation flags (for bio_alloc) + * @wait: if %1 wait for the discard to finish * * Description: - * Issue a discard request for the sectors in question. Does not wait. + * Issue a discard request for the sectors in question. */ -int blkdev_issue_discard(struct block_device *bdev, - sector_t sector, sector_t nr_sects, gfp_t gfp_mask) +int __blkdev_issue_discard(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + int wait) { + DECLARE_COMPLETION_ONSTACK(done); struct request_queue *q; struct bio *bio; int ret = 0; @@ -385,6 +390,7 @@ int blkdev_issue_discard(struct block_de bio->bi_end_io = blkdev_discard_end_io; bio->bi_bdev = bdev; + bio->bi_private = wait ? &done : NULL; bio->bi_sector = sector; @@ -399,6 +405,9 @@ int blkdev_issue_discard(struct block_de bio_get(bio); submit_bio(DISCARD_BARRIER, bio); + if (wait) + wait_for_completion(&done); + /* Check if it failed immediately */ if (bio_flagged(bio, BIO_EOPNOTSUPP)) ret = -EOPNOTSUPP; @@ -408,4 +417,4 @@ int blkdev_issue_discard(struct block_de } return ret; } -EXPORT_SYMBOL(blkdev_issue_discard); +EXPORT_SYMBOL(__blkdev_issue_discard); Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2009-08-15 21:40:20.507164178 -0300 +++ linux-2.6/include/linux/blkdev.h 2009-08-15 21:42:15.734715355 -0300 @@ -977,8 +977,14 @@ static inline struct request *blk_map_qu } extern int blkdev_issue_flush(struct block_device *, sector_t *); -extern int blkdev_issue_discard(struct block_device *, - sector_t sector, sector_t nr_sects, gfp_t); +extern int __blkdev_issue_discard(struct block_device *, sector_t sector, + sector_t nr_sects, gfp_t, int wait); + +static inline int blkdev_issue_discard(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask) +{ + return __blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0); +} static inline int sb_issue_discard(struct super_block *sb, sector_t block, sector_t nr_blocks) [-- Attachment #2: trim.c --] [-- Type: text/plain, Size: 576 bytes --] #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdint.h> #include <sys/ioctl.h> #define XFS_IOC_TRIM _IOR ('X', 126, uint32_t) int main(int argc, char **argv) { int minsize = 4096; int fd; if (argc != 2) { fprintf(stderr, "usage: %s mountpoint\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 1; } if (ioctl(fd, XFS_IOC_TRIM, &minsize)) { if (errno == EOPNOTSUPP) fprintf(stderr, "TRIM not supported\n"); else perror("XFS_IOC_TRIM"); return 1; } return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig @ 2009-08-16 1:35 ` Mark Lord 2009-08-16 2:19 ` Mark Lord 2009-08-19 20:39 ` Ingo Molnar 1 sibling, 1 reply; 27+ messages in thread From: Mark Lord @ 2009-08-16 1:35 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Christoph Hellwig wrote: > Given that everyone is so big in the discard discussion I'd like to > present what I had started to prepare for XFS. I didn't plan to send it > out until I get my hands onto a TRIM capable device (or at least get > time to add support to qemu), and so far it's only been tested in > dry-run mode. > > The basic idea is to add an ioctl which walks the free space btrees in > each allocation group and simply discard everythin that is free. Given > that XFS doesn't gragment freespace very much that's a very efficient > way to do it. In addition we also already support setting a threshold > under which we don't bother to discard an extent, it's currently > hardcoded in the helper tool. In the future we could also add things > like a sequence number in the AG headers if anything has changed at all, > but let's leave those optimizations until we need them. > > XFS locks the allocation btree using the btree buffers, so we do not > block allocations from any extent which we're not currenly discarding. > > Now the caveat for that is that we really do want to do the discard > synchronously, that is wait for the request to finish. That's what > I've implemented in this patch, but it's the part I haven't been > able to test so far. (and yes, this should be separate patch, but it's > really just an RFC for now) > > Mark, any chance to try it? Just create an XFS filesystem, age it a > bit and then call the attached little trim.c program on the mountmoint > (or any file inside the filesystem for that matter) .. Looking at it now. Thanks, Christoph! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 1:35 ` Mark Lord @ 2009-08-16 2:19 ` Mark Lord 2009-08-16 2:25 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Mark Lord @ 2009-08-16 2:19 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Mark Lord wrote: > Christoph Hellwig wrote: .. >> Mark, any chance to try it? Just create an XFS filesystem, age it a >> bit and then call the attached little trim.c program on the mountmoint >> (or any file inside the filesystem for that matter) > .. > > Looking at it now. Thanks, Christoph! .. Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). Rebuilding with 32-bit kernel now.. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 2:19 ` Mark Lord @ 2009-08-16 2:25 ` Christoph Hellwig 2009-08-16 2:49 ` Mark Lord 2009-08-16 13:00 ` Mark Lord 0 siblings, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2009-08-16 2:25 UTC (permalink / raw) To: Mark Lord Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote: > Mark Lord wrote: >> Christoph Hellwig wrote: > .. >>> Mark, any chance to try it? Just create an XFS filesystem, age it a >>> bit and then call the attached little trim.c program on the mountmoint >>> (or any file inside the filesystem for that matter) >> .. >> >> Looking at it now. Thanks, Christoph! > .. > > Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). > Rebuilding with 32-bit kernel now.. The actual ioctl is compatible, just add the case XFS_IOC_TRIM: return xfs_ioc_trim(mp, arg); to xfs_file_compat_ioctl(). I'll add this to the next spin of the patch. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 2:25 ` Christoph Hellwig @ 2009-08-16 2:49 ` Mark Lord 2009-08-16 3:25 ` Mark Lord 2009-08-16 13:00 ` Mark Lord 1 sibling, 1 reply; 27+ messages in thread From: Mark Lord @ 2009-08-16 2:49 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Christoph Hellwig wrote: > On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote: >> Mark Lord wrote: >>> Christoph Hellwig wrote: >> .. >>>> Mark, any chance to try it? Just create an XFS filesystem, age it a >>>> bit and then call the attached little trim.c program on the mountmoint >>>> (or any file inside the filesystem for that matter) >>> .. >>> >>> Looking at it now. Thanks, Christoph! >> .. >> >> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). >> Rebuilding with 32-bit kernel now.. > > The actual ioctl is compatible, just add the > > case XFS_IOC_TRIM: > return xfs_ioc_trim(mp, arg); > > to xfs_file_compat_ioctl(). I'll add this to the next spin of the patch. .. Peachy. Rebuilding again now.. Don't wait up for this .. I'll finish it on Sunday. Bedtime. :) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 2:49 ` Mark Lord @ 2009-08-16 3:25 ` Mark Lord 0 siblings, 0 replies; 27+ messages in thread From: Mark Lord @ 2009-08-16 3:25 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Mark Lord wrote: > Christoph Hellwig wrote: >> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote: >>> Mark Lord wrote: >>>> Christoph Hellwig wrote: >>> .. >>>>> Mark, any chance to try it? Just create an XFS filesystem, age it a >>>>> bit and then call the attached little trim.c program on the mountmoint >>>>> (or any file inside the filesystem for that matter) >>>> .. >>>> >>>> Looking at it now. Thanks, Christoph! >>> .. >>> >>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). >>> Rebuilding with 32-bit kernel now.. >> >> The actual ioctl is compatible, just add the >> >> case XFS_IOC_TRIM: >> return xfs_ioc_trim(mp, arg); >> >> to xfs_file_compat_ioctl(). I'll add this to the next spin of the patch. > .. > > Peachy. Rebuilding again now.. > > Don't wait up for this .. I'll finish it on Sunday. Bedtime. :) .. Mmm.. doesn't work on stock 2.6.31-rc6. It must also require Matthew's patches for discard/trim support. Later. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 2:25 ` Christoph Hellwig 2009-08-16 2:49 ` Mark Lord @ 2009-08-16 13:00 ` Mark Lord 2009-08-16 13:53 ` Christoph Hellwig 2009-08-16 13:59 ` Mark Lord 1 sibling, 2 replies; 27+ messages in thread From: Mark Lord @ 2009-08-16 13:00 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Christoph Hellwig wrote: > On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote: >> Mark Lord wrote: >>> Christoph Hellwig wrote: >> .. >>>> Mark, any chance to try it? Just create an XFS filesystem, age it a >>>> bit and then call the attached little trim.c program on the mountmoint >>>> (or any file inside the filesystem for that matter) >>> .. >>> >>> Looking at it now. Thanks, Christoph! >> .. >> >> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). >> Rebuilding with 32-bit kernel now.. > > The actual ioctl is compatible, just add the > > case XFS_IOC_TRIM: > return xfs_ioc_trim(mp, arg); > > to xfs_file_compat_ioctl(). I'll add this to the next spin of the patch. .. Okay, this gives me ENOSYS now --> discard/trim support is missing from the lower layers. What other patches do I need to make this work? The latest from Matthew's discard tree (May 2009) don't appear to be sufficient, even after updating them for 2.6.31-rc6. ??? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 13:00 ` Mark Lord @ 2009-08-16 13:53 ` Christoph Hellwig 2009-08-16 13:59 ` Mark Lord 1 sibling, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2009-08-16 13:53 UTC (permalink / raw) To: Mark Lord Cc: Christoph Hellwig, linux-fsdevel, jens.axboe, linux-kernel, linux-scsi, xfs On Sun, Aug 16, 2009 at 09:00:35AM -0400, Mark Lord wrote: > Okay, this gives me ENOSYS now --> discard/trim support is missing from > the lower layers. ENOSYS or EOPNOTSUPP? > What other patches do I need to make this work? > > The latest from Matthew's discard tree (May 2009) don't appear to be sufficient, > even after updating them for 2.6.31-rc6. Hmm. That was kinda the reason why I didn't want to post stuff yet. In current mainline only MTD implements the prepare_discard_fn, and we'll need it implemented and working. I think Matthews patches should be sufficient, unless they have some ATA revision check to not issue the TRIM for your device. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 13:00 ` Mark Lord 2009-08-16 13:53 ` Christoph Hellwig @ 2009-08-16 13:59 ` Mark Lord 2009-08-16 14:06 ` Mark Lord 2009-08-16 14:23 ` Christoph Hellwig 1 sibling, 2 replies; 27+ messages in thread From: Mark Lord @ 2009-08-16 13:59 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list [-- Attachment #1: Type: text/plain, Size: 2824 bytes --] Mark Lord wrote: > Christoph Hellwig wrote: >> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote: >>> Mark Lord wrote: >>>> Christoph Hellwig wrote: >>> .. >>>>> Mark, any chance to try it? Just create an XFS filesystem, age it a >>>>> bit and then call the attached little trim.c program on the mountmoint >>>>> (or any file inside the filesystem for that matter) >>>> .. >>>> >>>> Looking at it now. Thanks, Christoph! >>> .. >>> >>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl). >>> Rebuilding with 32-bit kernel now.. >> >> The actual ioctl is compatible, just add the >> >> case XFS_IOC_TRIM: >> return xfs_ioc_trim(mp, arg); >> >> to xfs_file_compat_ioctl(). I'll add this to the next spin of the patch. > .. > > Okay, this gives me ENOSYS now --> discard/trim support is missing from > the lower layers. > > What other patches do I need to make this work? > > The latest from Matthew's discard tree (May 2009) don't appear to be sufficient, > even after updating them for 2.6.31-rc6. .. Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue. My apologies for attachments, but I am attaching the updated Christoph patch, as well as my hacked-up forward-port of Matthew's patches. Not pretty, but they work. :) Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed XFS partition gave this for the first time around: [ 25.961891] Filesystem "sdb3": discarding sectors [0xc558-0x102328] [ 27.814553] Filesystem "sdb3": discarding sectors [0x10ea78-0x10e688] [ 29.771218] Filesystem "sdb3": discarding sectors [0x21d120-0x10e860] [ 31.726444] Filesystem "sdb3": discarding sectors [0x32b9a0-0x10e860] [ 33.679023] Filesystem "sdb3": discarding sectors [0x43f220-0x109860] [ 35.629948] Filesystem "sdb3": discarding sectors [0x548aa0-0x10e860] [ 37.583142] Filesystem "sdb3": discarding sectors [0x657320-0x10e860] [ 39.531822] Filesystem "sdb3": discarding sectors [0x765ba0-0x10e860] Slow, but presumably thorough. Subsequent runs were equally slow. The problem is, it still issues TRIMs to the LLD one extent at a time. Compare this with doing it all in a single TRIM command with the wiper.sh script (filesystem unmounted): [~] time wiper.sh /dev/sdb3 --commit wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord. Preparing for offline TRIM of free space on /dev/sdb3 (xfs non-mounted). This operation could destroy your data. Are you sure (y/N)? y Syncing disks.. Beginning TRIM operations.. Trimming 168 free extents encompassing 8793136 sectors (4294 MB) Done. real 0m1.249s user 0m0.110s sys 0m0.063s That includes the time for me to type 'y' and hit enter. :) Cheers [-- Attachment #2: 51_add_trim_support.patch --] [-- Type: text/x-diff, Size: 11019 bytes --] diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk-barrier.c linux/block/blk-barrier.c --- linux-2.6.31-rc6/block/blk-barrier.c 2009-08-16 09:16:36.303766940 -0400 +++ linux/block/blk-barrier.c 2009-08-16 09:19:07.287086209 -0400 @@ -348,30 +348,22 @@ clear_bit(BIO_UPTODATE, &bio->bi_flags); } + if (bio_has_data(bio)) + __free_page(bio_page(bio)); + + if (bio->bi_private) + complete(bio->bi_private); + bio_put(bio); } -/** - * blkdev_issue_discard - queue a discard - * @bdev: blockdev to issue discard for - * @sector: start sector - * @nr_sects: number of sectors to discard - * @gfp_mask: memory allocation flags (for bio_alloc) - * - * Description: - * Issue a discard request for the sectors in question. Does not wait. - */ -int blkdev_issue_discard(struct block_device *bdev, - sector_t sector, sector_t nr_sects, gfp_t gfp_mask) +int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, + sector_t nr_sects, gfp_t gfp_mask, + unsigned type, struct completion *completion) { - struct request_queue *q; - struct bio *bio; int ret = 0; + struct request_queue *q = bdev_get_queue(bdev); - if (bdev->bd_disk == NULL) - return -ENXIO; - - q = bdev_get_queue(bdev); if (!q) return -ENXIO; @@ -379,12 +371,13 @@ return -EOPNOTSUPP; while (nr_sects && !ret) { - bio = bio_alloc(gfp_mask, 0); + struct bio *bio = bio_alloc(gfp_mask, 1); if (!bio) return -ENOMEM; bio->bi_end_io = blkdev_discard_end_io; bio->bi_bdev = bdev; + bio->bi_private = completion; bio->bi_sector = sector; @@ -396,10 +389,13 @@ bio->bi_size = nr_sects << 9; nr_sects = 0; } + bio_get(bio); - submit_bio(DISCARD_BARRIER, bio); + submit_bio(type, bio); + + if (completion) + wait_for_completion(completion); - /* Check if it failed immediately */ if (bio_flagged(bio, BIO_EOPNOTSUPP)) ret = -EOPNOTSUPP; else if (!bio_flagged(bio, BIO_UPTODATE)) @@ -408,4 +404,24 @@ } return ret; } + +/** + * blkdev_issue_discard - queue a discard + * @bdev: blockdev to issue discard for + * @sector: start sector + * @nr_sects: number of sectors to discard + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + * Issue a discard request for the sectors in question. Does not wait. + */ +int blkdev_issue_discard(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask) +{ + if (bdev->bd_disk == NULL) + return -ENXIO; + + return __blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, + DISCARD_BARRIER, NULL); +} EXPORT_SYMBOL(blkdev_issue_discard); diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk-core.c linux/block/blk-core.c --- linux-2.6.31-rc6/block/blk-core.c 2009-08-16 09:16:36.307099905 -0400 +++ linux/block/blk-core.c 2009-08-16 08:53:19.000000000 -0400 @@ -1107,6 +1107,8 @@ void init_request_from_bio(struct request *req, struct bio *bio) { + might_sleep(); + req->cpu = bio->bi_comp_cpu; req->cmd_type = REQ_TYPE_FS; @@ -1127,7 +1129,7 @@ req->cmd_flags |= REQ_DISCARD; if (bio_barrier(bio)) req->cmd_flags |= REQ_SOFTBARRIER; - req->q->prepare_discard_fn(req->q, req); + req->q->prepare_discard_fn(req->q, req, bio); } else if (unlikely(bio_barrier(bio))) req->cmd_flags |= REQ_HARDBARRIER; diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk.h linux/block/blk.h --- linux-2.6.31-rc6/block/blk.h 2009-08-16 09:16:36.310433289 -0400 +++ linux/block/blk.h 2009-08-16 08:53:19.000000000 -0400 @@ -17,6 +17,10 @@ struct bio *bio); void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); +int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, + sector_t nr_sects, gfp_t gfp_mask, + unsigned type, struct completion *completion); + void blk_unplug_work(struct work_struct *work); void blk_unplug_timeout(unsigned long data); diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/ioctl.c linux/block/ioctl.c --- linux-2.6.31-rc6/block/ioctl.c 2009-08-16 09:16:36.313766813 -0400 +++ linux/block/ioctl.c 2009-08-16 08:53:19.000000000 -0400 @@ -7,6 +7,7 @@ #include <linux/smp_lock.h> #include <linux/blktrace_api.h> #include <asm/uaccess.h> +#include "blk.h" static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user *arg) { @@ -112,21 +113,10 @@ return res; } -static void blk_ioc_discard_endio(struct bio *bio, int err) -{ - if (err) { - if (err == -EOPNOTSUPP) - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); - clear_bit(BIO_UPTODATE, &bio->bi_flags); - } - complete(bio->bi_private); -} - static int blk_ioctl_discard(struct block_device *bdev, uint64_t start, uint64_t len) { - struct request_queue *q = bdev_get_queue(bdev); - int ret = 0; + DECLARE_COMPLETION_ONSTACK(wait); if (start & 511) return -EINVAL; @@ -138,39 +128,8 @@ if (start + len > (bdev->bd_inode->i_size >> 9)) return -EINVAL; - if (!q->prepare_discard_fn) - return -EOPNOTSUPP; - - while (len && !ret) { - DECLARE_COMPLETION_ONSTACK(wait); - struct bio *bio; - - bio = bio_alloc(GFP_KERNEL, 0); - - bio->bi_end_io = blk_ioc_discard_endio; - bio->bi_bdev = bdev; - bio->bi_private = &wait; - bio->bi_sector = start; - - if (len > queue_max_hw_sectors(q)) { - bio->bi_size = queue_max_hw_sectors(q) << 9; - len -= queue_max_hw_sectors(q); - start += queue_max_hw_sectors(q); - } else { - bio->bi_size = len << 9; - len = 0; - } - submit_bio(DISCARD_NOBARRIER, bio); - - wait_for_completion(&wait); - - if (bio_flagged(bio, BIO_EOPNOTSUPP)) - ret = -EOPNOTSUPP; - else if (!bio_flagged(bio, BIO_UPTODATE)) - ret = -EIO; - bio_put(bio); - } - return ret; + return __blkdev_issue_discard(bdev, start, len, GFP_KERNEL, + DISCARD_NOBARRIER, &wait); } static int put_ushort(unsigned long arg, unsigned short val) diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/drivers/ata/libata-scsi.c linux/drivers/ata/libata-scsi.c --- linux-2.6.31-rc6/drivers/ata/libata-scsi.c 2009-08-16 09:16:36.350433414 -0400 +++ linux/drivers/ata/libata-scsi.c 2009-08-16 08:53:19.000000000 -0400 @@ -1051,6 +1051,46 @@ desc[11] = block; } +static int ata_discard_fn(struct request_queue *q, struct request *req, + struct bio *bio) +{ + unsigned size; + struct page *page = alloc_page(GFP_KERNEL); + if (!page) + goto error; + + size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8, + bio->bi_sector, bio_sectors(bio)); + bio->bi_size = 0; + if (bio_add_pc_page(q, bio, page, size, 0) < size) + goto free_page; + + req->cmd_type = REQ_TYPE_BLOCK_PC; + req->cmd_len = 16; + req->cmd[0] = ATA_16; + req->cmd[1] = (6 << 1) | 1; /* dma, 48-bit */ + req->cmd[2] = 0x6; /* length, direction */ + req->cmd[3] = 0; /* feature high */ + req->cmd[4] = ATA_DSM_TRIM; /* feature low */ + req->cmd[5] = (size / 512) >> 8; /* nsect high */ + req->cmd[6] = size / 512; /* nsect low */ + req->cmd[7] = 0; /* lba */ + req->cmd[8] = 0; /* lba */ + req->cmd[9] = 0; /* lba */ + req->cmd[10] = 0; /* lba */ + req->cmd[11] = 0; /* lba */ + req->cmd[12] = 0; /* lba */ + req->cmd[13] = ATA_LBA; /* device */ + req->cmd[14] = ATA_CMD_DSM; /* command */ + req->cmd[15] = 0; /* control */ + + return 0; + free_page: + __free_page(page); + error: + return -ENOMEM; +} + static void ata_scsi_sdev_config(struct scsi_device *sdev) { sdev->use_10_for_rw = 1; @@ -1099,6 +1139,9 @@ /* configure max sectors */ blk_queue_max_sectors(sdev->request_queue, dev->max_sectors); + if (ata_id_has_trim(dev->id)) + blk_queue_set_discard(sdev->request_queue, ata_discard_fn); + if (dev->class == ATA_DEV_ATAPI) { struct request_queue *q = sdev->request_queue; void *buf; @@ -1747,6 +1790,12 @@ * whether the command completed successfully or not. If there * was no error, SK, ASC and ASCQ will all be zero. */ + + if (need_sense && qc->tf.command == ATA_CMD_DSM) { + ata_port_printk(ap, KERN_ERR, "%s: DISCARD/TRIM failed: disabling it\n", __func__); + blk_queue_set_discard(qc->dev->sdev->request_queue, NULL); + } + if (((cdb[0] == ATA_16) || (cdb[0] == ATA_12)) && ((cdb[2] & 0x20) || need_sense)) { ata_gen_passthru_sense(qc); diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/drivers/mtd/mtd_blkdevs.c linux/drivers/mtd/mtd_blkdevs.c --- linux-2.6.31-rc6/drivers/mtd/mtd_blkdevs.c 2009-08-16 09:16:36.963766818 -0400 +++ linux/drivers/mtd/mtd_blkdevs.c 2009-08-16 08:53:19.000000000 -0400 @@ -33,7 +33,7 @@ }; static int blktrans_discard_request(struct request_queue *q, - struct request *req) + struct request *req, struct bio *bio) { req->cmd_type = REQ_TYPE_LINUX_BLOCK; req->cmd[0] = REQ_LB_OP_DISCARD; diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/include/linux/blkdev.h linux/include/linux/blkdev.h --- linux-2.6.31-rc6/include/linux/blkdev.h 2009-08-16 09:16:39.053766322 -0400 +++ linux/include/linux/blkdev.h 2009-08-16 08:53:19.000000000 -0400 @@ -255,7 +255,8 @@ typedef int (make_request_fn) (struct request_queue *q, struct bio *bio); typedef int (prep_rq_fn) (struct request_queue *, struct request *); typedef void (unplug_fn) (struct request_queue *); -typedef int (prepare_discard_fn) (struct request_queue *, struct request *); +typedef int (prepare_discard_fn) (struct request_queue *, struct request *, + struct bio *bio); struct bio_vec; struct bvec_merge_data { diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/include/linux/fs.h linux/include/linux/fs.h --- linux-2.6.31-rc6/include/linux/fs.h 2009-08-16 09:16:39.070433246 -0400 +++ linux/include/linux/fs.h 2009-08-16 08:53:19.000000000 -0400 @@ -161,8 +161,8 @@ * These aren't really reads or writes, they pass down information about * parts of device that are now unused by the file system. */ -#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD) -#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER)) +#define DISCARD_NOBARRIER (WRITE | (1 << BIO_RW_DISCARD)) +#define DISCARD_BARRIER (DISCARD_NOBARRIER | (1 << BIO_RW_BARRIER)) #define SEL_IN 1 #define SEL_OUT 2 [-- Attachment #3: 52_christoph_xfs_trim.patch --] [-- Type: text/x-diff, Size: 6760 bytes --] diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/block/blk-barrier.c linux/block/blk-barrier.c --- linux-2.6.31-rc6/block/blk-barrier.c 2009-08-16 09:36:36.431146680 -0400 +++ linux/block/blk-barrier.c 2009-08-16 09:20:15.164578531 -0400 @@ -425,3 +425,4 @@ DISCARD_BARRIER, NULL); } EXPORT_SYMBOL(blkdev_issue_discard); +EXPORT_SYMBOL(__blkdev_issue_discard); diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c linux/fs/xfs/linux-2.6/xfs_ioctl.c --- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c 2009-08-16 09:16:39.000433070 -0400 +++ linux/fs/xfs/linux-2.6/xfs_ioctl.c 2009-08-16 09:30:38.973683042 -0400 @@ -1274,6 +1274,31 @@ return 0; } +int +xfs_ioc_trim( + struct xfs_mount *mp, + __uint32_t *argp) +{ + xfs_agnumber_t agno; + int error = 0; + __uint32_t minlen; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (get_user(minlen, argp)) + return -EFAULT; + + down_read(&mp->m_peraglock); + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { + error = -xfs_trim_extents(mp, agno, minlen); + if (error) + break; + } + up_read(&mp->m_peraglock); + + return error; +} + /* * Note: some of the ioctl's return positive numbers as a * byte count indicating success, such as readlink_by_handle. @@ -1523,6 +1548,9 @@ error = xfs_errortag_clearall(mp, 1); return -error; + case XFS_IOC_TRIM: + return xfs_ioc_trim(mp, arg); + default: return -ENOTTY; } diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c linux/fs/xfs/linux-2.6/xfs_ioctl32.c --- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c 2009-06-09 23:05:27.000000000 -0400 +++ linux/fs/xfs/linux-2.6/xfs_ioctl32.c 2009-08-16 09:31:21.005588977 -0400 @@ -539,6 +539,7 @@ void __user *arg = (void __user *)p; int ioflags = 0; int error; + extern int xfs_ioc_trim(struct xfs_mount *mp, __uint32_t *argp); if (filp->f_mode & FMODE_NOCMTIME) ioflags |= IO_INVIS; @@ -564,6 +565,8 @@ case XFS_IOC_ERROR_INJECTION: case XFS_IOC_ERROR_CLEARALL: return xfs_file_ioctl(filp, cmd, p); + case XFS_IOC_TRIM: + return xfs_ioc_trim(mp, arg); #ifndef BROKEN_X86_ALIGNMENT /* These are handled fine if no alignment issues */ case XFS_IOC_ALLOCSP: diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_alloc.h linux/fs/xfs/xfs_alloc.h --- linux-2.6.31-rc6/fs/xfs/xfs_alloc.h 2009-06-09 23:05:27.000000000 -0400 +++ linux/fs/xfs/xfs_alloc.h 2009-08-16 09:20:15.167913313 -0400 @@ -215,4 +215,7 @@ xfs_fsblock_t bno, /* starting block number of extent */ xfs_extlen_t len); /* length of extent */ +int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno, + xfs_extlen_t minlen); + #endif /* __XFS_ALLOC_H__ */ diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_fs.h linux/fs/xfs/xfs_fs.h --- linux-2.6.31-rc6/fs/xfs/xfs_fs.h 2009-08-16 09:16:39.017099926 -0400 +++ linux/fs/xfs/xfs_fs.h 2009-08-16 09:20:15.171246419 -0400 @@ -475,6 +475,7 @@ #define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq) #define XFS_IOC_FSGEOMETRY _IOR ('X', 124, struct xfs_fsop_geom) #define XFS_IOC_GOINGDOWN _IOR ('X', 125, __uint32_t) +#define XFS_IOC_TRIM _IOR ('X', 126, __uint32_t) /* XFS_IOC_GETFSUUID ---------- deprecated 140 */ --- linux-2.6.31-rc6/fs/xfs/xfs_alloc.c 2009-06-09 23:05:27.000000000 -0400 +++ linux/fs/xfs/xfs_alloc.c 2009-08-16 09:44:51.073580438 -0400 @@ -39,6 +39,9 @@ #include "xfs_alloc.h" #include "xfs_error.h" +int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, + sector_t nr_sects, gfp_t gfp_mask, + unsigned type, struct completion *completion); #define XFS_ABSDIFF(a,b) (((a) <= (b)) ? ((b) - (a)) : ((a) - (b))) @@ -2609,6 +2612,97 @@ return error; } +STATIC int +xfs_trim_extent( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_agblock_t fbno, + xfs_extlen_t flen) +{ + xfs_daddr_t blkno = XFS_AGB_TO_DADDR(mp, agno, fbno); + sector_t nblks = XFS_FSB_TO_BB(mp, flen); + int error; + DECLARE_COMPLETION_ONSTACK(done); + + xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]", + blkno, (u64)nblks); + + error = -__blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, + blkno, nblks, GFP_NOFS, DISCARD_BARRIER, &done); + if (error && error != EOPNOTSUPP) + xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error); + return error; +} + +/* + * Notify the underlying block device about our free extent map. + * + * This walks all free extents above a minimum threshold and notifies the + * underlying device that these blocks are unused. That information is + * useful for SSDs or thinly provisioned storage in high end arrays or + * virtualization scenarios. + */ +int +xfs_trim_extents( + struct xfs_mount *mp, + xfs_agnumber_t agno, + xfs_extlen_t minlen) /* minimum extent size to bother */ +{ + struct xfs_btree_cur *cur; /* cursor for the by-block btree */ + struct xfs_buf *agbp; /* AGF buffer pointer */ + xfs_agblock_t bno; /* block the for next search */ + xfs_agblock_t fbno; /* start block of found extent */ + xfs_extlen_t flen; /* length of found extent */ + int error; + int i; + + error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp); + if (error) + return error; + + bno = 0; + for (;;) { + cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno, + XFS_BTNUM_BNO); + + error = xfs_alloc_lookup_ge(cur, bno, minlen, &i); + if (error) + goto error0; + if (!i) { + /* + * No more free extents found: done. + */ + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + break; + } + + error = xfs_alloc_get_rec(cur, &fbno, &flen, &i); + if (error) + goto error0; + XFS_WANT_CORRUPTED_GOTO(i == 1, error0); + + /* + * Pass if the freespace extent isn't long enough to bother. + */ + if (flen >= minlen) { + error = xfs_trim_extent(mp, agno, fbno, flen); + if (error) { + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + break; + } + } + + xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); + bno = fbno + flen; + } + +out: + xfs_buf_relse(agbp); + return error; +error0: + xfs_btree_del_cursor(cur, XFS_BTREE_ERROR); + goto out; +} /* * AG Busy list management ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 13:59 ` Mark Lord @ 2009-08-16 14:06 ` Mark Lord 2009-08-16 14:23 ` Christoph Hellwig 1 sibling, 0 replies; 27+ messages in thread From: Mark Lord @ 2009-08-16 14:06 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list Mark Lord wrote: .. > Slow, but presumably thorough. > Subsequent runs were equally slow. > > The problem is, it still issues TRIMs to the LLD one extent at a time. > Compare this with doing it all in a single TRIM command > with the wiper.sh script (filesystem unmounted): > > [~] time wiper.sh /dev/sdb3 --commit > > wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord. > Preparing for offline TRIM of free space on /dev/sdb3 (xfs > non-mounted). > This operation could destroy your data. Are you sure (y/N)? y > Syncing disks.. > Beginning TRIM operations.. > Trimming 168 free extents encompassing 8793136 sectors (4294 MB) > Done. > > real 0m1.249s > user 0m0.110s > sys 0m0.063s > > That includes the time for me to type 'y' and hit enter. :) .. For completeness, here's the same operation again, except this time on the *mounted* xfs filesystem. It won't be trimming quite as many blocks (leaves 1% free space in reserve), but otherwise is similar: [~] time wiper.sh /dev/sdb3 --commit wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord. Preparing for online TRIM of free space on /dev/sdb3 (xfs mounted read-write at /x). This operation could destroy your data. Are you sure (y/N)? y Creating temporary file (4348405 KB).. Syncing disks.. Beginning TRIM operations.. Trimming 134 free extents encompassing 8696816 sectors (4246 MB) Removing temporary file.. Syncing disks.. Done. real 0m1.212s user 0m0.043s sys 0m0.053s ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 13:59 ` Mark Lord 2009-08-16 14:06 ` Mark Lord @ 2009-08-16 14:23 ` Christoph Hellwig 2009-08-16 14:26 ` Mark Lord 1 sibling, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2009-08-16 14:23 UTC (permalink / raw) To: Mark Lord Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote: > Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities > between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue. > > My apologies for attachments, but I am attaching the updated Christoph patch, > as well as my hacked-up forward-port of Matthew's patches. > > Not pretty, but they work. :) > > Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed > XFS partition gave this for the first time around: > The problem is, it still issues TRIMs to the LLD one extent at a time. > Compare this with doing it all in a single TRIM command > with the wiper.sh script (filesystem unmounted): I could do a variant which issues a single TRIM, but that would require us to lock out all other allocations for the time the trim takes. I'll hack that up once I get some time. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 14:23 ` Christoph Hellwig @ 2009-08-16 14:26 ` Mark Lord 0 siblings, 0 replies; 27+ messages in thread From: Mark Lord @ 2009-08-16 14:26 UTC (permalink / raw) To: Christoph Hellwig Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list Christoph Hellwig wrote: > On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote: >> Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities >> between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue. >> >> My apologies for attachments, but I am attaching the updated Christoph patch, >> as well as my hacked-up forward-port of Matthew's patches. >> >> Not pretty, but they work. :) >> >> Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed >> XFS partition gave this for the first time around: > >> The problem is, it still issues TRIMs to the LLD one extent at a time. >> Compare this with doing it all in a single TRIM command >> with the wiper.sh script (filesystem unmounted): > > I could do a variant which issues a single TRIM, but that would require > us to lock out all other allocations for the time the trim takes. I'll > hack that up once I get some time. .. Matthew's stuff will have to change to support that, too. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-16 0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig 2009-08-16 1:35 ` Mark Lord @ 2009-08-19 20:39 ` Ingo Molnar 2009-08-20 1:05 ` Christoph Hellwig 2009-08-20 1:39 ` Mark Lord 1 sibling, 2 replies; 27+ messages in thread From: Ingo Molnar @ 2009-08-19 20:39 UTC (permalink / raw) To: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe * Christoph Hellwig <hch@infradead.org> wrote: > Given that everyone is so big in the discard discussion > I'd like to present what I had started to prepare for > XFS. I didn't plan to send it out until I get my hands > onto a TRIM capable device (or at least get time to add > support to qemu), and so far it's only been tested in > dry-run mode. > > The basic idea is to add an ioctl which walks the free > space btrees in each allocation group and simply > discard everythin that is free. [...] A general interface design question: you added a new ioctl XFS_IOC_TRIM case. It's a sub-case of an ugly-looking demultiplexing xfs_file_ioctl(). What is your threshold for turning something into a syscall? When are ioctls acceptable in your opinion? I'm asking this because we are facing a similar problem with perfcounters: we need to extend the ioctl functionality there but introducing a new syscall looks wasteful. So i'm torn about the 'syscall versus ioctl' issue, i'd like to avoid making interface design mistakes and i'd like to solicit some opinions about this. I've attached the perfcounters ioctl patch below. Thanks, Ingo ----- Forwarded message from Peter Zijlstra <a.p.zijlstra@chello.nl> ----- Date: Wed, 19 Aug 2009 11:18:27 +0200 From: Peter Zijlstra <a.p.zijlstra@chello.nl> To: Ingo Molnar <mingo@elte.hu>, Paul Mackerras <paulus@samba.org> Subject: [PATCH 4/4][RFC] perf_counter: Allow sharing of output channels Cc: Arnaldo Carvalho de Melo <acme@redhat.com>, Frederic Weisbecker <fweisbec@gmail.com>, Mike Galbraith <efault@gmx.de>, linux-kernel@vger.kernel.org, stephane eranian <eranian@googlemail.com>, Peter Zijlstra <a.p.zijlstra@chello.nl> Provide the ability to configure a counter to send its output to another (already existing) counter's output stream. [ compile tested only ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> --- include/linux/perf_counter.h | 5 ++ kernel/perf_counter.c | 83 +++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 85 insertions(+), 3 deletions(-) Index: linux-2.6/include/linux/perf_counter.h =================================================================== --- linux-2.6.orig/include/linux/perf_counter.h +++ linux-2.6/include/linux/perf_counter.h @@ -216,6 +216,7 @@ struct perf_counter_attr { #define PERF_COUNTER_IOC_REFRESH _IO ('$', 2) #define PERF_COUNTER_IOC_RESET _IO ('$', 3) #define PERF_COUNTER_IOC_PERIOD _IOW('$', 4, u64) +#define PERF_COUNTER_IOC_SET_OUTPUT _IO ('$', 5) enum perf_counter_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, @@ -415,6 +416,9 @@ enum perf_callchain_context { PERF_CONTEXT_MAX = (__u64)-4095, }; +#define PERF_FLAG_FD_NO_GROUP (1U << 0) +#define PERF_FLAG_FD_OUTPUT (1U << 1) + #ifdef __KERNEL__ /* * Kernel-internal data types and definitions: @@ -536,6 +540,7 @@ struct perf_counter { struct list_head sibling_list; int nr_siblings; struct perf_counter *group_leader; + struct perf_counter *output; const struct pmu *pmu; enum perf_counter_active_state state; Index: linux-2.6/kernel/perf_counter.c =================================================================== --- linux-2.6.orig/kernel/perf_counter.c +++ linux-2.6/kernel/perf_counter.c @@ -1686,6 +1686,11 @@ static void free_counter(struct perf_cou atomic_dec(&nr_task_counters); } + if (counter->output) { + fput(counter->output->filp); + counter->output = NULL; + } + if (counter->destroy) counter->destroy(counter); @@ -1971,6 +1976,8 @@ unlock: return ret; } +int perf_counter_set_output(struct perf_counter *counter, int output_fd); + static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct perf_counter *counter = file->private_data; @@ -1994,6 +2001,9 @@ static long perf_ioctl(struct file *file case PERF_COUNTER_IOC_PERIOD: return perf_counter_period(counter, (u64 __user *)arg); + case PERF_COUNTER_IOC_SET_OUTPUT: + return perf_counter_set_output(counter, arg); + default: return -ENOTTY; } @@ -2264,6 +2274,11 @@ static int perf_mmap(struct file *file, WARN_ON_ONCE(counter->ctx->parent_ctx); mutex_lock(&counter->mmap_mutex); + if (counter->output) { + ret = -EINVAL; + goto unlock; + } + if (atomic_inc_not_zero(&counter->mmap_count)) { if (nr_pages != counter->data->nr_pages) ret = -EINVAL; @@ -2649,6 +2664,7 @@ static int perf_output_begin(struct perf struct perf_counter *counter, unsigned int size, int nmi, int sample) { + struct perf_counter *output_counter; struct perf_mmap_data *data; unsigned int offset, head; int have_lost; @@ -2658,13 +2674,17 @@ static int perf_output_begin(struct perf u64 lost; } lost_event; + rcu_read_lock(); /* * For inherited counters we send all the output towards the parent. */ if (counter->parent) counter = counter->parent; - rcu_read_lock(); + output_counter = rcu_dereference(counter->output); + if (output_counter) + counter = output_counter; + data = rcu_dereference(counter->data); if (!data) goto out; @@ -4214,6 +4234,57 @@ err_size: goto out; } +int perf_counter_set_output(struct perf_counter *counter, int output_fd) +{ + struct perf_counter *output_counter = NULL; + struct file *output_file = NULL; + struct perf_counter *old_output; + int fput_needed = 0; + int ret = -EINVAL; + + if (!output_fd) + goto set; + + output_file = fget_light(output_fd, &fput_needed); + if (!output_file) + return -EBADF; + + if (output_file->f_op != &perf_fops) + goto out; + + output_counter = output_file->private_data; + + /* Don't chain output fds */ + if (output_counter->output) + goto out; + + /* Don't set an output fd when we already have an output channel */ + if (counter->data) + goto out; + + atomic_long_inc(&output_file->f_count); + +set: + mutex_lock(&counter->mmap_mutex); + old_output = counter->output; + rcu_assign_pointer(counter->output, output_counter); + mutex_unlock(&counter->mmap_mutex); + + if (old_output) { + /* + * we need to make sure no existing perf_output_*() + * is still referencing this counter. + */ + synchronize_rcu(); + fput(old_output->filp); + } + + ret = 0; +out: + fput_light(output_file, fput_needed); + return ret; +} + /** * sys_perf_counter_open - open a performance counter, associate it to a task/cpu * @@ -4236,7 +4307,7 @@ SYSCALL_DEFINE5(perf_counter_open, int ret; /* for future expandability... */ - if (flags) + if (flags & ~(PERF_FLAG_FD_NO_GROUP | PERF_FLAG_FD_OUTPUT)) return -EINVAL; ret = perf_copy_attr(attr_uptr, &attr); @@ -4264,7 +4335,7 @@ SYSCALL_DEFINE5(perf_counter_open, * Look up the group leader (we will attach this counter to it): */ group_leader = NULL; - if (group_fd != -1) { + if (group_fd != -1 && !(flags & PERF_FLAG_FD_NO_GROUP)) { ret = -EINVAL; group_file = fget_light(group_fd, &fput_needed); if (!group_file) @@ -4306,6 +4377,12 @@ SYSCALL_DEFINE5(perf_counter_open, if (!counter_file) goto err_free_put_context; + if (flags & PERF_FLAG_FD_OUTPUT) { + ret = perf_counter_set_output(counter, group_fd); + if (ret) + goto err_free_put_context; + } + counter->filp = counter_file; WARN_ON_ONCE(ctx->parent_ctx); mutex_lock(&ctx->mutex); -- ----- End forwarded message ----- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-19 20:39 ` Ingo Molnar @ 2009-08-20 1:05 ` Christoph Hellwig 2009-08-20 1:10 ` Jamie Lokier 2009-08-21 12:46 ` Ingo Molnar 2009-08-20 1:39 ` Mark Lord 1 sibling, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2009-08-20 1:05 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe On Wed, Aug 19, 2009 at 10:39:16PM +0200, Ingo Molnar wrote: > A general interface design question: you added a new > ioctl XFS_IOC_TRIM case. It's a sub-case of an > ugly-looking demultiplexing xfs_file_ioctl(). ioctl is per defintion a multiplexer. > What is your threshold for turning something into a > syscall? When are ioctls acceptable in your opinion? > > I'm asking this because we are facing a similar problem > with perfcounters: we need to extend the ioctl > functionality there but introducing a new syscall looks > wasteful. > > So i'm torn about the 'syscall versus ioctl' issue, i'd > like to avoid making interface design mistakes and i'd > like to solicit some opinions about this. I've attached > the perfcounters ioctl patch below. Only add a syscall if it has _one_ clear defined purpose, which has kernel-wide meaning. Do not add an syscall that is just another multiplexer without structure. Most likely it will just be even worse than sys_ioctl. Also really don't bother adding a system call that is specific to one singler driver or filesystem. Besides horrible logistics - you'd need some always built-in stub calling out to the possibly modular drivers/filesystem - it also simply doesn't make any semantical sense. I can't say I like the ioctl use in perfcounters much, but adding a special syscalls instead would be even more horrible. As for the trim support this really just was an RFC to start bringing some code into play instead of the endless masturbation about hat code that doesn't exist happens on hardware most people don't have. The interface will most ceetainly change and I hope we will have a common interface for all filesystems (or at least those that care). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 1:05 ` Christoph Hellwig @ 2009-08-20 1:10 ` Jamie Lokier 2009-08-20 1:38 ` Douglas Gilbert 2009-08-20 1:38 ` Mark Lord 2009-08-21 12:46 ` Ingo Molnar 1 sibling, 2 replies; 27+ messages in thread From: Jamie Lokier @ 2009-08-20 1:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe Christoph Hellwig wrote: > > So i'm torn about the 'syscall versus ioctl' issue, i'd > > like to avoid making interface design mistakes and i'd > > like to solicit some opinions about this. I've attached > > the perfcounters ioctl patch below. > > Only add a syscall if it has _one_ clear defined purpose, > which has kernel-wide meaning. One clear defined purpose which comes to mind is a "trim" or "punch" system call, for making holes in files as well as trimming block devices. Several other OSes have that capability on files. I don't remember - does TRIM guarantee the blocks read zeros afterwards? It would be tidy if it does, as it could have the same meaning with files. -- Jamie ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 1:10 ` Jamie Lokier @ 2009-08-20 1:38 ` Douglas Gilbert 2009-08-20 1:38 ` Mark Lord 1 sibling, 0 replies; 27+ messages in thread From: Douglas Gilbert @ 2009-08-20 1:38 UTC (permalink / raw) To: Jamie Lokier Cc: Christoph Hellwig, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe Jamie Lokier wrote: > Christoph Hellwig wrote: >>> So i'm torn about the 'syscall versus ioctl' issue, i'd >>> like to avoid making interface design mistakes and i'd >>> like to solicit some opinions about this. I've attached >>> the perfcounters ioctl patch below. >> Only add a syscall if it has _one_ clear defined purpose, >> which has kernel-wide meaning. > > One clear defined purpose which comes to mind is a "trim" or "punch" > system call, for making holes in files as well as trimming block > devices. Several other OSes have that capability on files. > > I don't remember - does TRIM guarantee the blocks read zeros afterwards? > > It would be tidy if it does, as it could have the same meaning with files. Both ATA and SCSI leave it up to the device to decide whether reads after a trim on a logical block are determinate (i.e. zeroes are returned) or indeterminate. In the case of ATA the IDENTIFY DEVICE data word 69 bit 14 gives the indication. In the case of SCSI the READ CAPACITY(16) command response TPRZ bit gives the indication. In both case a value of one indicates determinate. Doug Gilbert ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 1:10 ` Jamie Lokier 2009-08-20 1:38 ` Douglas Gilbert @ 2009-08-20 1:38 ` Mark Lord 1 sibling, 0 replies; 27+ messages in thread From: Mark Lord @ 2009-08-20 1:38 UTC (permalink / raw) To: Jamie Lokier Cc: Christoph Hellwig, Ingo Molnar, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe Jamie Lokier wrote: .. > I don't remember - does TRIM guarantee the blocks read zeros afterwards? .. No, it doesn't. A drive can optionally support "deterministic TRIM", whereby it will return consistent data for any given trimmed sector afterwards, but that doesn't mean zeros. -ml ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 1:05 ` Christoph Hellwig 2009-08-20 1:10 ` Jamie Lokier @ 2009-08-21 12:46 ` Ingo Molnar 1 sibling, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-08-21 12:46 UTC (permalink / raw) To: Christoph Hellwig Cc: Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe * Christoph Hellwig <hch@infradead.org> wrote: > On Wed, Aug 19, 2009 at 10:39:16PM +0200, Ingo Molnar wrote: > > A general interface design question: you added a new > > ioctl XFS_IOC_TRIM case. It's a sub-case of an > > ugly-looking demultiplexing xfs_file_ioctl(). > > ioctl is per defintion a multiplexer. Yes, and? There's two variants of multiplexing: - multiplex something rather straightforward and related - multiplex unrelated fields, data, structures I consider the second one 'ugly', the first one 'ok-ish'. YMMV. > > What is your threshold for turning something into a syscall? > > When are ioctls acceptable in your opinion? > > > > I'm asking this because we are facing a similar problem with > > perfcounters: we need to extend the ioctl functionality there > > but introducing a new syscall looks wasteful. > > > > So i'm torn about the 'syscall versus ioctl' issue, i'd like to > > avoid making interface design mistakes and i'd like to solicit > > some opinions about this. I've attached the perfcounters ioctl > > patch below. > > Only add a syscall if it has _one_ clear defined purpose, which > has kernel-wide meaning. > > Do not add an syscall that is just another multiplexer without > structure. Most likely it will just be even worse than sys_ioctl. > > Also really don't bother adding a system call that is specific to > one singler driver or filesystem. Besides horrible logistics - > you'd need some always built-in stub calling out to the possibly > modular drivers/filesystem - it also simply doesn't make any > semantical sense. I can't say I like the ioctl use in > perfcounters much, but adding a special syscalls instead would be > even more horrible. > > As for the trim support this really just was an RFC to start > bringing some code into play instead of the endless masturbation > about hat code that doesn't exist happens on hardware most people > don't have. The interface will most ceetainly change and I hope > we will have a common interface for all filesystems (or at least > those that care). Okay, i'm confused. I'd like to understand the technical basis of your critisism and i'd like to address any deficiencies of the perfcounters code. You said you dont like the ioctl solution we have, but that you'd like a separate syscall even less. Perfcounters are a kernel-wide concept, encompassing 100% of all Linux installations, not just some special hardware. So by your own standard above they seem to be more than eligible for system calls (i hope i'm not mis-stating it), as long as they are cleanly structured. Yet you dont like the interface nor any future pushing of the currently ioctl bits of the interface into syscalls. Is there any other interface form you'd like more? Thanks, Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-19 20:39 ` Ingo Molnar 2009-08-20 1:05 ` Christoph Hellwig @ 2009-08-20 1:39 ` Mark Lord 2009-08-20 13:48 ` Ric Wheeler 1 sibling, 1 reply; 27+ messages in thread From: Mark Lord @ 2009-08-20 1:39 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list [resending, after fixing the Cc: list; somebody trimmed it earlier] Jamie Lokier wrote: .. > I don't remember - does TRIM guarantee the blocks read zeros afterwards? .. No, it doesn't. A drive can optionally support "deterministic TRIM", whereby it will return consistent data for any given trimmed sector afterwards, but that doesn't mean zeros. -ml ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 1:39 ` Mark Lord @ 2009-08-20 13:48 ` Ric Wheeler 2009-08-20 14:38 ` Mark Lord 2009-08-20 14:58 ` Douglas Gilbert 0 siblings, 2 replies; 27+ messages in thread From: Ric Wheeler @ 2009-08-20 13:48 UTC (permalink / raw) To: Mark Lord Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown On 08/19/2009 09:39 PM, Mark Lord wrote: > [resending, after fixing the Cc: list; somebody trimmed it earlier] > > Jamie Lokier wrote: > .. >> I don't remember - does TRIM guarantee the blocks read zeros afterwards? > .. > > No, it doesn't. > > A drive can optionally support "deterministic TRIM", whereby it will > return > consistent data for any given trimmed sector afterwards, but that > doesn't mean zeros. > > -ml Note that returning consistent data is critical for devices that are used in a RAID group since you will need each RAID block that is used to compute the parity to continue to return the same data until you overwrite it with new data :-) If we have a device that does not support this (or is misconfigured not to do this), we should not use those devices in an MD group & do discard against it... ric ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 13:48 ` Ric Wheeler @ 2009-08-20 14:38 ` Mark Lord 2009-08-20 14:42 ` Ric Wheeler ` (2 more replies) 2009-08-20 14:58 ` Douglas Gilbert 1 sibling, 3 replies; 27+ messages in thread From: Mark Lord @ 2009-08-20 14:38 UTC (permalink / raw) To: Ric Wheeler Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown Ric Wheeler wrote: > > Note that returning consistent data is critical for devices that are > used in a RAID group since you will need each RAID block that is used to > compute the parity to continue to return the same data until you > overwrite it with new data :-) > > If we have a device that does not support this (or is misconfigured not > to do this), we should not use those devices in an MD group & do discard > against it... .. Well, that's a bit drastic. But the RAID software should at least not issue TRIM commands in ignorance of such. Would it still be okay to do the TRIMs when the entire parity stripe (across all members) is being discarded? (As opposed to just partial data there being dropped) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 14:38 ` Mark Lord @ 2009-08-20 14:42 ` Ric Wheeler 2009-08-20 17:19 ` Greg Freemyer 2009-08-20 14:42 ` James Bottomley 2009-08-20 15:43 ` Rolf Eike Beer 2 siblings, 1 reply; 27+ messages in thread From: Ric Wheeler @ 2009-08-20 14:42 UTC (permalink / raw) To: Mark Lord Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown On 08/20/2009 10:38 AM, Mark Lord wrote: > Ric Wheeler wrote: >> >> Note that returning consistent data is critical for devices that are >> used in a RAID group since you will need each RAID block that is used >> to compute the parity to continue to return the same data until you >> overwrite it with new data :-) >> >> If we have a device that does not support this (or is misconfigured >> not to do this), we should not use those devices in an MD group & do >> discard against it... > .. > > Well, that's a bit drastic. But the RAID software should at least > not issue TRIM commands in ignorance of such. If the storage can return different data in a sequence of READ requests of the same sector (with no writes), there is nothing RAID could do. It would see total garbage... > Would it still be okay to do the TRIMs when the entire parity stripe > (across all members) is being discarded? (As opposed to just partial > data there being dropped) This should be safe if the MD bitmaps would prevent us from trying to READ/regenerate parity for that stripe... ric ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 14:42 ` Ric Wheeler @ 2009-08-20 17:19 ` Greg Freemyer 0 siblings, 0 replies; 27+ messages in thread From: Greg Freemyer @ 2009-08-20 17:19 UTC (permalink / raw) To: Ric Wheeler Cc: Mark Lord, Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown On Thu, Aug 20, 2009 at 10:42 AM, Ric Wheeler<rwheeler@redhat.com> wrote: > On 08/20/2009 10:38 AM, Mark Lord wrote: >> >> Ric Wheeler wrote: >>> >>> Note that returning consistent data is critical for devices that are >>> used in a RAID group since you will need each RAID block that is used >>> to compute the parity to continue to return the same data until you >>> overwrite it with new data :-) >>> >>> If we have a device that does not support this (or is misconfigured >>> not to do this), we should not use those devices in an MD group & do >>> discard against it... >> >> .. >> >> Well, that's a bit drastic. But the RAID software should at least >> not issue TRIM commands in ignorance of such. > > If the storage can return different data in a sequence of READ requests of > the same sector (with no writes), there is nothing RAID could do. It would > see total garbage... > >> Would it still be okay to do the TRIMs when the entire parity stripe >> (across all members) is being discarded? (As opposed to just partial >> data there being dropped) > > This should be safe if the MD bitmaps would prevent us from trying to > READ/regenerate parity for that stripe... > > ric The harder thing for mdraid is putting a stripe back in service. If even a single sector is written to a "discarded" stripe, the entire stripe has to be written with determinate data that has the right parity. ie. Only full stripes can be discarded and only full-stripes can be put back in service. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer Preservation and Forensic processing of Exchange Repositories White Paper - <http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html> The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 14:38 ` Mark Lord 2009-08-20 14:42 ` Ric Wheeler @ 2009-08-20 14:42 ` James Bottomley 2009-08-20 15:43 ` Rolf Eike Beer 2 siblings, 0 replies; 27+ messages in thread From: James Bottomley @ 2009-08-20 14:42 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown On Thu, 2009-08-20 at 10:38 -0400, Mark Lord wrote: > Would it still be okay to do the TRIMs when the entire parity stripe > (across all members) is being discarded? (As opposed to just partial > data there being dropped) Not really. The problem is that array verification is done at the block level not the fs level (although, I suppose, we could change that). So a fully discarded stripe still has to verify OK (as in what's read for the parity must match what's read for the data). All of this is the reason for the TPRZ bit for SCSI UNMAP ... and why WRITE_SAME is also under consideration for discards in T10. James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 14:38 ` Mark Lord 2009-08-20 14:42 ` Ric Wheeler 2009-08-20 14:42 ` James Bottomley @ 2009-08-20 15:43 ` Rolf Eike Beer 2009-08-20 17:00 ` Ric Wheeler 2 siblings, 1 reply; 27+ messages in thread From: Rolf Eike Beer @ 2009-08-20 15:43 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown [-- Attachment #1: Type: Text/Plain, Size: 2366 bytes --] Mark Lord wrote: > Ric Wheeler wrote: > > Note that returning consistent data is critical for devices that are > > used in a RAID group since you will need each RAID block that is used to > > compute the parity to continue to return the same data until you > > overwrite it with new data :-) > > > > If we have a device that does not support this (or is misconfigured not > > to do this), we should not use those devices in an MD group & do discard > > against it... > > .. > > Well, that's a bit drastic. But the RAID software should at least > not issue TRIM commands in ignorance of such. > > Would it still be okay to do the TRIMs when the entire parity stripe > (across all members) is being discarded? (As opposed to just partial > data there being dropped) I think there might be a related usecase that could benefit from TRIM/UNMAP/whatever support in file systems even if the physical devices do not support that. I have a RAID5 at work with LVM over it. This week I deleted an old logical volume of some 200GB that has been moved to a different volume group, tomorrow I will start to replace all the disks in the raid with bigger ones. So if the LVM told the raid "hey, this space is totally garbage from now on" the raid would not have to do any calculation when it has to rebuild that but could simply write fixed patterns to all disks (e.g. 0 to first data, 0 to second data and 0 as "0 xor 0" to parity). With the knowledge that some of the underlying devices would support "write all to zero" this operation could be speed up even more, with "write all fixed pattern" every unused chunk would go down to a single write operation (per disk) on rebuild regardless which parity algorithm is used. And even if things are in use the RAID can benefit from such things. If we just define that every unmapped space will always be 0 when read and I write to a raid volume and the other part of the checksum calculation is unmapped checksumming becomes easy as we already know half of the values before: 0. So we can save the reads from the second data stripe and most of the calculation. "dd if=/dev/md0" on an unmapped space is more or less the same as "dd if=/dev/zero" than. I only fear that these things are too obviously as I would be the first to have this idea ;) Greetings, Eike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 15:43 ` Rolf Eike Beer @ 2009-08-20 17:00 ` Ric Wheeler 0 siblings, 0 replies; 27+ messages in thread From: Ric Wheeler @ 2009-08-20 17:00 UTC (permalink / raw) To: Rolf Eike Beer Cc: Mark Lord, Ric Wheeler, Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown On 08/20/2009 11:43 AM, Rolf Eike Beer wrote: > Mark Lord wrote: > >> Ric Wheeler wrote: >> >>> Note that returning consistent data is critical for devices that are >>> used in a RAID group since you will need each RAID block that is used to >>> compute the parity to continue to return the same data until you >>> overwrite it with new data :-) >>> >>> If we have a device that does not support this (or is misconfigured not >>> to do this), we should not use those devices in an MD group& do discard >>> against it... >>> >> .. >> >> Well, that's a bit drastic. But the RAID software should at least >> not issue TRIM commands in ignorance of such. >> >> Would it still be okay to do the TRIMs when the entire parity stripe >> (across all members) is being discarded? (As opposed to just partial >> data there being dropped) >> > I think there might be a related usecase that could benefit from > TRIM/UNMAP/whatever support in file systems even if the physical devices do > not support that. I have a RAID5 at work with LVM over it. This week I deleted > an old logical volume of some 200GB that has been moved to a different volume > group, tomorrow I will start to replace all the disks in the raid with bigger > ones. So if the LVM told the raid "hey, this space is totally garbage from now > on" the raid would not have to do any calculation when it has to rebuild that > but could simply write fixed patterns to all disks (e.g. 0 to first data, 0 to > second data and 0 as "0 xor 0" to parity). With the knowledge that some of the > underlying devices would support "write all to zero" this operation could be > speed up even more, with "write all fixed pattern" every unused chunk would go > down to a single write operation (per disk) on rebuild regardless which parity > algorithm is used. > In the SCSI world, RAID array vendors use "WRITE_SAME" to do this. For the SCSI discard, the write same command has a discard bit set if I remember correctly so you basically get what you are describing above. ric > And even if things are in use the RAID can benefit from such things. If we > just define that every unmapped space will always be 0 when read and I write > to a raid volume and the other part of the checksum calculation is unmapped > checksumming becomes easy as we already know half of the values before: 0. So > we can save the reads from the second data stripe and most of the calculation. > "dd if=/dev/md0" on an unmapped space is more or less the same as "dd > if=/dev/zero" than. > > I only fear that these things are too obviously as I would be the first to > have this idea ;) > > Greetings, > > Eike > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH, RFC] xfs: batched discard support 2009-08-20 13:48 ` Ric Wheeler 2009-08-20 14:38 ` Mark Lord @ 2009-08-20 14:58 ` Douglas Gilbert 1 sibling, 0 replies; 27+ messages in thread From: Douglas Gilbert @ 2009-08-20 14:58 UTC (permalink / raw) To: Ric Wheeler Cc: Mark Lord, Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown Ric Wheeler wrote: > On 08/19/2009 09:39 PM, Mark Lord wrote: >> [resending, after fixing the Cc: list; somebody trimmed it earlier] >> >> Jamie Lokier wrote: >> .. >>> I don't remember - does TRIM guarantee the blocks read zeros afterwards? >> .. >> >> No, it doesn't. >> >> A drive can optionally support "deterministic TRIM", whereby it will >> return >> consistent data for any given trimmed sector afterwards, but that >> doesn't mean zeros. >> >> -ml > > Note that returning consistent data is critical for devices that are > used in a RAID group since you will need each RAID block that is used to > compute the parity to continue to return the same data until you > overwrite it with new data :-) > > If we have a device that does not support this (or is misconfigured not > to do this), we should not use those devices in an MD group & do discard > against it... A closer reading of d2015r2-ATAATAPI_Command_Set_-_2_ACS-2.pdf section 7.10.3.2 (latest ACS-2 draft from www.t13.org) shows that there are 3 possible variants for data read from a logical block that has been trimmed (or "unmapped"): a) indeterminate b) determinate c) determinate, return all zeroes In the case of b) the same data is returned for each subsequent read. And that data must not be something that has previously be written to some other LBA! In the case of SCSI (sbc3r19.pdf) case b) is not supported (very sensibly IMO). Another difference I noticed between SCSI and ATA drafts is with the SECURITY ERASE UNIT command which is somewhat similar to the SCSI FORMAT UNIT command (which includes a security erase option). The ATA draft says that all blocks are determinate ("mapped" in the SCSI state model) after a SECURITY ERASE UNIT. The SCSI draft says that all logical blocks may be unmapped after FORMAT UNIT. Doug Gilbert ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2009-08-21 12:46 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-16 0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig 2009-08-16 1:35 ` Mark Lord 2009-08-16 2:19 ` Mark Lord 2009-08-16 2:25 ` Christoph Hellwig 2009-08-16 2:49 ` Mark Lord 2009-08-16 3:25 ` Mark Lord 2009-08-16 13:00 ` Mark Lord 2009-08-16 13:53 ` Christoph Hellwig 2009-08-16 13:59 ` Mark Lord 2009-08-16 14:06 ` Mark Lord 2009-08-16 14:23 ` Christoph Hellwig 2009-08-16 14:26 ` Mark Lord 2009-08-19 20:39 ` Ingo Molnar 2009-08-20 1:05 ` Christoph Hellwig 2009-08-20 1:10 ` Jamie Lokier 2009-08-20 1:38 ` Douglas Gilbert 2009-08-20 1:38 ` Mark Lord 2009-08-21 12:46 ` Ingo Molnar 2009-08-20 1:39 ` Mark Lord 2009-08-20 13:48 ` Ric Wheeler 2009-08-20 14:38 ` Mark Lord 2009-08-20 14:42 ` Ric Wheeler 2009-08-20 17:19 ` Greg Freemyer 2009-08-20 14:42 ` James Bottomley 2009-08-20 15:43 ` Rolf Eike Beer 2009-08-20 17:00 ` Ric Wheeler 2009-08-20 14:58 ` Douglas Gilbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).