[PATCH, RFC] xfs: batched discard support

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH, RFC] xfs: batched discard support
@ 2009-08-16  0:47 Christoph Hellwig
  2009-08-16  1:35 ` Mark Lord
  2009-08-19 20:39 ` Ingo Molnar
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2009-08-16  0:47 UTC (permalink / raw)
  To: xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe

[-- Attachment #1: Type: text/plain, Size: 9156 bytes --]

Given that everyone is so big in the discard discussion I'd like to
present what I had started to prepare for XFS.  I didn't plan to send it
out until I get my hands onto a TRIM capable device (or at least get
time to add support to qemu), and so far it's only been tested in
dry-run mode.

The basic idea is to add an ioctl which walks the free space btrees in
each allocation group and simply discard everythin that is free.  Given
that XFS doesn't gragment freespace very much that's a very efficient
way to do it.  In addition we also already support setting a threshold
under which we don't bother to discard an extent, it's currently
hardcoded in the helper tool.  In the future we could also add things
like a sequence number in the AG headers if anything has changed at all,
but let's leave those optimizations until we need them.

XFS locks the allocation btree using the btree buffers, so we do not
block allocations from any extent which we're not currenly discarding.

Now the caveat for that is that we really do want to do the discard
synchronously, that is wait for the request to finish.  That's what
I've implemented in this patch, but it's the part I haven't been
able to test so far.  (and yes, this should be separate patch, but it's
really just an RFC for now)

Mark, any chance to try it?  Just create an XFS filesystem, age it a
bit and then call the attached little trim.c program on the mountmoint
(or any file inside the filesystem for that matter)


Index: linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-15 20:15:14.379163976 -0300
+++ linux-2.6/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-15 21:19:19.342664224 -0300
@@ -1275,6 +1275,31 @@ xfs_ioc_getbmapx(
 	return 0;
 }
 
+STATIC int
+xfs_ioc_trim(
+	struct xfs_mount	*mp,
+	__uint32_t		*argp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+	__uint32_t		minlen;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (get_user(minlen, argp))
+		return -EFAULT;
+
+	down_read(&mp->m_peraglock);
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = -xfs_trim_extents(mp, agno, minlen);
+		if (error)
+			break;
+	}
+	up_read(&mp->m_peraglock);
+
+	return error;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1524,6 +1549,9 @@ xfs_file_ioctl(
 		error = xfs_errortag_clearall(mp, 1);
 		return -error;
 
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}
Index: linux-2.6/fs/xfs/xfs_alloc.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_alloc.c	2009-08-15 20:11:00.791163409 -0300
+++ linux-2.6/fs/xfs/xfs_alloc.c	2009-08-15 21:40:16.226666638 -0300
@@ -2470,6 +2470,96 @@ error0:
 	return error;
 }
 
+STATIC int
+xfs_trim_extent(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen)
+{
+	xfs_daddr_t		blkno = XFS_AGB_TO_DADDR(mp, agno, fbno);
+	sector_t		nblks = XFS_FSB_TO_BB(mp, flen);
+	int			error;
+
+	xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]",
+			blkno, nblks);
+
+	error = -__blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+				        blkno, nblks, GFP_NOFS, 1);
+	if (error && error != EOPNOTSUPP)
+		xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error);
+	return error;
+}
+
+/*
+ * Notify the underlying block device about our free extent map.
+ *
+ * This walks all free extents above a minimum threshold and notifies the
+ * underlying device that these blocks are unused.  That information is
+ * useful for SSDs or thinly provisioned storage in high end arrays or
+ * virtualization scenarios.
+ */
+int
+xfs_trim_extents(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		minlen)	/* minimum extent size to bother */
+{
+	struct xfs_btree_cur	*cur;	/* cursor for the by-block btree */
+	struct xfs_buf		*agbp;	/* AGF buffer pointer */
+	xfs_agblock_t		bno;	/* block the for next search */
+	xfs_agblock_t		fbno;	/* start block of found extent */
+	xfs_extlen_t		flen;	/* length of found extent */
+	int			error;
+	int			i;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	bno = 0;
+	for (;;) {
+		cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno,
+					      XFS_BTNUM_BNO);
+
+		error = xfs_alloc_lookup_ge(cur, bno, minlen, &i);
+		if (error)
+			goto error0;
+		if (!i) {
+			/*
+			 * No more free extents found: done.
+			 */
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			break;
+		}
+
+		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+		/*
+		 * Pass if the freespace extent isn't long enough to bother.
+		 */
+		if (flen >= minlen) {
+			error = xfs_trim_extent(mp, agno, fbno, flen);
+			if (error) {
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				break;
+			}
+		}
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		bno = fbno + flen;
+	}
+
+out:
+	xfs_buf_relse(agbp);
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	goto out;
+}
 
 /*
  * AG Busy list management
Index: linux-2.6/fs/xfs/xfs_alloc.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_alloc.h	2009-08-15 20:12:51.762661386 -0300
+++ linux-2.6/fs/xfs/xfs_alloc.h	2009-08-15 20:15:07.334667592 -0300
@@ -217,4 +217,7 @@ xfs_free_extent(
 	xfs_fsblock_t	bno,	/* starting block number of extent */
 	xfs_extlen_t	len);	/* length of extent */
 
+int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno,
+	xfs_extlen_t minlen);
+
 #endif	/* __XFS_ALLOC_H__ */
Index: linux-2.6/fs/xfs/xfs_fs.h
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_fs.h	2009-08-15 20:22:03.735200427 -0300
+++ linux-2.6/fs/xfs/xfs_fs.h	2009-08-15 20:24:18.677996430 -0300
@@ -475,6 +475,7 @@ typedef struct xfs_handle {
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_TRIM		     _IOR ('X', 126, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2009-08-15 21:36:30.426696824 -0300
+++ linux-2.6/block/blk-barrier.c	2009-08-15 21:41:11.490664659 -0300
@@ -348,22 +348,27 @@ static void blkdev_discard_end_io(struct
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	}
 
+	if (bio->bi_private)
+		complete(bio->bi_private);
 	bio_put(bio);
 }
 
 /**
- * blkdev_issue_discard - queue a discard
+ * __blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
  * @sector:	start sector
  * @nr_sects:	number of sectors to discard
  * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @wait:	if %1 wait for the discard to finish
  *
  * Description:
- *    Issue a discard request for the sectors in question. Does not wait.
+ *    Issue a discard request for the sectors in question.
  */
-int blkdev_issue_discard(struct block_device *bdev,
-			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
+int __blkdev_issue_discard(struct block_device *bdev,
+			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+			 int wait)
 {
+	DECLARE_COMPLETION_ONSTACK(done);
 	struct request_queue *q;
 	struct bio *bio;
 	int ret = 0;
@@ -385,6 +390,7 @@ int blkdev_issue_discard(struct block_de
 
 		bio->bi_end_io = blkdev_discard_end_io;
 		bio->bi_bdev = bdev;
+		bio->bi_private = wait ? &done : NULL;
 
 		bio->bi_sector = sector;
 
@@ -399,6 +405,9 @@ int blkdev_issue_discard(struct block_de
 		bio_get(bio);
 		submit_bio(DISCARD_BARRIER, bio);
 
+		if (wait)
+			wait_for_completion(&done);
+
 		/* Check if it failed immediately */
 		if (bio_flagged(bio, BIO_EOPNOTSUPP))
 			ret = -EOPNOTSUPP;
@@ -408,4 +417,4 @@ int blkdev_issue_discard(struct block_de
 	}
 	return ret;
 }
-EXPORT_SYMBOL(blkdev_issue_discard);
+EXPORT_SYMBOL(__blkdev_issue_discard);
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2009-08-15 21:40:20.507164178 -0300
+++ linux-2.6/include/linux/blkdev.h	2009-08-15 21:42:15.734715355 -0300
@@ -977,8 +977,14 @@ static inline struct request *blk_map_qu
 }
 
 extern int blkdev_issue_flush(struct block_device *, sector_t *);
-extern int blkdev_issue_discard(struct block_device *,
-				sector_t sector, sector_t nr_sects, gfp_t);
+extern int __blkdev_issue_discard(struct block_device *, sector_t sector,
+		sector_t nr_sects, gfp_t, int wait);
+
+static inline int blkdev_issue_discard(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
+{
+	return __blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0);
+}
 
 static inline int sb_issue_discard(struct super_block *sb,
 				   sector_t block, sector_t nr_blocks)

[-- Attachment #2: trim.c --]
[-- Type: text/plain, Size: 576 bytes --]


#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>

#define XFS_IOC_TRIM                 _IOR ('X', 126, uint32_t)


int main(int argc, char **argv)
{
	int minsize = 4096;
	int fd;

	if (argc != 2) {
		fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	if (ioctl(fd, XFS_IOC_TRIM, &minsize)) {
		if (errno == EOPNOTSUPP)
			fprintf(stderr, "TRIM not supported\n");
		else
			perror("XFS_IOC_TRIM");
		return 1;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig
@ 2009-08-16  1:35 ` Mark Lord
  2009-08-16  2:19   ` Mark Lord
  2009-08-19 20:39 ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Lord @ 2009-08-16  1:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe

Christoph Hellwig wrote:
> Given that everyone is so big in the discard discussion I'd like to
> present what I had started to prepare for XFS.  I didn't plan to send it
> out until I get my hands onto a TRIM capable device (or at least get
> time to add support to qemu), and so far it's only been tested in
> dry-run mode.
> 
> The basic idea is to add an ioctl which walks the free space btrees in
> each allocation group and simply discard everythin that is free.  Given
> that XFS doesn't gragment freespace very much that's a very efficient
> way to do it.  In addition we also already support setting a threshold
> under which we don't bother to discard an extent, it's currently
> hardcoded in the helper tool.  In the future we could also add things
> like a sequence number in the AG headers if anything has changed at all,
> but let's leave those optimizations until we need them.
> 
> XFS locks the allocation btree using the btree buffers, so we do not
> block allocations from any extent which we're not currenly discarding.
> 
> Now the caveat for that is that we really do want to do the discard
> synchronously, that is wait for the request to finish.  That's what
> I've implemented in this patch, but it's the part I haven't been
> able to test so far.  (and yes, this should be separate patch, but it's
> really just an RFC for now)
> 
> Mark, any chance to try it?  Just create an XFS filesystem, age it a
> bit and then call the attached little trim.c program on the mountmoint
> (or any file inside the filesystem for that matter)
..

Looking at it now.  Thanks, Christoph!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  1:35 ` Mark Lord
@ 2009-08-16  2:19   ` Mark Lord
  2009-08-16  2:25     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Lord @ 2009-08-16  2:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe

Mark Lord wrote:
> Christoph Hellwig wrote:
..
>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>> bit and then call the attached little trim.c program on the mountmoint
>> (or any file inside the filesystem for that matter)
> ..
> 
> Looking at it now.  Thanks, Christoph!
..

Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
Rebuilding with 32-bit kernel now..



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  2:19   ` Mark Lord
@ 2009-08-16  2:25     ` Christoph Hellwig
  2009-08-16  2:49       ` Mark Lord
  2009-08-16 13:00       ` Mark Lord
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2009-08-16  2:25 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe

On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
> Mark Lord wrote:
>> Christoph Hellwig wrote:
> ..
>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>> bit and then call the attached little trim.c program on the mountmoint
>>> (or any file inside the filesystem for that matter)
>> ..
>>
>> Looking at it now.  Thanks, Christoph!
> ..
>
> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
> Rebuilding with 32-bit kernel now..

The actual ioctl is compatible, just add the

	case XFS_IOC_TRIM:
		return xfs_ioc_trim(mp, arg);

to xfs_file_compat_ioctl().  I'll add this to the next spin of the
patch.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  2:25     ` Christoph Hellwig
@ 2009-08-16  2:49       ` Mark Lord
  2009-08-16  3:25         ` Mark Lord
  2009-08-16 13:00       ` Mark Lord
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Lord @ 2009-08-16  2:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe

Christoph Hellwig wrote:
> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
>> Mark Lord wrote:
>>> Christoph Hellwig wrote:
>> ..
>>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>>> bit and then call the attached little trim.c program on the mountmoint
>>>> (or any file inside the filesystem for that matter)
>>> ..
>>>
>>> Looking at it now.  Thanks, Christoph!
>> ..
>>
>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
>> Rebuilding with 32-bit kernel now..
> 
> The actual ioctl is compatible, just add the
> 
> 	case XFS_IOC_TRIM:
> 		return xfs_ioc_trim(mp, arg);
> 
> to xfs_file_compat_ioctl().  I'll add this to the next spin of the patch.
..

Peachy.  Rebuilding again now..

Don't wait up for this .. I'll finish it on Sunday.  Bedtime.  :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  2:49       ` Mark Lord
@ 2009-08-16  3:25         ` Mark Lord
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-16  3:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe

Mark Lord wrote:
> Christoph Hellwig wrote:
>> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
>>> Mark Lord wrote:
>>>> Christoph Hellwig wrote:
>>> ..
>>>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>>>> bit and then call the attached little trim.c program on the mountmoint
>>>>> (or any file inside the filesystem for that matter)
>>>> ..
>>>>
>>>> Looking at it now.  Thanks, Christoph!
>>> ..
>>>
>>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
>>> Rebuilding with 32-bit kernel now..
>>
>> The actual ioctl is compatible, just add the
>>
>>     case XFS_IOC_TRIM:
>>         return xfs_ioc_trim(mp, arg);
>>
>> to xfs_file_compat_ioctl().  I'll add this to the next spin of the patch.
> ..
> 
> Peachy.  Rebuilding again now..
> 
> Don't wait up for this .. I'll finish it on Sunday.  Bedtime.  :)
..

Mmm.. doesn't work on stock 2.6.31-rc6.
It must also require Matthew's patches for discard/trim support.

Later.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  2:25     ` Christoph Hellwig
  2009-08-16  2:49       ` Mark Lord
@ 2009-08-16 13:00       ` Mark Lord
  2009-08-16 13:53         ` Christoph Hellwig
  2009-08-16 13:59         ` Mark Lord
  1 sibling, 2 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-16 13:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe

Christoph Hellwig wrote:
> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
>> Mark Lord wrote:
>>> Christoph Hellwig wrote:
>> ..
>>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>>> bit and then call the attached little trim.c program on the mountmoint
>>>> (or any file inside the filesystem for that matter)
>>> ..
>>>
>>> Looking at it now.  Thanks, Christoph!
>> ..
>>
>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
>> Rebuilding with 32-bit kernel now..
> 
> The actual ioctl is compatible, just add the
> 
> 	case XFS_IOC_TRIM:
> 		return xfs_ioc_trim(mp, arg);
> 
> to xfs_file_compat_ioctl().  I'll add this to the next spin of the patch.
..

Okay, this gives me ENOSYS now --> discard/trim support is missing from
the lower layers.

What other patches do I need to make this work?

The latest from Matthew's discard tree (May 2009) don't appear to be sufficient,
even after updating them for 2.6.31-rc6.

???

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16 13:00       ` Mark Lord
@ 2009-08-16 13:53         ` Christoph Hellwig
  2009-08-16 13:59         ` Mark Lord
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2009-08-16 13:53 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, linux-fsdevel, jens.axboe, linux-kernel,
	linux-scsi, xfs

On Sun, Aug 16, 2009 at 09:00:35AM -0400, Mark Lord wrote:
> Okay, this gives me ENOSYS now --> discard/trim support is missing from
> the lower layers.

ENOSYS or EOPNOTSUPP?

> What other patches do I need to make this work?
>
> The latest from Matthew's discard tree (May 2009) don't appear to be sufficient,
> even after updating them for 2.6.31-rc6.

Hmm.  That was kinda the reason why I didn't want to post stuff yet.  In
current mainline only MTD implements the prepare_discard_fn, and we'll
need it implemented and working.  I think Matthews patches should be
sufficient, unless they have some ATA revision check to not issue the
TRIM for your device.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16 13:00       ` Mark Lord
  2009-08-16 13:53         ` Christoph Hellwig
@ 2009-08-16 13:59         ` Mark Lord
  2009-08-16 14:06           ` Mark Lord
  2009-08-16 14:23           ` Christoph Hellwig
  1 sibling, 2 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-16 13:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe,
	IDE/ATA development list

[-- Attachment #1: Type: text/plain, Size: 2824 bytes --]

Mark Lord wrote:
> Christoph Hellwig wrote:
>> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
>>> Mark Lord wrote:
>>>> Christoph Hellwig wrote:
>>> ..
>>>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>>>> bit and then call the attached little trim.c program on the mountmoint
>>>>> (or any file inside the filesystem for that matter)
>>>> ..
>>>>
>>>> Looking at it now.  Thanks, Christoph!
>>> ..
>>>
>>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
>>> Rebuilding with 32-bit kernel now..
>>
>> The actual ioctl is compatible, just add the
>>
>>     case XFS_IOC_TRIM:
>>         return xfs_ioc_trim(mp, arg);
>>
>> to xfs_file_compat_ioctl().  I'll add this to the next spin of the patch.
> ..
> 
> Okay, this gives me ENOSYS now --> discard/trim support is missing from
> the lower layers.
> 
> What other patches do I need to make this work?
> 
> The latest from Matthew's discard tree (May 2009) don't appear to be sufficient,
> even after updating them for 2.6.31-rc6.
..

Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.

My apologies for attachments, but I am attaching the updated Christoph patch,
as well as my hacked-up forward-port of Matthew's patches.

Not pretty, but they work.  :)

Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
XFS partition gave this for the first time around:

[   25.961891] Filesystem "sdb3": discarding sectors [0xc558-0x102328]
[   27.814553] Filesystem "sdb3": discarding sectors [0x10ea78-0x10e688]
[   29.771218] Filesystem "sdb3": discarding sectors [0x21d120-0x10e860]
[   31.726444] Filesystem "sdb3": discarding sectors [0x32b9a0-0x10e860]
[   33.679023] Filesystem "sdb3": discarding sectors [0x43f220-0x109860]
[   35.629948] Filesystem "sdb3": discarding sectors [0x548aa0-0x10e860]
[   37.583142] Filesystem "sdb3": discarding sectors [0x657320-0x10e860]
[   39.531822] Filesystem "sdb3": discarding sectors [0x765ba0-0x10e860]

Slow, but presumably thorough.
Subsequent runs were equally slow.

The problem is, it still issues TRIMs to the LLD one extent at a time.
Compare this with doing it all in a single TRIM command
with the wiper.sh script (filesystem unmounted):

	[~] time wiper.sh /dev/sdb3 --commit
	
	wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
	Preparing for offline TRIM of free space on /dev/sdb3 (xfs non-mounted).
	This operation could destroy your data.  Are you sure (y/N)? y
	Syncing disks..
	Beginning TRIM operations..
	Trimming 168 free extents encompassing 8793136 sectors (4294 MB)
	Done.
	
	real    0m1.249s
	user    0m0.110s
	sys     0m0.063s

That includes the time for me to type 'y' and hit enter.  :)

Cheers

[-- Attachment #2: 51_add_trim_support.patch --]
[-- Type: text/x-diff, Size: 11019 bytes --]

diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk-barrier.c linux/block/blk-barrier.c
--- linux-2.6.31-rc6/block/blk-barrier.c	2009-08-16 09:16:36.303766940 -0400
+++ linux/block/blk-barrier.c	2009-08-16 09:19:07.287086209 -0400
@@ -348,30 +348,22 @@
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	}
 
+	if (bio_has_data(bio))
+		__free_page(bio_page(bio));
+
+	if (bio->bi_private)
+		complete(bio->bi_private);
+
 	bio_put(bio);
 }
 
-/**
- * blkdev_issue_discard - queue a discard
- * @bdev:	blockdev to issue discard for
- * @sector:	start sector
- * @nr_sects:	number of sectors to discard
- * @gfp_mask:	memory allocation flags (for bio_alloc)
- *
- * Description:
- *    Issue a discard request for the sectors in question. Does not wait.
- */
-int blkdev_issue_discard(struct block_device *bdev,
-			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
+int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+				sector_t nr_sects, gfp_t gfp_mask,
+				unsigned type, struct completion *completion)
 {
-	struct request_queue *q;
-	struct bio *bio;
 	int ret = 0;
+	struct request_queue *q = bdev_get_queue(bdev);
 
-	if (bdev->bd_disk == NULL)
-		return -ENXIO;
-
-	q = bdev_get_queue(bdev);
 	if (!q)
 		return -ENXIO;
 
@@ -379,12 +371,13 @@
 		return -EOPNOTSUPP;
 
 	while (nr_sects && !ret) {
-		bio = bio_alloc(gfp_mask, 0);
+		struct bio *bio = bio_alloc(gfp_mask, 1);
 		if (!bio)
 			return -ENOMEM;
 
 		bio->bi_end_io = blkdev_discard_end_io;
 		bio->bi_bdev = bdev;
+		bio->bi_private = completion;
 
 		bio->bi_sector = sector;
 
@@ -396,10 +389,13 @@
 			bio->bi_size = nr_sects << 9;
 			nr_sects = 0;
 		}
+
 		bio_get(bio);
-		submit_bio(DISCARD_BARRIER, bio);
+		submit_bio(type, bio);
+
+		if (completion)
+			wait_for_completion(completion);
 
-		/* Check if it failed immediately */
 		if (bio_flagged(bio, BIO_EOPNOTSUPP))
 			ret = -EOPNOTSUPP;
 		else if (!bio_flagged(bio, BIO_UPTODATE))
@@ -408,4 +404,24 @@
 	}
 	return ret;
 }
+
+/**
+ * blkdev_issue_discard - queue a discard
+ * @bdev:	blockdev to issue discard for
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to discard
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *    Issue a discard request for the sectors in question. Does not wait.
+ */
+int blkdev_issue_discard(struct block_device *bdev,
+			 sector_t sector, sector_t nr_sects, gfp_t gfp_mask)
+{
+	if (bdev->bd_disk == NULL)
+		return -ENXIO;
+
+	return __blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
+					DISCARD_BARRIER, NULL);
+}
 EXPORT_SYMBOL(blkdev_issue_discard);
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk-core.c linux/block/blk-core.c
--- linux-2.6.31-rc6/block/blk-core.c	2009-08-16 09:16:36.307099905 -0400
+++ linux/block/blk-core.c	2009-08-16 08:53:19.000000000 -0400
@@ -1107,6 +1107,8 @@
 
 void init_request_from_bio(struct request *req, struct bio *bio)
 {
+	might_sleep();
+
 	req->cpu = bio->bi_comp_cpu;
 	req->cmd_type = REQ_TYPE_FS;
 
@@ -1127,7 +1129,7 @@
 		req->cmd_flags |= REQ_DISCARD;
 		if (bio_barrier(bio))
 			req->cmd_flags |= REQ_SOFTBARRIER;
-		req->q->prepare_discard_fn(req->q, req);
+		req->q->prepare_discard_fn(req->q, req, bio);
 	} else if (unlikely(bio_barrier(bio)))
 		req->cmd_flags |= REQ_HARDBARRIER;
 
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/blk.h linux/block/blk.h
--- linux-2.6.31-rc6/block/blk.h	2009-08-16 09:16:36.310433289 -0400
+++ linux/block/blk.h	2009-08-16 08:53:19.000000000 -0400
@@ -17,6 +17,10 @@
 		      struct bio *bio);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
+int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+				sector_t nr_sects, gfp_t gfp_mask,
+				unsigned type, struct completion *completion);
+
 
 void blk_unplug_work(struct work_struct *work);
 void blk_unplug_timeout(unsigned long data);
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/block/ioctl.c linux/block/ioctl.c
--- linux-2.6.31-rc6/block/ioctl.c	2009-08-16 09:16:36.313766813 -0400
+++ linux/block/ioctl.c	2009-08-16 08:53:19.000000000 -0400
@@ -7,6 +7,7 @@
 #include <linux/smp_lock.h>
 #include <linux/blktrace_api.h>
 #include <asm/uaccess.h>
+#include "blk.h"
 
 static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user *arg)
 {
@@ -112,21 +113,10 @@
 	return res;
 }
 
-static void blk_ioc_discard_endio(struct bio *bio, int err)
-{
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
-	}
-	complete(bio->bi_private);
-}
-
 static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 			     uint64_t len)
 {
-	struct request_queue *q = bdev_get_queue(bdev);
-	int ret = 0;
+	DECLARE_COMPLETION_ONSTACK(wait);
 
 	if (start & 511)
 		return -EINVAL;
@@ -138,39 +128,8 @@
 	if (start + len > (bdev->bd_inode->i_size >> 9))
 		return -EINVAL;
 
-	if (!q->prepare_discard_fn)
-		return -EOPNOTSUPP;
-
-	while (len && !ret) {
-		DECLARE_COMPLETION_ONSTACK(wait);
-		struct bio *bio;
-
-		bio = bio_alloc(GFP_KERNEL, 0);
-
-		bio->bi_end_io = blk_ioc_discard_endio;
-		bio->bi_bdev = bdev;
-		bio->bi_private = &wait;
-		bio->bi_sector = start;
-
-		if (len > queue_max_hw_sectors(q)) {
-			bio->bi_size = queue_max_hw_sectors(q) << 9;
-			len -= queue_max_hw_sectors(q);
-			start += queue_max_hw_sectors(q);
-		} else {
-			bio->bi_size = len << 9;
-			len = 0;
-		}
-		submit_bio(DISCARD_NOBARRIER, bio);
-
-		wait_for_completion(&wait);
-
-		if (bio_flagged(bio, BIO_EOPNOTSUPP))
-			ret = -EOPNOTSUPP;
-		else if (!bio_flagged(bio, BIO_UPTODATE))
-			ret = -EIO;
-		bio_put(bio);
-	}
-	return ret;
+	return __blkdev_issue_discard(bdev, start, len, GFP_KERNEL,
+						DISCARD_NOBARRIER, &wait);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/drivers/ata/libata-scsi.c linux/drivers/ata/libata-scsi.c
--- linux-2.6.31-rc6/drivers/ata/libata-scsi.c	2009-08-16 09:16:36.350433414 -0400
+++ linux/drivers/ata/libata-scsi.c	2009-08-16 08:53:19.000000000 -0400
@@ -1051,6 +1051,46 @@
 	desc[11] = block;
 }
 
+static int ata_discard_fn(struct request_queue *q, struct request *req,
+							struct bio *bio)
+{
+	unsigned size;
+	struct page *page = alloc_page(GFP_KERNEL);
+	if (!page)
+		goto error;
+
+	size = ata_set_lba_range_entries(page_address(page), PAGE_SIZE / 8,
+					bio->bi_sector, bio_sectors(bio));
+	bio->bi_size = 0;
+	if (bio_add_pc_page(q, bio, page, size, 0) < size)
+		goto free_page;
+
+	req->cmd_type = REQ_TYPE_BLOCK_PC;
+	req->cmd_len = 16;
+	req->cmd[0] = ATA_16;
+	req->cmd[1] = (6 << 1) | 1;		/* dma, 48-bit */
+	req->cmd[2] = 0x6;			/* length, direction */
+	req->cmd[3] = 0;			/* feature high */
+	req->cmd[4] = ATA_DSM_TRIM;		/* feature low */
+	req->cmd[5] = (size / 512) >> 8;	/* nsect high */
+	req->cmd[6] = size / 512;		/* nsect low */
+	req->cmd[7] = 0;			/* lba */
+	req->cmd[8] = 0;			/* lba */
+	req->cmd[9] = 0;			/* lba */
+	req->cmd[10] = 0;			/* lba */
+	req->cmd[11] = 0;			/* lba */
+	req->cmd[12] = 0;			/* lba */
+	req->cmd[13] = ATA_LBA;			/* device */
+	req->cmd[14] = ATA_CMD_DSM;		/* command */
+	req->cmd[15] = 0;			/* control */
+
+	return 0;
+ free_page:
+	__free_page(page);
+ error:
+	return -ENOMEM;
+}
+
 static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
 	sdev->use_10_for_rw = 1;
@@ -1099,6 +1139,9 @@
 	/* configure max sectors */
 	blk_queue_max_sectors(sdev->request_queue, dev->max_sectors);
 
+	if (ata_id_has_trim(dev->id))
+		blk_queue_set_discard(sdev->request_queue, ata_discard_fn);
+
 	if (dev->class == ATA_DEV_ATAPI) {
 		struct request_queue *q = sdev->request_queue;
 		void *buf;
@@ -1747,6 +1790,12 @@
 	 * whether the command completed successfully or not. If there
 	 * was no error, SK, ASC and ASCQ will all be zero.
 	 */
+
+	if (need_sense && qc->tf.command == ATA_CMD_DSM) {
+		ata_port_printk(ap, KERN_ERR, "%s: DISCARD/TRIM failed: disabling it\n", __func__);
+		blk_queue_set_discard(qc->dev->sdev->request_queue, NULL);
+	}
+
 	if (((cdb[0] == ATA_16) || (cdb[0] == ATA_12)) &&
 	    ((cdb[2] & 0x20) || need_sense)) {
 		ata_gen_passthru_sense(qc);
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/drivers/mtd/mtd_blkdevs.c linux/drivers/mtd/mtd_blkdevs.c
--- linux-2.6.31-rc6/drivers/mtd/mtd_blkdevs.c	2009-08-16 09:16:36.963766818 -0400
+++ linux/drivers/mtd/mtd_blkdevs.c	2009-08-16 08:53:19.000000000 -0400
@@ -33,7 +33,7 @@
 };
 
 static int blktrans_discard_request(struct request_queue *q,
-				    struct request *req)
+				    struct request *req, struct bio *bio)
 {
 	req->cmd_type = REQ_TYPE_LINUX_BLOCK;
 	req->cmd[0] = REQ_LB_OP_DISCARD;
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/include/linux/blkdev.h linux/include/linux/blkdev.h
--- linux-2.6.31-rc6/include/linux/blkdev.h	2009-08-16 09:16:39.053766322 -0400
+++ linux/include/linux/blkdev.h	2009-08-16 08:53:19.000000000 -0400
@@ -255,7 +255,8 @@
 typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unplug_fn) (struct request_queue *);
-typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
+typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
+							struct bio *bio);
 
 struct bio_vec;
 struct bvec_merge_data {
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6/Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6/.gitignore linux-2.6.31-rc6/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.6.31-rc6/include/linux/fs.h	2009-08-16 09:16:39.070433246 -0400
+++ linux/include/linux/fs.h	2009-08-16 08:53:19.000000000 -0400
@@ -161,8 +161,8 @@
  * These aren't really reads or writes, they pass down information about
  * parts of device that are now unused by the file system.
  */
-#define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)
-#define DISCARD_BARRIER ((1 << BIO_RW_DISCARD) | (1 << BIO_RW_BARRIER))
+#define DISCARD_NOBARRIER (WRITE | (1 << BIO_RW_DISCARD))
+#define DISCARD_BARRIER (DISCARD_NOBARRIER | (1 << BIO_RW_BARRIER))
 
 #define SEL_IN		1
 #define SEL_OUT		2

[-- Attachment #3: 52_christoph_xfs_trim.patch --]
[-- Type: text/x-diff, Size: 6760 bytes --]

diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/block/blk-barrier.c linux/block/blk-barrier.c
--- linux-2.6.31-rc6/block/blk-barrier.c	2009-08-16 09:36:36.431146680 -0400
+++ linux/block/blk-barrier.c	2009-08-16 09:20:15.164578531 -0400
@@ -425,3 +425,4 @@
 					DISCARD_BARRIER, NULL);
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
+EXPORT_SYMBOL(__blkdev_issue_discard);
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c linux/fs/xfs/linux-2.6/xfs_ioctl.c
--- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-16 09:16:39.000433070 -0400
+++ linux/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-16 09:30:38.973683042 -0400
@@ -1274,6 +1274,31 @@
 	return 0;
 }
 
+int
+xfs_ioc_trim(
+	struct xfs_mount	*mp,
+	__uint32_t		*argp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+	__uint32_t		minlen;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (get_user(minlen, argp))
+		return -EFAULT;
+
+	down_read(&mp->m_peraglock);
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = -xfs_trim_extents(mp, agno, minlen);
+		if (error)
+			break;
+	}
+	up_read(&mp->m_peraglock);
+
+	return error;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1523,6 +1548,9 @@
 		error = xfs_errortag_clearall(mp, 1);
 		return -error;
 
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c linux/fs/xfs/linux-2.6/xfs_ioctl32.c
--- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-08-16 09:31:21.005588977 -0400
@@ -539,6 +539,7 @@
 	void			__user *arg = (void __user *)p;
 	int			ioflags = 0;
 	int			error;
+	extern int xfs_ioc_trim(struct xfs_mount *mp, __uint32_t *argp);
 
 	if (filp->f_mode & FMODE_NOCMTIME)
 		ioflags |= IO_INVIS;
@@ -564,6 +565,8 @@
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
 		return xfs_file_ioctl(filp, cmd, p);
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */
 	case XFS_IOC_ALLOCSP:
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_alloc.h linux/fs/xfs/xfs_alloc.h
--- linux-2.6.31-rc6/fs/xfs/xfs_alloc.h	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/xfs_alloc.h	2009-08-16 09:20:15.167913313 -0400
@@ -215,4 +215,7 @@
 	xfs_fsblock_t	bno,	/* starting block number of extent */
 	xfs_extlen_t	len);	/* length of extent */
 
+int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno,
+	xfs_extlen_t minlen);
+
 #endif	/* __XFS_ALLOC_H__ */
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_fs.h linux/fs/xfs/xfs_fs.h
--- linux-2.6.31-rc6/fs/xfs/xfs_fs.h	2009-08-16 09:16:39.017099926 -0400
+++ linux/fs/xfs/xfs_fs.h	2009-08-16 09:20:15.171246419 -0400
@@ -475,6 +475,7 @@
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_TRIM		     _IOR ('X', 126, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
--- linux-2.6.31-rc6/fs/xfs/xfs_alloc.c	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/xfs_alloc.c	2009-08-16 09:44:51.073580438 -0400
@@ -39,6 +39,9 @@
 #include "xfs_alloc.h"
 #include "xfs_error.h"
 
+int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+				sector_t nr_sects, gfp_t gfp_mask,
+				unsigned type, struct completion *completion);
 
 #define XFS_ABSDIFF(a,b)	(((a) <= (b)) ? ((b) - (a)) : ((a) - (b)))
 
@@ -2609,6 +2612,97 @@
 	return error;
 }
 
+STATIC int
+xfs_trim_extent(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen)
+{
+	xfs_daddr_t		blkno = XFS_AGB_TO_DADDR(mp, agno, fbno);
+	sector_t		nblks = XFS_FSB_TO_BB(mp, flen);
+	int			error;
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]",
+			blkno, (u64)nblks);
+
+	error = -__blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+			blkno, nblks, GFP_NOFS, DISCARD_BARRIER, &done);
+	if (error && error != EOPNOTSUPP)
+		xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error);
+	return error;
+}
+
+/*
+ * Notify the underlying block device about our free extent map.
+ *
+ * This walks all free extents above a minimum threshold and notifies the
+ * underlying device that these blocks are unused.  That information is
+ * useful for SSDs or thinly provisioned storage in high end arrays or
+ * virtualization scenarios.
+ */
+int
+xfs_trim_extents(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		minlen)	/* minimum extent size to bother */
+{
+	struct xfs_btree_cur	*cur;	/* cursor for the by-block btree */
+	struct xfs_buf		*agbp;	/* AGF buffer pointer */
+	xfs_agblock_t		bno;	/* block the for next search */
+	xfs_agblock_t		fbno;	/* start block of found extent */
+	xfs_extlen_t		flen;	/* length of found extent */
+	int			error;
+	int			i;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	bno = 0;
+	for (;;) {
+		cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno,
+					      XFS_BTNUM_BNO);
+
+		error = xfs_alloc_lookup_ge(cur, bno, minlen, &i);
+		if (error)
+			goto error0;
+		if (!i) {
+			/*
+			 * No more free extents found: done.
+			 */
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			break;
+		}
+
+		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+		/*
+		 * Pass if the freespace extent isn't long enough to bother.
+		 */
+		if (flen >= minlen) {
+			error = xfs_trim_extent(mp, agno, fbno, flen);
+			if (error) {
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				break;
+			}
+		}
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		bno = fbno + flen;
+	}
+
+out:
+	xfs_buf_relse(agbp);
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	goto out;
+}
 
 /*
  * AG Busy list management

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16 13:59         ` Mark Lord
@ 2009-08-16 14:06           ` Mark Lord
  2009-08-16 14:23           ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-16 14:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe,
	IDE/ATA development list

Mark Lord wrote:
..
> Slow, but presumably thorough.
> Subsequent runs were equally slow.
> 
> The problem is, it still issues TRIMs to the LLD one extent at a time.
> Compare this with doing it all in a single TRIM command
> with the wiper.sh script (filesystem unmounted):
> 
>     [~] time wiper.sh /dev/sdb3 --commit
>     
>     wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
>     Preparing for offline TRIM of free space on /dev/sdb3 (xfs 
> non-mounted).
>     This operation could destroy your data.  Are you sure (y/N)? y
>     Syncing disks..
>     Beginning TRIM operations..
>     Trimming 168 free extents encompassing 8793136 sectors (4294 MB)
>     Done.
>     
>     real    0m1.249s
>     user    0m0.110s
>     sys     0m0.063s
> 
> That includes the time for me to type 'y' and hit enter.  :)
..

For completeness, here's the same operation again,
except this time on the *mounted* xfs filesystem.
It won't be trimming quite as many blocks
(leaves 1% free space in reserve),
but otherwise is similar:

	[~] time wiper.sh /dev/sdb3 --commit
	
	wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
	Preparing for online TRIM of free space on /dev/sdb3 (xfs mounted read-write at /x).
	This operation could destroy your data.  Are you sure (y/N)? y
	Creating temporary file (4348405 KB)..
	Syncing disks..
	Beginning TRIM operations..
	Trimming 134 free extents encompassing 8696816 sectors (4246 MB)
	Removing temporary file..
	Syncing disks..
	Done.
	
	real    0m1.212s
	user    0m0.043s
	sys     0m0.053s


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16 13:59         ` Mark Lord
  2009-08-16 14:06           ` Mark Lord
@ 2009-08-16 14:23           ` Christoph Hellwig
  2009-08-16 14:26             ` Mark Lord
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2009-08-16 14:23 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe, IDE/ATA development list

On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote:
> Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
> between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.
>
> My apologies for attachments, but I am attaching the updated Christoph patch,
> as well as my hacked-up forward-port of Matthew's patches.
>
> Not pretty, but they work.  :)
>
> Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
> XFS partition gave this for the first time around:

> The problem is, it still issues TRIMs to the LLD one extent at a time.
> Compare this with doing it all in a single TRIM command
> with the wiper.sh script (filesystem unmounted):

I could do a variant which issues a single TRIM, but that would require
us to lock out all other allocations for the time the trim takes.  I'll
hack that up once I get some time.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16 14:23           ` Christoph Hellwig
@ 2009-08-16 14:26             ` Mark Lord
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-16 14:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe,
	IDE/ATA development list

Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote:
>> Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
>> between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.
>>
>> My apologies for attachments, but I am attaching the updated Christoph patch,
>> as well as my hacked-up forward-port of Matthew's patches.
>>
>> Not pretty, but they work.  :)
>>
>> Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
>> XFS partition gave this for the first time around:
> 
>> The problem is, it still issues TRIMs to the LLD one extent at a time.
>> Compare this with doing it all in a single TRIM command
>> with the wiper.sh script (filesystem unmounted):
> 
> I could do a variant which issues a single TRIM, but that would require
> us to lock out all other allocations for the time the trim takes.  I'll
> hack that up once I get some time.
..

Matthew's stuff will have to change to support that, too.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-16  0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig
  2009-08-16  1:35 ` Mark Lord
@ 2009-08-19 20:39 ` Ingo Molnar
  2009-08-20  1:05   ` Christoph Hellwig
  2009-08-20  1:39   ` Mark Lord
  1 sibling, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-08-19 20:39 UTC (permalink / raw)
  To: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds
  Cc: xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe


* Christoph Hellwig <hch@infradead.org> wrote:

> Given that everyone is so big in the discard discussion 
> I'd like to present what I had started to prepare for 
> XFS.  I didn't plan to send it out until I get my hands 
> onto a TRIM capable device (or at least get time to add 
> support to qemu), and so far it's only been tested in 
> dry-run mode.
> 
> The basic idea is to add an ioctl which walks the free 
> space btrees in each allocation group and simply 
> discard everythin that is free. [...]

A general interface design question: you added a new 
ioctl XFS_IOC_TRIM case. It's a sub-case of an 
ugly-looking demultiplexing xfs_file_ioctl().

What is your threshold for turning something into a 
syscall? When are ioctls acceptable in your opinion?

I'm asking this because we are facing a similar problem 
with perfcounters: we need to extend the ioctl 
functionality there but introducing a new syscall looks 
wasteful.

So i'm torn about the 'syscall versus ioctl' issue, i'd 
like to avoid making interface design mistakes and i'd 
like to solicit some opinions about this. I've attached 
the perfcounters ioctl patch below.

Thanks,

	Ingo

----- Forwarded message from Peter Zijlstra <a.p.zijlstra@chello.nl> -----

Date: Wed, 19 Aug 2009 11:18:27 +0200
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Ingo Molnar <mingo@elte.hu>, Paul Mackerras <paulus@samba.org>
Subject: [PATCH 4/4][RFC] perf_counter: Allow sharing of output channels
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Mike Galbraith <efault@gmx.de>, linux-kernel@vger.kernel.org,
	stephane eranian <eranian@googlemail.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>

Provide the ability to configure a counter to send its output to
another (already existing) counter's output stream.

[ compile tested only ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
---
 include/linux/perf_counter.h |    5 ++
 kernel/perf_counter.c        |   83 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 85 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -216,6 +216,7 @@ struct perf_counter_attr {
 #define PERF_COUNTER_IOC_REFRESH	_IO ('$', 2)
 #define PERF_COUNTER_IOC_RESET		_IO ('$', 3)
 #define PERF_COUNTER_IOC_PERIOD		_IOW('$', 4, u64)
+#define PERF_COUNTER_IOC_SET_OUTPUT	_IO ('$', 5)
 
 enum perf_counter_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
@@ -415,6 +416,9 @@ enum perf_callchain_context {
 	PERF_CONTEXT_MAX		= (__u64)-4095,
 };
 
+#define PERF_FLAG_FD_NO_GROUP	(1U << 0)
+#define PERF_FLAG_FD_OUTPUT	(1U << 1)
+
 #ifdef __KERNEL__
 /*
  * Kernel-internal data types and definitions:
@@ -536,6 +540,7 @@ struct perf_counter {
 	struct list_head		sibling_list;
 	int				nr_siblings;
 	struct perf_counter		*group_leader;
+	struct perf_counter		*output;
 	const struct pmu		*pmu;
 
 	enum perf_counter_active_state	state;
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1686,6 +1686,11 @@ static void free_counter(struct perf_cou
 			atomic_dec(&nr_task_counters);
 	}
 
+	if (counter->output) {
+		fput(counter->output->filp);
+		counter->output = NULL;
+	}
+
 	if (counter->destroy)
 		counter->destroy(counter);
 
@@ -1971,6 +1976,8 @@ unlock:
 	return ret;
 }
 
+int perf_counter_set_output(struct perf_counter *counter, int output_fd);
+
 static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct perf_counter *counter = file->private_data;
@@ -1994,6 +2001,9 @@ static long perf_ioctl(struct file *file
 	case PERF_COUNTER_IOC_PERIOD:
 		return perf_counter_period(counter, (u64 __user *)arg);
 
+	case PERF_COUNTER_IOC_SET_OUTPUT:
+		return perf_counter_set_output(counter, arg);
+
 	default:
 		return -ENOTTY;
 	}
@@ -2264,6 +2274,11 @@ static int perf_mmap(struct file *file, 
 
 	WARN_ON_ONCE(counter->ctx->parent_ctx);
 	mutex_lock(&counter->mmap_mutex);
+	if (counter->output) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	if (atomic_inc_not_zero(&counter->mmap_count)) {
 		if (nr_pages != counter->data->nr_pages)
 			ret = -EINVAL;
@@ -2649,6 +2664,7 @@ static int perf_output_begin(struct perf
 			     struct perf_counter *counter, unsigned int size,
 			     int nmi, int sample)
 {
+	struct perf_counter *output_counter;
 	struct perf_mmap_data *data;
 	unsigned int offset, head;
 	int have_lost;
@@ -2658,13 +2674,17 @@ static int perf_output_begin(struct perf
 		u64			 lost;
 	} lost_event;
 
+	rcu_read_lock();
 	/*
 	 * For inherited counters we send all the output towards the parent.
 	 */
 	if (counter->parent)
 		counter = counter->parent;
 
-	rcu_read_lock();
+	output_counter = rcu_dereference(counter->output);
+	if (output_counter)
+		counter = output_counter;
+
 	data = rcu_dereference(counter->data);
 	if (!data)
 		goto out;
@@ -4214,6 +4234,57 @@ err_size:
 	goto out;
 }
 
+int perf_counter_set_output(struct perf_counter *counter, int output_fd)
+{
+	struct perf_counter *output_counter = NULL;
+	struct file *output_file = NULL;
+	struct perf_counter *old_output;
+	int fput_needed = 0;
+	int ret = -EINVAL;
+
+	if (!output_fd)
+		goto set;
+
+	output_file = fget_light(output_fd, &fput_needed);
+	if (!output_file)
+		return -EBADF;
+
+	if (output_file->f_op != &perf_fops)
+		goto out;
+
+	output_counter = output_file->private_data;
+
+	/* Don't chain output fds */
+	if (output_counter->output)
+		goto out;
+
+	/* Don't set an output fd when we already have an output channel */
+	if (counter->data)
+		goto out;
+
+	atomic_long_inc(&output_file->f_count);
+
+set:
+	mutex_lock(&counter->mmap_mutex);
+	old_output = counter->output;
+	rcu_assign_pointer(counter->output, output_counter);
+	mutex_unlock(&counter->mmap_mutex);
+
+	if (old_output) {
+		/*
+		 * we need to make sure no existing perf_output_*()
+		 * is still referencing this counter.
+		 */
+		synchronize_rcu();
+		fput(old_output->filp);
+	}
+
+	ret = 0;
+out:
+	fput_light(output_file, fput_needed);
+	return ret;
+}
+
 /**
  * sys_perf_counter_open - open a performance counter, associate it to a task/cpu
  *
@@ -4236,7 +4307,7 @@ SYSCALL_DEFINE5(perf_counter_open,
 	int ret;
 
 	/* for future expandability... */
-	if (flags)
+	if (flags & ~(PERF_FLAG_FD_NO_GROUP | PERF_FLAG_FD_OUTPUT))
 		return -EINVAL;
 
 	ret = perf_copy_attr(attr_uptr, &attr);
@@ -4264,7 +4335,7 @@ SYSCALL_DEFINE5(perf_counter_open,
 	 * Look up the group leader (we will attach this counter to it):
 	 */
 	group_leader = NULL;
-	if (group_fd != -1) {
+	if (group_fd != -1 && !(flags & PERF_FLAG_FD_NO_GROUP)) {
 		ret = -EINVAL;
 		group_file = fget_light(group_fd, &fput_needed);
 		if (!group_file)
@@ -4306,6 +4377,12 @@ SYSCALL_DEFINE5(perf_counter_open,
 	if (!counter_file)
 		goto err_free_put_context;
 
+	if (flags & PERF_FLAG_FD_OUTPUT) {
+		ret = perf_counter_set_output(counter, group_fd);
+		if (ret)
+			goto err_free_put_context;
+	}
+
 	counter->filp = counter_file;
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);

-- 

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-19 20:39 ` Ingo Molnar
@ 2009-08-20  1:05   ` Christoph Hellwig
  2009-08-20  1:10     ` Jamie Lokier
  2009-08-21 12:46     ` Ingo Molnar
  2009-08-20  1:39   ` Mark Lord
  1 sibling, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2009-08-20  1:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds,
	xfs, linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe

On Wed, Aug 19, 2009 at 10:39:16PM +0200, Ingo Molnar wrote:
> A general interface design question: you added a new 
> ioctl XFS_IOC_TRIM case. It's a sub-case of an 
> ugly-looking demultiplexing xfs_file_ioctl().

ioctl is per defintion a multiplexer.

> What is your threshold for turning something into a 
> syscall? When are ioctls acceptable in your opinion?
> 
> I'm asking this because we are facing a similar problem 
> with perfcounters: we need to extend the ioctl 
> functionality there but introducing a new syscall looks 
> wasteful.
> 
> So i'm torn about the 'syscall versus ioctl' issue, i'd 
> like to avoid making interface design mistakes and i'd 
> like to solicit some opinions about this. I've attached 
> the perfcounters ioctl patch below.

Only add a syscall if it has _one_ clear defined purpose,
which has kernel-wide meaning.

Do not add an syscall that is just another multiplexer without
structure. Most likely it will just be even worse than sys_ioctl.

Also really don't bother adding a system call that is specific to
one singler driver or filesystem.  Besides horrible logistics -
you'd need some always built-in stub calling out to the possibly
modular drivers/filesystem - it also simply doesn't make any semantical
sense.  I can't say I like the ioctl use in perfcounters much,
but adding a special syscalls instead would be even more horrible.

As for the trim support this really just was an RFC to start bringing
some code into play instead of the endless masturbation about hat code
that doesn't exist happens on hardware most people don't have.  The
interface will most ceetainly change and I hope we will have a common
interface for all filesystems (or at least those that care).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20  1:05   ` Christoph Hellwig
@ 2009-08-20  1:10     ` Jamie Lokier
  2009-08-20  1:38       ` Douglas Gilbert
  2009-08-20  1:38       ` Mark Lord
  2009-08-21 12:46     ` Ingo Molnar
  1 sibling, 2 replies; 27+ messages in thread
From: Jamie Lokier @ 2009-08-20  1:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs,
	linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe

Christoph Hellwig wrote:
> > So i'm torn about the 'syscall versus ioctl' issue, i'd 
> > like to avoid making interface design mistakes and i'd 
> > like to solicit some opinions about this. I've attached 
> > the perfcounters ioctl patch below.
> 
> Only add a syscall if it has _one_ clear defined purpose,
> which has kernel-wide meaning.

One clear defined purpose which comes to mind is a "trim" or "punch"
system call, for making holes in files as well as trimming block
devices.  Several other OSes have that capability on files.

I don't remember - does TRIM guarantee the blocks read zeros afterwards?

It would be tidy if it does, as it could have the same meaning with files.

-- Jamie


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20  1:10     ` Jamie Lokier
@ 2009-08-20  1:38       ` Douglas Gilbert
  2009-08-20  1:38       ` Mark Lord
  1 sibling, 0 replies; 27+ messages in thread
From: Douglas Gilbert @ 2009-08-20  1:38 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	liml, jens.axboe

Jamie Lokier wrote:
> Christoph Hellwig wrote:
>>> So i'm torn about the 'syscall versus ioctl' issue, i'd 
>>> like to avoid making interface design mistakes and i'd 
>>> like to solicit some opinions about this. I've attached 
>>> the perfcounters ioctl patch below.
>> Only add a syscall if it has _one_ clear defined purpose,
>> which has kernel-wide meaning.
> 
> One clear defined purpose which comes to mind is a "trim" or "punch"
> system call, for making holes in files as well as trimming block
> devices.  Several other OSes have that capability on files.
> 
> I don't remember - does TRIM guarantee the blocks read zeros afterwards?
> 
> It would be tidy if it does, as it could have the same meaning with files.

Both ATA and SCSI leave it up to the device to decide
whether reads after a trim on a logical block are
determinate (i.e. zeroes are returned) or indeterminate.

In the case of ATA the IDENTIFY DEVICE data word 69 bit
14 gives the indication. In the case of SCSI the
READ CAPACITY(16) command response TPRZ bit gives the
indication. In both case a value of one indicates
determinate.

Doug Gilbert



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20  1:10     ` Jamie Lokier
  2009-08-20  1:38       ` Douglas Gilbert
@ 2009-08-20  1:38       ` Mark Lord
  1 sibling, 0 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-20  1:38 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe

Jamie Lokier wrote:
..
> I don't remember - does TRIM guarantee the blocks read zeros afterwards?
..

No, it doesn't.

A drive can optionally support "deterministic TRIM", whereby it will return
consistent data for any given trimmed sector afterwards, but that doesn't mean zeros.

-ml

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20  1:05   ` Christoph Hellwig
  2009-08-20  1:10     ` Jamie Lokier
@ 2009-08-21 12:46     ` Ingo Molnar
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-08-21 12:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs,
	linux-fsdevel, linux-scsi, linux-kernel, liml, jens.axboe


* Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Aug 19, 2009 at 10:39:16PM +0200, Ingo Molnar wrote:
> > A general interface design question: you added a new 
> > ioctl XFS_IOC_TRIM case. It's a sub-case of an 
> > ugly-looking demultiplexing xfs_file_ioctl().
> 
> ioctl is per defintion a multiplexer.

Yes, and? There's two variants of multiplexing:

  - multiplex something rather straightforward and related
  - multiplex unrelated fields, data, structures

I consider the second one 'ugly', the first one 'ok-ish'. YMMV.

> > What is your threshold for turning something into a syscall? 
> > When are ioctls acceptable in your opinion?
> > 
> > I'm asking this because we are facing a similar problem with 
> > perfcounters: we need to extend the ioctl functionality there 
> > but introducing a new syscall looks wasteful.
> > 
> > So i'm torn about the 'syscall versus ioctl' issue, i'd like to 
> > avoid making interface design mistakes and i'd like to solicit 
> > some opinions about this. I've attached the perfcounters ioctl 
> > patch below.
> 
> Only add a syscall if it has _one_ clear defined purpose, which 
> has kernel-wide meaning.
> 
> Do not add an syscall that is just another multiplexer without 
> structure. Most likely it will just be even worse than sys_ioctl.
> 
> Also really don't bother adding a system call that is specific to 
> one singler driver or filesystem.  Besides horrible logistics - 
> you'd need some always built-in stub calling out to the possibly 
> modular drivers/filesystem - it also simply doesn't make any 
> semantical sense.  I can't say I like the ioctl use in 
> perfcounters much, but adding a special syscalls instead would be 
> even more horrible.
> 
> As for the trim support this really just was an RFC to start 
> bringing some code into play instead of the endless masturbation 
> about hat code that doesn't exist happens on hardware most people 
> don't have.  The interface will most ceetainly change and I hope 
> we will have a common interface for all filesystems (or at least 
> those that care).

Okay, i'm confused.

I'd like to understand the technical basis of your critisism and i'd 
like to address any deficiencies of the perfcounters code. You said 
you dont like the ioctl solution we have, but that you'd like a 
separate syscall even less.

Perfcounters are a kernel-wide concept, encompassing 100% of all 
Linux installations, not just some special hardware.

So by your own standard above they seem to be more than eligible for 
system calls (i hope i'm not mis-stating it), as long as they are 
cleanly structured. Yet you dont like the interface nor any future 
pushing of the currently ioctl bits of the interface into syscalls.

Is there any other interface form you'd like more?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-19 20:39 ` Ingo Molnar
  2009-08-20  1:05   ` Christoph Hellwig
@ 2009-08-20  1:39   ` Mark Lord
  2009-08-20 13:48     ` Ric Wheeler
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Lord @ 2009-08-20  1:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Peter Zijlstra, Paul Mackerras, Linus Torvalds,
	xfs, linux-fsdevel, linux-scsi, linux-kernel, jens.axboe,
	IDE/ATA development list

[resending, after fixing the Cc: list; somebody trimmed it earlier]

Jamie Lokier wrote:
..
> I don't remember - does TRIM guarantee the blocks read zeros afterwards?
..

No, it doesn't.

A drive can optionally support "deterministic TRIM", whereby it will return
consistent data for any given trimmed sector afterwards, but that doesn't mean zeros.

-ml

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20  1:39   ` Mark Lord
@ 2009-08-20 13:48     ` Ric Wheeler
  2009-08-20 14:38       ` Mark Lord
  2009-08-20 14:58       ` Douglas Gilbert
  0 siblings, 2 replies; 27+ messages in thread
From: Ric Wheeler @ 2009-08-20 13:48 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras,
	Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe, IDE/ATA development list, Neil Brown

On 08/19/2009 09:39 PM, Mark Lord wrote:
> [resending, after fixing the Cc: list; somebody trimmed it earlier]
>
> Jamie Lokier wrote:
> ..
>> I don't remember - does TRIM guarantee the blocks read zeros afterwards?
> ..
>
> No, it doesn't.
>
> A drive can optionally support "deterministic TRIM", whereby it will 
> return
> consistent data for any given trimmed sector afterwards, but that 
> doesn't mean zeros.
>
> -ml

Note that returning consistent data is critical for devices that are 
used in a RAID group since you will need each RAID block that is used to 
compute the parity to continue to return the same data until you 
overwrite it with new data :-)

If we have a device that does not support this (or is misconfigured not 
to do this), we should not use those devices in an MD group & do discard 
against it...

ric


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 13:48     ` Ric Wheeler
@ 2009-08-20 14:38       ` Mark Lord
  2009-08-20 14:42         ` Ric Wheeler
                           ` (2 more replies)
  2009-08-20 14:58       ` Douglas Gilbert
  1 sibling, 3 replies; 27+ messages in thread
From: Mark Lord @ 2009-08-20 14:38 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras,
	Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe, IDE/ATA development list, Neil Brown

Ric Wheeler wrote:
>
> Note that returning consistent data is critical for devices that are 
> used in a RAID group since you will need each RAID block that is used to 
> compute the parity to continue to return the same data until you 
> overwrite it with new data :-)
> 
> If we have a device that does not support this (or is misconfigured not 
> to do this), we should not use those devices in an MD group & do discard 
> against it...
..

Well, that's a bit drastic.  But the RAID software should at least
not issue TRIM commands in ignorance of such.

Would it still be okay to do the TRIMs when the entire parity stripe
(across all members) is being discarded?  (As opposed to just partial
data there being dropped)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 14:38       ` Mark Lord
@ 2009-08-20 14:42         ` Ric Wheeler
  2009-08-20 17:19           ` Greg Freemyer
  2009-08-20 14:42         ` James Bottomley
  2009-08-20 15:43         ` Rolf Eike Beer
  2 siblings, 1 reply; 27+ messages in thread
From: Ric Wheeler @ 2009-08-20 14:42 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Christoph Hellwig, Peter Zijlstra, Paul Mackerras,
	Linus Torvalds, xfs, linux-fsdevel, linux-scsi, linux-kernel,
	jens.axboe, IDE/ATA development list, Neil Brown

On 08/20/2009 10:38 AM, Mark Lord wrote:
> Ric Wheeler wrote:
>>
>> Note that returning consistent data is critical for devices that are
>> used in a RAID group since you will need each RAID block that is used
>> to compute the parity to continue to return the same data until you
>> overwrite it with new data :-)
>>
>> If we have a device that does not support this (or is misconfigured
>> not to do this), we should not use those devices in an MD group & do
>> discard against it...
> ..
>
> Well, that's a bit drastic. But the RAID software should at least
> not issue TRIM commands in ignorance of such.

If the storage can return different data in a sequence of READ requests of the 
same sector (with no writes), there is nothing RAID could do. It would see total 
garbage...

> Would it still be okay to do the TRIMs when the entire parity stripe
> (across all members) is being discarded? (As opposed to just partial
> data there being dropped)

This should be safe if the MD bitmaps would prevent us from trying to 
READ/regenerate parity for that stripe...

ric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 14:42         ` Ric Wheeler
@ 2009-08-20 17:19           ` Greg Freemyer
  0 siblings, 0 replies; 27+ messages in thread
From: Greg Freemyer @ 2009-08-20 17:19 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Ingo Molnar, Christoph Hellwig, Peter Zijlstra,
	Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi,
	linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown

On Thu, Aug 20, 2009 at 10:42 AM, Ric Wheeler<rwheeler@redhat.com> wrote:
> On 08/20/2009 10:38 AM, Mark Lord wrote:
>>
>> Ric Wheeler wrote:
>>>
>>> Note that returning consistent data is critical for devices that are
>>> used in a RAID group since you will need each RAID block that is used
>>> to compute the parity to continue to return the same data until you
>>> overwrite it with new data :-)
>>>
>>> If we have a device that does not support this (or is misconfigured
>>> not to do this), we should not use those devices in an MD group & do
>>> discard against it...
>>
>> ..
>>
>> Well, that's a bit drastic. But the RAID software should at least
>> not issue TRIM commands in ignorance of such.
>
> If the storage can return different data in a sequence of READ requests of
> the same sector (with no writes), there is nothing RAID could do. It would
> see total garbage...
>
>> Would it still be okay to do the TRIMs when the entire parity stripe
>> (across all members) is being discarded? (As opposed to just partial
>> data there being dropped)
>
> This should be safe if the MD bitmaps would prevent us from trying to
> READ/regenerate parity for that stripe...
>
> ric

The harder thing for mdraid is putting a stripe back in service.  If
even a single sector is written to a "discarded" stripe, the entire
stripe has to be written with determinate data that has the right
parity.

ie. Only full stripes can be discarded and only full-stripes can be
put back in service.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 14:38       ` Mark Lord
  2009-08-20 14:42         ` Ric Wheeler
@ 2009-08-20 14:42         ` James Bottomley
  2009-08-20 15:43         ` Rolf Eike Beer
  2 siblings, 0 replies; 27+ messages in thread
From: James Bottomley @ 2009-08-20 14:42 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Peter Zijlstra,
	Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi,
	linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown

On Thu, 2009-08-20 at 10:38 -0400, Mark Lord wrote:
> Would it still be okay to do the TRIMs when the entire parity stripe
> (across all members) is being discarded?  (As opposed to just partial
> data there being dropped)

Not really.  The problem is that array verification is done at the block
level not the fs level (although, I suppose, we could change that).  So
a fully discarded stripe still has to verify OK (as in what's read for
the parity must match what's read for the data).  All of this is the
reason for the TPRZ bit for SCSI UNMAP ... and why WRITE_SAME is also
under consideration for discards in T10.

James

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 14:38       ` Mark Lord
  2009-08-20 14:42         ` Ric Wheeler
  2009-08-20 14:42         ` James Bottomley
@ 2009-08-20 15:43         ` Rolf Eike Beer
  2009-08-20 17:00           ` Ric Wheeler
  2 siblings, 1 reply; 27+ messages in thread
From: Rolf Eike Beer @ 2009-08-20 15:43 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Peter Zijlstra,
	Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi,
	linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown

[-- Attachment #1: Type: Text/Plain, Size: 2366 bytes --]

Mark Lord wrote:
> Ric Wheeler wrote:
> > Note that returning consistent data is critical for devices that are
> > used in a RAID group since you will need each RAID block that is used to
> > compute the parity to continue to return the same data until you
> > overwrite it with new data :-)
> >
> > If we have a device that does not support this (or is misconfigured not
> > to do this), we should not use those devices in an MD group & do discard
> > against it...
>
> ..
>
> Well, that's a bit drastic.  But the RAID software should at least
> not issue TRIM commands in ignorance of such.
>
> Would it still be okay to do the TRIMs when the entire parity stripe
> (across all members) is being discarded?  (As opposed to just partial
> data there being dropped)

I think there might be a related usecase that could benefit from 
TRIM/UNMAP/whatever support in file systems even if the physical devices do 
not support that. I have a RAID5 at work with LVM over it. This week I deleted 
an old logical volume of some 200GB that has been moved to a different volume 
group, tomorrow I will start to replace all the disks in the raid with bigger 
ones. So if the LVM told the raid "hey, this space is totally garbage from now 
on" the raid would not have to do any calculation when it has to rebuild that 
but could simply write fixed patterns to all disks (e.g. 0 to first data, 0 to 
second data and 0 as "0 xor 0" to parity). With the knowledge that some of the 
underlying devices would support "write all to zero" this operation could be 
speed up even more, with "write all fixed pattern" every unused chunk would go 
down to a single write operation (per disk) on rebuild regardless which parity 
algorithm is used.

And even if things are in use the RAID can benefit from such things. If we 
just define that every unmapped space will always be 0 when read and I write 
to a raid volume and the other part of the checksum calculation is unmapped 
checksumming becomes easy as we already know half of the values before: 0. So 
we can save the reads from the second data stripe and most of the calculation.
"dd if=/dev/md0" on an unmapped space is more or less the same as "dd 
if=/dev/zero" than.

I only fear that these things are too obviously as I would be the first to 
have this idea ;)

Greetings,

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 15:43         ` Rolf Eike Beer
@ 2009-08-20 17:00           ` Ric Wheeler
  0 siblings, 0 replies; 27+ messages in thread
From: Ric Wheeler @ 2009-08-20 17:00 UTC (permalink / raw)
  To: Rolf Eike Beer
  Cc: Mark Lord, Ric Wheeler, Ingo Molnar, Christoph Hellwig,
	Peter Zijlstra, Paul Mackerras, Linus Torvalds, xfs,
	linux-fsdevel, linux-scsi, linux-kernel, jens.axboe,
	IDE/ATA development list, Neil Brown

On 08/20/2009 11:43 AM, Rolf Eike Beer wrote:
> Mark Lord wrote:
>    
>> Ric Wheeler wrote:
>>      
>>> Note that returning consistent data is critical for devices that are
>>> used in a RAID group since you will need each RAID block that is used to
>>> compute the parity to continue to return the same data until you
>>> overwrite it with new data :-)
>>>
>>> If we have a device that does not support this (or is misconfigured not
>>> to do this), we should not use those devices in an MD group&  do discard
>>> against it...
>>>        
>> ..
>>
>> Well, that's a bit drastic.  But the RAID software should at least
>> not issue TRIM commands in ignorance of such.
>>
>> Would it still be okay to do the TRIMs when the entire parity stripe
>> (across all members) is being discarded?  (As opposed to just partial
>> data there being dropped)
>>      
> I think there might be a related usecase that could benefit from
> TRIM/UNMAP/whatever support in file systems even if the physical devices do
> not support that. I have a RAID5 at work with LVM over it. This week I deleted
> an old logical volume of some 200GB that has been moved to a different volume
> group, tomorrow I will start to replace all the disks in the raid with bigger
> ones. So if the LVM told the raid "hey, this space is totally garbage from now
> on" the raid would not have to do any calculation when it has to rebuild that
> but could simply write fixed patterns to all disks (e.g. 0 to first data, 0 to
> second data and 0 as "0 xor 0" to parity). With the knowledge that some of the
> underlying devices would support "write all to zero" this operation could be
> speed up even more, with "write all fixed pattern" every unused chunk would go
> down to a single write operation (per disk) on rebuild regardless which parity
> algorithm is used.
>    

In the SCSI world, RAID array vendors use "WRITE_SAME" to do this. For 
the SCSI discard, the write same command has a discard bit set if I 
remember correctly so you basically get what you are describing above.

ric

> And even if things are in use the RAID can benefit from such things. If we
> just define that every unmapped space will always be 0 when read and I write
> to a raid volume and the other part of the checksum calculation is unmapped
> checksumming becomes easy as we already know half of the values before: 0. So
> we can save the reads from the second data stripe and most of the calculation.
> "dd if=/dev/md0" on an unmapped space is more or less the same as "dd
> if=/dev/zero" than.
>
> I only fear that these things are too obviously as I would be the first to
> have this idea ;)
>
> Greetings,
>
> Eike
>    


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH, RFC] xfs: batched discard support
  2009-08-20 13:48     ` Ric Wheeler
  2009-08-20 14:38       ` Mark Lord
@ 2009-08-20 14:58       ` Douglas Gilbert
  1 sibling, 0 replies; 27+ messages in thread
From: Douglas Gilbert @ 2009-08-20 14:58 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Ingo Molnar, Christoph Hellwig, Peter Zijlstra,
	Paul Mackerras, Linus Torvalds, xfs, linux-fsdevel, linux-scsi,
	linux-kernel, jens.axboe, IDE/ATA development list, Neil Brown

Ric Wheeler wrote:
> On 08/19/2009 09:39 PM, Mark Lord wrote:
>> [resending, after fixing the Cc: list; somebody trimmed it earlier]
>>
>> Jamie Lokier wrote:
>> ..
>>> I don't remember - does TRIM guarantee the blocks read zeros afterwards?
>> ..
>>
>> No, it doesn't.
>>
>> A drive can optionally support "deterministic TRIM", whereby it will 
>> return
>> consistent data for any given trimmed sector afterwards, but that 
>> doesn't mean zeros.
>>
>> -ml
> 
> Note that returning consistent data is critical for devices that are 
> used in a RAID group since you will need each RAID block that is used to 
> compute the parity to continue to return the same data until you 
> overwrite it with new data :-)
> 
> If we have a device that does not support this (or is misconfigured not 
> to do this), we should not use those devices in an MD group & do discard 
> against it...

A closer reading of d2015r2-ATAATAPI_Command_Set_-_2_ACS-2.pdf
section 7.10.3.2 (latest ACS-2 draft from www.t13.org) shows
that there are 3 possible variants for data read from a logical
block that has been trimmed (or "unmapped"):
    a) indeterminate
    b) determinate
    c) determinate, return all zeroes

In the case of b) the same data is returned for each
subsequent read. And that data must not be something
that has previously be written to some other LBA!

In the case of SCSI (sbc3r19.pdf) case b) is not
supported (very sensibly IMO).


Another difference I noticed between SCSI and ATA
drafts is with the SECURITY ERASE UNIT command which
is somewhat similar to the SCSI FORMAT UNIT command
(which includes a security erase option). The ATA
draft says that all blocks are determinate ("mapped"
in the SCSI state model) after a SECURITY ERASE UNIT.
The SCSI draft says that all logical blocks may be
unmapped after FORMAT UNIT.

Doug Gilbert

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-08-21 12:46 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-16  0:47 [PATCH, RFC] xfs: batched discard support Christoph Hellwig
2009-08-16  1:35 ` Mark Lord
2009-08-16  2:19   ` Mark Lord
2009-08-16  2:25     ` Christoph Hellwig
2009-08-16  2:49       ` Mark Lord
2009-08-16  3:25         ` Mark Lord
2009-08-16 13:00       ` Mark Lord
2009-08-16 13:53         ` Christoph Hellwig
2009-08-16 13:59         ` Mark Lord
2009-08-16 14:06           ` Mark Lord
2009-08-16 14:23           ` Christoph Hellwig
2009-08-16 14:26             ` Mark Lord
2009-08-19 20:39 ` Ingo Molnar
2009-08-20  1:05   ` Christoph Hellwig
2009-08-20  1:10     ` Jamie Lokier
2009-08-20  1:38       ` Douglas Gilbert
2009-08-20  1:38       ` Mark Lord
2009-08-21 12:46     ` Ingo Molnar
2009-08-20  1:39   ` Mark Lord
2009-08-20 13:48     ` Ric Wheeler
2009-08-20 14:38       ` Mark Lord
2009-08-20 14:42         ` Ric Wheeler
2009-08-20 17:19           ` Greg Freemyer
2009-08-20 14:42         ` James Bottomley
2009-08-20 15:43         ` Rolf Eike Beer
2009-08-20 17:00           ` Ric Wheeler
2009-08-20 14:58       ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).