linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
@ 2014-03-21  4:30 Darrick J. Wong
  2014-03-21  4:30 ` [PATCH 1/5] fs/bio-integrity: remove duplicate code Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:30 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.  The
interface is an extension to the AIO interface: two new commands
(IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
arg list is interpreted to point to a buffer containing a header,
followed by the the PI data.  These patches are against 3.14-rc7.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 provides the plumbing to get the user's buffer all the way to
the block integrity code.  I'm not quite sure if the mechanism I took
(passing the results of get_user_pages around) actually works in all
cases (such as the user's buffer being swapped out), but it survives
a simple test.  Due to the way that the code deals with the array of
struct page*s that represent the PI buffer, there's an unfortunate
requirement that no PI tuple may cross a page boundary.  Given that
so far DIF is only 8 or 16 bytes this isn't a problem... yet.  There's
also no explicit fallback for the case where the user pages are not
within a device's DMA range.

Patch #3 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
"mod_user_buf_fn" hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #4 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #5 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -p r -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <libaio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <errno.h>
#include <stdlib.h>
#include <stdint.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <linux/fs.h>

#define IOCB_CMD_PREADVM	(9)
#define IOCB_CMD_PWRITEVM	(10)
#define GENERATE_GUARD	(1)
#define GENERATE_REF	(2)
#define GENERATE_APP	(4)
#define GENERATE_ALL	(7)

#define NR_IOS	(1)

static void dump_buffer(char *buf, size_t len)
{
	size_t off;
	char *p;

	for (p = buf; p < buf + len; p++) {
		off = p - buf;
		if (off % 32 == 0) {
			if (p != buf)
				printf("\n");
			printf("%05zu:", off);
		}
		printf(" %02x", *p & 0xFF);
	}
	printf("\n");
}

/* Table generated using the following polynomium:
 * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
 * gt: 0x8bb7
 */
static const uint16_t t10_dif_crc_table[256] = {
	0x0000, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
	0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
	0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
	0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
	0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
	0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
	0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
	0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
	0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
	0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
	0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
	0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
	0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
	0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
	0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
	0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
	0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
	0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
	0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
	0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
	0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
	0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
	0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
	0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
	0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
	0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
	0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
	0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
	0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
	0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
	0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
	0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
};

uint16_t crc_t10dif(uint16_t crc, const unsigned char *buffer, uint32_t len)
{
	unsigned int i;

	for (i = 0 ; i < len ; i++)
		crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff];

	return crc;
}

struct sd_dif_tuple {
       uint16_t guard_tag;	/* Checksum */
       uint16_t app_tag;		/* Opaque storage */
       uint32_t ref_tag;		/* Target LBA or indirect LBA */
};

static void stamp_pi_buffer(struct sd_dif_tuple *t, uint16_t csum,
			    uint16_t tag, uint32_t sector)
{
	t->guard_tag = htons(csum);
	t->app_tag = htons(tag);
	t->ref_tag = htonl(sector);
}

static void print_help(const char *progname)
{
	printf("Usage: %s [OPTS] fname\n", progname);
	printf("-a	Use this application tag\n");
	printf("-d	Do not use O_DIRECT\n");
	printf("-o	Read/write this many sectors into the device\n");
	printf("-r	Use DIX to read\n");
	printf("-s	Allocate buffer of this many sectors\n");
	printf("-w	Use DIX to write\n");
	printf("-z	Do not use O_SYNC\n");
}

int main(int argc, char *argv[])
{
	struct sd_dif_tuple *pi;
	int page_size = sysconf(_SC_PAGESIZE);
	io_context_t ioctx;
	struct io_event events[NR_IOS];
	struct iocb iocbs[NR_IOS];
	struct iocb *iocbps[NR_IOS];
	void *buf, *buf2;
	unsigned char *p;
	void *mbuf, *mbuf2;
	int ret, fd, i;
	struct iovec iov[3];
	int opt;
	int dix_read = 0, dix_write = 0;
	unsigned int SECTOR_SIZE = 0;
	unsigned long long num_sectors = 8, BUF_SIZE;
	unsigned long long sector_offset = 256, BDEV_OFFSET;
	unsigned int APP_TAG = 0xEF53;
	unsigned int the_byte = 0x55;
	size_t pi_buflen;
	int o_direct = O_DIRECT;
	int o_sync = O_SYNC;
	uint32_t pi_flags = 0;

	while ((opt = getopt(argc, argv, "b:zdrws:o:a:p:")) != -1) {
		switch (opt) {
		case 'a':
			APP_TAG = strtoul(optarg, NULL, 0);
			break;
		case 'b':
			the_byte = strtoul(optarg, NULL, 0) & 0xFF;
			break;
		case 'd':
			o_direct = 0;
			break;
		case 'o':
			sector_offset = strtoull(optarg, NULL, 0);
			break;
		case 'p':
			for (i = 0; i < strlen(optarg); i++)
				switch (optarg[i]) {
				case 'a':
					pi_flags |= GENERATE_APP;
					break;
				case 'g':
					pi_flags |= GENERATE_GUARD;
					break;
				case 'r':
					pi_flags |= GENERATE_REF;
					break;
				default:
					print_help(argv[0]);
					return 2;
				}
			break;
		case 'r':
			dix_read = 1;
			break;
		case 's':
			num_sectors = strtoull(optarg, NULL, 0);
			break;
		case 'w':
			dix_write = 1;
			break;
		case 'z':
			o_sync = 0;
			break;
		default:
			print_help(argv[0]);
			return 0;
		}
	}

	if (optind >= argc) {
		print_help(argv[0]);
		return 0;
	}

	if (dix_read)
		fprintf(stderr, "Using DIX read.\n");
	if (dix_write)
		fprintf(stderr, "Using DIX write.\n");

	fd = open(argv[optind], o_direct | o_sync | O_RDWR);
	if (fd < 0) {
		perror(argv[optind]);
		return 1;
	}

	/* For now, don't let non-block devices in */
	SECTOR_SIZE = 512;
	if (ioctl(fd, BLKSSZGET, &SECTOR_SIZE)) {
		perror(argv[optind]);
	}

	pi_buflen = (num_sectors + 1) * sizeof(struct sd_dif_tuple);

	BUF_SIZE = num_sectors * SECTOR_SIZE;
	BDEV_OFFSET = sector_offset * SECTOR_SIZE;
	fprintf(stderr, "sector=%d num_sectors=%llu pi_len=%zu pi_flag=0x%x\n",
		SECTOR_SIZE, num_sectors, pi_buflen, pi_flags);
	if (posix_memalign(&buf, page_size, BUF_SIZE) ||
	    posix_memalign(&buf2, page_size, BUF_SIZE) ||
	    posix_memalign(&mbuf, page_size, pi_buflen) ||
	    posix_memalign(&mbuf2, page_size, pi_buflen)) {
		perror("memalign");
		return 1;
	}

	if (io_queue_init(2, &ioctx)) {
		perror("io_queue_init");
		return 1;
	}

	/* Write everything out */
	memcpy(mbuf, &pi_flags, sizeof(pi_flags));
	memset(buf, the_byte, BUF_SIZE);
	for (p = buf, i = 0, pi = mbuf + sizeof(struct sd_dif_tuple);
	     i < num_sectors;
	     i++, pi++, p += SECTOR_SIZE)
		stamp_pi_buffer(pi,
				pi_flags & GENERATE_GUARD ? 0 : crc_t10dif(0, p, SECTOR_SIZE),
				pi_flags & GENERATE_APP ? 0 : APP_TAG,
				pi_flags & GENERATE_REF ? 0 : (BDEV_OFFSET / SECTOR_SIZE) + i);
	iov[0].iov_base = buf;
	iov[0].iov_len = page_size;
	iov[1].iov_base = buf + page_size;
	iov[1].iov_len = BUF_SIZE - page_size;
	iov[2].iov_base = mbuf;
	iov[2].iov_len = pi_buflen;
	iocbps[0] = iocbs;
	io_prep_pwritev(iocbs, fd, iov, (dix_write ? 3 : 2), BDEV_OFFSET);
	if (dix_write)
		iocbs[0].aio_lio_opcode = IOCB_CMD_PWRITEVM;

	fprintf(stderr, "Writing %llu bytes\n", BUF_SIZE);
	ret = io_submit(ioctx, 1, iocbps);
	if (ret < 0) {
		errno = -ret;
		perror("io_submit");
		return 1;
	}

	ret = io_getevents(ioctx, 1, 1, events, NULL);
	if (ret < 0) {
		errno = -ret;
		perror("io_getevents");
		return 1;
	}

	if ((signed)events[0].res < 0) {
		errno = -((signed)events[0].res);
		perror("io_pwritev");
		return 1;
	}
	fprintf(stderr, "Wrote %lu bytes\n", events[0].res);

	/* Read everything back in */
	memset(buf2, 0x00, BUF_SIZE);
	memset(mbuf2, 0x00, pi_buflen);
	memcpy(mbuf2, &pi_flags, sizeof(pi_flags));
	iov[0].iov_base = buf2;
	iov[0].iov_len = page_size;
	iov[1].iov_base = buf2 + page_size;
	iov[1].iov_len = BUF_SIZE - page_size;
	iov[2].iov_base = mbuf2;
	iov[2].iov_len = pi_buflen;
	iocbps[0] = iocbs;
	io_prep_preadv(iocbs, fd, iov, (dix_read ? 3 : 2), BDEV_OFFSET);
	if (dix_read)
		iocbs[0].aio_lio_opcode = IOCB_CMD_PREADVM;

	fprintf(stderr, "Reading %llu bytes\n", BUF_SIZE);
	ret = io_submit(ioctx, 1, iocbps);
	if (ret < 0) {
		errno = -ret;
		perror("io_submit");
		return 1;
	}

	ret = io_getevents(ioctx, 1, 1, events, NULL);
	if (ret < 0) {
		errno = -ret;
		perror("io_getevents");
		return 1;
	}

	if ((signed)events[0].res < 0) {
		errno = -((signed)events[0].res);
		perror("io_preadv");
		return 1;
	}
	fprintf(stderr, "Read %lu bytes\n", events[0].res);

	/* Compare */
	ret = 0;
	if (memcmp(buf, buf2, BUF_SIZE)) {
		ret = 2;
		fprintf(stderr, "Buffers do not match!\n");
	}
	if (dix_read && dix_write) {
		fprintf(stderr, "write pi\n");
		dump_buffer(mbuf, pi_buflen);
		fprintf(stderr, "read pi\n");
		dump_buffer(mbuf2, pi_buflen);
		if(memcmp(mbuf, mbuf2, pi_buflen)) {
			ret = 2;
			fprintf(stderr, "DIX buffers do not match!\n");
		}
	} else
		fprintf(stderr, "Need to pass -rw to compare DIX buffers!\n");

	if (io_queue_release(ioctx)) {
		perror("io_queue_release");
		return 1;
	}

	close(fd);
	if (!ret)
		fprintf(stderr, "Success.\n");

	return ret;
}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] fs/bio-integrity: remove duplicate code
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
@ 2014-03-21  4:30 ` Darrick J. Wong
  2014-03-21  4:30 ` [PATCH 2/5] aio/dio: enable DIX passthrough Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:30 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, Gu Zheng, linux-scsi, linux-mm

Frøm: Gu Zheng <guz.fnst@cn.fujitsu.com>

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 fs/bio-integrity.c |   83 +++++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:	bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:	bio to generate/verify integrity metadata for
+ * @operate:	operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 	struct blk_integrity_exchg bix;
 	struct bio_vec bv;
 	struct bvec_iter iter;
-	sector_t sector = bio->bi_iter.bi_sector;
-	unsigned int sectors, total;
+	sector_t sector;
+	unsigned int sectors, total, ret;
 	void *prot_buf = bio->bi_integrity->bip_buf;
 
-	total = 0;
+	if (operate)
+		sector = bio->bi_iter.bi_sector;
+	else
+		sector = bio->bi_integrity->bip_iter.bi_sector;
+
+	total = ret = 0;
 	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
 	bix.sector_size = bi->sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
 		bix.prot_buf = prot_buf;
 		bix.sector = sector;
 
-		bi->generate_fn(&bix);
+		if (operate) {
+			bi->generate_fn(&bix);
+		} else {
+			ret = bi->verify_fn(&bix);
+			if (ret) {
+				kunmap_atomic(kaddr);
+				return ret;
+			}
+		}
 
 		sectors = bv.bv_len / bi->sector_size;
 		sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
 		kunmap_atomic(kaddr);
 	}
+	return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:	bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+	bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	struct blk_integrity_exchg bix;
-	struct bio_vec *bv;
-	sector_t sector = bio->bi_integrity->bip_iter.bi_sector;
-	unsigned int sectors, ret = 0;
-	void *prot_buf = bio->bi_integrity->bip_buf;
-	int i;
-
-	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-	bix.sector_size = bi->sector_size;
-
-	bio_for_each_segment_all(bv, bio, i) {
-		void *kaddr = kmap_atomic(bv->bv_page);
-
-		bix.data_buf = kaddr + bv->bv_offset;
-		bix.data_size = bv->bv_len;
-		bix.prot_buf = prot_buf;
-		bix.sector = sector;
-
-		ret = bi->verify_fn(&bix);
-
-		if (ret) {
-			kunmap_atomic(kaddr);
-			return ret;
-		}
-
-		sectors = bv->bv_len / bi->sector_size;
-		sector += sectors;
-		prot_buf += sectors * bi->tuple_size;
-
-		kunmap_atomic(kaddr);
-	}
-
-	return ret;
+	return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] aio/dio: enable DIX passthrough
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
  2014-03-21  4:30 ` [PATCH 1/5] fs/bio-integrity: remove duplicate code Darrick J. Wong
@ 2014-03-21  4:30 ` Darrick J. Wong
  2014-03-21  4:31 ` [PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:30 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Provide a set of new AIO commands (IOCB_CMD_P{READ,WRITE}VM) that
utilize the last iovec of the iovec array to convey protection
information to and from userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/block/data-integrity.txt |   11 ++
 fs/aio.c                               |   22 ++++
 fs/bio-integrity.c                     |   93 +++++++++++++++++++
 fs/direct-io.c                         |  157 +++++++++++++++++++++++++++++---
 include/linux/aio.h                    |    3 +
 include/linux/bio.h                    |   15 +++
 include/uapi/linux/aio_abi.h           |    2 
 mm/filemap.c                           |    7 +
 8 files changed, 294 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
       It is up to the receiver to process them and verify data
       integrity upon completion.
 
+    int bio_integrity_prep_buffer(struct bio *bio, int rw,
+				  struct bio_integrity_prep_iter *pi);
+
+      This function should be called before submit_bio; its purpose is to
+      attach an arbitrary array of struct page * containing integrity data
+      to an existing bio.  Primarily this is intended for AIO/DIO to be
+      able to attach a userspace buffer to a bio.
+
+      The bio_integrity_prep_iter should contain the page offset and buffer
+      length of the PI buffer, the number of pages, and the actual array of
+      pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
     METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..5d425d8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1259,6 +1259,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 	struct iovec inline_vec, *iovec = &inline_vec;
 
 	switch (opcode) {
+	case IOCB_CMD_PREADVM:
+		if (!(file->f_flags & O_DIRECT))
+			return -EINVAL;
+		req->ki_flags |= KIOCB_USE_PI;
+
 	case IOCB_CMD_PREAD:
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
@@ -1266,6 +1271,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 		rw_op	= file->f_op->aio_read;
 		goto rw_common;
 
+	case IOCB_CMD_PWRITEVM:
+		if (!(file->f_flags & O_DIRECT))
+			return -EINVAL;
+		req->ki_flags |= KIOCB_USE_PI;
+
 	case IOCB_CMD_PWRITE:
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
@@ -1280,7 +1290,9 @@ rw_common:
 			return -EINVAL;
 
 		ret = (opcode == IOCB_CMD_PREADV ||
-		       opcode == IOCB_CMD_PWRITEV)
+		       opcode == IOCB_CMD_PWRITEV ||
+		       opcode == IOCB_CMD_PREADVM ||
+		       opcode == IOCB_CMD_PWRITEVM)
 			? aio_setup_vectored_rw(req, rw, buf, &nr_segs,
 						&iovec, compat)
 			: aio_setup_single_vector(req, rw, buf, &nr_segs,
@@ -1288,6 +1300,13 @@ rw_common:
 		if (ret)
 			return ret;
 
+		if ((req->ki_flags & KIOCB_USE_PI) && nr_segs < 2) {
+			pr_err("%s: not enough iovecs for PI!\n", __func__);
+			if (iovec != &inline_vec)
+				kfree(iovec);
+			return -EINVAL;
+		}
+
 		ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes);
 		if (ret < 0) {
 			if (iovec != &inline_vec)
@@ -1407,6 +1426,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_data = iocb->aio_data;
 	req->ki_pos = iocb->aio_offset;
 	req->ki_nbytes = iocb->aio_nbytes;
+	req->ki_flags = 0;
 
 	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
 			   (char __user *)(unsigned long)iocb->aio_buf,
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..af398f0 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 	struct bio_vec *iv;
 
 	if (bip->bip_vcnt >= bip_integrity_vecs(bip)) {
-		printk(KERN_ERR "%s: bip_vec full\n", __func__);
+		pr_err("%s: bip_vec full\n", __func__);
 		return 0;
 	}
 
@@ -250,7 +250,7 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 					DIV_ROUND_UP(len, bi->tag_size));
 
 	if (nr_sectors * bi->tuple_size > bip->bip_iter.bi_size) {
-		printk(KERN_ERR "%s: tag too big for bio: %u > %u\n", __func__,
+		pr_err("%s: tag too big for bio: %u > %u\n", __func__,
 		       nr_sectors * bi->tuple_size, bip->bip_iter.bi_size);
 		return -1;
 	}
@@ -375,6 +375,95 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
 }
 
 /**
+ * bio_integrity_prep_buffer - Prepare bio for integrity I/O
+ * @bio:	bio to prepare
+ * @rw:		data direction for the bio
+ * @pi:		pi data to attach to bio
+ *
+ * Description: Allocates a buffer for integrity metadata, maps the
+ * pages and attaches them to a bio.  The bio must have target device
+ * and start sector set prior to calling.  The pages specified in the
+ * @pi argument should contain integrity metadata in the WRITE case,
+ * and should be ready to receive metadata in the READ case.
+ */
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+			      struct bio_integrity_prep_iter *pi)
+{
+	struct bio_integrity_payload *bip;
+	struct blk_integrity *bi;
+	unsigned long start, end;
+	unsigned int len, nr_pages;
+	unsigned int bytes, i;
+	unsigned int sectors;
+
+	bi = bdev_get_integrity(bio->bi_bdev);
+	BUG_ON(bi == NULL);
+	BUG_ON(bio_integrity(bio));
+
+	sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
+
+	/* Allocate kernel buffer for protection data */
+	len = sectors * blk_integrity_tuple_size(bi);
+	end = (pi->pi_offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = pi->pi_offset >> PAGE_SHIFT;
+	nr_pages = end - start;
+
+	if (pi->pi_len < len) {
+		pr_err("%s: not enough space left in buffer!\n", __func__);
+		return -ENOMEM;
+	}
+
+	/* Allocate bio integrity payload and integrity vectors */
+	bip = bio_integrity_alloc(bio, GFP_NOIO, pi->pi_nrpages);
+	if (unlikely(bip == NULL)) {
+		pr_err("could not allocate data integrity bioset\n");
+		return -EIO;
+	}
+
+	bip->bip_owns_buf = 0;
+	bip->bip_buf = NULL;
+	bip->bip_iter.bi_size = len;
+	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
+
+	/* Map it */
+	for (i = 0 ; i < nr_pages ; i++) {
+		int ret;
+		bytes = PAGE_SIZE - pi->pi_offset;
+
+		if (bytes > pi->pi_len)
+			bytes = pi->pi_len;
+		if (bytes > len)
+			bytes = len;
+		if (pi->pi_len <= 0 || len == 0)
+			break;
+
+		ret = bio_integrity_add_page(bio, *pi->pi_userpages,
+					     bytes, pi->pi_offset);
+
+		if (ret == 0)
+			return -EIO;
+
+		if (ret < bytes)
+			break;
+
+		len -= bytes;
+		pi->pi_len -= bytes;
+		if (pi->pi_offset + bytes == PAGE_SIZE)
+			pi->pi_userpages++;
+		pi->pi_offset = (pi->pi_offset + bytes) % PAGE_SIZE;
+	}
+
+	/* Install custom I/O completion handler if read verify is enabled */
+	if ((rw & WRITE) == READ) {
+		bip->bip_end_io = bio->bi_end_io;
+		bio->bi_end_io = bio_integrity_endio;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(bio_integrity_prep_buffer);
+
+/**
  * bio_integrity_prep - Prepare bio for integrity I/O
  * @bio:	bio to prepare
  *
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 160a548..ee357dd 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -111,6 +111,10 @@ struct dio_submit {
 	 */
 	unsigned head;			/* next page to process */
 	unsigned tail;			/* last valid page + 1 */
+
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	struct bio_integrity_prep_iter	pi_iter;
+#endif
 };
 
 /* dio_state communicated between submission path and end_io */
@@ -137,6 +141,10 @@ struct dio {
 	struct kiocb *iocb;		/* kiocb */
 	ssize_t result;                 /* IO result */
 
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	struct bio_integrity_prep_iter	pi_iter;	/* PI buffers */
+#endif
+
 	/*
 	 * pages[] (and any fields placed after it) are not zeroed out at
 	 * allocation time.  Don't add new fields after pages[] unless you
@@ -221,6 +229,75 @@ static inline struct page *dio_get_page(struct dio *dio,
 	return dio->pages[sdio->head++];
 }
 
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+static int dio_tear_down_pi(struct dio *dio)
+{
+	size_t i;
+
+	if (!dio->pi_iter.pi_userpages)
+		return 0;
+
+	for (i = 0; i < dio->pi_iter.pi_nrpages; i++)
+		page_cache_release(dio->pi_iter.pi_userpages[i]);
+	kfree(dio->pi_iter.pi_userpages);
+	dio->pi_iter.pi_userpages = NULL;
+	return 0;
+}
+
+static int dio_prep_for_pi(struct dio *dio, struct block_device *bdev, int rw,
+			   struct iovec *pi_iov)
+{
+	unsigned long start, end;
+	struct request_queue *q;
+	int retval;
+
+	if (!pi_iov)
+		return 0;
+
+	if (pi_iov->iov_len == 0)
+		return -EINVAL;
+
+	end = (((unsigned long)pi_iov->iov_base) + pi_iov->iov_len +
+		PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = ((unsigned long)pi_iov->iov_base) >> PAGE_SHIFT;
+	dio->pi_iter.pi_offset = offset_in_page(pi_iov->iov_base);
+	dio->pi_iter.pi_len = pi_iov->iov_len;
+	dio->pi_iter.pi_nrpages = end - start;
+	q = bdev_get_queue(bdev);
+	dio->pi_iter.pi_userpages = kzalloc(dio->pi_iter.pi_nrpages *
+					    sizeof(struct page *),
+					    GFP_NOIO | q->bounce_gfp);
+	if (!dio->pi_iter.pi_userpages) {
+		pr_err("%s: no room for page array?\n", __func__);
+		return -ENOMEM;
+	}
+
+	retval = get_user_pages_fast((unsigned long)pi_iov->iov_base,
+				     dio->pi_iter.pi_nrpages, rw & WRITE,
+				     dio->pi_iter.pi_userpages);
+	if (retval != dio->pi_iter.pi_nrpages) {
+		pr_err("%s: couldn't map pages?\n", __func__);
+		dio_tear_down_pi(dio);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+#else
+static int dio_tear_down_pi(struct dio *dio)
+{
+	return 0;
+}
+
+static int dio_prep_for_pi(struct dio *dio, struct block_device *bdev, int rw,
+			   struct iovec *pi_iov)
+{
+	if (!pi_iov)
+		return 0;
+	return -EINVAL;
+}
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
 /**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
@@ -255,6 +332,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
 			transferred = dio->i_size - offset;
 	}
 
+	dio_tear_down_pi(dio);
+
 	if (ret == 0)
 		ret = dio->page_errors;
 	if (ret == 0)
@@ -385,6 +464,22 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int dio_prep_pi_buffers(struct dio *dio, struct dio_submit *sdio)
+{
+	struct bio *bio = sdio->bio;
+	if (sdio->pi_iter.pi_userpages == NULL || !bio_integrity_enabled(bio))
+		return 0;
+
+	return bio_integrity_prep_buffer(bio, dio->rw, &sdio->pi_iter);
+}
+#else
+static int dio_prep_pi_buffers(struct dio *dio, struct dio_submit *sdio)
+{
+	return 0;
+}
+#endif
+
 /*
  * In the AIO read case we speculatively dirty the pages before starting IO.
  * During IO completion, any of these pages which happen to have been written
@@ -392,13 +487,18 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
  *
  * bios hold a dio reference between submit_bio and ->end_io.
  */
-static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
+static inline int dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 {
 	struct bio *bio = sdio->bio;
 	unsigned long flags;
+	int ret = 0;
 
 	bio->bi_private = dio;
 
+	ret = dio_prep_pi_buffers(dio, sdio);
+	if (ret)
+		return ret;
+
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
@@ -415,6 +515,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	sdio->bio = NULL;
 	sdio->boundary = 0;
 	sdio->logical_offset_in_bio = 0;
+
+	return ret;
 }
 
 /*
@@ -736,8 +838,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 		 * have.
 		 */
 		if (sdio->final_block_in_bio != sdio->cur_page_block ||
-		    cur_offset != bio_next_offset)
-			dio_bio_submit(dio, sdio);
+		    cur_offset != bio_next_offset) {
+			ret = dio_bio_submit(dio, sdio);
+			if (ret)
+				goto out;
+		}
 	}
 
 	if (sdio->bio == NULL) {
@@ -747,7 +852,9 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 	}
 
 	if (dio_bio_add_page(sdio) != 0) {
-		dio_bio_submit(dio, sdio);
+		ret = dio_bio_submit(dio, sdio);
+		if (ret)
+			goto out;
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
 			ret = dio_bio_add_page(sdio);
@@ -823,8 +930,12 @@ out:
 	 * avoid metadata seeks.
 	 */
 	if (sdio->boundary) {
+		int ret2;
+
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		dio_bio_submit(dio, sdio);
+		ret2 = dio_bio_submit(dio, sdio);
+		if (ret == 0)
+			ret = ret2;
 		page_cache_release(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
@@ -1120,16 +1231,22 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	unsigned blocksize_mask = (1 << blkbits) - 1;
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
-	struct dio *dio;
+	struct dio *dio = NULL;
 	struct dio_submit sdio = { 0, };
 	unsigned long user_addr;
 	size_t bytes;
 	struct buffer_head map_bh = { 0, };
 	struct blk_plug plug;
+	struct iovec *pi_iov = NULL;
 
 	if (rw & WRITE)
 		rw = WRITE_ODIRECT;
 
+	if (iocb->ki_flags & KIOCB_USE_PI) {
+		nr_segs--;
+		pi_iov = (struct iovec *)(iov + nr_segs);
+	}
+
 	/*
 	 * Avoid references to bdev if not absolutely needed to give
 	 * the early prefetch in the caller enough time.
@@ -1174,6 +1291,11 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	 */
 	memset(dio, 0, offsetof(struct dio, pages));
 
+	/* Set up a buffer to hold DIX data */
+	retval = dio_prep_for_pi(dio, bdev, rw, pi_iov);
+	if (retval)
+		goto out_dio;
+
 	dio->flags = flags;
 	if (dio->flags & DIO_LOCKING) {
 		if (rw == READ) {
@@ -1187,8 +1309,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 							      end - 1);
 			if (retval) {
 				mutex_unlock(&inode->i_mutex);
-				kmem_cache_free(dio_cache, dio);
-				goto out;
+				goto out_pi;
 			}
 		}
 	}
@@ -1217,8 +1338,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 			 * We grab i_mutex only for reads so we don't have
 			 * to release it here
 			 */
-			kmem_cache_free(dio_cache, dio);
-			goto out;
+			goto out_pi;
 		}
 	}
 
@@ -1228,6 +1348,9 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	atomic_inc(&inode->i_dio_count);
 
 	retval = 0;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	sdio.pi_iter = dio->pi_iter;
+#endif
 	sdio.blkbits = blkbits;
 	sdio.blkfactor = i_blkbits - blkbits;
 	sdio.block_in_file = offset >> blkbits;
@@ -1315,8 +1438,12 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 		page_cache_release(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
-	if (sdio.bio)
-		dio_bio_submit(dio, &sdio);
+	if (sdio.bio) {
+		int ret2;
+		ret2 = dio_bio_submit(dio, &sdio);
+		if (retval == 0)
+			retval = ret2;
+	}
 
 	blk_finish_plug(&plug);
 
@@ -1353,7 +1480,11 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 		retval = dio_complete(dio, offset, retval, false);
 	} else
 		BUG_ON(retval != -EIOCBQUEUED);
-
+	return retval;
+out_pi:
+	dio_tear_down_pi(dio);
+out_dio:
+	kmem_cache_free(dio_cache, dio);
 out:
 	return retval;
 }
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..2060e66 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -29,6 +29,8 @@ struct kiocb;
 
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
+#define KIOCB_USE_PI		(1)
+
 struct kiocb {
 	struct file		*ki_filp;
 	struct kioctx		*ki_ctx;	/* NULL for sync ops */
@@ -52,6 +54,7 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+	unsigned int		ki_flags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a4d39b..4729ab1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -635,6 +635,13 @@ struct biovec_slab {
 	struct kmem_cache *slab;
 };
 
+struct bio_integrity_prep_iter {
+	struct page **pi_userpages;	/* Pages containing PI data */
+	size_t pi_nrpages;		/* Number of PI data pages */
+	size_t pi_offset;		/* Offset into the page */
+	size_t pi_len;			/* Length of the buffer */
+};
+
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -663,6 +670,8 @@ extern int bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
 extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
 extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep_buffer(struct bio *, int rw,
+				     struct bio_integrity_prep_iter *);
 extern void bio_integrity_endio(struct bio *, int);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
@@ -693,6 +702,12 @@ static inline void bioset_integrity_free (struct bio_set *bs)
 	return;
 }
 
+static inline int bio_integrity_prep_buffer(struct bio *bio, int rw,
+					    struct bio_integrity_prep_iter *pi)
+{
+	return 0;
+}
+
 static inline int bio_integrity_prep(struct bio *bio)
 {
 	return 0;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..f8d70d0 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+	IOCB_CMD_PREADVM = 9,
+	IOCB_CMD_PWRITEVM = 10,
 };
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..3aefb0e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2477,6 +2477,13 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 							ppos, count, ocount);
 		if (written < 0 || written == count)
 			goto out;
+
+		/* User-provided PI requires direct IO */
+		if (iocb->ki_flags & KIOCB_USE_PI) {
+			err = -EINVAL;
+			goto out;
+		}
+
 		/*
 		 * direct-io write to a hole: fall through to buffered I/O
 		 * for completing the rest of the request.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
  2014-03-21  4:30 ` [PATCH 1/5] fs/bio-integrity: remove duplicate code Darrick J. Wong
  2014-03-21  4:30 ` [PATCH 2/5] aio/dio: enable DIX passthrough Darrick J. Wong
@ 2014-03-21  4:31 ` Darrick J. Wong
  2014-03-21  4:31 ` [PATCH 4/5] aio/dio: advertise possible userspace flags Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:31 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/block/data-integrity.txt |   11 ++++
 block/blk-integrity.c                  |    1 
 drivers/scsi/sd_dif.c                  |   76 ++++++++++++++++++++++++----
 fs/bio-integrity.c                     |   87 +++++++++++++++++++++++++++++++-
 fs/direct-io.c                         |   15 ++++++
 include/linux/bio.h                    |    3 +
 include/linux/blkdev.h                 |    2 +
 7 files changed, 178 insertions(+), 17 deletions(-)


diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
       The bio_integrity_prep_iter should contain the page offset and buffer
       length of the PI buffer, the number of pages, and the actual array of
-      pages, as returned by get_user_pages.
+      pages, as returned by get_user_pages.  The user_flags argument should
+      contain whatever flag values were passed in by userspace; the values
+      of the flags are specific to the block integrity provider, and are
+      passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
     METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
       are available per hardware sector.  For DIF this is either 2 or
       0 depending on the value of the Control Mode Page ATO bit.
 
+      'mod_user_buf_fn' updates the appropriate integrity metadata for
+      a WRITE operation.  This function is called when userspace passes
+      in a PI buffer along with file data; the flags argument (which is
+      specific to the blk_integrity provider) arrange for pre-processing
+      of the user buffer prior to issuing the IO.
+
       See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 ----------------------------------------------------------------------
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 		bi->set_tag_fn = template->set_tag_fn;
 		bi->get_tag_fn = template->get_tag_fn;
 		bi->tag_size = template->tag_size;
+		bi->mod_user_buf_fn = template->mod_user_buf_fn;
 	} else
 		bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD	(1)
+#define GENERATE_REF	(2)
+#define GENERATE_APP	(4)
+#define GENERATE_ALL	(7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+				 int flags)
 {
 	void *buf = bix->data_buf;
 	struct sd_dif_tuple *sdt = bix->prot_buf;
 	sector_t sector = bix->sector;
 	unsigned int i;
 
+	if (flags & ~GENERATE_ALL)
+		return -EINVAL;
+	if (!flags)
+		return -ENOTTY;
+
 	for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-		sdt->guard_tag = fn(buf, bix->sector_size);
-		sdt->ref_tag = cpu_to_be32(sector & 0xffffffff);
-		sdt->app_tag = 0;
+		if (flags & GENERATE_GUARD)
+			sdt->guard_tag = fn(buf, bix->sector_size);
+		if (flags & GENERATE_REF)
+			sdt->ref_tag = cpu_to_be32(sector & 0xffffffff);
+		if (flags & GENERATE_APP)
+			sdt->app_tag = 0;
 
 		buf += bix->sector_size;
 		sector++;
 	}
+
+	return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type1_generate(bix, sd_dif_crc_fn);
+	sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type1_generate(bix, sd_dif_ip_fn);
+	sd_dif_type1_generate(bix, sd_dif_ip_fn, GENERATE_ALL);
+}
+
+static int sd_dif_type1_mod_crc(struct blk_integrity_exchg *bix,
+				int flags)
+{
+	return sd_dif_type1_generate(bix, sd_dif_crc_fn, flags);
+}
+
+static int sd_dif_type1_mod_ip(struct blk_integrity_exchg *bix,
+			       int flags)
+{
+	return sd_dif_type1_generate(bix, sd_dif_ip_fn, flags);
 }
 
 static int sd_dif_type1_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
@@ -163,6 +190,7 @@ static struct blk_integrity dif_type1_integrity_crc = {
 	.set_tag_fn		= sd_dif_type1_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type1_mod_crc,
 };
 
 static struct blk_integrity dif_type1_integrity_ip = {
@@ -173,6 +201,7 @@ static struct blk_integrity dif_type1_integrity_ip = {
 	.set_tag_fn		= sd_dif_type1_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type1_mod_ip,
 };
 
 
@@ -180,29 +209,50 @@ static struct blk_integrity dif_type1_integrity_ip = {
  * Type 3 protection has a 16-bit guard tag and 16 + 32 bits of opaque
  * tag space.
  */
-static void sd_dif_type3_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+static int sd_dif_type3_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+				 int flags)
 {
 	void *buf = bix->data_buf;
 	struct sd_dif_tuple *sdt = bix->prot_buf;
 	unsigned int i;
 
+	if (flags & (~GENERATE_ALL | GENERATE_REF))
+		return -EINVAL;
+	if (!flags)
+		return -ENOTTY;
+
 	for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-		sdt->guard_tag = fn(buf, bix->sector_size);
-		sdt->ref_tag = 0;
-		sdt->app_tag = 0;
+		if (flags & GENERATE_GUARD)
+			sdt->guard_tag = fn(buf, bix->sector_size);
+		if (flags & GENERATE_APP) {
+			sdt->ref_tag = 0;
+			sdt->app_tag = 0;
+		}
 
 		buf += bix->sector_size;
 	}
+
+	return 0;
 }
 
 static void sd_dif_type3_generate_crc(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type3_generate(bix, sd_dif_crc_fn);
+	sd_dif_type3_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type3_generate_ip(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type3_generate(bix, sd_dif_ip_fn);
+	sd_dif_type3_generate(bix, sd_dif_ip_fn, GENERATE_ALL);
+}
+
+static int sd_dif_type3_mod_crc(struct blk_integrity_exchg *bix, int flags)
+{
+	return sd_dif_type3_generate(bix, sd_dif_crc_fn, flags);
+}
+
+static int sd_dif_type3_mod_ip(struct blk_integrity_exchg *bix, int flags)
+{
+	return sd_dif_type3_generate(bix, sd_dif_ip_fn, flags);
 }
 
 static int sd_dif_type3_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
@@ -283,6 +333,7 @@ static struct blk_integrity dif_type3_integrity_crc = {
 	.set_tag_fn		= sd_dif_type3_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type3_mod_crc,
 };
 
 static struct blk_integrity dif_type3_integrity_ip = {
@@ -293,6 +344,7 @@ static struct blk_integrity dif_type3_integrity_ip = {
 	.set_tag_fn		= sd_dif_type3_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type3_mod_ip,
 };
 
 /*
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index af398f0..381ee38 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,6 +301,82 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
+ * @bio:	bio to generate/verify integrity metadata for
+ */
+int bio_integrity_update_user_buffer(struct bio *bio)
+{
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	struct blk_integrity_exchg bix;
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	sector_t sector;
+	unsigned int sectors, total, ret;
+	void *prot_buf;
+	unsigned int prot_offset, prot_len, bv_offset, bv_len;
+	struct bio_vec *iv;
+	struct bio_integrity_payload *bip = bio->bi_integrity;
+
+	if (!bi->mod_user_buf_fn)
+		return 0;
+
+	sector = bio->bi_iter.bi_sector;
+
+	total = ret = 0;
+	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
+	bix.sector_size = bi->sector_size;
+
+	iv = bip->bip_vec;
+	prot_offset = iv->bv_offset;
+	prot_len = iv->bv_len;
+	prot_buf = kmap_atomic(iv->bv_page);
+
+	bio_for_each_segment(bv, bio, iter) {
+		void *kaddr = kmap_atomic(bv.bv_page);
+		bv_len = bv.bv_len;
+		bv_offset = bv.bv_offset;
+
+		while (bv_len > 0) {
+			if (prot_len < bi->tuple_size) {
+				kunmap_atomic(prot_buf);
+				iv++;
+				BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+				prot_offset = iv->bv_offset;
+				prot_len = iv->bv_len;
+				prot_buf = kmap_atomic(iv->bv_page);
+			}
+			bix.data_buf = kaddr + bv_offset;
+			bix.data_size = min(bv_len,
+				prot_len / bi->tuple_size * bix.sector_size);
+			bix.prot_buf = prot_buf + prot_offset;
+			bix.sector = sector;
+
+			ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+			if (ret) {
+				if (ret == -ENOTTY)
+					ret = 0;
+				kunmap_atomic(kaddr);
+				kunmap_atomic(prot_buf);
+				return ret;
+			}
+
+			bv_offset += bix.data_size;
+			bv_len -= bix.data_size;
+			sectors = bix.data_size / bi->sector_size;
+			sector += sectors;
+			prot_offset += sectors * bi->tuple_size;
+			prot_len -= sectors * bi->tuple_size;
+			total += sectors * bi->tuple_size;
+			BUG_ON(total > bio->bi_integrity->bip_iter.bi_size);
+		}
+		kunmap_atomic(kaddr);
+	}
+	kunmap_atomic(prot_buf);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
+
+/**
  * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
  * @bio:	bio to generate/verify integrity metadata for
  * @operate:	operate number, 1 for generate, 0 for verify
@@ -395,6 +471,7 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 	unsigned int len, nr_pages;
 	unsigned int bytes, i;
 	unsigned int sectors;
+	int ret;
 
 	bi = bdev_get_integrity(bio->bi_bdev);
 	BUG_ON(bi == NULL);
@@ -414,7 +491,7 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 	}
 
 	/* Allocate bio integrity payload and integrity vectors */
-	bip = bio_integrity_alloc(bio, GFP_NOIO, pi->pi_nrpages);
+	bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
 	if (unlikely(bip == NULL)) {
 		pr_err("could not allocate data integrity bioset\n");
 		return -EIO;
@@ -424,10 +501,10 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 	bip->bip_buf = NULL;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
+	bip->bip_user_flags = pi->pi_userflags;
 
 	/* Map it */
 	for (i = 0 ; i < nr_pages ; i++) {
-		int ret;
 		bytes = PAGE_SIZE - pi->pi_offset;
 
 		if (bytes > pi->pi_len)
@@ -457,9 +534,11 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 	if ((rw & WRITE) == READ) {
 		bip->bip_end_io = bio->bi_end_io;
 		bio->bi_end_io = bio_integrity_endio;
-	}
+		ret = 0;
+	} else
+		ret = bio_integrity_update_user_buffer(bio);
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL(bio_integrity_prep_buffer);
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ee357dd..9fef197 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -247,8 +247,11 @@ static int dio_tear_down_pi(struct dio *dio)
 static int dio_prep_for_pi(struct dio *dio, struct block_device *bdev, int rw,
 			   struct iovec *pi_iov)
 {
+	struct blk_integrity *bi;
+	size_t tuple_size;
 	unsigned long start, end;
 	struct request_queue *q;
+	uint32_t user_flags;
 	int retval;
 
 	if (!pi_iov)
@@ -257,6 +260,18 @@ static int dio_prep_for_pi(struct dio *dio, struct block_device *bdev, int rw,
 	if (pi_iov->iov_len == 0)
 		return -EINVAL;
 
+	retval = copy_from_user(&user_flags, pi_iov->iov_base,
+				sizeof(user_flags));
+	if (retval)
+		return retval;
+	bi = bdev_get_integrity(bdev);
+	tuple_size = bi->tuple_size;
+	if (tuple_size < sizeof(user_flags))
+		tuple_size = sizeof(user_flags);
+	pi_iov->iov_base += tuple_size;
+	pi_iov->iov_len -= tuple_size;
+	dio->pi_iter.pi_userflags = user_flags;
+
 	end = (((unsigned long)pi_iov->iov_base) + pi_iov->iov_len +
 		PAGE_SIZE - 1) >> PAGE_SHIFT;
 	start = ((unsigned long)pi_iov->iov_base) >> PAGE_SHIFT;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4729ab1..5bd9618 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -304,7 +304,9 @@ struct bio_integrity_payload {
 	struct work_struct	bip_work;	/* I/O completion */
 
 	struct bio_vec		*bip_vec;
+	unsigned int		bip_user_flags;
 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
+	/* This must be last! */
 };
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
@@ -640,6 +642,7 @@ struct bio_integrity_prep_iter {
 	size_t pi_nrpages;		/* Number of PI data pages */
 	size_t pi_offset;		/* Offset into the page */
 	size_t pi_len;			/* Length of the buffer */
+	unsigned int pi_userflags;	/* Userspace flags */
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..cf1ec22 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1426,12 +1426,14 @@ typedef void (integrity_gen_fn) (struct blk_integrity_exchg *);
 typedef int (integrity_vrfy_fn) (struct blk_integrity_exchg *);
 typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
 typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
+typedef int (integrity_mod_user_buf_fn) (struct blk_integrity_exchg *, int);
 
 struct blk_integrity {
 	integrity_gen_fn	*generate_fn;
 	integrity_vrfy_fn	*verify_fn;
 	integrity_set_tag_fn	*set_tag_fn;
 	integrity_get_tag_fn	*get_tag_fn;
+	integrity_mod_user_buf_fn	*mod_user_buf_fn;
 
 	unsigned short		flags;
 	unsigned short		tuple_size;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] aio/dio: advertise possible userspace flags
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (2 preceding siblings ...)
  2014-03-21  4:31 ` [PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
@ 2014-03-21  4:31 ` Darrick J. Wong
  2014-03-21  4:31 ` [PATCH 5/5] blk-integrity: refactor various routines Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:31 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Expose possible userland flags to the new AIO/DIO PI interface so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/ABI/testing/sysfs-block  |   14 ++++++++++++++
 Documentation/block/data-integrity.txt |   26 +++++++++++++++++++++++++
 block/blk-integrity.c                  |   33 ++++++++++++++++++++++++++++++++
 drivers/scsi/sd_dif.c                  |   11 +++++++++++
 include/linux/blkdev.h                 |    7 +++++++
 5 files changed, 91 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
 		512 bytes of data.
 
 
+What:		/sys/block/<disk>/integrity/tuple_size
+Date:		March 2014
+Contact:	Darrick J. Wong <darrick.wong@oracle.com>
+Description:
+		Size in bytes of the integrity data buffer for each logical
+		block.
+
+What:		/sys/block/<disk>/integrity/write_user_flags
+Date:		March 2014
+Contact:	Darrick J. Wong <darrick.wong@oracle.com>
+Description:
+		Provides a list of flags that userspace can pass to the kernel
+		when supplying integrity data for a write IO.
+
 What:		/sys/block/<disk>/integrity/write_generate
 Date:		June 2008
 Contact:	Martin K. Petersen <martin.petersen@oracle.com>
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index b72a54f..38a83a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,33 @@ will require extra work due to the application tag.
       specific to the blk_integrity provider) arrange for pre-processing
       of the user buffer prior to issuing the IO.
 
+      'user_write_flags' points to an array of struct blk_integrity_flag,
+      which maps mod_user_buf_fn flags to a description of what they do.
+
       See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+    The AIO/DIO interface has been extended with a new API to provide
+    userspace programs the ability to provide PI data with a WRITE, or
+    to receive PI data with a READ.  There are two new AIO commands,
+    IOCB_CMD_PREADVM and IOCB_CMD_PWRITEVM.  They have the same general
+    struct iocb format as IOCB_CMD_PREADV and IOCB_CMD_PWRITEV, respectively.
+    The final struct iovec should point to the buffer that contains the
+    PI data.
+
+    This buffer must be aligned to a page boundary, and it must have the
+    following format: Flags are stored in a 32-bit integer.  There must
+    then be padding out to the next multiple of the tuple size.  After
+    that comes the tuple data.  Valid flag values can be found in
+    /sys/block/*/integrity/user_write_flags.  The tuple size can be found
+    in /sys/block/*/integrity/tuple_size.  Tuples must not split a page
+    boundary.
+
+    In general, the flags allow the user program to ask the in-kernel
+    integrity provider to fill in some parts of the tuples.  For example,
+    the T10 DIF provider can fill in the reference tag (sector number) so
+    that userspace can choose not to care about the reference tag.
+
 ----------------------------------------------------------------------
 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity *bi, char *page)
 	return sprintf(page, "%d\n", (bi->flags & INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+	struct blk_integrity_flag *flag = bi->user_write_flags;
+	char *p = page;
+	ssize_t ret = 0;
+
+	while (flag->value) {
+		ret += snprintf(p, PAGE_SIZE - ret, "0x%x: %s\n",
+				flag->value, flag->descr);
+		p = page + ret;
+		flag++;
+	}
+	return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+	return sprintf(page, "%d\n", bi->tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
 	.attr = { .name = "format", .mode = S_IRUGO },
 	.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry = {
 	.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+	.attr = { .name = "write_user_flags", .mode = S_IRUGO },
+	.show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+	.attr = { .name = "tuple_size", .mode = S_IRUGO },
+	.show = integrity_tuple_size_show,
+};
+
 static struct attribute *integrity_attrs[] = {
 	&integrity_format_entry.attr,
 	&integrity_tag_size_entry.attr,
 	&integrity_read_entry.attr,
 	&integrity_write_entry.attr,
+	&integrity_write_flags_entry.attr,
+	&integrity_tuple_size_entry.attr,
 	NULL,
 };
 
@@ -422,6 +454,7 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 		bi->get_tag_fn = template->get_tag_fn;
 		bi->tag_size = template->tag_size;
 		bi->mod_user_buf_fn = template->mod_user_buf_fn;
+		bi->user_write_flags = template->user_write_flags;
 	} else
 		bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 74182c9..bea648b 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -182,6 +182,13 @@ static void sd_dif_type1_get_tag(void *prot, void *tag_buf, unsigned int sectors
 	}
 }
 
+static struct blk_integrity_flag dif_flags[] = {
+	{GENERATE_GUARD,	"generate guard tag"},
+	{GENERATE_REF,		"generate ref tag"},
+	{GENERATE_APP,		"generate app tag"},
+	{0, NULL},
+};
+
 static struct blk_integrity dif_type1_integrity_crc = {
 	.name			= "T10-DIF-TYPE1-CRC",
 	.generate_fn		= sd_dif_type1_generate_crc,
@@ -191,6 +198,7 @@ static struct blk_integrity dif_type1_integrity_crc = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type1_mod_crc,
+	.user_write_flags	= dif_flags,
 };
 
 static struct blk_integrity dif_type1_integrity_ip = {
@@ -202,6 +210,7 @@ static struct blk_integrity dif_type1_integrity_ip = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type1_mod_ip,
+	.user_write_flags	= dif_flags,
 };
 
 
@@ -334,6 +343,7 @@ static struct blk_integrity dif_type3_integrity_crc = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type3_mod_crc,
+	.user_write_flags	= dif_flags,
 };
 
 static struct blk_integrity dif_type3_integrity_ip = {
@@ -345,6 +355,7 @@ static struct blk_integrity dif_type3_integrity_ip = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type3_mod_ip,
+	.user_write_flags	= dif_flags,
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf1ec22..e8e6401 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1428,6 +1428,11 @@ typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
 typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
 typedef int (integrity_mod_user_buf_fn) (struct blk_integrity_exchg *, int);
 
+struct blk_integrity_flag {
+	unsigned int value;
+	const char *descr;
+};
+
 struct blk_integrity {
 	integrity_gen_fn	*generate_fn;
 	integrity_vrfy_fn	*verify_fn;
@@ -1443,6 +1448,8 @@ struct blk_integrity {
 	const char		*name;
 
 	struct kobject		kobj;
+
+	struct blk_integrity_flag	*user_write_flags;
 };
 
 extern bool blk_integrity_is_initialized(struct gendisk *);

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] blk-integrity: refactor various routines
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (3 preceding siblings ...)
  2014-03-21  4:31 ` [PATCH 4/5] aio/dio: advertise possible userspace flags Darrick J. Wong
@ 2014-03-21  4:31 ` Darrick J. Wong
  2014-03-21 14:57 ` [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Jeff Moyer
  2014-03-21 18:23 ` Zach Brown
  6 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21  4:31 UTC (permalink / raw)
  To: axboe, martin.petersen, darrick.wong, JBottomley, bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/bio-integrity.c  |  120 +++++++++++++++++++++------------------------------
 include/linux/bio.h |    2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 	struct bio_set *bs = bio->bi_pool;
 
-	if (bip->bip_owns_buf)
-		kfree(bip->bip_buf);
+	kfree(bip->bip_buf);
 
 	if (bs) {
 		if (bip->bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 {
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	unsigned int nr_sectors;
-
-	BUG_ON(bip->bip_buf == NULL);
+	unsigned int nr_sectors, tag_offset, sectors;
+	void *prot_buf;
+	unsigned int prot_offset, prot_len;
+	struct bio_vec *iv;
+	void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
 	if (bi->tag_size == 0)
 		return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 		return -1;
 	}
 
-	if (set)
-		bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
-	else
-		bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+	iv = bip->bip_vec;
+	prot_offset = iv->bv_offset;
+	prot_len = iv->bv_len;
+	prot_buf = kmap_atomic(iv->bv_page);
+	tag_fn = set ? bi->set_tag_fn : bi->get_tag_fn;
+	tag_offset = 0;
+
+	while (nr_sectors) {
+		if (prot_len < bi->tuple_size) {
+			kunmap_atomic(prot_buf);
+			iv++;
+			BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+			prot_offset = iv->bv_offset;
+			prot_len = iv->bv_len;
+			prot_buf = kmap_atomic(iv->bv_page);
+		}
+		sectors = min(prot_len / bi->tuple_size, nr_sectors);
+		tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+		nr_sectors -= sectors;
+		tag_offset += sectors * bi->tuple_size;
+		prot_offset += sectors * bi->tuple_size;
+		prot_len -= sectors * bi->tuple_size;
+	}
+	kunmap_atomic(prot_buf);
 
 	return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:	bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+				   walk_buf_fn *mod_fn)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 	struct blk_integrity_exchg bix;
 	struct bio_vec bv;
 	struct bvec_iter iter;
-	sector_t sector;
 	unsigned int sectors, total, ret;
 	void *prot_buf;
 	unsigned int prot_offset, prot_len, bv_offset, bv_len;
 	struct bio_vec *iv;
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 
-	if (!bi->mod_user_buf_fn)
+	if (!mod_fn)
 		return 0;
 
-	sector = bio->bi_iter.bi_sector;
-
 	total = ret = 0;
 	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
 	bix.sector_size = bi->sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
 			bix.prot_buf = prot_buf + prot_offset;
 			bix.sector = sector;
 
-			ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+			ret = mod_fn(&bix, bip->bip_user_flags);
 			if (ret) {
 				if (ret == -ENOTTY)
 					ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
 	kunmap_atomic(prot_buf);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * @bio:	bio to generate/verify integrity metadata for
- * @operate:	operate number, 1 for generate, 0 for verify
+ * @sector:	stratin
  */
-static int bio_integrity_generate_verify(struct bio *bio, int operate)
+int bio_integrity_update_user_buffer(struct bio *bio)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	struct blk_integrity_exchg bix;
-	struct bio_vec bv;
-	struct bvec_iter iter;
-	sector_t sector;
-	unsigned int sectors, total, ret;
-	void *prot_buf = bio->bi_integrity->bip_buf;
-
-	if (operate)
-		sector = bio->bi_iter.bi_sector;
-	else
-		sector = bio->bi_integrity->bip_iter.bi_sector;
-
-	total = ret = 0;
-	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-	bix.sector_size = bi->sector_size;
-
-	bio_for_each_segment(bv, bio, iter) {
-		void *kaddr = kmap_atomic(bv.bv_page);
-		bix.data_buf = kaddr + bv.bv_offset;
-		bix.data_size = bv.bv_len;
-		bix.prot_buf = prot_buf;
-		bix.sector = sector;
-
-		if (operate) {
-			bi->generate_fn(&bix);
-		} else {
-			ret = bi->verify_fn(&bix);
-			if (ret) {
-				kunmap_atomic(kaddr);
-				return ret;
-			}
-		}
-
-		sectors = bv.bv_len / bi->sector_size;
-		sector += sectors;
-		prot_buf += sectors * bi->tuple_size;
-		total += sectors * bi->tuple_size;
-		BUG_ON(total > bio->bi_integrity->bip_iter.bi_size);
-
-		kunmap_atomic(kaddr);
-	}
-	return ret;
+	return bio_integrity_walk_bufs(bio, bio->bi_iter.bi_sector,
+				       bi->mod_user_buf_fn);
 }
+EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
  * bio_integrity_generate - Generate integrity metadata for a bio
@@ -439,7 +416,9 @@ static int bio_integrity_generate_verify(struct bio *bio, int operate)
  */
 static void bio_integrity_generate(struct bio *bio)
 {
-	bio_integrity_generate_verify(bio, 1);
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	bio_integrity_walk_bufs(bio, bio->bi_iter.bi_sector,
+				(walk_buf_fn *)bi->generate_fn);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -479,7 +458,6 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 
 	sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
 
-	/* Allocate kernel buffer for protection data */
 	len = sectors * blk_integrity_tuple_size(bi);
 	end = (pi->pi_offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	start = pi->pi_offset >> PAGE_SHIFT;
@@ -497,8 +475,6 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 		return -EIO;
 	}
 
-	bip->bip_owns_buf = 0;
-	bip->bip_buf = NULL;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
 	bip->bip_user_flags = pi->pi_userflags;
@@ -591,7 +567,6 @@ int bio_integrity_prep(struct bio *bio)
 		return -EIO;
 	}
 
-	bip->bip_owns_buf = 1;
 	bip->bip_buf = buf;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
@@ -646,7 +621,10 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-	return bio_integrity_generate_verify(bio, 0);
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	return bio_integrity_walk_bufs(bio,
+				       bio->bi_integrity->bip_iter.bi_sector,
+				       (walk_buf_fn *)bi->verify_fn);
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5bd9618..c2ad1a8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -292,14 +292,12 @@ struct bio_integrity_payload {
 
 	struct bvec_iter	bip_iter;
 
-	/* kill - should just use bip_vec */
 	void			*bip_buf;	/* generated integrity data */
 
 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
 
 	unsigned short		bip_slab;	/* slab the bip came from */
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
-	unsigned		bip_owns_buf:1;	/* should free bip_buf */
 
 	struct work_struct	bip_work;	/* I/O completion */
 

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (4 preceding siblings ...)
  2014-03-21  4:31 ` [PATCH 5/5] blk-integrity: refactor various routines Darrick J. Wong
@ 2014-03-21 14:57 ` Jeff Moyer
  2014-03-21 21:39   ` Darrick J. Wong
  2014-03-21 18:23 ` Zach Brown
  6 siblings, 1 reply; 19+ messages in thread
From: Jeff Moyer @ 2014-03-21 14:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, bcrl, viro, linux-fsdevel,
	linux-aio, linux-scsi, linux-mm

"Darrick J. Wong" <darrick.wong@oracle.com> writes:

> This RFC provides a rough implementation of a mechanism to allow
> userspace to attach protection information (e.g. T10 DIF) data to a
> disk write and to receive the information alongside a disk read.  The
> interface is an extension to the AIO interface: two new commands
> (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the

Sorry for the shallow question, but what does that M stand for?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (5 preceding siblings ...)
  2014-03-21 14:57 ` [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Jeff Moyer
@ 2014-03-21 18:23 ` Zach Brown
  2014-03-21 21:44   ` Benjamin LaHaise
  2014-03-21 22:20   ` Darrick J. Wong
  6 siblings, 2 replies; 19+ messages in thread
From: Zach Brown @ 2014-03-21 18:23 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, bcrl, viro, linux-fsdevel,
	linux-aio, linux-scsi, linux-mm

On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> This RFC provides a rough implementation of a mechanism to allow
> userspace to attach protection information (e.g. T10 DIF) data to a
> disk write and to receive the information alongside a disk read.  The
> interface is an extension to the AIO interface: two new commands
> (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> arg list is interpreted to point to a buffer containing a header,
> followed by the the PI data.

Instead of adding commands that indicate that the final element is a
magical pi buffer, why not expand the iocb?

In the user iocb, a bit in aio_flags could indicate that aio_reserved2
is a pointer to an extension of the iocb.  In that extension could be a
full iov *, nr_segs for PI data.

You'd then translate that into a bigger kernel kiocb with a specific
pointer to PI data rather than having to bubble the tests for this magic
final iovec down through the kernel.

+       if (iocb->ki_flags & KIOCB_USE_PI) {
+               nr_segs--;
+               pi_iov = (struct iovec *)(iov + nr_segs);
+       }

I suggest this because there's already pressure to extend the iocb.
Folks want io priority inputs, completion time outputs, etc.

It's a much cleaner way to extend the interface without an explosion of
command enums that are really combinations of per-io arguments that are
present or not.

And heck, on the sync rw syscall side, add variant that have a pointer
to this same extension struct.  There's nothing inherently aio specific
about having lots more per-io inputs and outputs.

- z

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 14:57 ` [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Jeff Moyer
@ 2014-03-21 21:39   ` Darrick J. Wong
  2014-03-21 23:48     ` Zach Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21 21:39 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: axboe, martin.petersen, JBottomley, bcrl, viro, linux-fsdevel,
	linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 10:57:31AM -0400, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> 
> > This RFC provides a rough implementation of a mechanism to allow
> > userspace to attach protection information (e.g. T10 DIF) data to a
> > disk write and to receive the information alongside a disk read.  The
> > interface is an extension to the AIO interface: two new commands
> > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> 
> Sorry for the shallow question, but what does that M stand for?

Hmmm... I really don't remember why I picked 'M'.  Probably because it implied
that the IO has extra 'M'etadata associated with it.

But now I see, 'VM' connotes something entirely wrong.

--D
> 
> Cheers,
> Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 18:23 ` Zach Brown
@ 2014-03-21 21:44   ` Benjamin LaHaise
  2014-03-21 22:54     ` Darrick J. Wong
  2014-03-21 22:20   ` Darrick J. Wong
  1 sibling, 1 reply; 19+ messages in thread
From: Benjamin LaHaise @ 2014-03-21 21:44 UTC (permalink / raw)
  To: Zach Brown
  Cc: Darrick J. Wong, axboe, martin.petersen, JBottomley, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

Hi folks,

On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > This RFC provides a rough implementation of a mechanism to allow
> > userspace to attach protection information (e.g. T10 DIF) data to a
> > disk write and to receive the information alongside a disk read.  The
> > interface is an extension to the AIO interface: two new commands
> > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > arg list is interpreted to point to a buffer containing a header,
> > followed by the the PI data.
> 
> Instead of adding commands that indicate that the final element is a
> magical pi buffer, why not expand the iocb?
> 
> In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> is a pointer to an extension of the iocb.  In that extension could be a
> full iov *, nr_segs for PI data.

I'm inclined to agree with Zach on this item.  Ultimately, we need an 
extensible data structure that can be grown without completely revising 
the ABI as new parameters are added.  We need something that is either 
TLV based, or an extensible array.

> You'd then translate that into a bigger kernel kiocb with a specific
> pointer to PI data rather than having to bubble the tests for this magic
> final iovec down through the kernel.
> 
> +       if (iocb->ki_flags & KIOCB_USE_PI) {
> +               nr_segs--;
> +               pi_iov = (struct iovec *)(iov + nr_segs);
> +       }
> 
> I suggest this because there's already pressure to extend the iocb.
> Folks want io priority inputs, completion time outputs, etc.

There are already folks at other companies looking at similar extensions.  
I think there are folks at Google who have similar requirements.

Do you have time to put in some effort into defining these extensions?

		-ben

> It's a much cleaner way to extend the interface without an explosion of
> command enums that are really combinations of per-io arguments that are
> present or not.
> 
> And heck, on the sync rw syscall side, add variant that have a pointer
> to this same extension struct.  There's nothing inherently aio specific
> about having lots more per-io inputs and outputs.
> 
> - z

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 18:23 ` Zach Brown
  2014-03-21 21:44   ` Benjamin LaHaise
@ 2014-03-21 22:20   ` Darrick J. Wong
  2014-03-22  0:00     ` Zach Brown
  1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21 22:20 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, bcrl, viro, linux-fsdevel,
	linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > This RFC provides a rough implementation of a mechanism to allow
> > userspace to attach protection information (e.g. T10 DIF) data to a
> > disk write and to receive the information alongside a disk read.  The
> > interface is an extension to the AIO interface: two new commands
> > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > arg list is interpreted to point to a buffer containing a header,
> > followed by the the PI data.
> 
> Instead of adding commands that indicate that the final element is a
> magical pi buffer, why not expand the iocb?
> 
> In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> is a pointer to an extension of the iocb.  In that extension could be a
> full iov *, nr_segs for PI data.
> 
> You'd then translate that into a bigger kernel kiocb with a specific
> pointer to PI data rather than having to bubble the tests for this magic
> final iovec down through the kernel.
> 
> +       if (iocb->ki_flags & KIOCB_USE_PI) {
> +               nr_segs--;
> +               pi_iov = (struct iovec *)(iov + nr_segs);
> +       }
> 
> I suggest this because there's already pressure to extend the iocb.
> Folks want io priority inputs, completion time outputs, etc.

I'm curious about the reqprio field -- it seems like it was put there to
request some kind of IO priority change, but the kernel doesn't use it.

If aio_reserved2 becomes a (flag-guarded) pointer to an array of aio
extensions, I'd be tempted to reuse the reqprio to signal the length of the
extension array, and if anyone wants to start using reqprio, they could add it
as an extension.

(More about this in my response to Ben LaHaise.)

> It's a much cleaner way to extend the interface without an explosion of
> command enums that are really combinations of per-io arguments that are
> present or not.

Agreed.

> And heck, on the sync rw syscall side, add variant that have a pointer
> to this same extension struct.  There's nothing inherently aio specific
> about having lots more per-io inputs and outputs.

I'm curious -- what kinds of extensions do you envision for sync()?

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 21:44   ` Benjamin LaHaise
@ 2014-03-21 22:54     ` Darrick J. Wong
  2014-03-22  0:29       ` Zach Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-21 22:54 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Zach Brown, axboe, martin.petersen, JBottomley, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> Hi folks,
> 
> On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> > On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > > arg list is interpreted to point to a buffer containing a header,
> > > followed by the the PI data.
> > 
> > Instead of adding commands that indicate that the final element is a
> > magical pi buffer, why not expand the iocb?
> > 
> > In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> > is a pointer to an extension of the iocb.  In that extension could be a
> > full iov *, nr_segs for PI data.
> 
> I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> extensible data structure that can be grown without completely revising 
> the ABI as new parameters are added.  We need something that is either 
> TLV based, or an extensible array.

Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
becomes a u16 that tells us the array length.  The libaio.h equivalents are
iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.

Next, let's define a conceptual structure for aio extensions:

struct iocb_extension {
	void *ie_buf;
	unsigned int ie_buflen;
	unsigned int ie_type;
	unsigned int ie_flags;
};

The actual definitions can be defined in a similar fashion to the other aio
structures so that the structures are padded to the same layout regardless of
bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.

Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
to the extension designer to decide if they want to support either a S-G list,
one big (vaddr) buffer, or toggle flags?

For the PI passthrough, I'll define IOCB_EXT_PI as the first ie_type, and move
the flags argument out of the PI buffer and into ie_flags.

I could also make an IOCB_EXT_REQPRIO where ie_flags = reqprio, but since the
kernel ignores it right now, I don't see much point.

> > You'd then translate that into a bigger kernel kiocb with a specific
> > pointer to PI data rather than having to bubble the tests for this magic
> > final iovec down through the kernel.
> > 
> > +       if (iocb->ki_flags & KIOCB_USE_PI) {
> > +               nr_segs--;
> > +               pi_iov = (struct iovec *)(iov + nr_segs);
> > +       }
> > 
> > I suggest this because there's already pressure to extend the iocb.
> > Folks want io priority inputs, completion time outputs, etc.
> 
> There are already folks at other companies looking at similar extensions.  
> I think there are folks at Google who have similar requirements.

To everyone else interested in AIO extensions: I'd love to hear your ideas.

> Do you have time to put in some effort into defining these extensions?

I think so.  Let's see how much we can get done.

--D
> 
> 		-ben
> 
> > It's a much cleaner way to extend the interface without an explosion of
> > command enums that are really combinations of per-io arguments that are
> > present or not.
> > 
> > And heck, on the sync rw syscall side, add variant that have a pointer
> > to this same extension struct.  There's nothing inherently aio specific
> > about having lots more per-io inputs and outputs.
> > 
> > - z
> 
> -- 
> "Thought is the essence of where you are now."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 21:39   ` Darrick J. Wong
@ 2014-03-21 23:48     ` Zach Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Zach Brown @ 2014-03-21 23:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, axboe, martin.petersen, JBottomley, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 02:39:59PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 10:57:31AM -0400, Jeff Moyer wrote:
> > "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> > 
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > 
> > Sorry for the shallow question, but what does that M stand for?
> 
> Hmmm... I really don't remember why I picked 'M'.  Probably because it implied
> that the IO has extra 'M'etadata associated with it.

Magical! :)

- z

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 22:20   ` Darrick J. Wong
@ 2014-03-22  0:00     ` Zach Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Zach Brown @ 2014-03-22  0:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, bcrl, viro, linux-fsdevel,
	linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 03:20:25PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
> > On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
> > > This RFC provides a rough implementation of a mechanism to allow
> > > userspace to attach protection information (e.g. T10 DIF) data to a
> > > disk write and to receive the information alongside a disk read.  The
> > > interface is an extension to the AIO interface: two new commands
> > > (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
> > > arg list is interpreted to point to a buffer containing a header,
> > > followed by the the PI data.
> > 
> > Instead of adding commands that indicate that the final element is a
> > magical pi buffer, why not expand the iocb?
> > 
> > In the user iocb, a bit in aio_flags could indicate that aio_reserved2
> > is a pointer to an extension of the iocb.  In that extension could be a
> > full iov *, nr_segs for PI data.
> > 
> > You'd then translate that into a bigger kernel kiocb with a specific
> > pointer to PI data rather than having to bubble the tests for this magic
> > final iovec down through the kernel.
> > 
> > +       if (iocb->ki_flags & KIOCB_USE_PI) {
> > +               nr_segs--;
> > +               pi_iov = (struct iovec *)(iov + nr_segs);
> > +       }
> > 
> > I suggest this because there's already pressure to extend the iocb.
> > Folks want io priority inputs, completion time outputs, etc.
> 
> I'm curious about the reqprio field -- it seems like it was put there to
> request some kind of IO priority change, but the kernel doesn't use it.

The user-facing iocbs were derived from the posix aio interface which
has a reqprio field (aio(7), aio_reqprio).  I don't think anything's
ever been done with it.

I don't know more about what current io prio stuff people might want to
specify..  ioprio_set(2) args instead of having to bounce through
syscalls and current-> for each op?  cgroup bits?  No idea.

> If aio_reserved2 becomes a (flag-guarded) pointer to an array of aio
> extensions, I'd be tempted to reuse the reqprio to signal the length of the
> extension array, and if anyone wants to start using reqprio, they could add it
> as an extension.

I'll admit, I'm hesitant to cannibalize reqprio for this.  It's a lame
s16.  But maybe it'll be the least awful alternative.

> (More about this in my response to Ben LaHaise.)

(I'll go reply over there too.)

> > And heck, on the sync rw syscall side, add variant that have a pointer
> > to this same extension struct.  There's nothing inherently aio specific
> > about having lots more per-io inputs and outputs.
> 
> I'm curious -- what kinds of extensions do you envision for sync()?

Sorry, that was poorly worded.  By 'sync' I meant the synchronous
classic sys_*write* syscalls.  Maybe we should add another variant with
a "struct io_goo *" pointer, or whatever.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-21 22:54     ` Darrick J. Wong
@ 2014-03-22  0:29       ` Zach Brown
  2014-03-22  2:32         ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Zach Brown @ 2014-03-22  0:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Benjamin LaHaise, axboe, martin.petersen, JBottomley, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
>
> > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > extensible data structure that can be grown without completely revising 
> > the ABI as new parameters are added.  We need something that is either 
> > TLV based, or an extensible array.
> 
> Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
> that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
> becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
> becomes a u16 that tells us the array length.  The libaio.h equivalents are
> iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> 
> Next, let's define a conceptual structure for aio extensions:
> 
> struct iocb_extension {
> 	void *ie_buf;
> 	unsigned int ie_buflen;
> 	unsigned int ie_type;
> 	unsigned int ie_flags;
> };
> 
> The actual definitions can be defined in a similar fashion to the other aio
> structures so that the structures are padded to the same layout regardless of
> bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.

I'm firmly in the camp that doesn't want to go down this abstract road.
We had this conversation with Kent when he wanted to do something very
similar.

What happens if there are duplicate ie_types?  Is that universally
prohibited, validity left up to the types that are duplicated?  What if
the len is not the right size?  Who checks that?  What if the extension
(they're arguments, but one thing at a time) is writable and the buf
pointers overlap or is unaligned?  Is that cool, who checks it?

Who defines the acceptable set?  Can drivers make up their own weird
types?  How does strace print all this?  How does the security module
universe declare policies that can forbid or allow these things?

Personally, I think this level of dynamism is not worth the complexity.

Can we instead just have a nice easy struct with fixed members that only
grows?

struct some_more_args {
	u64 has; /* = HAS_PI_VEC; */
	u64 pi_vec_ptr;
	u64 pi_vec_nr_segs;
};

struct some_more_args {
	u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
	u64 pi_vec_ptr;
	u64 pi_vec_nr_segs;
	u64 magic_thing;
};

If it only grows and has bits indicating presence then I think we're
good.   You only fetch the space for the bits that are indicated.  You
can return errors for bits you don't recognize.  You could perhaps offer
some way to announce the bits you recognize.

I'll admit, though, that I don't really like having to fetch the 'has'
bits first to find out how large the rest of the struct is.  Maybe
that's not worth worrying about.

Thoughts?  Am I out to lunch here?

> Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
> to the extension designer to decide if they want to support either a S-G list,
> one big (vaddr) buffer, or toggle flags?

No idea.  Either seems doable.  I'd aim for simpler to reduce the number
of weird cases to handle or forbid (iovecs with a byte per page!) unless
Martin thinks people want to vector the PI goo.

> I think so.  Let's see how much we can get done.

FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
around.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-22  0:29       ` Zach Brown
@ 2014-03-22  2:32         ` Darrick J. Wong
  2014-03-22  9:43           ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-22  2:32 UTC (permalink / raw)
  To: Zach Brown
  Cc: Benjamin LaHaise, axboe, martin.petersen, JBottomley, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> >
> > > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > > extensible data structure that can be grown without completely revising 
> > > the ABI as new parameters are added.  We need something that is either 
> > > TLV based, or an extensible array.
> > 
> > Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
> > that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
> > becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
> > becomes a u16 that tells us the array length.  The libaio.h equivalents are
> > iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> > 
> > Next, let's define a conceptual structure for aio extensions:
> > 
> > struct iocb_extension {
> > 	void *ie_buf;
> > 	unsigned int ie_buflen;
> > 	unsigned int ie_type;
> > 	unsigned int ie_flags;
> > };
> > 
> > The actual definitions can be defined in a similar fashion to the other aio
> > structures so that the structures are padded to the same layout regardless of
> > bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.
> 
> I'm firmly in the camp that doesn't want to go down this abstract road.
> We had this conversation with Kent when he wanted to do something very
> similar.

Could you point me to this discussion?  I'd like to read it.

> What happens if there are duplicate ie_types?  Is that universally
> prohibited, validity left up to the types that are duplicated?

Yes.

> What if the len is not the right size?  Who checks that?

The extension driver, presumably.

>  What if the extension (they're arguments, but one thing at a time) is
>  writable and the buf pointers overlap or is unaligned?  Is that cool, who
>  checks it?

Each extension driver has to check the alignment.  I don't know what to do
about buffer pointer overlap; if you want to shoot yourself in the foot that's
fine with me.

> Who defines the acceptable set?


>  Can drivers make up their own weird types?

How do you mean?  As far as whatever's in the ie_buf, I think that depends on
the extension.

>  How does strace print all this?  How does the security module universe
>  declare policies that can forbid or allow these things?

I don't know.

> Personally, I think this level of dynamism is not worth the complexity.
> 
> Can we instead just have a nice easy struct with fixed members that only
> grows?
> 
> struct some_more_args {
> 	u64 has; /* = HAS_PI_VEC; */
> 	u64 pi_vec_ptr;
> 	u64 pi_vec_nr_segs;
> };
> 
> struct some_more_args {
> 	u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
> 	u64 pi_vec_ptr;
> 	u64 pi_vec_nr_segs;
> 	u64 magic_thing;
> };
> 
> If it only grows and has bits indicating presence then I think we're
> good.   You only fetch the space for the bits that are indicated.  You
> can return errors for bits you don't recognize.  You could perhaps offer
> some way to announce the bits you recognize.

<shrug> I was gonna just -EINVAL for types we don't recognize, or which don't
apply in this scenario.

> I'll admit, though, that I don't really like having to fetch the 'has'
> bits first to find out how large the rest of the struct is.  Maybe
> that's not worth worrying about.

I'm not worrying about having to pluck 'has' out of the structure, but needing
a function to tell me how big of a buffer I need for a given pile of flags
seems ... icky.  But maybe the ease of modifying strace and security auditors
would make it worth it?

> Thoughts?  Am I out to lunch here?

I don't have a problem adopting your design, aside from the complications of
figuring out how big struct some_more_args really is.

> > Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
> > to the extension designer to decide if they want to support either a S-G list,
> > one big (vaddr) buffer, or toggle flags?
> 
> No idea.  Either seems doable.  I'd aim for simpler to reduce the number
> of weird cases to handle or forbid (iovecs with a byte per page!) unless
> Martin thinks people want to vector the PI goo.

For now I'll leave it as a simple buffer until I hear otherwise.

> > I think so.  Let's see how much we can get done.
> 
> FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
> around.

Me too!

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-22  2:32         ` Darrick J. Wong
@ 2014-03-22  9:43           ` Darrick J. Wong
  2014-03-23 14:02             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-22  9:43 UTC (permalink / raw)
  To: Zach Brown
  Cc: Benjamin LaHaise, axboe, martin.petersen, JBottomley, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> > On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
> > >
> > > > I'm inclined to agree with Zach on this item.  Ultimately, we need an 
> > > > extensible data structure that can be grown without completely revising 
> > > > the ABI as new parameters are added.  We need something that is either 
> > > > TLV based, or an extensible array.
> > > 
> > > Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
> > > that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
> > > becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
> > > becomes a u16 that tells us the array length.  The libaio.h equivalents are
> > > iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
> > > 
> > > Next, let's define a conceptual structure for aio extensions:
> > > 
> > > struct iocb_extension {
> > > 	void *ie_buf;
> > > 	unsigned int ie_buflen;
> > > 	unsigned int ie_type;
> > > 	unsigned int ie_flags;
> > > };
> > > 
> > > The actual definitions can be defined in a similar fashion to the other aio
> > > structures so that the structures are padded to the same layout regardless of
> > > bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.
> > 
> > I'm firmly in the camp that doesn't want to go down this abstract road.
> > We had this conversation with Kent when he wanted to do something very
> > similar.
> 
> Could you point me to this discussion?  I'd like to read it.

Is it "[RFC, PATCH] Extensible AIO interface"?
http://lkml.iu.edu//hypermail/linux/kernel/1210.0/00651.html 

Regrettably that discussion happened right during that period where I was
pleasantly AWOL from work for a few months. :)

Will read ... tomorrow.

> > What happens if there are duplicate ie_types?  Is that universally
> > prohibited, validity left up to the types that are duplicated?
> 
> Yes.
> 
> > What if the len is not the right size?  Who checks that?
> 
> The extension driver, presumably.
> 
> >  What if the extension (they're arguments, but one thing at a time) is
> >  writable and the buf pointers overlap or is unaligned?  Is that cool, who
> >  checks it?
> 
> Each extension driver has to check the alignment.  I don't know what to do
> about buffer pointer overlap; if you want to shoot yourself in the foot that's
> fine with me.
> 
> > Who defines the acceptable set?

(This was an "I don't know", for anyone who cares.)

> 
> >  Can drivers make up their own weird types?
> 
> How do you mean?  As far as whatever's in the ie_buf, I think that depends on
> the extension.
> 
> >  How does strace print all this?  How does the security module universe
> >  declare policies that can forbid or allow these things?
> 
> I don't know.
> 
> > Personally, I think this level of dynamism is not worth the complexity.
> > 
> > Can we instead just have a nice easy struct with fixed members that only
> > grows?
> > 
> > struct some_more_args {
> > 	u64 has; /* = HAS_PI_VEC; */
> > 	u64 pi_vec_ptr;
> > 	u64 pi_vec_nr_segs;
> > };
> > 
> > struct some_more_args {
> > 	u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
> > 	u64 pi_vec_ptr;
> > 	u64 pi_vec_nr_segs;
> > 	u64 magic_thing;
> > };
> > 
> > If it only grows and has bits indicating presence then I think we're
> > good.   You only fetch the space for the bits that are indicated.  You
> > can return errors for bits you don't recognize.  You could perhaps offer
> > some way to announce the bits you recognize.
> 
> <shrug> I was gonna just -EINVAL for types we don't recognize, or which don't
> apply in this scenario.
> 
> > I'll admit, though, that I don't really like having to fetch the 'has'
> > bits first to find out how large the rest of the struct is.  Maybe
> > that's not worth worrying about.
> 
> I'm not worrying about having to pluck 'has' out of the structure, but needing
> a function to tell me how big of a buffer I need for a given pile of flags
> seems ... icky.  But maybe the ease of modifying strace and security auditors
> would make it worth it?

How about explicitly specifying the structure size in struct some_more_args,
and checking that against whatever we find in .has?  Hm.  I still think that's
too clever for my brain to keep together for long.

I'm also nervous that we could be creating this monster of a structure wherein
some user wants to tack the first and last hints ever created onto an IO, so
now we have to lug this huge structure around that has space for hints that
we're not going to use, and most of which is zeroes.

I think it would be easy to add one of these interfaces to the regular
{read,write}{,v} calls too.

--D
> 
> > Thoughts?  Am I out to lunch here?
> 
> I don't have a problem adopting your design, aside from the complications of
> figuring out how big struct some_more_args really is.
> 
> > > Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
> > > to the extension designer to decide if they want to support either a S-G list,
> > > one big (vaddr) buffer, or toggle flags?
> > 
> > No idea.  Either seems doable.  I'd aim for simpler to reduce the number
> > of weird cases to handle or forbid (iovecs with a byte per page!) unless
> > Martin thinks people want to vector the PI goo.
> 
> For now I'll leave it as a simple buffer until I hear otherwise.
> 
> > > I think so.  Let's see how much we can get done.
> > 
> > FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
> > around.
> 
> Me too!
> 
> --D
> > 
> > - z
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-22  9:43           ` Darrick J. Wong
@ 2014-03-23 14:02             ` Jan Kara
  2014-03-23 17:07               ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2014-03-23 14:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zach Brown, Benjamin LaHaise, axboe, martin.petersen, JBottomley,
	viro, linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Sat 22-03-14 02:43:20, Darrick J. Wong wrote:
> On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> > > I'll admit, though, that I don't really like having to fetch the 'has'
> > > bits first to find out how large the rest of the struct is.  Maybe
> > > that's not worth worrying about.
> > 
> > I'm not worrying about having to pluck 'has' out of the structure, but needing
> > a function to tell me how big of a buffer I need for a given pile of flags
> > seems ... icky.  But maybe the ease of modifying strace and security auditors
> > would make it worth it?
> 
> How about explicitly specifying the structure size in struct some_more_args,
> and checking that against whatever we find in .has?  Hm.  I still think that's
> too clever for my brain to keep together for long.
> 
> I'm also nervous that we could be creating this monster of a structure wherein
> some user wants to tack the first and last hints ever created onto an IO, so
> now we have to lug this huge structure around that has space for hints that
> we're not going to use, and most of which is zeroes.
  Well, why does it matter that the structure would be big? Are do you
think the memory consumption would matter?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO
  2014-03-23 14:02             ` Jan Kara
@ 2014-03-23 17:07               ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2014-03-23 17:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zach Brown, Benjamin LaHaise, axboe, martin.petersen, JBottomley,
	viro, linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Sun, Mar 23, 2014 at 03:02:44PM +0100, Jan Kara wrote:
> On Sat 22-03-14 02:43:20, Darrick J. Wong wrote:
> > On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
> > > > I'll admit, though, that I don't really like having to fetch the 'has'
> > > > bits first to find out how large the rest of the struct is.  Maybe
> > > > that's not worth worrying about.
> > > 
> > > I'm not worrying about having to pluck 'has' out of the structure, but needing
> > > a function to tell me how big of a buffer I need for a given pile of flags
> > > seems ... icky.  But maybe the ease of modifying strace and security auditors
> > > would make it worth it?
> > 
> > How about explicitly specifying the structure size in struct some_more_args,
> > and checking that against whatever we find in .has?  Hm.  I still think that's
> > too clever for my brain to keep together for long.
> > 
> > I'm also nervous that we could be creating this monster of a structure wherein
> > some user wants to tack the first and last hints ever created onto an IO, so
> > now we have to lug this huge structure around that has space for hints that
> > we're not going to use, and most of which is zeroes.
>   Well, why does it matter that the structure would be big? Are do you
> think the memory consumption would matter?

I doubt the memory consumption will be a big deal (compared to the size of the
IOs), but I'm a little concerned about the overhead of copying a mostly-zeroes
user buffer into the kernel.  I guess it's not a big deal to copy the whole
thing now and if people complain about the overhead, switch it to let the IO
attribute controllers selectively copy_from_user later.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-03-23 17:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-21  4:30 [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Darrick J. Wong
2014-03-21  4:30 ` [PATCH 1/5] fs/bio-integrity: remove duplicate code Darrick J. Wong
2014-03-21  4:30 ` [PATCH 2/5] aio/dio: enable DIX passthrough Darrick J. Wong
2014-03-21  4:31 ` [PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
2014-03-21  4:31 ` [PATCH 4/5] aio/dio: advertise possible userspace flags Darrick J. Wong
2014-03-21  4:31 ` [PATCH 5/5] blk-integrity: refactor various routines Darrick J. Wong
2014-03-21 14:57 ` [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO Jeff Moyer
2014-03-21 21:39   ` Darrick J. Wong
2014-03-21 23:48     ` Zach Brown
2014-03-21 18:23 ` Zach Brown
2014-03-21 21:44   ` Benjamin LaHaise
2014-03-21 22:54     ` Darrick J. Wong
2014-03-22  0:29       ` Zach Brown
2014-03-22  2:32         ` Darrick J. Wong
2014-03-22  9:43           ` Darrick J. Wong
2014-03-23 14:02             ` Jan Kara
2014-03-23 17:07               ` Darrick J. Wong
2014-03-21 22:20   ` Darrick J. Wong
2014-03-22  0:00     ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).