linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO
@ 2014-03-24 16:22 Darrick J. Wong
  2014-03-24 16:22 ` [PATCH 1/6] fs/bio-integrity: remove duplicate code Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:22 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.
There's a new "IO extension" interface wherein we define a structure
(per zab's comments on the v2 series) io_extension that points to the
the PI data buffer.  These patches are against 3.14-rc7.

NOTE: As far as I know this works, but this is just a refresh of last
week's patchset to start the discussion at LSF, which was moved up to
today.  I've not done rigorous testing, hence the 'donotmerge'.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 implements a generic IO extension interface so that we can
receive a struct io_extension from userspace containing the structure
size, a flag telling us which extensions we'd like to use (ie_has),
and (eventually) extension data.  There's a small framework for
mapping ie_has bits to actual extensions.

Patch #3 provides the plumbing to get the user's buffer all the way to
the block integrity code.  Due to the way that the code deals with the
array of struct page*s that represent the PI buffer, there's an
unfortunate requirement that no PI tuple may cross a page boundary.
Given that so far DIF is only 8 or 16 bytes this isn't a problem yet.
There's also no explicit fallback for the case where the user pages
are not within a device's DMA range.  This patch hooks into the IO
extension interface.

Patch #4 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
"mod_user_buf_fn" hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #5 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #6 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Eventually there will be a patch #7 that makes it so that IO
extensions can be piped through the synchronous IO calls, but it was
nowhere near ready when I sent this patchset. :(

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -pr -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <libaio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <errno.h>
#include <stdlib.h>
#include <stdint.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <sys/syscall.h>

#define GENERATE_GUARD	(1)
#define GENERATE_REF	(2)
#define GENERATE_APP	(4)
#define GENERATE_ALL	(7)

#define NR_IOS		(1)
#define NR_IOVS		(2)
#define NR_IOCB_EXTS	(1)

/* Stuff that should go in libaio.h */
#define IO_EXT_INVALID	(0)
#define	IO_EXT_PI	(1)	/* protection info attached */

#define IOCB_FLAG_EXTENSIONS	(1 << 1)

#define __FIOEXT	040000000

struct io_extension {
	__u64 ie_size;
	__u64 ie_has;

	/* PI stuff */
	__u64 ie_pi_buf;
	__u32 ie_pi_buflen;
	__u32 ie_pi_ret;
	__u32 ie_pi_flags;
};

static void io_prep_extensions(struct iocb *iocb, struct io_extension *ext,
			       unsigned int nr)
{
	iocb->u.c.flags |= IOCB_FLAG_EXTENSIONS;
	iocb->u.c.__pad3 = (long long)ext;
}

static void io_prep_extension(struct io_extension *ext)
{
	memset(ext, 0, sizeof(struct io_extension));
	ext->ie_size = sizeof(*ext);
}

static void io_prep_extension_pi(struct io_extension *ext, void *buf,
				 unsigned int buflen, unsigned int flags)
{
	ext->ie_has |= IO_EXT_PI;
	ext->ie_pi_buf = (__u64)buf;
	ext->ie_pi_buflen = buflen;
	ext->ie_pi_flags = flags;
}
/* End stuff for libaio.h */

static void dump_buffer(char *buf, size_t len)
{
	size_t off;
	char *p;

	for (p = buf; p < buf + len; p++) {
		off = p - buf;
		if (off % 32 == 0) {
			if (p != buf)
				printf("\n");
			printf("%05zu:", off);
		}
		printf(" %02x", *p & 0xFF);
	}
	printf("\n");
}

/* Table generated using the following polynomium:
 * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
 * gt: 0x8bb7
 */
static const uint16_t t10_dif_crc_table[256] = {
	0x0000, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
	0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
	0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
	0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
	0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
	0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
	0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
	0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
	0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
	0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
	0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
	0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
	0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
	0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
	0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
	0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
	0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
	0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
	0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
	0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
	0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
	0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
	0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
	0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
	0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
	0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
	0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
	0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
	0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
	0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
	0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
	0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
};

uint16_t crc_t10dif(uint16_t crc, const unsigned char *buffer, uint32_t len)
{
	unsigned int i;

	for (i = 0 ; i < len ; i++)
		crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff];

	return crc;
}

struct sd_dif_tuple {
       uint16_t guard_tag;	/* Checksum */
       uint16_t app_tag;		/* Opaque storage */
       uint32_t ref_tag;		/* Target LBA or indirect LBA */
};

static void stamp_pi_buffer(struct sd_dif_tuple *t, uint16_t csum,
			    uint16_t tag, uint32_t sector)
{
	t->guard_tag = htons(csum);
	t->app_tag = htons(tag);
	t->ref_tag = htonl(sector);
}

static void print_help(const char *progname)
{
	printf("Usage: %s [OPTS] fname\n", progname);
	printf("-a	Use this application tag\n");
	printf("-d	Do not use O_DIRECT\n");
	printf("-o	Read/write this many sectors into the device\n");
	printf("-q	Don't use AIO.\n");
	printf("-r	Use DIX to read\n");
	printf("-s	Allocate buffer of this many sectors\n");
	printf("-w	Use DIX to write\n");
	printf("-z	Do not use O_SYNC\n");
}

int main(int argc, char *argv[])
{
	struct sd_dif_tuple *pi;
	int page_size = sysconf(_SC_PAGESIZE);
	io_context_t ioctx;
	struct io_event events[NR_IOS];
	struct iocb iocbs[NR_IOS];
	struct iocb *iocbps[NR_IOS];
	void *buf, *buf2;
	unsigned char *p;
	void *mbuf, *mbuf2;
	int ret, fd, i;
	struct iovec iov[NR_IOVS];
	struct io_extension iocb_ext[NR_IOCB_EXTS];
	int opt;
	int dix_read = 0, dix_write = 0;
	unsigned int SECTOR_SIZE = 0;
	unsigned long long num_sectors = 8, BUF_SIZE;
	unsigned long long sector_offset = 256, BDEV_OFFSET;
	unsigned int APP_TAG = 0xEF53;
	unsigned int the_byte = 0x55;
	size_t pi_buflen;
	int o_direct = O_DIRECT;
	int o_sync = O_SYNC;
	int use_aio = 1;
	uint32_t pi_flags = 0;

	while ((opt = getopt(argc, argv, "b:zdrws:o:a:p:q")) != -1) {
		switch (opt) {
		case 'a':
			APP_TAG = strtoul(optarg, NULL, 0);
			break;
		case 'b':
			the_byte = strtoul(optarg, NULL, 0) & 0xFF;
			break;
		case 'd':
			o_direct = 0;
			break;
		case 'o':
			sector_offset = strtoull(optarg, NULL, 0);
			break;
		case 'p':
			for (i = 0; i < strlen(optarg); i++)
				switch (optarg[i]) {
				case 'a':
					pi_flags |= GENERATE_APP;
					break;
				case 'g':
					pi_flags |= GENERATE_GUARD;
					break;
				case 'r':
					pi_flags |= GENERATE_REF;
					break;
				default:
					print_help(argv[0]);
					return 2;
				}
			break;
		case 'r':
			dix_read = 1;
			break;
		case 's':
			num_sectors = strtoull(optarg, NULL, 0);
			break;
		case 'w':
			dix_write = 1;
			break;
		case 'z':
			o_sync = 0;
			break;
		case 'q':
			use_aio = 0;
			break;
		default:
			print_help(argv[0]);
			return 0;
		}
	}

	if (optind >= argc) {
		print_help(argv[0]);
		return 0;
	}

	if (dix_read)
		fprintf(stderr, "Using DIX read.\n");
	if (dix_write)
		fprintf(stderr, "Using DIX write.\n");

	fd = open(argv[optind], o_direct | o_sync | O_RDWR);
	if (fd < 0) {
		perror(argv[optind]);
		return 1;
	}

	/* For now, don't let non-block devices in */
	SECTOR_SIZE = 512;
	if (ioctl(fd, BLKSSZGET, &SECTOR_SIZE)) {
		perror(argv[optind]);
	}

	pi_buflen = num_sectors * sizeof(struct sd_dif_tuple);

	BUF_SIZE = num_sectors * SECTOR_SIZE;
	BDEV_OFFSET = sector_offset * SECTOR_SIZE;
	fprintf(stderr, "sector=%d num_sectors=%llu pi_len=%zu pi_flag=0x%x\n",
		SECTOR_SIZE, num_sectors, pi_buflen, pi_flags);
	if (posix_memalign(&buf, page_size, BUF_SIZE) ||
	    posix_memalign(&buf2, page_size, BUF_SIZE) ||
	    posix_memalign(&mbuf, page_size, pi_buflen) ||
	    posix_memalign(&mbuf2, page_size, pi_buflen)) {
		perror("memalign");
		return 1;
	}

	if (io_queue_init(2, &ioctx)) {
		perror("io_queue_init");
		return 1;
	}

	/* Write everything out */
	memset(mbuf, 0, pi_buflen);
	memset(buf, the_byte, BUF_SIZE);
	for (p = buf, i = 0, pi = mbuf;
	     i < num_sectors;
	     i++, pi++, p += SECTOR_SIZE)
		stamp_pi_buffer(pi,
				pi_flags & GENERATE_GUARD ? 0 : crc_t10dif(0, p, SECTOR_SIZE),
				pi_flags & GENERATE_APP ? 0 : APP_TAG,
				pi_flags & GENERATE_REF ? 0 : (BDEV_OFFSET / SECTOR_SIZE) + i);
	io_prep_extension(iocb_ext);
	io_prep_extension_pi(iocb_ext, mbuf, pi_buflen, pi_flags);
	iov[0].iov_base = buf;
	iov[0].iov_len = page_size;
	iov[1].iov_base = buf + page_size;
	iov[1].iov_len = BUF_SIZE - page_size;
	iocbps[0] = iocbs;
	io_prep_pwritev(iocbs, fd, iov, NR_IOVS, BDEV_OFFSET);
	if (dix_write)
		io_prep_extensions(iocbs, iocb_ext, NR_IOCB_EXTS);

	fprintf(stderr, "Writing %llu bytes\n", BUF_SIZE);
	if (use_aio) {
		ret = io_submit(ioctx, 1, iocbps);
		if (ret < 0) {
			errno = -ret;
			perror("io_submit");
			return 1;
		}

		ret = io_getevents(ioctx, 1, 1, events, NULL);
		if (ret < 0) {
			errno = -ret;
			perror("io_getevents");
			return 1;
		}

		if ((signed)events[0].res < 0) {
			errno = -((signed)events[0].res);
			perror("io_pwritev");
			return 1;
		}
	} else {
		ret = fcntl(fd, F_GETFL);
		ret = fcntl(fd, F_SETFL, ret | __FIOEXT);
		if (ret) {
			perror("fcntl set");
			return 1;
		}
		fprintf(stderr, "flags=0x%x iocb=%p\n", fcntl(fd, F_GETFL),
			iocb_ext);
		/* ret = pwritev(fd, iov, NR_IOVS, BDEV_OFFSET); */
		ret = syscall(SYS_pwritev, fd, iov, NR_IOVS, BDEV_OFFSET,
			      dix_write ? iocb_ext : NULL);
		if (ret < 0) {
			errno = -ret;
			perror("pwritev");
			return 1;
		}
	}
	fprintf(stderr, "Wrote %lu bytes\n", events[0].res);

	/* Read everything back in */
	memset(buf2, 0x00, BUF_SIZE);
	memset(mbuf2, 0x00, pi_buflen);
	io_prep_extension(iocb_ext);
	io_prep_extension_pi(iocb_ext, mbuf2, pi_buflen, 0);
	iov[0].iov_base = buf2;
	iov[0].iov_len = page_size;
	iov[1].iov_base = buf2 + page_size;
	iov[1].iov_len = BUF_SIZE - page_size;
	iocbps[0] = iocbs;
	io_prep_preadv(iocbs, fd, iov, NR_IOVS, BDEV_OFFSET);
	if (dix_read)
		io_prep_extensions(iocbs, iocb_ext, NR_IOCB_EXTS);

	fprintf(stderr, "Reading %llu bytes\n", BUF_SIZE);
	if (use_aio) {
		ret = io_submit(ioctx, 1, iocbps);
		if (ret < 0) {
			errno = -ret;
			perror("io_submit");
			return 1;
		}

		ret = io_getevents(ioctx, 1, 1, events, NULL);
		if (ret < 0) {
			errno = -ret;
			perror("io_getevents");
			return 1;
		}

		if ((signed)events[0].res < 0) {
			errno = -((signed)events[0].res);
			perror("io_preadv");
			return 1;
		}
	} else {
		ret = fcntl(fd, F_GETFL);
		ret = fcntl(fd, F_SETFL, ret | __FIOEXT);
		if (ret) {
			perror("fcntl set");
			return 1;
		}
		fprintf(stderr, "flags=0x%x\n", fcntl(fd, F_GETFL));
		/* ret = preadv(fd, iov, NR_IOVS, BDEV_OFFSET, iocb_ext); */
		ret = syscall(SYS_preadv, fd, iov, NR_IOVS, BDEV_OFFSET,
			      dix_read ? iocb_ext : NULL);
		if (ret < 0) {
			errno = -ret;
			perror("preadv");
			return 1;
		}
	}
	fprintf(stderr, "Read %lu bytes\n", events[0].res);

	/* Compare */
	ret = 0;
	if (memcmp(buf, buf2, BUF_SIZE)) {
		ret = 2;
		fprintf(stderr, "Buffers do not match!\n");
	}
	if (dix_read && dix_write) {
		fprintf(stderr, "write pi\n");
		dump_buffer(mbuf, pi_buflen);
		fprintf(stderr, "read pi\n");
		dump_buffer(mbuf2, pi_buflen);
		if(memcmp(mbuf, mbuf2, pi_buflen)) {
			ret = 2;
			fprintf(stderr, "DIX buffers do not match!\n");
		}
	} else
		fprintf(stderr, "Need to pass -rw to compare DIX buffers!\n");

	if (io_queue_release(ioctx)) {
		perror("io_queue_release");
		return 1;
	}

	close(fd);
	if (!ret)
		fprintf(stderr, "Success.\n");

	return ret;
}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] fs/bio-integrity: remove duplicate code
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
@ 2014-03-24 16:22 ` Darrick J. Wong
  2014-04-02 19:17   ` Zach Brown
  2014-03-24 16:22 ` [PATCH 2/6] io: define an interface for IO extensions Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:22 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, Gu Zheng, linux-scsi, linux-mm

Frøm: Gu Zheng <guz.fnst@cn.fujitsu.com>

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 fs/bio-integrity.c |   83 +++++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:	bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:	bio to generate/verify integrity metadata for
+ * @operate:	operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 	struct blk_integrity_exchg bix;
 	struct bio_vec bv;
 	struct bvec_iter iter;
-	sector_t sector = bio->bi_iter.bi_sector;
-	unsigned int sectors, total;
+	sector_t sector;
+	unsigned int sectors, total, ret;
 	void *prot_buf = bio->bi_integrity->bip_buf;
 
-	total = 0;
+	if (operate)
+		sector = bio->bi_iter.bi_sector;
+	else
+		sector = bio->bi_integrity->bip_iter.bi_sector;
+
+	total = ret = 0;
 	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
 	bix.sector_size = bi->sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
 		bix.prot_buf = prot_buf;
 		bix.sector = sector;
 
-		bi->generate_fn(&bix);
+		if (operate) {
+			bi->generate_fn(&bix);
+		} else {
+			ret = bi->verify_fn(&bix);
+			if (ret) {
+				kunmap_atomic(kaddr);
+				return ret;
+			}
+		}
 
 		sectors = bv.bv_len / bi->sector_size;
 		sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
 		kunmap_atomic(kaddr);
 	}
+	return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:	bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+	bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	struct blk_integrity_exchg bix;
-	struct bio_vec *bv;
-	sector_t sector = bio->bi_integrity->bip_iter.bi_sector;
-	unsigned int sectors, ret = 0;
-	void *prot_buf = bio->bi_integrity->bip_buf;
-	int i;
-
-	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-	bix.sector_size = bi->sector_size;
-
-	bio_for_each_segment_all(bv, bio, i) {
-		void *kaddr = kmap_atomic(bv->bv_page);
-
-		bix.data_buf = kaddr + bv->bv_offset;
-		bix.data_size = bv->bv_len;
-		bix.prot_buf = prot_buf;
-		bix.sector = sector;
-
-		ret = bi->verify_fn(&bix);
-
-		if (ret) {
-			kunmap_atomic(kaddr);
-			return ret;
-		}
-
-		sectors = bv->bv_len / bi->sector_size;
-		sector += sectors;
-		prot_buf += sectors * bi->tuple_size;
-
-		kunmap_atomic(kaddr);
-	}
-
-	return ret;
+	return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] io: define an interface for IO extensions
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
  2014-03-24 16:22 ` [PATCH 1/6] fs/bio-integrity: remove duplicate code Darrick J. Wong
@ 2014-03-24 16:22 ` Darrick J. Wong
       [not found]   ` <20140324162244.10848.46322.stgit-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
  2014-04-02 19:49   ` Zach Brown
  2014-03-24 16:22 ` [PATCH 3/6] aio/dio: enable PI passthrough Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:22 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Define a generic interface to allow userspace to attach metadata to an
IO operation.  This interface will be used initially to implement
protection information (PI) pass through, though it ought to be usable
by anyone else desiring to extend the IO interface.  It should not be
difficult to modify the non-AIO calls to use this mechanism.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/aio.c                     |  180 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/aio.h          |    7 ++
 include/uapi/linux/aio_abi.h |   15 +++-
 3 files changed, 197 insertions(+), 5 deletions(-)


diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..0c40bdc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static int io_teardown_extensions(struct kiocb *req);
+static int io_setup_extensions(struct kiocb *req, int is_write,
+			       struct io_extension __user *ioext);
+static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct qstr this = QSTR_INIT("[aio]", 5);
@@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 	struct io_event	*ev_page, *event;
 	unsigned long	flags;
 	unsigned tail, pos;
+	int ret;
+
+	ret = io_teardown_extensions(iocb);
+	if (ret) {
+		if (!res)
+			res = ret;
+		else if (!res2)
+			res2 = ret;
+		else
+			pr_err("error %d tearing down aio extensions\n", ret);
+	}
 
 	/*
 	 * Special case handling for sync iocbs:
@@ -1350,15 +1366,167 @@ rw_common:
 	return 0;
 }
 
+/* IO extension code */
+#define REQUIRED_STRUCTURE_SIZE(type, member)	\
+	(offsetof(type, member) + sizeof(((type *)NULL)->member))
+#define IO_EXT_SIZE(member) \
+	REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
+
+struct io_extension_type {
+	unsigned int type;
+	unsigned int extension_struct_size;
+	int (*setup_fn)(struct kiocb *, int is_write);
+	int (*destroy_fn)(struct kiocb *);
+};
+
+static struct io_extension_type extensions[] = {
+	{IO_EXT_INVALID, 0, NULL, NULL},
+};
+
+static int is_write_iocb(struct iocb *iocb)
+{
+	switch (iocb->aio_lio_opcode) {
+	case IOCB_CMD_PWRITE:
+	case IOCB_CMD_PWRITEV:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static int io_teardown_extensions(struct kiocb *req)
+{
+	struct io_extension_type *iet;
+	int ret, ret2;
+
+	if (req->ki_ioext == NULL)
+		return 0;
+
+	/* Shut down all the extensions */
+	ret = 0;
+	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
+		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
+			continue;
+		ret2 = iet->destroy_fn(req);
+		if (ret2 && !ret)
+			ret = ret2;
+	}
+
+	/* Copy out return values */
+	if (unlikely(copy_to_user(req->ki_ioext->ke_user,
+				  &req->ki_ioext->ke_kern,
+				  sizeof(struct io_extension)))) {
+		if (!ret)
+			ret = -EFAULT;
+	}
+
+	kfree(req->ki_ioext);
+	req->ki_ioext = NULL;
+	return ret;
+}
+
+static int io_check_bufsize(__u64 has, __u64 size)
+{
+	struct io_extension_type *iet;
+	__u64 all_flags = 0;
+
+	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
+		all_flags |= iet->type;
+		if (!(has & iet->type))
+			continue;
+		if (iet->extension_struct_size > size)
+			return -EINVAL;
+	}
+
+	if (has & ~all_flags)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_setup_extensions(struct kiocb *req, int is_write,
+			       struct io_extension __user *ioext)
+{
+	struct io_extension_type *iet;
+	__u64 sz, has;
+	int ret;
+
+	/* Check size of buffer */
+	if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz))))
+		return -EFAULT;
+	if (sz > PAGE_SIZE ||
+	    sz > sizeof(struct io_extension) ||
+	    sz < IO_EXT_SIZE(ie_has))
+		return -EINVAL;
+
+	/* Check that the buffer's big enough */
+	if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has))))
+		return -EFAULT;
+	ret = io_check_bufsize(has, sz);
+	if (ret)
+		return ret;
+
+	/* Copy from userland */
+	req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
+	if (!req->ki_ioext)
+		return -ENOMEM;
+
+	req->ki_ioext->ke_user = ioext;
+	if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* Try to initialize all the extensions */
+	has = 0;
+	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
+		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
+			continue;
+		ret = iet->setup_fn(req, is_write);
+		if (ret) {
+			req->ki_ioext->ke_kern.ie_has = has;
+			goto out_destroy;
+		}
+		has |= iet->type;
+	}
+
+	return 0;
+out_destroy:
+	io_teardown_extensions(req);
+out:
+	kfree(req->ki_ioext);
+	req->ki_ioext = NULL;
+	return ret;
+}
+
+static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req)
+{
+	struct io_extension __user *user_ext;
+
+	if (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS))
+		return 0;
+
+	user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;
+	return io_setup_extensions(req, is_write_iocb(iocb), user_ext);
+}
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
 	struct kiocb *req;
 	ssize_t ret;
 
+	/* check for flags we don't know about */
+	if (iocb->aio_flags & ~IOCB_FLAG_ALL) {
+		pr_debug("EINVAL: invalid flags\n");
+		return -EINVAL;
+	}
+
 	/* enforce forwards compatibility on users */
-	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
-		pr_debug("EINVAL: reserve field set\n");
+	if (unlikely(iocb->aio_reserved1 ||
+	    (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS) &&
+	     iocb->aio_extension_ptr))) {
+		pr_debug("EINVAL: reserved field set\n");
 		return -EINVAL;
 	}
 
@@ -1408,13 +1576,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_pos = iocb->aio_offset;
 	req->ki_nbytes = iocb->aio_nbytes;
 
+	ret = iocb_setup_extensions(iocb, req);
+	if (unlikely(ret))
+		goto out_del_ext;
+
 	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
 			   (char __user *)(unsigned long)iocb->aio_buf,
 			   compat);
 	if (ret)
-		goto out_put_req;
+		goto out_del_ext;
 
 	return 0;
+out_del_ext:
+	io_teardown_extensions(req);
 out_put_req:
 	put_reqs_available(ctx, 1);
 	percpu_ref_put(&ctx->reqs);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..60f4364 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -29,6 +29,10 @@ struct kiocb;
 
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
+struct kio_extension {
+	struct io_extension __user *ke_user;
+	struct io_extension ke_kern;
+};
 struct kiocb {
 	struct file		*ki_filp;
 	struct kioctx		*ki_ctx;	/* NULL for sync ops */
@@ -52,6 +56,9 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	/* Kernel copy of extension descriptors */
+	struct kio_extension	*ki_ioext;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..07ffd1f 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -53,6 +53,8 @@ enum {
  *                   is valid.
  */
 #define IOCB_FLAG_RESFD		(1 << 0)
+#define IOCB_FLAG_EXTENSIONS	(1 << 1)
+#define IOCB_FLAG_ALL		(IOCB_FLAG_RESFD | IOCB_FLAG_EXTENSIONS)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
@@ -70,6 +72,15 @@ struct io_event {
 #error edit for your odd byteorder.
 #endif
 
+/* IO extension types */
+#define IO_EXT_INVALID	(0)
+
+/* IO extension descriptor */
+struct io_extension {
+	__u64 ie_size;
+	__u64 ie_has;
+};
+
 /*
  * we always use a 64bit off_t when communicating
  * with userland.  its up to libraries to do the
@@ -91,8 +102,8 @@ struct iocb {
 	__u64	aio_nbytes;
 	__s64	aio_offset;
 
-	/* extra parameters */
-	__u64	aio_reserved2;	/* TODO: use this for a (struct sigevent *) */
+	/* aio extensions, only present if IOCB_FLAG_EXTENSIONS */
+	__u64	aio_extension_ptr;
 
 	/* flags for the "struct iocb" */
 	__u32	aio_flags;

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] aio/dio: enable PI passthrough
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
  2014-03-24 16:22 ` [PATCH 1/6] fs/bio-integrity: remove duplicate code Darrick J. Wong
  2014-03-24 16:22 ` [PATCH 2/6] io: define an interface for IO extensions Darrick J. Wong
@ 2014-03-24 16:22 ` Darrick J. Wong
  2014-04-02 20:01   ` Zach Brown
  2014-03-24 16:22 ` [PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:22 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Provide an IO extension handler that attaches PI data from the io
extension structure to a kiocb, then teach directio how to attach the
pages representing the PI buffer directly to a bio.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/block/data-integrity.txt |   11 ++++
 fs/aio.c                               |   62 +++++++++++++++++++++
 fs/bio-integrity.c                     |   94 +++++++++++++++++++++++++++++++-
 fs/direct-io.c                         |   70 +++++++++++++++++++-----
 include/linux/aio.h                    |   10 +++
 include/linux/bio.h                    |   15 +++++
 include/uapi/linux/aio_abi.h           |    6 ++
 mm/filemap.c                           |    6 ++
 8 files changed, 259 insertions(+), 15 deletions(-)


diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
       It is up to the receiver to process them and verify data
       integrity upon completion.
 
+    int bio_integrity_prep_buffer(struct bio *bio, int rw,
+				  struct bio_integrity_prep_iter *pi);
+
+      This function should be called before submit_bio; its purpose is to
+      attach an arbitrary array of struct page * containing integrity data
+      to an existing bio.  Primarily this is intended for AIO/DIO to be
+      able to attach a userspace buffer to a bio.
+
+      The bio_integrity_prep_iter should contain the page offset and buffer
+      length of the PI buffer, the number of pages, and the actual array of
+      pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
     METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 0c40bdc..3f932c3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1379,7 +1379,69 @@ struct io_extension_type {
 	int (*destroy_fn)(struct kiocb *);
 };
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int destroy_pi_ext(struct kiocb *req)
+{
+	unsigned int i;
+
+	if (req->ki_ioext->ke_pi_iter.pi_userpages == NULL)
+		return 0;
+
+	for (i = 0; i < req->ki_ioext->ke_pi_iter.pi_nrpages; i++)
+		page_cache_release(req->ki_ioext->ke_pi_iter.pi_userpages[i]);
+	kfree(req->ki_ioext->ke_pi_iter.pi_userpages);
+	req->ki_ioext->ke_pi_iter.pi_userpages = NULL;
+
+	return 0;
+}
+
+static int setup_pi_ext(struct kiocb *req, int is_write)
+{
+	struct file *file = req->ki_filp;
+	struct io_extension *ext = &req->ki_ioext->ke_kern;
+	void *p;
+	unsigned long start, end;
+	int retval;
+
+	if (!(file->f_flags & O_DIRECT)) {
+		pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
+		return -EINVAL;
+	}
+
+	BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
+
+	end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
+		PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
+	req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
+	req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
+	req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
+	p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
+		    sizeof(struct page *),
+		    GFP_NOIO);
+	if (p == NULL) {
+		pr_err("%s: no room for page array?\n", __func__);
+		return -ENOMEM;
+	}
+	req->ki_ioext->ke_pi_iter.pi_userpages = p;
+
+	retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
+				     req->ki_ioext->ke_pi_iter.pi_nrpages,
+				     is_write,
+				     req->ki_ioext->ke_pi_iter.pi_userpages);
+	if (retval != req->ki_ioext->ke_pi_iter.pi_nrpages) {
+		pr_err("%s: couldn't map pages?\n", __func__);
+		req->ki_ioext->ke_pi_iter.pi_nrpages = retval;
+		return -ENOMEM;
+	}
+	req->ki_flags |= KIOCB_DIO_ONLY;
+
+	return 0;
+}
+#endif
+
 static struct io_extension_type extensions[] = {
+	{IO_EXT_PI, IO_EXT_SIZE(ie_pi_ret), setup_pi_ext, destroy_pi_ext},
 	{IO_EXT_INVALID, 0, NULL, NULL},
 };
 
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..3df9aeb 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 	struct bio_vec *iv;
 
 	if (bip->bip_vcnt >= bip_integrity_vecs(bip)) {
-		printk(KERN_ERR "%s: bip_vec full\n", __func__);
+		pr_err("%s: bip_vec full\n", __func__);
 		return 0;
 	}
 
@@ -250,7 +250,7 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 					DIV_ROUND_UP(len, bi->tag_size));
 
 	if (nr_sectors * bi->tuple_size > bip->bip_iter.bi_size) {
-		printk(KERN_ERR "%s: tag too big for bio: %u > %u\n", __func__,
+		pr_err("%s: tag too big for bio: %u > %u\n", __func__,
 		       nr_sectors * bi->tuple_size, bip->bip_iter.bi_size);
 		return -1;
 	}
@@ -375,6 +375,96 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
 }
 
 /**
+ * bio_integrity_prep_buffer - Prepare bio for integrity I/O
+ * @bio:	bio to prepare
+ * @rw:		data direction for the bio
+ * @pi:		pi data to attach to bio
+ *
+ * Description: Allocates a buffer for integrity metadata, maps the
+ * pages and attaches them to a bio.  The bio must have target device
+ * and start sector set prior to calling.  The pages specified in the
+ * @pi argument should contain integrity metadata in the WRITE case,
+ * and should be ready to receive metadata in the READ case.
+ */
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+			      struct bio_integrity_prep_iter *pi)
+{
+	struct bio_integrity_payload *bip;
+	struct blk_integrity *bi;
+	unsigned long start, end;
+	unsigned int len, nr_pages;
+	unsigned int bytes, i;
+	unsigned int sectors;
+	int ret;
+
+	bi = bdev_get_integrity(bio->bi_bdev);
+	BUG_ON(bi == NULL);
+	BUG_ON(bio_integrity(bio));
+
+	sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
+
+	/* Allocate kernel buffer for protection data */
+	len = sectors * blk_integrity_tuple_size(bi);
+	end = (pi->pi_offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = pi->pi_offset >> PAGE_SHIFT;
+	nr_pages = end - start;
+
+	if (pi->pi_len < len) {
+		pr_err("%s: not enough space left in buffer!\n", __func__);
+		return -ENOMEM;
+	}
+
+	/* Allocate bio integrity payload and integrity vectors */
+	bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
+	if (unlikely(bip == NULL)) {
+		pr_err("could not allocate data integrity bioset\n");
+		return -EIO;
+	}
+
+	bip->bip_owns_buf = 0;
+	bip->bip_buf = NULL;
+	bip->bip_iter.bi_size = len;
+	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
+
+	/* Map it */
+	for (i = 0 ; i < nr_pages ; i++) {
+		bytes = PAGE_SIZE - pi->pi_offset;
+
+		if (bytes > pi->pi_len)
+			bytes = pi->pi_len;
+		if (bytes > len)
+			bytes = len;
+		if (pi->pi_len <= 0 || len == 0)
+			break;
+
+		ret = bio_integrity_add_page(bio, *pi->pi_userpages,
+					     bytes, pi->pi_offset);
+
+		if (ret == 0)
+			return -EIO;
+
+		if (ret < bytes)
+			break;
+
+		len -= bytes;
+		pi->pi_len -= bytes;
+		if (pi->pi_offset + bytes == PAGE_SIZE)
+			pi->pi_userpages++;
+		pi->pi_offset = (pi->pi_offset + bytes) % PAGE_SIZE;
+	}
+
+	/* Install custom I/O completion handler if read verify is enabled */
+	if ((rw & WRITE) == READ) {
+		bip->bip_end_io = bio->bi_end_io;
+		bio->bi_end_io = bio_integrity_endio;
+		ret = 0;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(bio_integrity_prep_buffer);
+
+/**
  * bio_integrity_prep - Prepare bio for integrity I/O
  * @bio:	bio to prepare
  *
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 160a548..3f591f8 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -111,6 +111,10 @@ struct dio_submit {
 	 */
 	unsigned head;			/* next page to process */
 	unsigned tail;			/* last valid page + 1 */
+
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	struct bio_integrity_prep_iter	pi_iter;
+#endif
 };
 
 /* dio_state communicated between submission path and end_io */
@@ -221,6 +225,7 @@ static inline struct page *dio_get_page(struct dio *dio,
 	return dio->pages[sdio->head++];
 }
 
+
 /**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
@@ -385,6 +390,22 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int dio_prep_pi_buffers(struct dio *dio, struct dio_submit *sdio)
+{
+	struct bio *bio = sdio->bio;
+	if (sdio->pi_iter.pi_userpages == NULL || !bio_integrity_enabled(bio))
+		return 0;
+
+	return bio_integrity_prep_buffer(bio, dio->rw, &sdio->pi_iter);
+}
+#else
+static int dio_prep_pi_buffers(struct dio *dio, struct dio_submit *sdio)
+{
+	return 0;
+}
+#endif
+
 /*
  * In the AIO read case we speculatively dirty the pages before starting IO.
  * During IO completion, any of these pages which happen to have been written
@@ -392,13 +413,18 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
  *
  * bios hold a dio reference between submit_bio and ->end_io.
  */
-static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
+static inline int dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 {
 	struct bio *bio = sdio->bio;
 	unsigned long flags;
+	int ret = 0;
 
 	bio->bi_private = dio;
 
+	ret = dio_prep_pi_buffers(dio, sdio);
+	if (ret)
+		return ret;
+
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
@@ -415,6 +441,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	sdio->bio = NULL;
 	sdio->boundary = 0;
 	sdio->logical_offset_in_bio = 0;
+
+	return ret;
 }
 
 /*
@@ -736,8 +764,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 		 * have.
 		 */
 		if (sdio->final_block_in_bio != sdio->cur_page_block ||
-		    cur_offset != bio_next_offset)
-			dio_bio_submit(dio, sdio);
+		    cur_offset != bio_next_offset) {
+			ret = dio_bio_submit(dio, sdio);
+			if (ret)
+				goto out;
+		}
 	}
 
 	if (sdio->bio == NULL) {
@@ -747,7 +778,9 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 	}
 
 	if (dio_bio_add_page(sdio) != 0) {
-		dio_bio_submit(dio, sdio);
+		ret = dio_bio_submit(dio, sdio);
+		if (ret)
+			goto out;
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
 			ret = dio_bio_add_page(sdio);
@@ -823,8 +856,12 @@ out:
 	 * avoid metadata seeks.
 	 */
 	if (sdio->boundary) {
+		int ret2;
+
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		dio_bio_submit(dio, sdio);
+		ret2 = dio_bio_submit(dio, sdio);
+		if (ret == 0)
+			ret = ret2;
 		page_cache_release(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
@@ -1120,7 +1157,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	unsigned blocksize_mask = (1 << blkbits) - 1;
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
-	struct dio *dio;
+	struct dio *dio = NULL;
 	struct dio_submit sdio = { 0, };
 	unsigned long user_addr;
 	size_t bytes;
@@ -1187,8 +1224,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 							      end - 1);
 			if (retval) {
 				mutex_unlock(&inode->i_mutex);
-				kmem_cache_free(dio_cache, dio);
-				goto out;
+				goto out_dio;
 			}
 		}
 	}
@@ -1217,8 +1253,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 			 * We grab i_mutex only for reads so we don't have
 			 * to release it here
 			 */
-			kmem_cache_free(dio_cache, dio);
-			goto out;
+			goto out_dio;
 		}
 	}
 
@@ -1228,6 +1263,9 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	atomic_inc(&inode->i_dio_count);
 
 	retval = 0;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	sdio.pi_iter = iocb->ki_ioext->ke_pi_iter;
+#endif
 	sdio.blkbits = blkbits;
 	sdio.blkfactor = i_blkbits - blkbits;
 	sdio.block_in_file = offset >> blkbits;
@@ -1315,8 +1353,12 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 		page_cache_release(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
-	if (sdio.bio)
-		dio_bio_submit(dio, &sdio);
+	if (sdio.bio) {
+		int ret2;
+		ret2 = dio_bio_submit(dio, &sdio);
+		if (retval == 0)
+			retval = ret2;
+	}
 
 	blk_finish_plug(&plug);
 
@@ -1353,7 +1395,9 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 		retval = dio_complete(dio, offset, retval, false);
 	} else
 		BUG_ON(retval != -EIOCBQUEUED);
-
+	return retval;
+out_dio:
+	kmem_cache_free(dio_cache, dio);
 out:
 	return retval;
 }
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 60f4364..3f142b8 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -6,6 +6,7 @@
 #include <linux/aio_abi.h>
 #include <linux/uio.h>
 #include <linux/rcupdate.h>
+#include <linux/bio.h>
 
 #include <linux/atomic.h>
 
@@ -14,6 +15,8 @@ struct kiocb;
 
 #define KIOCB_KEY		0
 
+#define KIOCB_DIO_ONLY	(1)	/* don't try buffered if directio fails */
+
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
  * cancelled or completed (this makes a certain amount of sense because
@@ -29,10 +32,15 @@ struct kiocb;
 
 typedef int (kiocb_cancel_fn)(struct kiocb *);
 
+/* per-kiocb extension data */
 struct kio_extension {
 	struct io_extension __user *ke_user;
 	struct io_extension ke_kern;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	struct bio_integrity_prep_iter	ke_pi_iter;	/* PI buffers */
+#endif
 };
+
 struct kiocb {
 	struct file		*ki_filp;
 	struct kioctx		*ki_ctx;	/* NULL for sync ops */
@@ -59,6 +67,8 @@ struct kiocb {
 
 	/* Kernel copy of extension descriptors */
 	struct kio_extension	*ki_ioext;
+
+	unsigned int		ki_flags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5a4d39b..4729ab1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -635,6 +635,13 @@ struct biovec_slab {
 	struct kmem_cache *slab;
 };
 
+struct bio_integrity_prep_iter {
+	struct page **pi_userpages;	/* Pages containing PI data */
+	size_t pi_nrpages;		/* Number of PI data pages */
+	size_t pi_offset;		/* Offset into the page */
+	size_t pi_len;			/* Length of the buffer */
+};
+
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -663,6 +670,8 @@ extern int bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
 extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
 extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep_buffer(struct bio *, int rw,
+				     struct bio_integrity_prep_iter *);
 extern void bio_integrity_endio(struct bio *, int);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
@@ -693,6 +702,12 @@ static inline void bioset_integrity_free (struct bio_set *bs)
 	return;
 }
 
+static inline int bio_integrity_prep_buffer(struct bio *bio, int rw,
+					    struct bio_integrity_prep_iter *pi)
+{
+	return 0;
+}
+
 static inline int bio_integrity_prep(struct bio *bio)
 {
 	return 0;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 07ffd1f..d7b8c68 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -74,11 +74,17 @@ struct io_event {
 
 /* IO extension types */
 #define IO_EXT_INVALID	(0)
+#define IO_EXT_PI	(1)	/* protection info (checksums, etc) */
 
 /* IO extension descriptor */
 struct io_extension {
 	__u64 ie_size;
 	__u64 ie_has;
+
+	/* PI stuff */
+	__u64 ie_pi_buf;
+	__u32 ie_pi_buflen;
+	__u32 ie_pi_ret;
 };
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..d35ddb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2477,6 +2477,12 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 							ppos, count, ocount);
 		if (written < 0 || written == count)
 			goto out;
+
+		if (iocb->ki_flags & KIOCB_DIO_ONLY) {
+			err = -EINVAL;
+			goto out;
+		}
+
 		/*
 		 * direct-io write to a hole: fall through to buffered I/O
 		 * for completing the rest of the request.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (2 preceding siblings ...)
  2014-03-24 16:22 ` [PATCH 3/6] aio/dio: enable PI passthrough Darrick J. Wong
@ 2014-03-24 16:22 ` Darrick J. Wong
  2014-03-24 16:23 ` [PATCH 5/6] PI IO extension: advertise possible userspace flags Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:22 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/block/data-integrity.txt |   11 ++++
 block/blk-integrity.c                  |    1 
 drivers/scsi/sd_dif.c                  |   76 ++++++++++++++++++++++++++----
 fs/aio.c                               |    3 +
 fs/bio-integrity.c                     |   80 ++++++++++++++++++++++++++++++++
 fs/direct-io.c                         |    1 
 include/linux/bio.h                    |    3 +
 include/linux/blkdev.h                 |    2 +
 include/uapi/linux/aio_abi.h           |    1 
 9 files changed, 162 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
       The bio_integrity_prep_iter should contain the page offset and buffer
       length of the PI buffer, the number of pages, and the actual array of
-      pages, as returned by get_user_pages.
+      pages, as returned by get_user_pages.  The user_flags argument should
+      contain whatever flag values were passed in by userspace; the values
+      of the flags are specific to the block integrity provider, and are
+      passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
     METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
       are available per hardware sector.  For DIF this is either 2 or
       0 depending on the value of the Control Mode Page ATO bit.
 
+      'mod_user_buf_fn' updates the appropriate integrity metadata for
+      a WRITE operation.  This function is called when userspace passes
+      in a PI buffer along with file data; the flags argument (which is
+      specific to the blk_integrity provider) arrange for pre-processing
+      of the user buffer prior to issuing the IO.
+
       See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 ----------------------------------------------------------------------
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 		bi->set_tag_fn = template->set_tag_fn;
 		bi->get_tag_fn = template->get_tag_fn;
 		bi->tag_size = template->tag_size;
+		bi->mod_user_buf_fn = template->mod_user_buf_fn;
 	} else
 		bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD	(1)
+#define GENERATE_REF	(2)
+#define GENERATE_APP	(4)
+#define GENERATE_ALL	(7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+				 int flags)
 {
 	void *buf = bix->data_buf;
 	struct sd_dif_tuple *sdt = bix->prot_buf;
 	sector_t sector = bix->sector;
 	unsigned int i;
 
+	if (flags & ~GENERATE_ALL)
+		return -EINVAL;
+	if (!flags)
+		return -ENOTTY;
+
 	for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-		sdt->guard_tag = fn(buf, bix->sector_size);
-		sdt->ref_tag = cpu_to_be32(sector & 0xffffffff);
-		sdt->app_tag = 0;
+		if (flags & GENERATE_GUARD)
+			sdt->guard_tag = fn(buf, bix->sector_size);
+		if (flags & GENERATE_REF)
+			sdt->ref_tag = cpu_to_be32(sector & 0xffffffff);
+		if (flags & GENERATE_APP)
+			sdt->app_tag = 0;
 
 		buf += bix->sector_size;
 		sector++;
 	}
+
+	return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type1_generate(bix, sd_dif_crc_fn);
+	sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type1_generate(bix, sd_dif_ip_fn);
+	sd_dif_type1_generate(bix, sd_dif_ip_fn, GENERATE_ALL);
+}
+
+static int sd_dif_type1_mod_crc(struct blk_integrity_exchg *bix,
+				int flags)
+{
+	return sd_dif_type1_generate(bix, sd_dif_crc_fn, flags);
+}
+
+static int sd_dif_type1_mod_ip(struct blk_integrity_exchg *bix,
+			       int flags)
+{
+	return sd_dif_type1_generate(bix, sd_dif_ip_fn, flags);
 }
 
 static int sd_dif_type1_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
@@ -163,6 +190,7 @@ static struct blk_integrity dif_type1_integrity_crc = {
 	.set_tag_fn		= sd_dif_type1_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type1_mod_crc,
 };
 
 static struct blk_integrity dif_type1_integrity_ip = {
@@ -173,6 +201,7 @@ static struct blk_integrity dif_type1_integrity_ip = {
 	.set_tag_fn		= sd_dif_type1_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type1_mod_ip,
 };
 
 
@@ -180,29 +209,50 @@ static struct blk_integrity dif_type1_integrity_ip = {
  * Type 3 protection has a 16-bit guard tag and 16 + 32 bits of opaque
  * tag space.
  */
-static void sd_dif_type3_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+static int sd_dif_type3_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+				 int flags)
 {
 	void *buf = bix->data_buf;
 	struct sd_dif_tuple *sdt = bix->prot_buf;
 	unsigned int i;
 
+	if (flags & (~GENERATE_ALL | GENERATE_REF))
+		return -EINVAL;
+	if (!flags)
+		return -ENOTTY;
+
 	for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
-		sdt->guard_tag = fn(buf, bix->sector_size);
-		sdt->ref_tag = 0;
-		sdt->app_tag = 0;
+		if (flags & GENERATE_GUARD)
+			sdt->guard_tag = fn(buf, bix->sector_size);
+		if (flags & GENERATE_APP) {
+			sdt->ref_tag = 0;
+			sdt->app_tag = 0;
+		}
 
 		buf += bix->sector_size;
 	}
+
+	return 0;
 }
 
 static void sd_dif_type3_generate_crc(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type3_generate(bix, sd_dif_crc_fn);
+	sd_dif_type3_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type3_generate_ip(struct blk_integrity_exchg *bix)
 {
-	sd_dif_type3_generate(bix, sd_dif_ip_fn);
+	sd_dif_type3_generate(bix, sd_dif_ip_fn, GENERATE_ALL);
+}
+
+static int sd_dif_type3_mod_crc(struct blk_integrity_exchg *bix, int flags)
+{
+	return sd_dif_type3_generate(bix, sd_dif_crc_fn, flags);
+}
+
+static int sd_dif_type3_mod_ip(struct blk_integrity_exchg *bix, int flags)
+{
+	return sd_dif_type3_generate(bix, sd_dif_ip_fn, flags);
 }
 
 static int sd_dif_type3_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
@@ -283,6 +333,7 @@ static struct blk_integrity dif_type3_integrity_crc = {
 	.set_tag_fn		= sd_dif_type3_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type3_mod_crc,
 };
 
 static struct blk_integrity dif_type3_integrity_ip = {
@@ -293,6 +344,7 @@ static struct blk_integrity dif_type3_integrity_ip = {
 	.set_tag_fn		= sd_dif_type3_set_tag,
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
+	.mod_user_buf_fn	= sd_dif_type3_mod_ip,
 };
 
 /*
diff --git a/fs/aio.c b/fs/aio.c
index 3f932c3..2e80fbb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1424,6 +1424,7 @@ static int setup_pi_ext(struct kiocb *req, int is_write)
 		return -ENOMEM;
 	}
 	req->ki_ioext->ke_pi_iter.pi_userpages = p;
+	req->ki_ioext->ke_pi_iter.pi_userflags = ext->ie_pi_flags;
 
 	retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
 				     req->ki_ioext->ke_pi_iter.pi_nrpages,
@@ -1441,7 +1442,7 @@ static int setup_pi_ext(struct kiocb *req, int is_write)
 #endif
 
 static struct io_extension_type extensions[] = {
-	{IO_EXT_PI, IO_EXT_SIZE(ie_pi_ret), setup_pi_ext, destroy_pi_ext},
+	{IO_EXT_PI, IO_EXT_SIZE(ie_pi_flags), setup_pi_ext, destroy_pi_ext},
 	{IO_EXT_INVALID, 0, NULL, NULL},
 };
 
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 3df9aeb..381ee38 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,6 +301,82 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
+ * @bio:	bio to generate/verify integrity metadata for
+ */
+int bio_integrity_update_user_buffer(struct bio *bio)
+{
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	struct blk_integrity_exchg bix;
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	sector_t sector;
+	unsigned int sectors, total, ret;
+	void *prot_buf;
+	unsigned int prot_offset, prot_len, bv_offset, bv_len;
+	struct bio_vec *iv;
+	struct bio_integrity_payload *bip = bio->bi_integrity;
+
+	if (!bi->mod_user_buf_fn)
+		return 0;
+
+	sector = bio->bi_iter.bi_sector;
+
+	total = ret = 0;
+	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
+	bix.sector_size = bi->sector_size;
+
+	iv = bip->bip_vec;
+	prot_offset = iv->bv_offset;
+	prot_len = iv->bv_len;
+	prot_buf = kmap_atomic(iv->bv_page);
+
+	bio_for_each_segment(bv, bio, iter) {
+		void *kaddr = kmap_atomic(bv.bv_page);
+		bv_len = bv.bv_len;
+		bv_offset = bv.bv_offset;
+
+		while (bv_len > 0) {
+			if (prot_len < bi->tuple_size) {
+				kunmap_atomic(prot_buf);
+				iv++;
+				BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+				prot_offset = iv->bv_offset;
+				prot_len = iv->bv_len;
+				prot_buf = kmap_atomic(iv->bv_page);
+			}
+			bix.data_buf = kaddr + bv_offset;
+			bix.data_size = min(bv_len,
+				prot_len / bi->tuple_size * bix.sector_size);
+			bix.prot_buf = prot_buf + prot_offset;
+			bix.sector = sector;
+
+			ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+			if (ret) {
+				if (ret == -ENOTTY)
+					ret = 0;
+				kunmap_atomic(kaddr);
+				kunmap_atomic(prot_buf);
+				return ret;
+			}
+
+			bv_offset += bix.data_size;
+			bv_len -= bix.data_size;
+			sectors = bix.data_size / bi->sector_size;
+			sector += sectors;
+			prot_offset += sectors * bi->tuple_size;
+			prot_len -= sectors * bi->tuple_size;
+			total += sectors * bi->tuple_size;
+			BUG_ON(total > bio->bi_integrity->bip_iter.bi_size);
+		}
+		kunmap_atomic(kaddr);
+	}
+	kunmap_atomic(prot_buf);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
+
+/**
  * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
  * @bio:	bio to generate/verify integrity metadata for
  * @operate:	operate number, 1 for generate, 0 for verify
@@ -425,6 +501,7 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 	bip->bip_buf = NULL;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
+	bip->bip_user_flags = pi->pi_userflags;
 
 	/* Map it */
 	for (i = 0 ; i < nr_pages ; i++) {
@@ -458,7 +535,8 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 		bip->bip_end_io = bio->bi_end_io;
 		bio->bi_end_io = bio_integrity_endio;
 		ret = 0;
-	}
+	} else
+		ret = bio_integrity_update_user_buffer(bio);
 
 	return ret;
 }
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3f591f8..6740638 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -225,7 +225,6 @@ static inline struct page *dio_get_page(struct dio *dio,
 	return dio->pages[sdio->head++];
 }
 
-
 /**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4729ab1..5bd9618 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -304,7 +304,9 @@ struct bio_integrity_payload {
 	struct work_struct	bip_work;	/* I/O completion */
 
 	struct bio_vec		*bip_vec;
+	unsigned int		bip_user_flags;
 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
+	/* This must be last! */
 };
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
@@ -640,6 +642,7 @@ struct bio_integrity_prep_iter {
 	size_t pi_nrpages;		/* Number of PI data pages */
 	size_t pi_offset;		/* Offset into the page */
 	size_t pi_len;			/* Length of the buffer */
+	unsigned int pi_userflags;	/* Userspace flags */
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..cf1ec22 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1426,12 +1426,14 @@ typedef void (integrity_gen_fn) (struct blk_integrity_exchg *);
 typedef int (integrity_vrfy_fn) (struct blk_integrity_exchg *);
 typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
 typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
+typedef int (integrity_mod_user_buf_fn) (struct blk_integrity_exchg *, int);
 
 struct blk_integrity {
 	integrity_gen_fn	*generate_fn;
 	integrity_vrfy_fn	*verify_fn;
 	integrity_set_tag_fn	*set_tag_fn;
 	integrity_get_tag_fn	*get_tag_fn;
+	integrity_mod_user_buf_fn	*mod_user_buf_fn;
 
 	unsigned short		flags;
 	unsigned short		tuple_size;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index d7b8c68..f6556a6 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -85,6 +85,7 @@ struct io_extension {
 	__u64 ie_pi_buf;
 	__u32 ie_pi_buflen;
 	__u32 ie_pi_ret;
+	__u32 ie_pi_flags;
 };
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] PI IO extension: advertise possible userspace flags
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (3 preceding siblings ...)
  2014-03-24 16:22 ` [PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
@ 2014-03-24 16:23 ` Darrick J. Wong
  2014-03-24 16:23 ` [PATCH 6/6] blk-integrity: refactor various routines Darrick J. Wong
  2014-04-02 19:14 ` [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Zach Brown
  6 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:23 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Expose possible userland flags to the new PI IO extension so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/ABI/testing/sysfs-block  |   14 ++++++++++++++
 Documentation/block/data-integrity.txt |   22 +++++++++++++++++++++
 block/blk-integrity.c                  |   33 ++++++++++++++++++++++++++++++++
 drivers/scsi/sd_dif.c                  |   11 +++++++++++
 include/linux/blkdev.h                 |    7 +++++++
 5 files changed, 87 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
 		512 bytes of data.
 
 
+What:		/sys/block/<disk>/integrity/tuple_size
+Date:		March 2014
+Contact:	Darrick J. Wong <darrick.wong@oracle.com>
+Description:
+		Size in bytes of the integrity data buffer for each logical
+		block.
+
+What:		/sys/block/<disk>/integrity/write_user_flags
+Date:		March 2014
+Contact:	Darrick J. Wong <darrick.wong@oracle.com>
+Description:
+		Provides a list of flags that userspace can pass to the kernel
+		when supplying integrity data for a write IO.
+
 What:		/sys/block/<disk>/integrity/write_generate
 Date:		June 2008
 Contact:	Martin K. Petersen <martin.petersen@oracle.com>
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index b72a54f..e33d4a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,29 @@ will require extra work due to the application tag.
       specific to the blk_integrity provider) arrange for pre-processing
       of the user buffer prior to issuing the IO.
 
+      'user_write_flags' points to an array of struct blk_integrity_flag,
+      which maps mod_user_buf_fn flags to a description of what they do.
+
       See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+    The "IO extension" interface has been expanded to provide
+    userspace programs with the ability to provide PI data with a WRITE,
+    or to receive PI data with a READ.  The fields ie_pi_buf,
+    ie_pi_buflen, and ie_pi_flags should contain a pointer to the PI
+    buffer, the length of the PI buffer, and any flags that should be
+    passed to the PI provider.
+
+    This buffer must contain PI tuples.  Tuples must NOT split a page
+    boundary.  Valid flag values can be found in
+    /sys/block/*/integrity/user_write_flags.  The tuple size can be found
+    in /sys/block/*/integrity/tuple_size.
+
+    In general, the flags allow the user program to ask the in-kernel
+    integrity provider to fill in some parts of the tuples.  For example,
+    the T10 DIF provider can fill in the reference tag (sector number) so
+    that userspace can choose not to care about the reference tag.
+
 ----------------------------------------------------------------------
 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity *bi, char *page)
 	return sprintf(page, "%d\n", (bi->flags & INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+	struct blk_integrity_flag *flag = bi->user_write_flags;
+	char *p = page;
+	ssize_t ret = 0;
+
+	while (flag->value) {
+		ret += snprintf(p, PAGE_SIZE - ret, "0x%x: %s\n",
+				flag->value, flag->descr);
+		p = page + ret;
+		flag++;
+	}
+	return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+	return sprintf(page, "%d\n", bi->tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
 	.attr = { .name = "format", .mode = S_IRUGO },
 	.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry = {
 	.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+	.attr = { .name = "write_user_flags", .mode = S_IRUGO },
+	.show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+	.attr = { .name = "tuple_size", .mode = S_IRUGO },
+	.show = integrity_tuple_size_show,
+};
+
 static struct attribute *integrity_attrs[] = {
 	&integrity_format_entry.attr,
 	&integrity_tag_size_entry.attr,
 	&integrity_read_entry.attr,
 	&integrity_write_entry.attr,
+	&integrity_write_flags_entry.attr,
+	&integrity_tuple_size_entry.attr,
 	NULL,
 };
 
@@ -422,6 +454,7 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 		bi->get_tag_fn = template->get_tag_fn;
 		bi->tag_size = template->tag_size;
 		bi->mod_user_buf_fn = template->mod_user_buf_fn;
+		bi->user_write_flags = template->user_write_flags;
 	} else
 		bi->name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 74182c9..bea648b 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -182,6 +182,13 @@ static void sd_dif_type1_get_tag(void *prot, void *tag_buf, unsigned int sectors
 	}
 }
 
+static struct blk_integrity_flag dif_flags[] = {
+	{GENERATE_GUARD,	"generate guard tag"},
+	{GENERATE_REF,		"generate ref tag"},
+	{GENERATE_APP,		"generate app tag"},
+	{0, NULL},
+};
+
 static struct blk_integrity dif_type1_integrity_crc = {
 	.name			= "T10-DIF-TYPE1-CRC",
 	.generate_fn		= sd_dif_type1_generate_crc,
@@ -191,6 +198,7 @@ static struct blk_integrity dif_type1_integrity_crc = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type1_mod_crc,
+	.user_write_flags	= dif_flags,
 };
 
 static struct blk_integrity dif_type1_integrity_ip = {
@@ -202,6 +210,7 @@ static struct blk_integrity dif_type1_integrity_ip = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type1_mod_ip,
+	.user_write_flags	= dif_flags,
 };
 
 
@@ -334,6 +343,7 @@ static struct blk_integrity dif_type3_integrity_crc = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type3_mod_crc,
+	.user_write_flags	= dif_flags,
 };
 
 static struct blk_integrity dif_type3_integrity_ip = {
@@ -345,6 +355,7 @@ static struct blk_integrity dif_type3_integrity_ip = {
 	.tuple_size		= sizeof(struct sd_dif_tuple),
 	.tag_size		= 0,
 	.mod_user_buf_fn	= sd_dif_type3_mod_ip,
+	.user_write_flags	= dif_flags,
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf1ec22..e8e6401 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1428,6 +1428,11 @@ typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
 typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
 typedef int (integrity_mod_user_buf_fn) (struct blk_integrity_exchg *, int);
 
+struct blk_integrity_flag {
+	unsigned int value;
+	const char *descr;
+};
+
 struct blk_integrity {
 	integrity_gen_fn	*generate_fn;
 	integrity_vrfy_fn	*verify_fn;
@@ -1443,6 +1448,8 @@ struct blk_integrity {
 	const char		*name;
 
 	struct kobject		kobj;
+
+	struct blk_integrity_flag	*user_write_flags;
 };
 
 extern bool blk_integrity_is_initialized(struct gendisk *);

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] blk-integrity: refactor various routines
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (4 preceding siblings ...)
  2014-03-24 16:23 ` [PATCH 5/6] PI IO extension: advertise possible userspace flags Darrick J. Wong
@ 2014-03-24 16:23 ` Darrick J. Wong
  2014-04-02 19:14 ` [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Zach Brown
  6 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-03-24 16:23 UTC (permalink / raw)
  To: axboe, zab, martin.petersen, darrick.wong, JBottomley, jmoyer,
	bcrl, viro
  Cc: linux-fsdevel, linux-aio, linux-scsi, linux-mm

Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/bio-integrity.c  |  120 +++++++++++++++++++++------------------------------
 include/linux/bio.h |    2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 	struct bio_set *bs = bio->bi_pool;
 
-	if (bip->bip_owns_buf)
-		kfree(bip->bip_buf);
+	kfree(bip->bip_buf);
 
 	if (bs) {
 		if (bip->bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 {
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	unsigned int nr_sectors;
-
-	BUG_ON(bip->bip_buf == NULL);
+	unsigned int nr_sectors, tag_offset, sectors;
+	void *prot_buf;
+	unsigned int prot_offset, prot_len;
+	struct bio_vec *iv;
+	void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
 	if (bi->tag_size == 0)
 		return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void *tag_buf, unsigned int len,
 		return -1;
 	}
 
-	if (set)
-		bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
-	else
-		bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+	iv = bip->bip_vec;
+	prot_offset = iv->bv_offset;
+	prot_len = iv->bv_len;
+	prot_buf = kmap_atomic(iv->bv_page);
+	tag_fn = set ? bi->set_tag_fn : bi->get_tag_fn;
+	tag_offset = 0;
+
+	while (nr_sectors) {
+		if (prot_len < bi->tuple_size) {
+			kunmap_atomic(prot_buf);
+			iv++;
+			BUG_ON(iv >= bip->bip_vec + bip->bip_vcnt);
+			prot_offset = iv->bv_offset;
+			prot_len = iv->bv_len;
+			prot_buf = kmap_atomic(iv->bv_page);
+		}
+		sectors = min(prot_len / bi->tuple_size, nr_sectors);
+		tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+		nr_sectors -= sectors;
+		tag_offset += sectors * bi->tuple_size;
+		prot_offset += sectors * bi->tuple_size;
+		prot_len -= sectors * bi->tuple_size;
+	}
+	kunmap_atomic(prot_buf);
 
 	return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:	bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+				   walk_buf_fn *mod_fn)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 	struct blk_integrity_exchg bix;
 	struct bio_vec bv;
 	struct bvec_iter iter;
-	sector_t sector;
 	unsigned int sectors, total, ret;
 	void *prot_buf;
 	unsigned int prot_offset, prot_len, bv_offset, bv_len;
 	struct bio_vec *iv;
 	struct bio_integrity_payload *bip = bio->bi_integrity;
 
-	if (!bi->mod_user_buf_fn)
+	if (!mod_fn)
 		return 0;
 
-	sector = bio->bi_iter.bi_sector;
-
 	total = ret = 0;
 	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
 	bix.sector_size = bi->sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
 			bix.prot_buf = prot_buf + prot_offset;
 			bix.sector = sector;
 
-			ret = bi->mod_user_buf_fn(&bix, bip->bip_user_flags);
+			ret = mod_fn(&bix, bip->bip_user_flags);
 			if (ret) {
 				if (ret == -ENOTTY)
 					ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
 	kunmap_atomic(prot_buf);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * @bio:	bio to generate/verify integrity metadata for
- * @operate:	operate number, 1 for generate, 0 for verify
+ * @sector:	stratin
  */
-static int bio_integrity_generate_verify(struct bio *bio, int operate)
+int bio_integrity_update_user_buffer(struct bio *bio)
 {
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-	struct blk_integrity_exchg bix;
-	struct bio_vec bv;
-	struct bvec_iter iter;
-	sector_t sector;
-	unsigned int sectors, total, ret;
-	void *prot_buf = bio->bi_integrity->bip_buf;
-
-	if (operate)
-		sector = bio->bi_iter.bi_sector;
-	else
-		sector = bio->bi_integrity->bip_iter.bi_sector;
-
-	total = ret = 0;
-	bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-	bix.sector_size = bi->sector_size;
-
-	bio_for_each_segment(bv, bio, iter) {
-		void *kaddr = kmap_atomic(bv.bv_page);
-		bix.data_buf = kaddr + bv.bv_offset;
-		bix.data_size = bv.bv_len;
-		bix.prot_buf = prot_buf;
-		bix.sector = sector;
-
-		if (operate) {
-			bi->generate_fn(&bix);
-		} else {
-			ret = bi->verify_fn(&bix);
-			if (ret) {
-				kunmap_atomic(kaddr);
-				return ret;
-			}
-		}
-
-		sectors = bv.bv_len / bi->sector_size;
-		sector += sectors;
-		prot_buf += sectors * bi->tuple_size;
-		total += sectors * bi->tuple_size;
-		BUG_ON(total > bio->bi_integrity->bip_iter.bi_size);
-
-		kunmap_atomic(kaddr);
-	}
-	return ret;
+	return bio_integrity_walk_bufs(bio, bio->bi_iter.bi_sector,
+				       bi->mod_user_buf_fn);
 }
+EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
  * bio_integrity_generate - Generate integrity metadata for a bio
@@ -439,7 +416,9 @@ static int bio_integrity_generate_verify(struct bio *bio, int operate)
  */
 static void bio_integrity_generate(struct bio *bio)
 {
-	bio_integrity_generate_verify(bio, 1);
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	bio_integrity_walk_bufs(bio, bio->bi_iter.bi_sector,
+				(walk_buf_fn *)bi->generate_fn);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -479,7 +458,6 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 
 	sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
 
-	/* Allocate kernel buffer for protection data */
 	len = sectors * blk_integrity_tuple_size(bi);
 	end = (pi->pi_offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	start = pi->pi_offset >> PAGE_SHIFT;
@@ -497,8 +475,6 @@ int bio_integrity_prep_buffer(struct bio *bio, int rw,
 		return -EIO;
 	}
 
-	bip->bip_owns_buf = 0;
-	bip->bip_buf = NULL;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
 	bip->bip_user_flags = pi->pi_userflags;
@@ -591,7 +567,6 @@ int bio_integrity_prep(struct bio *bio)
 		return -EIO;
 	}
 
-	bip->bip_owns_buf = 1;
 	bip->bip_buf = buf;
 	bip->bip_iter.bi_size = len;
 	bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;
@@ -646,7 +621,10 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-	return bio_integrity_generate_verify(bio, 0);
+	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+	return bio_integrity_walk_bufs(bio,
+				       bio->bi_integrity->bip_iter.bi_sector,
+				       (walk_buf_fn *)bi->verify_fn);
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5bd9618..c2ad1a8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -292,14 +292,12 @@ struct bio_integrity_payload {
 
 	struct bvec_iter	bip_iter;
 
-	/* kill - should just use bip_vec */
 	void			*bip_buf;	/* generated integrity data */
 
 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
 
 	unsigned short		bip_slab;	/* slab the bip came from */
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
-	unsigned		bip_owns_buf:1;	/* should free bip_buf */
 
 	struct work_struct	bip_work;	/* I/O completion */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO
  2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
                   ` (5 preceding siblings ...)
  2014-03-24 16:23 ` [PATCH 6/6] blk-integrity: refactor various routines Darrick J. Wong
@ 2014-04-02 19:14 ` Zach Brown
  2014-04-02 20:05   ` Zach Brown
  6 siblings, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 19:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Mon, Mar 24, 2014 at 09:22:31AM -0700, Darrick J. Wong wrote:
> This RFC provides a rough implementation of a mechanism to allow
> userspace to attach protection information (e.g. T10 DIF) data to a
> disk write and to receive the information alongside a disk read.

I have some comments for you! :)  Mostly about the interface up in aio.
I don't have all that much to say about the bio/pi bits.

> Patch #2 implements a generic IO extension interface so that we can
> receive a struct io_extension from userspace containing the structure
> size, a flag telling us which extensions we'd like to use (ie_has),
> and (eventually) extension data.  There's a small framework for
> mapping ie_has bits to actual extensions.

I still really don't think that we should be thinking of these as
generic extensions.  We're talking about arguments to syscalls.  's a
small number of them with strong semantics because they're a part of the
syscall ABI.  I don't think we should implement them by iterating over
per-field ops structs.

Anyway, more in reply to the patches.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code
  2014-03-24 16:22 ` [PATCH 1/6] fs/bio-integrity: remove duplicate code Darrick J. Wong
@ 2014-04-02 19:17   ` Zach Brown
  2014-04-02 20:35     ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 19:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, Gu Zheng, linux-scsi, linux-mm

> +static int bio_integrity_generate_verify(struct bio *bio, int operate)
>  {

> +	if (operate)
> +		sector = bio->bi_iter.bi_sector;
> +	else
> +		sector = bio->bi_integrity->bip_iter.bi_sector;

> +		if (operate) {
> +			bi->generate_fn(&bix);
> +		} else {
> +			ret = bi->verify_fn(&bix);
> +			if (ret) {
> +				kunmap_atomic(kaddr);
> +				return ret;
> +			}
> +		}

I was glad to see this replaced with explicit sector and func arguments
in later refactoring in the 6/ patch.

But I don't think the function poiner casts in that 6/ patch are wise
(Or even safe all the time, given crazy function pointer trampolines?
Is that still a thing?).  I'd have made a single walk_fn type that
returns and have the non-returning iterators just return 0.

- z

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
       [not found]   ` <20140324162244.10848.46322.stgit-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
@ 2014-04-02 19:22     ` Jeff Moyer
  2014-04-02 22:08       ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2014-04-02 19:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, zab-H+wXaHxf7aLQT0dZR+AlfA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	JBottomley-bzQdu9zFT3WakBO8gow8eQ, bcrl-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-man-u79uwXL29TY76Z2rM5mHXA

"Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> writes:

> Define a generic interface to allow userspace to attach metadata to an
> IO operation.  This interface will be used initially to implement
> protection information (PI) pass through, though it ought to be usable
> by anyone else desiring to extend the IO interface.  It should not be
> difficult to modify the non-AIO calls to use this mechanism.

My main issue with this patch is determining what exactly gets returned
to userspace when there is an issue in the teardown_extensions path.
It looks like you'll get the first error propagated from
io_teardown_extensions, others are ignored.  Then, in aio_complete, if
there was no error with the I/O, then you'll get the teardown error
reported in event->res, otherwise you'll get it in event->res2.  So,
what are the valid errors returned by the teardown routine for
extensions?  How is the userspace app supposed to determine where the
error came from, the I/O or a failure in the extension teardown?

I think it may make sense to only use res2 for reporting io extension
teardown failures.  Any new code that will use extensions can certainly
be written to check both res and res2, and this method would prevent the
ambiguity I mentioned.

Finally, I know this is an RFC, but please add some man-page changes to
your patch set, and CC linux-man.  Michael Kerrisk typically has
valuable advice on new APIs.

Cheers,
Jeff

>
> Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/aio.c                     |  180 +++++++++++++++++++++++++++++++++++++++++-
>  include/linux/aio.h          |    7 ++
>  include/uapi/linux/aio_abi.h |   15 +++-
>  3 files changed, 197 insertions(+), 5 deletions(-)
>
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 062a5f6..0c40bdc 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
>  static const struct file_operations aio_ring_fops;
>  static const struct address_space_operations aio_ctx_aops;
>  
> +static int io_teardown_extensions(struct kiocb *req);
> +static int io_setup_extensions(struct kiocb *req, int is_write,
> +			       struct io_extension __user *ioext);
> +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
> +
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  {
>  	struct qstr this = QSTR_INIT("[aio]", 5);
> @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
>  	struct io_event	*ev_page, *event;
>  	unsigned long	flags;
>  	unsigned tail, pos;
> +	int ret;
> +
> +	ret = io_teardown_extensions(iocb);
> +	if (ret) {
> +		if (!res)
> +			res = ret;
> +		else if (!res2)
> +			res2 = ret;
> +		else
> +			pr_err("error %d tearing down aio extensions\n", ret);
> +	}
>  
>  	/*
>  	 * Special case handling for sync iocbs:
> @@ -1350,15 +1366,167 @@ rw_common:
>  	return 0;
>  }
>  
> +/* IO extension code */
> +#define REQUIRED_STRUCTURE_SIZE(type, member)	\
> +	(offsetof(type, member) + sizeof(((type *)NULL)->member))
> +#define IO_EXT_SIZE(member) \
> +	REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
> +
> +struct io_extension_type {
> +	unsigned int type;
> +	unsigned int extension_struct_size;
> +	int (*setup_fn)(struct kiocb *, int is_write);
> +	int (*destroy_fn)(struct kiocb *);
> +};
> +
> +static struct io_extension_type extensions[] = {
> +	{IO_EXT_INVALID, 0, NULL, NULL},
> +};
> +
> +static int is_write_iocb(struct iocb *iocb)
> +{
> +	switch (iocb->aio_lio_opcode) {
> +	case IOCB_CMD_PWRITE:
> +	case IOCB_CMD_PWRITEV:
> +		return 1;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static int io_teardown_extensions(struct kiocb *req)
> +{
> +	struct io_extension_type *iet;
> +	int ret, ret2;
> +
> +	if (req->ki_ioext == NULL)
> +		return 0;
> +
> +	/* Shut down all the extensions */
> +	ret = 0;
> +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> +			continue;
> +		ret2 = iet->destroy_fn(req);
> +		if (ret2 && !ret)
> +			ret = ret2;
> +	}
> +
> +	/* Copy out return values */
> +	if (unlikely(copy_to_user(req->ki_ioext->ke_user,
> +				  &req->ki_ioext->ke_kern,
> +				  sizeof(struct io_extension)))) {
> +		if (!ret)
> +			ret = -EFAULT;
> +	}
> +
> +	kfree(req->ki_ioext);
> +	req->ki_ioext = NULL;
> +	return ret;
> +}
> +
> +static int io_check_bufsize(__u64 has, __u64 size)
> +{
> +	struct io_extension_type *iet;
> +	__u64 all_flags = 0;
> +
> +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> +		all_flags |= iet->type;
> +		if (!(has & iet->type))
> +			continue;
> +		if (iet->extension_struct_size > size)
> +			return -EINVAL;
> +	}
> +
> +	if (has & ~all_flags)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_setup_extensions(struct kiocb *req, int is_write,
> +			       struct io_extension __user *ioext)
> +{
> +	struct io_extension_type *iet;
> +	__u64 sz, has;
> +	int ret;
> +
> +	/* Check size of buffer */
> +	if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz))))
> +		return -EFAULT;
> +	if (sz > PAGE_SIZE ||
> +	    sz > sizeof(struct io_extension) ||
> +	    sz < IO_EXT_SIZE(ie_has))
> +		return -EINVAL;
> +
> +	/* Check that the buffer's big enough */
> +	if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has))))
> +		return -EFAULT;
> +	ret = io_check_bufsize(has, sz);
> +	if (ret)
> +		return ret;
> +
> +	/* Copy from userland */
> +	req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> +	if (!req->ki_ioext)
> +		return -ENOMEM;
> +
> +	req->ki_ioext->ke_user = ioext;
> +	if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* Try to initialize all the extensions */
> +	has = 0;
> +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> +			continue;
> +		ret = iet->setup_fn(req, is_write);
> +		if (ret) {
> +			req->ki_ioext->ke_kern.ie_has = has;
> +			goto out_destroy;
> +		}
> +		has |= iet->type;
> +	}
> +
> +	return 0;
> +out_destroy:
> +	io_teardown_extensions(req);
> +out:
> +	kfree(req->ki_ioext);
> +	req->ki_ioext = NULL;
> +	return ret;
> +}
> +
> +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req)
> +{
> +	struct io_extension __user *user_ext;
> +
> +	if (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS))
> +		return 0;
> +
> +	user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;
> +	return io_setup_extensions(req, is_write_iocb(iocb), user_ext);
> +}
> +
>  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  			 struct iocb *iocb, bool compat)
>  {
>  	struct kiocb *req;
>  	ssize_t ret;
>  
> +	/* check for flags we don't know about */
> +	if (iocb->aio_flags & ~IOCB_FLAG_ALL) {
> +		pr_debug("EINVAL: invalid flags\n");
> +		return -EINVAL;
> +	}
> +
>  	/* enforce forwards compatibility on users */
> -	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> -		pr_debug("EINVAL: reserve field set\n");
> +	if (unlikely(iocb->aio_reserved1 ||
> +	    (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS) &&
> +	     iocb->aio_extension_ptr))) {
> +		pr_debug("EINVAL: reserved field set\n");
>  		return -EINVAL;
>  	}
>  
> @@ -1408,13 +1576,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  	req->ki_pos = iocb->aio_offset;
>  	req->ki_nbytes = iocb->aio_nbytes;
>  
> +	ret = iocb_setup_extensions(iocb, req);
> +	if (unlikely(ret))
> +		goto out_del_ext;
> +
>  	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
>  			   (char __user *)(unsigned long)iocb->aio_buf,
>  			   compat);
>  	if (ret)
> -		goto out_put_req;
> +		goto out_del_ext;
>  
>  	return 0;
> +out_del_ext:
> +	io_teardown_extensions(req);
>  out_put_req:
>  	put_reqs_available(ctx, 1);
>  	percpu_ref_put(&ctx->reqs);
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..60f4364 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -29,6 +29,10 @@ struct kiocb;
>  
>  typedef int (kiocb_cancel_fn)(struct kiocb *);
>  
> +struct kio_extension {
> +	struct io_extension __user *ke_user;
> +	struct io_extension ke_kern;
> +};
>  struct kiocb {
>  	struct file		*ki_filp;
>  	struct kioctx		*ki_ctx;	/* NULL for sync ops */
> @@ -52,6 +56,9 @@ struct kiocb {
>  	 * this is the underlying eventfd context to deliver events to.
>  	 */
>  	struct eventfd_ctx	*ki_eventfd;
> +
> +	/* Kernel copy of extension descriptors */
> +	struct kio_extension	*ki_ioext;
>  };
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index bb2554f..07ffd1f 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -53,6 +53,8 @@ enum {
>   *                   is valid.
>   */
>  #define IOCB_FLAG_RESFD		(1 << 0)
> +#define IOCB_FLAG_EXTENSIONS	(1 << 1)
> +#define IOCB_FLAG_ALL		(IOCB_FLAG_RESFD | IOCB_FLAG_EXTENSIONS)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {
> @@ -70,6 +72,15 @@ struct io_event {
>  #error edit for your odd byteorder.
>  #endif
>  
> +/* IO extension types */
> +#define IO_EXT_INVALID	(0)
> +
> +/* IO extension descriptor */
> +struct io_extension {
> +	__u64 ie_size;
> +	__u64 ie_has;
> +};
> +
>  /*
>   * we always use a 64bit off_t when communicating
>   * with userland.  its up to libraries to do the
> @@ -91,8 +102,8 @@ struct iocb {
>  	__u64	aio_nbytes;
>  	__s64	aio_offset;
>  
> -	/* extra parameters */
> -	__u64	aio_reserved2;	/* TODO: use this for a (struct sigevent *) */
> +	/* aio extensions, only present if IOCB_FLAG_EXTENSIONS */
> +	__u64	aio_extension_ptr;
>  
>  	/* flags for the "struct iocb" */
>  	__u32	aio_flags;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
  2014-03-24 16:22 ` [PATCH 2/6] io: define an interface for IO extensions Darrick J. Wong
       [not found]   ` <20140324162244.10848.46322.stgit-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
@ 2014-04-02 19:49   ` Zach Brown
  2014-04-02 22:28     ` Darrick J. Wong
  1 sibling, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 19:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

> @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
>  	struct io_event	*ev_page, *event;
>  	unsigned long	flags;
>  	unsigned tail, pos;
> +	int ret;
> +
> +	ret = io_teardown_extensions(iocb);
> +	if (ret) {
> +		if (!res)
> +			res = ret;
> +		else if (!res2)
> +			res2 = ret;
> +		else
> +			pr_err("error %d tearing down aio extensions\n", ret);
> +	}

This ends up trying to copy the kernel's io_extension copy back to
userspace from interrupts, which obviously won't fly.

And to what end?  So that maybe someone can later add an 'extension'
that can fill in some field that's then copied to userspace?  But by
copying the entire argument struct back?

Let's not get ahead of ourselves.  If they're going to try and give
userspace some feedback after IO completion they're going to have to try
a lot harder because they don't have acces to the submitting task
context anymore.  They'd have to pin some reference to a feedback
mechanism in the in-flight io.  I think we'd want that explicit in the
iocb, not hiding off on the other side of this extension interface.

I'd just remove this generic teardown callback path entirely.  If
there's PI state hanging off the iocb tear it down during iocb teardown.

> +struct io_extension_type {
> +	unsigned int type;
> +	unsigned int extension_struct_size;
> +	int (*setup_fn)(struct kiocb *, int is_write);
> +	int (*destroy_fn)(struct kiocb *);
> +};

I'd also get rid of all of this.  More below.

> +static int io_setup_extensions(struct kiocb *req, int is_write,
> +			       struct io_extension __user *ioext)
> +{
> +	struct io_extension_type *iet;
> +	__u64 sz, has;
> +	int ret;
> +
> +	/* Check size of buffer */
> +	if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz))))
> +		return -EFAULT;
> +	if (sz > PAGE_SIZE ||
> +	    sz > sizeof(struct io_extension) ||
> +	    sz < IO_EXT_SIZE(ie_has))
> +		return -EINVAL;
> +
> +	/* Check that the buffer's big enough */
> +	if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has))))
> +		return -EFAULT;
> +	ret = io_check_bufsize(has, sz);
> +	if (ret)
> +		return ret;
> +
> +	/* Copy from userland */
> +	req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> +	if (!req->ki_ioext)
> +		return -ENOMEM;
> +
> +	req->ki_ioext->ke_user = ioext;
> +	if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}

(Isn't there some allocate-and-copy-from-userspace helper now? But..)

I don't like the rudundancy of the implicit size requirement by a
field's flag being set being duplicated by the explicit size argument.
What does that give us, exactly?

Our notion of the total size only seems to only matter if we're copying
the entire struct from userspace and I'm don't think we need to do that.

For each argument, we're translating it into some kernel equivalent,
right?  Fields in the iocb  As each of these are initialized I'd just
test the presence bits and __get_user() the userspace arguemnts
directly, or copy_from_user() something slightly more complicated on to
the stack.

That gets rid of us having to care about the size at all.  It stops us
from allocating a kernel copy and pinning it for the duration of the IO.
We'd just be sampling the present userspace arguments as we initialie
the iocb during submission.

> +	/* Try to initialize all the extensions */
> +	has = 0;
> +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> +			continue;
> +		ret = iet->setup_fn(req, is_write);
> +		if (ret) {
> +			req->ki_ioext->ke_kern.ie_has = has;
> +			goto out_destroy;
> +		}
> +		has |= iet->type;
> +	}

So instead of doing all this we'd test explicit bits and act
accordingly.  If they're trivial translations between userspace fields
and iocb fields we could just do it inline in this helper that'd be more
like iocb_parse_more_args(iocb, struct __user *ptr).  For more
complicated stuff, like the PI page pinning, it could call out to PI.

> +	user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;

Need a __force there?  Has this been run through sparse?

- z

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] aio/dio: enable PI passthrough
  2014-03-24 16:22 ` [PATCH 3/6] aio/dio: enable PI passthrough Darrick J. Wong
@ 2014-04-02 20:01   ` Zach Brown
  2014-04-02 20:44     ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 20:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

> +static int setup_pi_ext(struct kiocb *req, int is_write)
> +{
> +	struct file *file = req->ki_filp;
> +	struct io_extension *ext = &req->ki_ioext->ke_kern;
> +	void *p;
> +	unsigned long start, end;
> +	int retval;
> +
> +	if (!(file->f_flags & O_DIRECT)) {
> +		pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
> +		return -EINVAL;
> +	}
> +
> +	BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
> +
> +	end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
> +		PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
> +	req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
> +	req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
> +	req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
> +	p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
> +		    sizeof(struct page *),
> +		    GFP_NOIO);

Can userspace give us bad data and get us to generate insane allcation
attempt warnings?

> +	if (p == NULL) {
> +		pr_err("%s: no room for page array?\n", __func__);
> +		return -ENOMEM;
> +	}
> +	req->ki_ioext->ke_pi_iter.pi_userpages = p;
> +
> +	retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
> +				     req->ki_ioext->ke_pi_iter.pi_nrpages,
> +				     is_write,

Isn't this is_write backwards?  If it's a write syscall then the PI
pages is going to be read from.

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO
  2014-04-02 19:14 ` [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Zach Brown
@ 2014-04-02 20:05   ` Zach Brown
  0 siblings, 0 replies; 21+ messages in thread
From: Zach Brown @ 2014-04-02 20:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

> 's a small number of them with strong semantics because they're a part
> of the syscall ABI.

("There's" a small number of them..  vim troubles :))

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code
  2014-04-02 19:17   ` Zach Brown
@ 2014-04-02 20:35     ` Darrick J. Wong
  0 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 20:35 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, Gu Zheng, linux-scsi, linux-mm

On Wed, Apr 02, 2014 at 12:17:58PM -0700, Zach Brown wrote:
> > +static int bio_integrity_generate_verify(struct bio *bio, int operate)
> >  {
> 
> > +	if (operate)
> > +		sector = bio->bi_iter.bi_sector;
> > +	else
> > +		sector = bio->bi_integrity->bip_iter.bi_sector;
> 
> > +		if (operate) {
> > +			bi->generate_fn(&bix);
> > +		} else {
> > +			ret = bi->verify_fn(&bix);
> > +			if (ret) {
> > +				kunmap_atomic(kaddr);
> > +				return ret;
> > +			}
> > +		}
> 
> I was glad to see this replaced with explicit sector and func arguments
> in later refactoring in the 6/ patch.
> 
> But I don't think the function poiner casts in that 6/ patch are wise
> (Or even safe all the time, given crazy function pointer trampolines?
> Is that still a thing?).  I'd have made a single walk_fn type that
> returns and have the non-returning iterators just return 0.

Noted.  I cleaned all that crap out just yesterday, so now there's only one
walk function and some context data that gets passed to the iterator function.
Much less horrifying.

(I really only included this patch so that I'd have less rebasing work when
3.15-rc1 comes out.)

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] aio/dio: enable PI passthrough
  2014-04-02 20:01   ` Zach Brown
@ 2014-04-02 20:44     ` Darrick J. Wong
  2014-04-02 22:33       ` Zach Brown
  0 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 20:44 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Wed, Apr 02, 2014 at 01:01:33PM -0700, Zach Brown wrote:
> > +static int setup_pi_ext(struct kiocb *req, int is_write)
> > +{
> > +	struct file *file = req->ki_filp;
> > +	struct io_extension *ext = &req->ki_ioext->ke_kern;
> > +	void *p;
> > +	unsigned long start, end;
> > +	int retval;
> > +
> > +	if (!(file->f_flags & O_DIRECT)) {
> > +		pr_debug("EINVAL: can't use PI without O_DIRECT.\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages);
> > +
> > +	end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen +
> > +		PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT;
> > +	req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf);
> > +	req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen;
> > +	req->ki_ioext->ke_pi_iter.pi_nrpages = end - start;
> > +	p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages *
> > +		    sizeof(struct page *),
> > +		    GFP_NOIO);
> 
> Can userspace give us bad data and get us to generate insane allcation
> attempt warnings?

Easily.  One of the bits I have to work on for the PI part is figuring out how
to check with the PI provider that the arguments (the iovec and the pi buffer)
actually make any sense, in terms of length and alignment requirements (PI
tuples can't cross pages).  I think it's as simple as adding a bio_integrity
ops call, and then calling down to it from the kiocb level.

One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
not # of struct iovecs) that I can throw at the kernel?

> > +	if (p == NULL) {
> > +		pr_err("%s: no room for page array?\n", __func__);
> > +		return -ENOMEM;
> > +	}
> > +	req->ki_ioext->ke_pi_iter.pi_userpages = p;
> > +
> > +	retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf,
> > +				     req->ki_ioext->ke_pi_iter.pi_nrpages,
> > +				     is_write,
> 
> Isn't this is_write backwards?  If it's a write syscall then the PI
> pages is going to be read from.

Yes, I think so.  Good catch!

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
  2014-04-02 19:22     ` Jeff Moyer
@ 2014-04-02 22:08       ` Darrick J. Wong
  0 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 22:08 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: axboe, zab, martin.petersen, JBottomley, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm, linux-man

On Wed, Apr 02, 2014 at 03:22:20PM -0400, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> 
> > Define a generic interface to allow userspace to attach metadata to an
> > IO operation.  This interface will be used initially to implement
> > protection information (PI) pass through, though it ought to be usable
> > by anyone else desiring to extend the IO interface.  It should not be
> > difficult to modify the non-AIO calls to use this mechanism.
> 
> My main issue with this patch is determining what exactly gets returned
> to userspace when there is an issue in the teardown_extensions path.
> It looks like you'll get the first error propagated from
> io_teardown_extensions, others are ignored.  Then, in aio_complete, if
> there was no error with the I/O, then you'll get the teardown error
> reported in event->res, otherwise you'll get it in event->res2.  So,
> what are the valid errors returned by the teardown routine for
> extensions?  How is the userspace app supposed to determine where the
> error came from, the I/O or a failure in the extension teardown?

There's also the question of which extension spat out the error.  One solution
would be to augment struct io_extension with all the error fields that we want
(an extension can declare its own if needed) as we do now, and if errors happen
during setup, we can just copy_to_user them back.  If nothing else fails with
the IO setup, the setup routine can return -EINVAL, and userspace can look for
updated error fields in the struct.

Unfortunately for the teardown error case you'd have to pin the whole page in
memory for the duration of the IO just to have it around.  For now this isn't a
problem because teardown can't fail anyway.

> I think it may make sense to only use res2 for reporting io extension
> teardown failures.  Any new code that will use extensions can certainly
> be written to check both res and res2, and this method would prevent the
> ambiguity I mentioned.

Hmm, doesn't look like anyone actually uses res2 except for USB gadgets.

It's tempting just to shove the first ioextension error code that comes along
into res2 and abort the whole thing, and let userspace guess where the res2
code came from.  I think there's an additional problem with stuffing return
codes: in the case of synchronous IO syscalls, we'd have to deal with how to
cram error codes from (potentially) multiple sources into the single return
value, while not giving userspace any help as to where the code came from.

Now that I've written all that out, I don't like this idea so I'll drop it. :)

> Finally, I know this is an RFC, but please add some man-page changes to
> your patch set, and CC linux-man.  Michael Kerrisk typically has
> valuable advice on new APIs.

I'll do that the next time I rev the patches.  Thank you for the suggestion.

--D
> 
> Cheers,
> Jeff
> 
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/aio.c                     |  180 +++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/aio.h          |    7 ++
> >  include/uapi/linux/aio_abi.h |   15 +++-
> >  3 files changed, 197 insertions(+), 5 deletions(-)
> >
> >
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 062a5f6..0c40bdc 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
> >  static const struct file_operations aio_ring_fops;
> >  static const struct address_space_operations aio_ctx_aops;
> >  
> > +static int io_teardown_extensions(struct kiocb *req);
> > +static int io_setup_extensions(struct kiocb *req, int is_write,
> > +			       struct io_extension __user *ioext);
> > +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
> > +
> >  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> >  {
> >  	struct qstr this = QSTR_INIT("[aio]", 5);
> > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
> >  	struct io_event	*ev_page, *event;
> >  	unsigned long	flags;
> >  	unsigned tail, pos;
> > +	int ret;
> > +
> > +	ret = io_teardown_extensions(iocb);
> > +	if (ret) {
> > +		if (!res)
> > +			res = ret;
> > +		else if (!res2)
> > +			res2 = ret;
> > +		else
> > +			pr_err("error %d tearing down aio extensions\n", ret);
> > +	}
> >  
> >  	/*
> >  	 * Special case handling for sync iocbs:
> > @@ -1350,15 +1366,167 @@ rw_common:
> >  	return 0;
> >  }
> >  
> > +/* IO extension code */
> > +#define REQUIRED_STRUCTURE_SIZE(type, member)	\
> > +	(offsetof(type, member) + sizeof(((type *)NULL)->member))
> > +#define IO_EXT_SIZE(member) \
> > +	REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
> > +
> > +struct io_extension_type {
> > +	unsigned int type;
> > +	unsigned int extension_struct_size;
> > +	int (*setup_fn)(struct kiocb *, int is_write);
> > +	int (*destroy_fn)(struct kiocb *);
> > +};
> > +
> > +static struct io_extension_type extensions[] = {
> > +	{IO_EXT_INVALID, 0, NULL, NULL},
> > +};
> > +
> > +static int is_write_iocb(struct iocb *iocb)
> > +{
> > +	switch (iocb->aio_lio_opcode) {
> > +	case IOCB_CMD_PWRITE:
> > +	case IOCB_CMD_PWRITEV:
> > +		return 1;
> > +	default:
> > +		return 0;
> > +	}
> > +}
> > +
> > +static int io_teardown_extensions(struct kiocb *req)
> > +{
> > +	struct io_extension_type *iet;
> > +	int ret, ret2;
> > +
> > +	if (req->ki_ioext == NULL)
> > +		return 0;
> > +
> > +	/* Shut down all the extensions */
> > +	ret = 0;
> > +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> > +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> > +			continue;
> > +		ret2 = iet->destroy_fn(req);
> > +		if (ret2 && !ret)
> > +			ret = ret2;
> > +	}
> > +
> > +	/* Copy out return values */
> > +	if (unlikely(copy_to_user(req->ki_ioext->ke_user,
> > +				  &req->ki_ioext->ke_kern,
> > +				  sizeof(struct io_extension)))) {
> > +		if (!ret)
> > +			ret = -EFAULT;
> > +	}
> > +
> > +	kfree(req->ki_ioext);
> > +	req->ki_ioext = NULL;
> > +	return ret;
> > +}
> > +
> > +static int io_check_bufsize(__u64 has, __u64 size)
> > +{
> > +	struct io_extension_type *iet;
> > +	__u64 all_flags = 0;
> > +
> > +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> > +		all_flags |= iet->type;
> > +		if (!(has & iet->type))
> > +			continue;
> > +		if (iet->extension_struct_size > size)
> > +			return -EINVAL;
> > +	}
> > +
> > +	if (has & ~all_flags)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> > +static int io_setup_extensions(struct kiocb *req, int is_write,
> > +			       struct io_extension __user *ioext)
> > +{
> > +	struct io_extension_type *iet;
> > +	__u64 sz, has;
> > +	int ret;
> > +
> > +	/* Check size of buffer */
> > +	if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz))))
> > +		return -EFAULT;
> > +	if (sz > PAGE_SIZE ||
> > +	    sz > sizeof(struct io_extension) ||
> > +	    sz < IO_EXT_SIZE(ie_has))
> > +		return -EINVAL;
> > +
> > +	/* Check that the buffer's big enough */
> > +	if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has))))
> > +		return -EFAULT;
> > +	ret = io_check_bufsize(has, sz);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Copy from userland */
> > +	req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> > +	if (!req->ki_ioext)
> > +		return -ENOMEM;
> > +
> > +	req->ki_ioext->ke_user = ioext;
> > +	if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	/* Try to initialize all the extensions */
> > +	has = 0;
> > +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> > +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> > +			continue;
> > +		ret = iet->setup_fn(req, is_write);
> > +		if (ret) {
> > +			req->ki_ioext->ke_kern.ie_has = has;
> > +			goto out_destroy;
> > +		}
> > +		has |= iet->type;
> > +	}
> > +
> > +	return 0;
> > +out_destroy:
> > +	io_teardown_extensions(req);
> > +out:
> > +	kfree(req->ki_ioext);
> > +	req->ki_ioext = NULL;
> > +	return ret;
> > +}
> > +
> > +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req)
> > +{
> > +	struct io_extension __user *user_ext;
> > +
> > +	if (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS))
> > +		return 0;
> > +
> > +	user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;
> > +	return io_setup_extensions(req, is_write_iocb(iocb), user_ext);
> > +}
> > +
> >  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> >  			 struct iocb *iocb, bool compat)
> >  {
> >  	struct kiocb *req;
> >  	ssize_t ret;
> >  
> > +	/* check for flags we don't know about */
> > +	if (iocb->aio_flags & ~IOCB_FLAG_ALL) {
> > +		pr_debug("EINVAL: invalid flags\n");
> > +		return -EINVAL;
> > +	}
> > +
> >  	/* enforce forwards compatibility on users */
> > -	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> > -		pr_debug("EINVAL: reserve field set\n");
> > +	if (unlikely(iocb->aio_reserved1 ||
> > +	    (!(iocb->aio_flags & IOCB_FLAG_EXTENSIONS) &&
> > +	     iocb->aio_extension_ptr))) {
> > +		pr_debug("EINVAL: reserved field set\n");
> >  		return -EINVAL;
> >  	}
> >  
> > @@ -1408,13 +1576,19 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
> >  	req->ki_pos = iocb->aio_offset;
> >  	req->ki_nbytes = iocb->aio_nbytes;
> >  
> > +	ret = iocb_setup_extensions(iocb, req);
> > +	if (unlikely(ret))
> > +		goto out_del_ext;
> > +
> >  	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
> >  			   (char __user *)(unsigned long)iocb->aio_buf,
> >  			   compat);
> >  	if (ret)
> > -		goto out_put_req;
> > +		goto out_del_ext;
> >  
> >  	return 0;
> > +out_del_ext:
> > +	io_teardown_extensions(req);
> >  out_put_req:
> >  	put_reqs_available(ctx, 1);
> >  	percpu_ref_put(&ctx->reqs);
> > diff --git a/include/linux/aio.h b/include/linux/aio.h
> > index d9c92da..60f4364 100644
> > --- a/include/linux/aio.h
> > +++ b/include/linux/aio.h
> > @@ -29,6 +29,10 @@ struct kiocb;
> >  
> >  typedef int (kiocb_cancel_fn)(struct kiocb *);
> >  
> > +struct kio_extension {
> > +	struct io_extension __user *ke_user;
> > +	struct io_extension ke_kern;
> > +};
> >  struct kiocb {
> >  	struct file		*ki_filp;
> >  	struct kioctx		*ki_ctx;	/* NULL for sync ops */
> > @@ -52,6 +56,9 @@ struct kiocb {
> >  	 * this is the underlying eventfd context to deliver events to.
> >  	 */
> >  	struct eventfd_ctx	*ki_eventfd;
> > +
> > +	/* Kernel copy of extension descriptors */
> > +	struct kio_extension	*ki_ioext;
> >  };
> >  
> >  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> > diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> > index bb2554f..07ffd1f 100644
> > --- a/include/uapi/linux/aio_abi.h
> > +++ b/include/uapi/linux/aio_abi.h
> > @@ -53,6 +53,8 @@ enum {
> >   *                   is valid.
> >   */
> >  #define IOCB_FLAG_RESFD		(1 << 0)
> > +#define IOCB_FLAG_EXTENSIONS	(1 << 1)
> > +#define IOCB_FLAG_ALL		(IOCB_FLAG_RESFD | IOCB_FLAG_EXTENSIONS)
> >  
> >  /* read() from /dev/aio returns these structures. */
> >  struct io_event {
> > @@ -70,6 +72,15 @@ struct io_event {
> >  #error edit for your odd byteorder.
> >  #endif
> >  
> > +/* IO extension types */
> > +#define IO_EXT_INVALID	(0)
> > +
> > +/* IO extension descriptor */
> > +struct io_extension {
> > +	__u64 ie_size;
> > +	__u64 ie_has;
> > +};
> > +
> >  /*
> >   * we always use a 64bit off_t when communicating
> >   * with userland.  its up to libraries to do the
> > @@ -91,8 +102,8 @@ struct iocb {
> >  	__u64	aio_nbytes;
> >  	__s64	aio_offset;
> >  
> > -	/* extra parameters */
> > -	__u64	aio_reserved2;	/* TODO: use this for a (struct sigevent *) */
> > +	/* aio extensions, only present if IOCB_FLAG_EXTENSIONS */
> > +	__u64	aio_extension_ptr;
> >  
> >  	/* flags for the "struct iocb" */
> >  	__u32	aio_flags;
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
  2014-04-02 19:49   ` Zach Brown
@ 2014-04-02 22:28     ` Darrick J. Wong
  2014-04-02 22:53       ` Zach Brown
  0 siblings, 1 reply; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 22:28 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Wed, Apr 02, 2014 at 12:49:47PM -0700, Zach Brown wrote:
> > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
> >  	struct io_event	*ev_page, *event;
> >  	unsigned long	flags;
> >  	unsigned tail, pos;
> > +	int ret;
> > +
> > +	ret = io_teardown_extensions(iocb);
> > +	if (ret) {
> > +		if (!res)
> > +			res = ret;
> > +		else if (!res2)
> > +			res2 = ret;
> > +		else
> > +			pr_err("error %d tearing down aio extensions\n", ret);
> > +	}
> 
> This ends up trying to copy the kernel's io_extension copy back to
> userspace from interrupts, which obviously won't fly.
> 
> And to what end?  So that maybe someone can later add an 'extension'
> that can fill in some field that's then copied to userspace?  But by
> copying the entire argument struct back?
> 
> Let's not get ahead of ourselves.  If they're going to try and give
> userspace some feedback after IO completion they're going to have to try
> a lot harder because they don't have acces to the submitting task
> context anymore.  They'd have to pin some reference to a feedback
> mechanism in the in-flight io.  I think we'd want that explicit in the
> iocb, not hiding off on the other side of this extension interface.

I think we'd want to find an extension that really needs this.  PI doesn't.
We can skate by without supporting the teardown errors case for now.

> I'd just remove this generic teardown callback path entirely.  If
> there's PI state hanging off the iocb tear it down during iocb teardown.

Hmm, I thought aio_complete /was/ iocb teardown time.

> > +struct io_extension_type {
> > +	unsigned int type;
> > +	unsigned int extension_struct_size;
> > +	int (*setup_fn)(struct kiocb *, int is_write);
> > +	int (*destroy_fn)(struct kiocb *);
> > +};
> 
> I'd also get rid of all of this.  More below.
> 
> > +static int io_setup_extensions(struct kiocb *req, int is_write,
> > +			       struct io_extension __user *ioext)
> > +{
> > +	struct io_extension_type *iet;
> > +	__u64 sz, has;
> > +	int ret;
> > +
> > +	/* Check size of buffer */
> > +	if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz))))
> > +		return -EFAULT;
> > +	if (sz > PAGE_SIZE ||
> > +	    sz > sizeof(struct io_extension) ||
> > +	    sz < IO_EXT_SIZE(ie_has))
> > +		return -EINVAL;
> > +
> > +	/* Check that the buffer's big enough */
> > +	if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has))))
> > +		return -EFAULT;
> > +	ret = io_check_bufsize(has, sz);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Copy from userland */
> > +	req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
> > +	if (!req->ki_ioext)
> > +		return -ENOMEM;
> > +
> > +	req->ki_ioext->ke_user = ioext;
> > +	if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> 
> (Isn't there some allocate-and-copy-from-userspace helper now? But..)

<shrug> Is there?  I didn't find one when I looked, but it wasn't an exhaustive
search.

> I don't like the rudundancy of the implicit size requirement by a
> field's flag being set being duplicated by the explicit size argument.
> What does that give us, exactly?

Either another sanity check or another way to screw up, depending on how you
look at it.  I'd been considering shortening the size field to u32 and adding a
magic number field, but I wonder if that's really necessary.  Seems like it
shouldn't be -- if userland screws up, it's not hard to kill the process.
(Or segv it, or...)

> Our notion of the total size only seems to only matter if we're copying
> the entire struct from userspace and I'm don't think we need to do that.
> 
> For each argument, we're translating it into some kernel equivalent,
> right?

Yes.

> Fields in the iocb  As each of these are initialized I'd just
> test the presence bits and __get_user() the userspace arguemnts
> directly, or copy_from_user() something slightly more complicated on to
> the stack.
>
> That gets rid of us having to care about the size at all.  It stops us
> from allocating a kernel copy and pinning it for the duration of the IO.
> We'd just be sampling the present userspace arguments as we initialie
> the iocb during submission.

I like this idea.  For the PI extension, nothing particularly error-prone
happens in teardown, which allows the flexibility to copy_from_user any
arguments required, and to copy_to_user any setup errors that happen.  I can
get rid a lot of allocate-and-copy nonsense, as you point out.

Ok, I'll migrate my patches towards this strategy, and let's see how much code
goes away. :)

I've also noticed a bug where if you make one of these PI-extended calls on a
file living on a filesystem, it'll extend the io request's range to be
filesystem block-aligned, which causes all kinds of havoc with the user
provided PI buffers, since they now need to be extended to fit the added
blocks.  Alternately, one could require PI IOs to be fs-block aligned when
dealing with regular files. 

> > +	/* Try to initialize all the extensions */
> > +	has = 0;
> > +	for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) {
> > +		if (!(req->ki_ioext->ke_kern.ie_has & iet->type))
> > +			continue;
> > +		ret = iet->setup_fn(req, is_write);
> > +		if (ret) {
> > +			req->ki_ioext->ke_kern.ie_has = has;
> > +			goto out_destroy;
> > +		}
> > +		has |= iet->type;
> > +	}
> 
> So instead of doing all this we'd test explicit bits and act
> accordingly.  If they're trivial translations between userspace fields
> and iocb fields we could just do it inline in this helper that'd be more
> like iocb_parse_more_args(iocb, struct __user *ptr).  For more
> complicated stuff, like the PI page pinning, it could call out to PI.

I'd definitely leave the PI stuff in separate functions to avoid cluttering
io_*_extension().

> > +	user_ext = (struct io_extension __user *)iocb->aio_extension_ptr;
> 
> Need a __force there?  Has this been run through sparse?

Nope, none of the usual quality checks.  Hence DONOTMERGE. :)

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] aio/dio: enable PI passthrough
  2014-04-02 20:44     ` Darrick J. Wong
@ 2014-04-02 22:33       ` Zach Brown
  2014-04-02 22:55         ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 22:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

> One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
> not # of struct iovecs) that I can throw at the kernel?

Yeah, dunno.  I'd guess big :).  I'd hope that the PI code already has a
way to clamp the size of bios if there's a limit to the size of PI data
that can be managed downstream?

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
  2014-04-02 22:28     ` Darrick J. Wong
@ 2014-04-02 22:53       ` Zach Brown
  2014-04-02 23:06         ` Darrick J. Wong
  0 siblings, 1 reply; 21+ messages in thread
From: Zach Brown @ 2014-04-02 22:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

> > I'd just remove this generic teardown callback path entirely.  If
> > there's PI state hanging off the iocb tear it down during iocb teardown.
> 
> Hmm, I thought aio_complete /was/ iocb teardown time.

Well, usually :).  If you build up before aio_run_iocb() then you nead
to teardown in kiocb_free(), which is also called by aio_complete().

> > (Isn't there some allocate-and-copy-from-userspace helper now? But..)
> 
> <shrug> Is there?  I didn't find one when I looked, but it wasn't an exhaustive
> search.

I could have sworn that I saw something.. ah, right, memdup_user().

> > I don't like the rudundancy of the implicit size requirement by a
> > field's flag being set being duplicated by the explicit size argument.
> > What does that give us, exactly?
> 
> Either another sanity check or another way to screw up, depending on how you
> look at it.  I'd been considering shortening the size field to u32 and adding a
> magic number field, but I wonder if that's really necessary.  Seems like it
> shouldn't be -- if userland screws up, it's not hard to kill the process.
> (Or segv it, or...)

I don't think I'd bother.  The bits should be enough and are already
necessary to have explicit indicators of fields being set.

> > Fields in the iocb  As each of these are initialized I'd just
> > test the presence bits and __get_user() the userspace arguemnts
> > directly, or copy_from_user() something slightly more complicated on to
> > the stack.
> >
> > That gets rid of us having to care about the size at all.  It stops us
> > from allocating a kernel copy and pinning it for the duration of the IO.
> > We'd just be sampling the present userspace arguments as we initialie
> > the iocb during submission.
> 
> I like this idea.  For the PI extension, nothing particularly error-prone
> happens in teardown, which allows the flexibility to copy_from_user any
> arguments required, and to copy_to_user any setup errors that happen.  I can
> get rid a lot of allocate-and-copy nonsense, as you point out.
> 
> Ok, I'll migrate my patches towards this strategy, and let's see how much code
> goes away. :)

Cool :).

> I've also noticed a bug where if you make one of these PI-extended calls on a
> file living on a filesystem, it'll extend the io request's range to be
> filesystem block-aligned, which causes all kinds of havoc with the user
> provided PI buffers, since they now need to be extended to fit the added
> blocks.  Alternately, one could require PI IOs to be fs-block aligned when
> dealing with regular files. 

I think, like O_DIRECT, it just has to be aligned or fail :(.

- z

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] aio/dio: enable PI passthrough
  2014-04-02 22:33       ` Zach Brown
@ 2014-04-02 22:55         ` Darrick J. Wong
  0 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 22:55 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Wed, Apr 02, 2014 at 03:33:11PM -0700, Zach Brown wrote:
> > One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
> > not # of struct iovecs) that I can throw at the kernel?
> 
> Yeah, dunno.  I'd guess big :).  I'd hope that the PI code already has a
> way to clamp the size of bios if there's a limit to the size of PI data
> that can be managed downstream?

I guess if we restricted the size of the PI buffer to a page's worth of
pointers to struct page, that limits us to 128M on x64 with DIF and 512b
sectors.  That's not really a whole lot; I suppose one could (ab)use vmalloc.

Yes, blk-integrity clamps the size of the bio to fit the downstream device's
maximum integrity sg size.  See max_integrity_segments for details, or the
mostly-undocumented sg_prot_tablesize sysfs attribute that reveals it.

I don't know what a practical limit is; scsi_debug sets it to 65536.

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] io: define an interface for IO extensions
  2014-04-02 22:53       ` Zach Brown
@ 2014-04-02 23:06         ` Darrick J. Wong
  0 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2014-04-02 23:06 UTC (permalink / raw)
  To: Zach Brown
  Cc: axboe, martin.petersen, JBottomley, jmoyer, bcrl, viro,
	linux-fsdevel, linux-aio, linux-scsi, linux-mm

On Wed, Apr 02, 2014 at 03:53:33PM -0700, Zach Brown wrote:
> > > I'd just remove this generic teardown callback path entirely.  If
> > > there's PI state hanging off the iocb tear it down during iocb teardown.
> > 
> > Hmm, I thought aio_complete /was/ iocb teardown time.
> 
> Well, usually :).  If you build up before aio_run_iocb() then you nead
> to teardown in kiocb_free(), which is also called by aio_complete().

Oh, yeah.  I handle that by tearing down the extensions if stuff fails, though
I don't remember if that was in this version of the patchset.

> > > (Isn't there some allocate-and-copy-from-userspace helper now? But..)
> > 
> > <shrug> Is there?  I didn't find one when I looked, but it wasn't an exhaustive
> > search.
> 
> I could have sworn that I saw something.. ah, right, memdup_user().

Noted. :)

> > > I don't like the rudundancy of the implicit size requirement by a
> > > field's flag being set being duplicated by the explicit size argument.
> > > What does that give us, exactly?
> > 
> > Either another sanity check or another way to screw up, depending on how you
> > look at it.  I'd been considering shortening the size field to u32 and adding a
> > magic number field, but I wonder if that's really necessary.  Seems like it
> > shouldn't be -- if userland screws up, it's not hard to kill the process.
> > (Or segv it, or...)
> 
> I don't think I'd bother.  The bits should be enough and are already
> necessary to have explicit indicators of fields being set.

<nod>

> > > Fields in the iocb  As each of these are initialized I'd just
> > > test the presence bits and __get_user() the userspace arguemnts
> > > directly, or copy_from_user() something slightly more complicated on to
> > > the stack.
> > >
> > > That gets rid of us having to care about the size at all.  It stops us
> > > from allocating a kernel copy and pinning it for the duration of the IO.
> > > We'd just be sampling the present userspace arguments as we initialie
> > > the iocb during submission.
> > 
> > I like this idea.  For the PI extension, nothing particularly error-prone
> > happens in teardown, which allows the flexibility to copy_from_user any
> > arguments required, and to copy_to_user any setup errors that happen.  I can
> > get rid a lot of allocate-and-copy nonsense, as you point out.
> > 
> > Ok, I'll migrate my patches towards this strategy, and let's see how much code
> > goes away. :)
> 
> Cool :).
> 
> > I've also noticed a bug where if you make one of these PI-extended calls on a
> > file living on a filesystem, it'll extend the io request's range to be
> > filesystem block-aligned, which causes all kinds of havoc with the user
> > provided PI buffers, since they now need to be extended to fit the added
> > blocks.  Alternately, one could require PI IOs to be fs-block aligned when
> > dealing with regular files. 
> 
> I think, like O_DIRECT, it just has to be aligned or fail :(.

Heh.  O_DIRECT is a hilarious maze of twisty unobvious requirements.  Yuck.

#define O_IMNAIVEENOUGHTOTHINKIKNOWWHATTHISDOES	O_DIRECT

--D
> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-04-02 23:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-24 16:22 [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Darrick J. Wong
2014-03-24 16:22 ` [PATCH 1/6] fs/bio-integrity: remove duplicate code Darrick J. Wong
2014-04-02 19:17   ` Zach Brown
2014-04-02 20:35     ` Darrick J. Wong
2014-03-24 16:22 ` [PATCH 2/6] io: define an interface for IO extensions Darrick J. Wong
     [not found]   ` <20140324162244.10848.46322.stgit-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2014-04-02 19:22     ` Jeff Moyer
2014-04-02 22:08       ` Darrick J. Wong
2014-04-02 19:49   ` Zach Brown
2014-04-02 22:28     ` Darrick J. Wong
2014-04-02 22:53       ` Zach Brown
2014-04-02 23:06         ` Darrick J. Wong
2014-03-24 16:22 ` [PATCH 3/6] aio/dio: enable PI passthrough Darrick J. Wong
2014-04-02 20:01   ` Zach Brown
2014-04-02 20:44     ` Darrick J. Wong
2014-04-02 22:33       ` Zach Brown
2014-04-02 22:55         ` Darrick J. Wong
2014-03-24 16:22 ` [PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info Darrick J. Wong
2014-03-24 16:23 ` [PATCH 5/6] PI IO extension: advertise possible userspace flags Darrick J. Wong
2014-03-24 16:23 ` [PATCH 6/6] blk-integrity: refactor various routines Darrick J. Wong
2014-04-02 19:14 ` [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO Zach Brown
2014-04-02 20:05   ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).